There were these stuffed dogs plushies in a meeting room at an old workplace that were SUPER enhanced via the power of googly eyes.

The call of LLMs is strong, we get to pick up the pieces later

Jun 18, 2024

Update! The slides for my talk at QuantUXCon 2024 about making better experiments by doing fewer useless crappy ones are up! If you had registered for the event, the recording of the session is available in the UI. If you didn't register and want to see a clandestine recording I sneakily made of my self, send me a message. Also, the stress of house hunting is finally over!

Uh oh, Looks like another group of people think that LLMs are the answer to their applied research problems. Instead of people who want to get rid of user research participants by using LLMs to create "synthetic users", this time it is pollsters who want to replace their increasingly dwindling survey panels with synthetic respondents.

Using AI for Political Polling – Ash Center

What I find interesting about this is that people who are proposing the use of LLM synthetic data things are being more nuanced about it? As in, they've already seen various people get skewered for saying patently ridiculous things like "get rid of the user in user testing!" Since user testing is specifically about being surprised by how users react to a given product, using an LLM to simulate a generalized average experience rather misses the point. It's cheaping out on data collection in a way that undermines the purpose of the data collection.

But these people making the proposal for LLMs in polling likely have seen such efforts and have attempted to craft arguments for why the methodology might still work in their unique domain. I'm not particularly sure that I'm convinced by the arguments but they at least make the attempt.

So the problem facing pollsters right now is that it is increasingly difficult to get people to respond to polls. Obviously the old method of "randomly dialing numbers out of the phone book" has stopped working when cell phones replaced home landlines. Pew has put out a bit of research (linked below) on how pollsters are shifting to other methods in order to fill out their panels. The current trend seems to lean towards opt-in online panels and using text messages.

How Public Polling Has Changed in the 21st Century
A new study found that 61% of national pollsters used different methods in 2022 than in 2016. And last year, 17% of pollsters used multiple methods to sample or interview people – up from 2% in 2016.

So the draw of these synthetic LLM panels is they will spit out as many survey responses as they want. They also will answer all sorts of probing follow-up questions and alternate scenarios as you want. Great for an industry struggling to reach actual humans to measure. People who don't know better, and probably people who probably should know better, are all piling in on using these tools to further their confirmation biases.

But the argument for whether these LLMs are in any way effective is much murkier. Apparently if you feed them a set of survey questions, they'll do "a fairly good job" at simulating what humans do. But the article notes one place where it breaks down – the model gave incorrect responses for responses about the war in Ukraine because the training data predated the actual political discourse on the topic itself. In effect, the large language model isn't really predictive of future political opinions about things, it at best can capture a bunch of the political discussion zeitgeist that gets spewed onto Reddit, Twitter, and news sites. If you want your survey panel to be weighted to regurgitate the generalized political discourse of the times, go ahead? By the way, what's the correlation between online political posturing and actual voting behavior anyways? I don't have any interest in polling so I don't know if this is a desirable state for them or not.

The future is in better bullshit detection

One thing is becoming quite clear in this space of "Let's make an LLM generate DATA for us" efforts – regardless of whether it actually works on a conceptual or practical level, people who make decisions will most definitely make attempts to apply the technology to every domain they can. Stopping them ahead of time is likely an exercise in futility until they see it break themselves.

So it falls upon people like us who get to figure out if the methods work for our situations. Luckily model evaluation is somewhat in our wheelhouse, however awkward it can be. I know countless data scientists who are currently working in the space of evaluating whether some LLM-based methodology is effective or not. Maybe they're using LLMs to summarize and classify input text, or perhaps using it as a helpful UI element. Or they're being used to generate fake data for other purposes. Either way, the work itself is unavoidable.

So while many people are enamored by the glory of creating new models and making them do things that we probably shouldn't be letting them do it. My opinion is that a chunk of my career for the next couple of years is learning the growing toolbox that allows us to judge whether a model is actually, finally, delivering what people claim it can do. I think I'm going to need to learn to poke holes in these things more and more.

My AI-pessimistic view of things is that the hype is going to fade as more and more spectacular failures pop into the general public's view of technology. Once these things stop looking like magic and more like the mishappen blobs of distilled Reddit memes they are, people are going to start asking what can these models do. They're going to want some ways to assure themselves that AI-Model-V-1000 will actually do what's promised.

As people who sit on the other side of the table, developing products and evaluating models, at some point we will be tasked with figuring out what promises we can put onto our products when the blowback for putting out a rushed AI release becomes too painful to risk. Hopefully we'll have figured out a thing or two by then.


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

  • randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted, so support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!