Unstructured data is still a pain in the butt (but less impossible now)
Quick announcement. For the next few weeks I will be traveling on a much-needed family vacation. I intend on writing on the road (the 7yr streak must go on), but due to time zones I might accidentally be off by half a day or something. I also expect to be writing completely from my phone.
For some reason, one of THE most popular quantitative UX research-y questions I get asked about now involves mostly unstructured user data. This is the "go analyze thousands of rows of unstructured text and 'make sense' of it all" type of question. The whole field of Natural Language Processing has been nibbling at these problems for many decades now, but as of late the demand for the work seems to be at an all time high thanks to the LLM boom putting it in people's heads that these language-based problems can be tackled with "AI" that does language convincingly.
For many of these problems, modern LLMs do represent the state of the art performance. I'm thinking about problems like "theme extraction" – pulling out a short summary of a text of it's primary theme, or "sentiment detection" – labeling whether the comment is happy or angry or sarcastic. While I did study some NLP back in my student days, that was literally in the age where "bag of words" was one of the dominant models of language, and markov chain language models were considered a little bit "fancy" due to how resource and data intensive the method was. So the fact that with a big LLM model, some prompting and practically zero labeled data can allow anyone to coax out almost state of the art results seems like unbelievable magic.
Overall these analysis pipelines take very similar shapes:
- Do a bunch of data cleaning, PII stripping, etc. this part tends to be super important and extremely fussy
- Try to explain what you're doing in a system prompt
- For each row of data
- Shove everything through some kind of topical clustering system based on embeddings, cosine similarity, or whatever seems reasonable and then make clusters
- Inspect the various clusters and try to assign a label to them. You can do this manually or with more LLM summarization pixie dust
- Try to eyeball whether they look right or not as a sniff test. Guess if the clusters are "the right size" or not – you don't want over-broad categories catching most of the output, but you also don't want a bunch of singletons
- Try to read through the individual result classifications to determine "did the analysis do its intended job?" There's often so many your sanity starts slipping
- Find all sorts of weirdness, it gets more intriguing the more you dig. It's rarely completely wrong, but never quite right either. You find yourself doubting your design choices or even your own command of language
- Give up going fully solo and try to find even more humans to look at the results carefully and label whether they agree or not with the output
- Loop back to the beginning and pull the slot machine lever again, making somewhat haphazard, somewhat educated guesses as to what specific parts of the process need changes.
Now, in more formal studies you'd do things like have a validated test set with hand-labeled data made by humans beforehand, which you can compare the model's output against the humans to calculate the inter-rater-reliability (IRR) score. But since those labeled datasets are extremely expensive in terms of time to create, most of us in industry employ the "spot check and the vibe" evaluation method.
The problem with the whole situation is that stakeholders learned that this overall text-based method is plausible. And if things are plausible, then with the amazing advances in technology the analysis must be easier than ever to do now! Since much of the product development world is messy unstructured data like feedback responses, open ended question responses, interview transcripts, user created reviews and comments there is a LOT of this data sitting around. It seems so perfectly obvious to leverage the new hot tech to rapidly "summarize" giant blobs of quirky text data into headlines.
Insofar as if you shove text at an LLM it will give you text back, then yes it's "trivial" to do analyses on unstructured text data. At the very least, the difficulty went from "nearly impossible" from the bag-of-words age, to "possible" now. But while generating a result is easy enough, validating the result and building confidence that it is actively doing the intended analysis is my ongoing nightmare.
Yes, LLM's are prone to ridiculous hallucinations and are unlikely to ever be free of them. But even if they aren't making things up from whole cloth, they exhibit really weird behavior when "summarizing" in that their attention mechanisms carry a bias as to what is important. That importance may have little to do with your actual research question. Whatever quirks are going on under the LLM hood, in the end it still lands on human raters and labelers to bridge the results to reality.
And therein lies the tension that we face. Having a result versus a validated result are completely different things. Some stakeholders are sensitive to the difference, while others aren't (or they find it convenient to not care). Labeling is slow and expensive in a world that is increasingly pressuring for speed and shipping. You start fielding questions like "what if we generate synthetic users and data?" "can we use an LLM-as-judge system?" While there's plenty of emerging literature about these methods right now, none of them are necessarily grounded in human experience either. It all still falls back onto grounding against humans eventually, but few people want to pay that hefty cost.
So in the meanwhile, spot-and-check is the go-to method in practice for small teams. We all know that it's not the correct way to do things, but given time and resources it's often what we have. The labeling guidelines barely exist because no one's gone through the data to form opinions about what the coding scheme should look like yet either. Getting workable IRR out of a group of humans under these conditions is a bit of a pipe dream – it pretty much violates every single best practice for the whole evaluation method. How much better is inserting a machine that is known to make stuff up in various situations into this mess? It's taking an already hot mess of wild west barely-labeled unstructured data analysis and throwing in a different hot mess of LLMs on top. Are we really learning anything?
Anyways, the lesson here is that in the case of messes of unstructured data, there is a path to analyzing it. It's often many orders of magnitude slower than what our stakeholders think it should be, but a relatively rigorous set of methods exists if we can convince them to go along with the project.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
Counting Stuff Official Forums: Discuss posts, or other data topics with the community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
- Send a one time tip (feel free to change the amount)
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!