A picture of some ruins in I believe Greece

Labeling things by hand when everyone's trying not to

TopPost Jan 14, 2025

This week I've got a lot of half-baked things going on that need another couple of weeks to fully bake, including something about data fundamentals and classes. And my gamedev conference is in 2 weeks and the house is still far from done... So I'm riffing off some stuff I read last week.

‪A reader, Michael Mullarkey, just launched their Data Dash newsletter (yay) and just made their first post about their experiences with hand labeling data. For ages, I've always been a proponent of resisting the temptation to automate and do things by hand a least a bit, I'm gonna take the opportunity to both amplify a viewpoint I agree with, while also updating some viewpoints in the new age of LLMs.

The newer age of labeling data

Right now, the hot stuff with labeling data, especially semi-structured and unstructured data, is the use of Large Language Models for zero-shot or low-shot classification. The basic idea is that, IF you accept the premise that LLMs do acceptably well for this class of problem on the whole and your particular problem in the specific, THEN you can use an LLM to help automate the labeling of data much faster and cheaper than it would take to get humans to do it.

This is still a very active area of research and thus far my high level summary of the vibe of the literature is "it seems to work decently well, until it doesn't, and humans aren't that great either". While the literature is ever-shifting and there's a glut of "we tried this on our problem and it sorta worked!" papers, the methodology seems to work well enough that you can't rule it out immediately, even if might be garbage for your particular task due to the specific intersection of quirks of your project.

An AI joke from the last year or so (?) suggested that we all replace claims of "AI" with "clueless intern" to get an accurate assessment of what LLMs could do. Amusingly (or not), since a lot of menial human labeling tasks get kicked down to students or Mechanical Turkers, both who may have little training or context and are prone to making errors, this the AI version isn't too far of a stretch from current practice. So, maybe the method isn't completely BS. Maybe.

As with all things "AI" these days, I'm fairly skeptical of the broad, overly optimistic claims, but think there's specific problem scopes where their use is likely fine. The problem is we don't have a way to identify what such scopes are yet. So I continue to spend time trying out the method on smaller scale projects on the side to get familiar with them. I'm always lured in by unfounded promise of potentially freeing myself for hours and hours of toil, while at the same time being faced with the very difficult problem of coming up with an effective LLM output evaluation strategy.

But even with the lure of a potential method to automate a large part of the toil of labeling data away, I'm going to continue advocate for finding opportunities to hand label.

Why hand labeling is still super important

My preference for hand labeling comes along two separate avenues. First is that without good hand labeling, we continue to have little basis for doing model evaluation. Second, I think that there's benefit to us as researchers to understand our data directly that we cannot get if we hand it off to someone else, even another human.

The annoying things about LLMs is that they're obviously these imperfect black boxes. We know they can break, but we can't peer into their inner workings in a way that will give us confidence as to where they're going to break. More importantly, we don't know if those breakpoints overlap with whatever particular task we're trying to use them for. Since it's a bad idea to simply hope that our use cases aren't breaking, we need to have a way to evaluate the model output and make a judgement for ourselves. Our particular data and use cases are essentially unique to us, and I certainly don't have faith that evaluations done in other contexts will generalize yet.

So far, the methods I've seen for understanding if a particular LLM is fit for task revolves around various forms of spot checking or sampling of outputs, using trained humans as a reference point. There's already robust methods for calculating things like inter-rater-reliability, and treating a LLM system as "yet another labeler" isn't particularly difficult and many studies that look into this problem do exactly this. But at the end of the day, it means you've still got humans labeling things by hand. It's just a significantly smaller number of humans.

As a bit of an aside, I've seen seen multiple instances where different versions of the same model can behave differently enough to identical input that you have to treat them as being different beasts. So keep your evaluation code handy.

While evaluation is important for doing production work, I think that hand labeling is still extremely important for us as researchers even if we're not doing prod stuff. Going through lots of data by hand exposes us to a broad range of data points that in my experience tends to give us new and interesting ideas as we go. Anyone who has labeled data by hand will have experienced the need (and desire) to go back and adjust their label taxonomy because they learn new things as they go. A label becomes overused and needs to be broken up with more nuance. Another group of labels actually should be combined into a related theme instead. You find some completely unexpected thread that leads you down a new research question. Reading the data forces us to develop an evolving model and viewpoint of what the data is in our minds and we learn from that effort. It's not wasted effort, no matter how much people pushing for "productivity" claim it to be. So any push to delegate this work to another human, let alone an unthinking machine, should be looked upon with suspicion.

I don't think there's a way to get this insight into our brains in another way. Even if we didn't have "AI" to do things like "summarize" or "extract themes" (tasks which some people claim LLMs can do, but again, I'm still skeptical as to how good the output is for our needs), the amount of learning we get from reading a summary of a book from Wikipedia or Cliff's Notes is significantly less than reading the book ourselves. This should have been a lesson we learned in high school, let alone reinventing it again with LLMs at the office.

But here we are.


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

  • randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
  • Send a one time tip (feel free to change the amount)
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Tags