Data work in the fast fashion code era
Software engineers are having. A bit of an existential moment right now because, regardless of your opinions about the AI hype cycle, LLMs as a code writing tool has been getting to the point where a lot of engineers I respect are noting how the tools have gotten very good of late. And these are folks who are still quite critical of the whole AI nonsense going on in general.
The growing sentiment seems to be that code has become "cheap", very much like how fast fashion clothes are cheap – it's of dubious quality, is meant to fall apart and be thrown away in a couple of seasons, and can't really be maintained in any meaningful sense. But it gets the immediate job done at a price (in time) that is very hard to beat.
For software engineers who are supposed to be writing software for systems that have a long shelf life, this presents a problem that they're collectively trying to figure out. What processes need to be in place to get the most benefit with the least risk? What has to change? How are you going to hire new developers, or even train up younger, newer talent?
But for the vast majority of us who work in data science and quantitative research, especially us folks that have a significant ad hoc component to our work, the prospect that the code being generated by such tools is barely-maintainable throwaway code is not a huge barrier to adoption. Hell, personally speaking, the majority of the code I write professionally gets thrown away pretty damn quickly. Most probably doesn't even get run more than a single time to wrangle some unholy data atrocity into a usable presentation. I'm sure many people have had similar experiences.
So, if code-generating LLMs are supposed to be pretty damn good now, what can we do with them in our work?
The one thing that I'm pretty sure still doesn't work
So the first thing that engineering types THINK we would want is for some reason SQL generation. And while I haven't tried doing it in the past six months or so, I'm still decently confident that LLM-to-SQL generation is still hot garbage. There continues to be massive issues of transmitting "enough context" to an LLM to generate all but the most basic SQL. I've had multiple SWEs say that these systems can "generate SQL for their work" and upon probing it's inevitably some short 10-50 line query that maybe juggles a couple of tables for a UI view. As an analyst, I'm used to dealing with the 500-line, "why did you nest these CTEs in more CTEs?" monster queries full of CASE statements, and we all know how much unspoken domain knowledge gets crammed into those.
The many ways of getting/creating data
LLMs can be used to create data. That sentence can be interpreted in at least two ways, one of which I find an utter atrocity, and the other is something that I'm actually pretty excited about.
The atrocity part is all the nonsense about using LLMs to generate synthetic data. These are the folks who are in the business of selling the snake oil of "why do user testing, or surveys when you can just have an LLM generate the user testing responses for you!" If I see the term "silicon sample" used unironically any more I'm going to lose it. The fact that some people are willing to pay other people to use these samples for "research" completely boggles my mind. It's like paying someone to make up lies you want to hear.
Suffice it to say, if you're in the UX world, or even just doing data science work that impacts humans and studies, you should collect your data directly from actual humans. Fancypants auto-complete isn't a rational substitute.
Let's jump to the part about cheap code that makes me excite.
Cheap code means we can get more data
Everyone who makes a career out of working with data has collected a lot of domain knowledge about the systems we work with. We learn how our data is collected, generated, and use all sorts of tricks and hacks to make the best out of the data we have available.
There's always been more data that we believe would be useful but we haven't had the ability to get access to. For example, I once had to extract data out of a bunch of poorly formatted PDF files from a local government office release. While I know in theory that there's software packages out there I could use to read data from PDFs, it is a lot of work to find the appropriate Python package, read tutorials on how to use the library, then try to apply it to the specific flavor of madness in the files I was working with. It could easily take me a week of nighttime coding to figure out how to do it. LLMs generating cheap code could generate the Python code to do much of the work.
But it doesn't just stop at coding up PDF extractors. LLMs can help with stuff like stringing together machine vision libraries to do video analysis. Some of the fancy models have video analysis functionality built in that can extract frames and objects and other details. What used to require a lot of specialized software knowledge to make working implementations now only requires us to be able to give clear specifications and have a keen eye for evaluating outcomes.
So nowadays, I'm on the lookout for situations where I can actively collect, or extract more data than I could ages ago – because the code to create new data points out of previously "too expensive to bother" unstructured data has become so much cheaper.
The key to all this comes back to the domain knowledge we have in our individual industries. Only you are able to think of what new data could be useful in your given context. If you can imagine a software-based path to getting what you want, modern LLM throwaway code is likely to get you there.
But evaluation is the problem of our times
The downside to LLM's prolific code generation is that it can generate nonsense. Software engineers guard against a bunch of that nonsense by having very expansive test suites that make sure whatever crazy black box an LLM spits out will still conform to their specs where it counts. Data extraction code can't be boxed in as cleanly.
While yes, a lot of our analysis and data extraction code can have some unit tests built around them, for example making sure that certain derived fields are calculated correctly, and certain example scenarios are calculated correctly using specified test data. But our lives are practically defined by "new and exciting edge cases" – precisely the things you can't really write a test for.
Moreover, much of the "turned unstructured stuff into data" tooling is inherently probabilistic. Machine vision and video analysis tools all work in probabilities. Text classifiers have always been varying degrees of "hit or miss", whether they have a classical architecture or LLMs. Just because we magically summoned the Python code to invoke such tools quickly on our problems doesn't get rid of that fact, and we need to incorporate those probabilities into our evaluations of such systems, and also the data being extracted.
I feel like this evaluation of "to what extent we can/can't trust this new data we're generating" is going to be the defining challenge for us for the next couple of years.
We also need less SWE help now
One thing I've always seen across many companies is how data work is at least somewhat limited by the lack of engineering resources. Sometime it takes the form of needing Eng help to spin up an internal server for something because the internal IT infrastructure is really hard to navigate. Sometimes it's partnering with SWEs who can decipher the arcane codebase to figure out how a piece of telemetry was actually logged. Other times it's getting their buy-in to implement some tracking. A lot of that isn't usually extremely difficult work, but it required a lot of domain knowledge we didn't have time to get.
But now, those barriers can be much lower because the tools for understanding code and generating code have gotten so much better. We often have the basic skills to do this low-risk work, but rarely have the time to get up to speed. It's another place to consider unblocking yourself... assuming you want to take up the maintenance side whatever code you ship. Which, not many of us want to do.
But it's a thought.
Data stuff to share
This link to "All the views" was bouncing around our Discord today where people located the longest line of sights on the planet, a 530km sight line stretching from western China into almost Kyrgyzstan. There's also a map of a bunch of other calculated long sightlines for places all over the globe. And before you ask, yes, it does factor in terrain heights, the curvature of the Earth as an oblate spheroid, and also atmospheric refraction.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — homepage, contact info, etc.
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
- Send a one time tip (feel free to change the amount)
- Share posts you like with other people!
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!