Vibe coding is delayed pain
The past week I had the chance to do something data related outside of work! I wanted to go through my town's property tax records to get a sense of how they dealt with things over time. The problem was that all the records were published in annual pdf files in a gnarly table format. Since I've never actually had to extract data out of pdf files before, it seemed leaning on an LLM to vibe code up an extractor should be able to get me close enough to what I needed. Between having to learn how Python's PDF handling libraries worked from scratch, or letting an LLM at least put together something that should at least run imperfectly, it seemed easier to go the vibe route.
My general philosophy these days for applying the modern generative AI tools to life is that I will only use them in situations where I can relatively quickly and unambiguously verify that the output is correct. Code has the nice property of having external behavior that can be verified against tests and spot checks, which makes evaluation relatively easy. I also don't have to worry about maintaining this code, or integrating it into a larger system because this is a small hobby one-off project. Since the risks are low, let's go!
5 Minutes to "wow"
So in product to development, one idea that comes in and out of popularity is the concept of "X minutes to wow" as in, how many minutes does it take before a user is able to experience a good sense of what value the product is bringing to their life, in an impressive and positive way. A lot of products will try to optimize their onboarding processes to make sure that the user can "get to the good stuff" as quickly as possible. The thing they don't want to do is have a lot of tedious steps that allow users to get distracted because very likely those user will never come back.
Admittedly, a lot of AI experiences are very good at getting users to this point because their innate functionality can seem pretty magical. Even if you're aware of the underlying technology and statistics involved, it's still quiet uncanny that a couple of billion parameters can somehow generate compelling things.
For my toy project, I spent a couple of minutes uploading example PDFs, and describing the sort of Python code I wanted it to generate and within a handful of minutes managed to get something that appeared to do what I wanted. Moreover, since I was asking for code to do text extraction and not the extraction itself, I could verify that the thing was working as I was expecting and not just hallucinating garbage at me.
On the whole, choosing to have it write code seemed like the correct choice in the long run. I'm pretty good at describing what I need, and how I think the text extraction should go, so the code it was generating largely fit my expectations. It was also very nice to not have to worry about syntax and implementation details and concentrate on the high level logic. So overall it had been a positive experience.
But then, I started looking at the results a bit closer...
And 15 minutes to "omfg what BS"
Not every aspect of a product can be easy and magical. Sometimes, the software tool just makes something extremely difficult into something possible to do. And so, as users get past the wow point and start using the product more, they're bound to hit upon rough spots.
The rough spot I hit upon in the project was when I noticed that the data was perfectly extracted for datasets from 2018 to 2023, but 2024 and 2025 were missing. Further digging showed that it wasn't really the LLM's fault – somehow the town changed how they were publishing the PDF files and while the pages appeared identical visually, the text was formatted wildly differently so that the assumptions the code made were invalid. Typical "parsing data from a government source" BS.
The new data format is a nightmare for computer processing. While 2018-2023 data was nice and simple text on a page that you could highlight like a normal text document out of notepad.exe, 2024 effectively had two "columns" cutting the page vertically. Given that data itself ran in rough rows across the page, these columns effectively chop the data into chunks that are a huge pain to correlate. 2025 made things even worse by splitting everything into four vertical columns that you read in a ridiculous snaking pattern top to bottom, left to right. Also, just for funsies, the 2024 and 2025 PDFs were somehow 10x the size as the old format so testing a full run takes over twenty minutes. Whatever generator the town used to make these exports has utterly broken PDF functionality.

Obviously, this isn't the fault of the LLM, but government doing government things. My challenge now was to get the system to generate code that ideally processes all this data simultaneously, extract the fields I want, then spit them out into a longitudinal dataset across all the years.
The problem of course was that with the data files cut up and shuffled like this, it is a challenge, even for a human, to stitch up the data correctly. Some columns of data have identifiers that can be used as keys, addresses and lot numbers, but other columns don't have any of that identifying information. While a human would use heuristics to try to figure these things out, it's quite difficult to explain the heuristics clearly, and they very likely don't universally apply. Honestly, I'm not 100% convinced that the problem is even cleanly solvable as is because I've seen some weird examples where the order of some lines seemed very slightly shuffled. It's a raging mess.
And so, proceeding onward requires going into a loop of generate code, test code, view outputs and debug logging, try to articulate what could be improved, and the repeat. This was when I was formally introduced to the... nightmarish... experience that is 'modifying existing code with an prompt while hoping other stuff doesn't get changed'. Since the code has a decent amount of interdependency for stuff, it's more convenient to have the system generate the whole file each time (it seems to largely try to keep the code stable over passes anyways). But sometimes, making tweaks to the parsing algorithm for 2025 would set off a refactor that bleeds into other parsing strategies that would have to be reverted.
To make things worse, the messed up file structure meant that even I had a lot of trouble thinking of potential strategies to locate the correct data together. In fact, as of this writing, I still haven't figured out a reliable process for matching the data correctly. So far, the LLMs have managed to get somewhat close for 2024 data, and is utterly hopeless for 2025. At this point if if I choose to go on, I'd have to figure out if I want to literally split the files out into separate methods to further prevent the code from modifying things I want to keep stable.
Overall, this experience is significantly turning me off from the idea of blanket code generation help. I can see where it's very useful in generate new code, especially in generating tests and other housekeeping bits. But it's so, so, so painful to have to solve a tricky, nuanced, hard-to-articulate problem. My feeling is it's like I was hill climbing an optimization problem and the first few moves were huge improvements in the output quality. Now, we're in the nasty situation where I'll request a new direction/algorithm/pattern, the LLM generates code, and it's even odds whether the output will be better or worse than the previous version for the new formats. Essentially, evaluation has become very difficult. I can't tell if we're moving closer to the target or not.
To make matters worse, it's also a roll of the dice (maybe 1-in-6? 1-in-10?) chance that the LLM generates a regression to code I don't want changed. Sometimes it's becomes a new strategy requires a bit of refactoring, but other times it's not clear why things change. Essentially, if I want to continue in a steady way, I might have to split the code into separate files so that I can keep certain bits out of the blast range. The problem feels like it gets increasingly worse the longer the code file wound up being, so slicing things up might also help keep things in control there too... maybe.
As of right now, I've been busy and haven't gone back to trying to make the code work. The technique continues to walk a knife's edge of being sometimes useful, but always threatening to derail itself unexpectedly. It's like sitting in some kind of "autonomous" car and being forced to constantly pay attention so you can catch it attempting to jump a cliff or something. Not the greatest feeling, but the utility for when it actually does work is pretty nice, so it's hard to throw it all out.
Either way, as I worked I did pick up a bit of quirky habits that seemed to make the process better. I'd love to hear what other tricks people have developed while using such systems.
- Telling the system to save out a verbose debug file explaining how it evaluated and chose the numbers that it did helped a ton. Especially just feeding the log file right back at the system.
- It seems coding-aid LLMs have been trained to stick to a proposed code solution to something so as not to constantly churn new attempts that might break pre-existing code. That said it can still happen.
- On the flip side, if I thought the proposed strategy doesn't show promise, I need to explicitly tell the machine to change things.
- When things got really long and I was concerned I'd run out of context window given the number of files and logs I was uploading, I had decent luck requesting a summary of the context "for the next session" and code comments laying out the logic and specific edge cases being handled. Pasting the context to a new LLM usually worked quite well.
I honestly have no idea how I'd include generative AI methods to my work... this little situation highlighted just how awkward things can get when starting to figure out how to put the tool to effective use. I imagine with time and practice, I'd stumble upon more tricks that help things go smoother when I do have to write code.
But these things still write fairly garbage SQL. =P
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — homepage, contact info, etc.
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
- Send a one time tip (feel free to change the amount)
- Share posts you like with other people!
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!