Grand Central Terminal can look really really really empty when you just photograph a closed off area

No one works with clean slates, we shouldn't write with it either

May 6, 2025

Talk to any data person at a sufficiently large company (like, any multinational, or anything that's been around a few decades) and ask them "what database do you use?". The answer is most likely a variation of "yes". Ancient MySQL install? Maybe. PostgreSQL somewhere? Sure, why not. Hadoop, SQL Server, Redshift, BigQuery? Probably at least one, maybe multiple. Oracle? If they've got deep pockets. None of these technologies are particular sexy or exciting to talk about. Some, like a random Access database, are flat out against widely understood best practice.

All this legacy stuff piles up because companies merge and acquire new, incompatible tech stacks. They make technical decisions that don't pan out. They hire vendors to do one thing and later in-house the capability. They take years to do big migrations because they can't tolerate outages. Sometimes there's just no budget to spend on modernization, and if a critical system just works, why try to "fix" it?

The only people who can afford to use the latest and best tech all the time in production are small teams, new teams, students, hobbyists, and folks who are actively testing out new tech for potential adoption. While not a small population, this group of people are already hyper catered to with the news and launch-fueled tech and data blogging environment. We all know they're in a good spot.

What about all the folks that are stuck using legacy tech? I'm increasingly becoming aware of a growing gap of knowledge with older stuff. Everyone's first thought about older tech is that, surely there would have been blog posts and tutorials published from around when that tech was new. Why would there be a need to publish that stuff now? Surely, the preexisting stuff would still be relevant and useful today. Right?

Out of curiosity I did a search for "how to write a mapreduce job". While rare now, I'm sure someone out there is writing MR still. The first google hit was the official Apache Hadoop tutorial for writing MR. It is written in Java, which is excusable since Hadoop itself is built upon Java and it'd be preinstalled. But we all know doing data science-y, string manipulation stuff in Java is... verbose. I know for a fact that you can run MR jobs using arbitrary languages, even shell scripts, based off the streaming protocol because I did it myself ages ago. The keyword you'd have to know is "streaming", which would bring you to the official streaming tutorial. But if someone wanted to learn to do it from scratch, they'd have to come up with the idea of using a non-Java language and then search for how to use their preferred language in Hadoop mapreduce. The core documentation heads that section as "Hadoop Streaming", second place under "MapReduce Tutorial". But if I were a clueless newcomer, I'd probably be intimidated long before getting to that point and completely skip what I needed.

If you look at the legacy posts about the topic, many has the old dated feel of what blogs looked like in the 2000s, which is somewhat appropriate given the state of blogs of the era. The information contained therein, which mostly consists of the magic incantation needed to invoke the Hadoop streaming job with the appropriate flags, seems correct? Probably? The question is whether it still remains correct today after a decade of time. Did the Hadoop API change? Is there a new, better way of doing things that didn't exist in 2012? For example, Python 3 wasn't the standard back then, so does that change affect how to use it with Hadoop? Probably not, but there might be issues..

There's also important questions of how things can be done without a clean slate. A lot of tech writing assumes a clean slate for simplicity for the audience– anyone who has to deal with figuring out how to make a weird "heavily modified open source" legacy FTP-based data ingestion pipeline work with the latest DAG orchestration tool is surely paid enough to devote the time to figure things out. Anyone who has been given admin rights to a small replicated database instance must surely be able to negotiate new software installation with their IT team. Right? Our lived experience screams that this isn't true, and we all know that. But because such problems are "too niche", few people feel incentivized to create this content.

It goes back to what I wrote a while ago about how we should be making "Intermediate+" content. That is, content where you start peeling away the white lies of implementation – like the lie that we can all start from clean slate installs – and get into some meaty details about making stuff work, bugs, hacks, and roadblocks included.

Making “intermediate+” content
It’s what the internet needs, and we’re the only people who can create it for ourselves and our peers.

At the time, I was writing that post primarily from the perspective of just defining what the heck "Intermediate +" means. My hope was that people would want to write more of that stuff because they'd want it for themselves. But now, I want to really put in the emphasis that the internet as a whole needs us to write this stuff for ourselves. Because no one else is incentivized to do so.

  • Manage to do something clever to get two systems to finally work after a week of frustration? Write it down, for your future self to reference if for no one else.
  • Solve a stupid repeating problem at work using a piece of software that maybe has 100 users? Write it down!
  • Hit upon a really weird bug that you can't figure out and don't have a solution but it's weird? Write that down too!
  • Manage to teach an old tool new tricks? Or just reimplemented something in a not-as-sexy language? Share it with folks!
  • Make your own version of the "I hit a problem, googled for a solution, found out I posted the solution years ago" story!

It's never been cheaper to store and publish static text on the internet. Web hosting plans in the early 2000s used to charge maybe $5-15 a month for a few megabytes of disk space and static hosting. Nowadays, you can do it for free or essentially pennies a year on a bunch of large, reliable cloud vendors, or GitHub, or a bunch of other places.

The reason I feel more strongly about this now is because during my logs class, which I had designed from the start to show that logs analysis can be done with literally no new software installed on a stock MacBook without an internet connection, had multiple students be thankful we took that thorny path of working from nothing. They weren't in a position to get new software installed at work. Some are at the beginning of the long organizational battle where they're trying to show the value of using quantitative methods but aren't in a position to access anything yet. Those people need very scrappy tricks to find the leverage they need to initiate change.

These folks and many others looking to learn something don't have the luxury of starting from scratch and using the most cutting edge methods and tooling. They've got to work with whatever dated infrastructure they're given. Even others who had more organizational leeway still had to deal with a bunch of legacy systems where solving those challenges with modern tooling might take weeks or months of work, when they need to show results quickly.

One further thing that I'd like to point out is that on the surface, it seems like these people need beginner level content. Getting a basic job to run or talking to a database from some code sounds like trivial intro level stuff. But I'd argue that they actually need intermediate+ level complexity stuff about those same "intro level" things. A beginner level tutorial would just show the simplest way possible to get new users started. An intermediate level tutorial would strip away the clean slate and start looking at the complexity involved in getting software to play well with other pre-existing software.

Going forward, I'm going to try to touch on this kind of topic more. But if you're reading this and have a story or tutorial that you'd like to share, reach out and become a guest writer here. I promise the process doesn't hurt and you get a free editor (me).


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

  • randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
  • Send a one time tip (feel free to change the amount)
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Intermediate+ content is an infinitely wide space since there's so many possible combinations of topics, tools, techniques, and platforms.