While house hunting a while back, I came across this boiler room and I have to admit I was pretty jealous. My actual boiler room looks absolutely nothing like this.

So you've been asked to "take over" some old data pipeline...

Apr 21, 2026

Work anywhere long enough and you're going to receive "gifts" from various people at work in the form of "here's a data processing thing, it is yours now". The gift giver is gone for whatever reason. Maybe there is documentation, but at best it's incomplete or outdated if it ever existed to begin with. It very likely has broken because people who aren't the owner likely wouldn't have remembered it existed to give it to you.

Whatever the reasons or situation for being handed this data pipeline-like thing, you will have the same set of questions as everyone else. What is this thing? What is it supposed to do? How does it work? How do people use it?

These situations come up all the time. I'm sure everyone reading this can probably think of some situation, maybe big, maybe very tiny, where they were just handed a data process and were told to make sense of it. But what I find very fascinating about this situation is that it is wide open to interpretation. Depending on the exact specifics of the situation, you have the freedom to do some really interesting work based on what lenses you choose to apply to the problem.

Today's post is a bit about various techniques that are common brought to bear to approach the problem. It is also a bit of an attempt to remind people who've done repeatedly before that the paths we actually do take in the end are probably not the most obvious ones.

Accessing the situation

No matter what happens the first thing is to gather the basic info about what you just got. So the first order of business is always going to be basic gathering of context. The vast majority of this information is not technical. While you might be holding a blob of code in your hands, very little of that critical context has to do with the code. You've got to first figure out who wanted to use this data, and for what purpose. Unless you know this thing was supposed to count widgets in a certain way for a very specific report used by this one specific person, it'll make no sense when you delve into the code.

While this is the very first step, the amount of time different people spend at this initial step tends to be indicative of the amount of experience you've spent working with data. People who are fairly new tend to gloss over this step because it takes time. You have to track down people and ask questions that may not have an answer. Meanwhile the code is right there for inspection, so why can't we just dive in first to help formulate what questions we want to ask?

Meanwhile, experienced folk spend a LOT of time in this step. They'll take a cursory scan of the code to see if there's any documentation written down, but even it exists they won't 100% take it at face value. Instead they want to track down the backstory of the code – who, exactly, was using this output. What specifically was it used for? More importantly, why do we even care now?

The recognition here is that the code blob is a relic of the past. It existed in a previous context, and given the ever-changing nature of the world, it is probably ill suited to the current context. The reason why we care any bit about the history behind a blob of arbitrary data processing code is because it helps us set the stage for what this code should be in the future – and that includes whether the future should be the trash bin.

If you ask around and find out the pipeline created was measuring the wrong thing – trash bin. Find out the people who requested it had no clue what they were doing, and were using it to make outlandish claims – trash bin. Find out that no one even uses it any more but someone thought they did? Yup. Data pipelines aren't quite as disposable as dashboards, but they are just one step removed from being ignored like any other dashboard due to lack of organizational attention.

Even if you find out that the data pipeline is actually being used downstream for legitimate purposes (gasp!), there's always opportunities to take a critical look at the environment the code sits in. Should someone else be owning and maintaining the code? Is it built in a hacky way that predates existing infrastructure and should be modernized? Has the world changed enough that maybe we should have a big discussion over whether we need to change how we measure things? These are big, meaty questions that might not specifically apply to your particular problem, but is always worth considering while you're preparing to do the work. Knowing the answers ahead of time lets you look at the code with a different set of eyes than purely from a "review this code" perspective.

And just look at all the things you can question at this phase. Everything at the high level from the original premise, the organizational needs, the human processes that likely surround the pipeline, and more! You have to be very thoughtful about which avenue you want to explore, of course, but within the parameters of your unique situation there's a lot to be learned.

Then, after all that initial context setting, once we're satisfied that we're going to be doing work that will actually be useful in the current context we can dive into the nitty gritty of the actual code.

Finally in the code

As far as I know, there's two ways that you can attempt to figure out how someone else's analysis code works – you either start from the beginning and trace the logic going forwards, or you start from the end and work backwards. I know it sounds stupid to have to write that obvious fact out, but in my experience, at most one, and sometimes none, of the stating points is immediately obvious. I don't know why that's almost always the case, but I don't think I've ever encountered a data pipeline of any consequence where I could clearly understand how things start and how things end. Maybe I'm just cursed?

As I mentioned in the code review post last week, LLMs have made this significantly easier, by being pretty darn good at converting walls of code into human-sensible language that you can query against. But. Even. Then. Only one end of a pipeline is fully visible. The other end is somehow obscured by an external tool interface you can't see that kicks off the pipeline with magical environment variable settings, or the output can't be seen until you execute the pipeline but no one knows how to run it.

But regardless of your preferred way of understanding the code, the goal switches to coming to a working understanding of how it all works. What design choices were made along the way in terms of data handling, data cleaning, analysis. Were those choices justified or arbitrary? Correct or confused? Does it even "do what it says it does" because there's actually no guarantee. A human or an LLM could have written all sorts of broken nonsense into an analysis process and all that needs pretty careful verification if you plan on signing your own name to the code going forward.

And "code adoption" is the name of the game when you get as far as deeply inspecting an inherited data pipeline code for resurrection purposes. While you're examining it, it's someone else's code. The mistakes, bugs, and gross inaccuracies are always the fault of someone else. That whole situation changes once you go through, deem things "good enough to run". From that point on, it's your code because you're the last poor sap that's touched it. You can try to git blame your way out of anything that happens, but the target of the blame is long gone. You're likely going to have to maintain for an indefinite amount of time until you can find someone else to own it. Maybe you can convince an existing data team, or the team that actually runs the thing you're measuring – so they can own the metric that's being used to measure their work instead of some random data person on the side.

All those repercussions is why the cranky old folks ask lots of questions up front. They're trying to gain clarity on "what will happen if I agree to own this code". They're trying pretty hard to avoid putting energy into something that's a single person's pet project that will just eat your soul while simultaneously not having any useful impact.

And so, you trace through the code bit by bit, bringing how it works and fits together into your head. Even if you don't review every single line of code for correctness by hand (who's got that time), you can usually get a gist of what's going on at the function level – this is filtering out [this sort of data], that's normalizing the dates, this is pulling data from a database, that's joining rows together. When you know what the code will be tasked with doing in the immediate future, you can have opinions about whether the code is doing the correct thing.

And as you slog through, you'll likely be taking notes, and those can potentially turn into the documentation that you wish you had but no one ever bothered to write up.

And don't rush, too much

Usually these projects exist because something has gone wrong somewhere enough that it caught the attention of someone important. That usually translates to "hi, we need this yesterday because $EXECUTIVE asked questions about it". The instinct is to just dive right in because it's a fire drill. So be aware that there's no need to do all the above questioning and thinking under intense time pressure. It's perfectly acceptable to temporarily debug the code just enough to make things run – the equivalent to a broken down car rolling into your shop in an emergency and you just do the bare minimum to patch the flat tires before sending it out the door with a half-dead engine and parts falling off. Then with the time you buy in getting "the existing code" to run, you can figure out everything else.

In a perfect world, even with just patching up the code temporarily, you'd want to start the process of pushing back on some of the deeper issues early while you have the attention of $EXECUTIVE, but do so only to the extent you're able to get answers to the questions you have. Rushing to a conclusion here tends to be more painful than doing it thoughtfully. That said, if you patch the code up and find a significant issue that doesn't seem intentional, it's a way to start a whole conversation around "what are we doing, and should we be doing it?".


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
  • Send a one time tip (feel free to change the amount)
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!