Things to consider when modifying your data systems

Jan 28, 2025

Last week, I had to do something that I am usually extremely loathe to do – I had to use regular expressions in a SQL query to extract a piece of text data I needed. This is usually a sign that something has gone horribly wrong because regexp processing is a computationally expensive operation compared to many other functions that could be done on a SQL database. There would have been a way to design the db schema such that regexp could have been replaced with a simpler, cleaner solution.

Now, while this particular situation came up because a bug had been introduced that destroyed pre-parsed data, this situation where we analysts have to force our data into contortions isn't new. There's always been a tension between databases optimized for transactional use, and ones optimized for analytical use. That's why we have different things like production business databases on one end and analytical data warehouses on the other. Even if you're using a data warehouse like I was in this instance, there's always a question of whether it is time to do move an expensive operation into a new pre-calculated field or table.

The question of what to do, of when to move to a data warehouse, or when to update the schema is a messy, complex question that can only by answered by "it depends". But since I hate such unsatisfying answers, here's an attempt at jotting down what one process for figuring things out looks like.

Moving between time and space

While it's likely not the root of all data warehousing decisions points, it's convenient to think about the problem of whether to make changes in a data warehouse as asking the question of "do we need to make a change to our time-space tradeoffs".

The gist of the tradeoff is that for given computation, you can trade memory space for processing time and vice-versa. A database index is a really cheap way to trade a relatively small amount of disk and memory space to access your rows much faster. Even the most fundamental data warehouse does this by pre-aggregating transaction data into data cubes – you pay for all the extra space of storing these pre-computed results, but your queries are much faster since you don't have to read all the rows of raw transaction data.

In the beginning of any company, that initial transaction log data is small, so you can get away with running your analytics directly on the production database. Things slow down eventually, so people add indexes. Eventually you move to a replicated clone of production data so that your analytical queries don't block production workloads with locks. Then your queries start taking too long because getting the answers you want involves complex joins and aggregations that the database wasn't optimized to handle and you have to figure out what you need to do.

When faced with solving such issues, you're essentially being asked to make a design decision. Given your existing environment and present and future needs, design a solution that fits your requirements. Since only you know what your constraints are, good luck!

Things to think about when designing a solution

As with any project, we're trying to balance the amount of time we invest in and how much value we get out of it. This is why sometimes it's worth spending a year building out a whole analytics platform complete with a data warehouse, and other times it's completely out of the question. So here's some of the stuff that I'm juggling in trying to come to that decision.

Cost/return of pre-computation – As I mentioned, there's a lot of benefits to pre-computation. Adding an index or a smart data partitioning scheme can turn a query that never finishes into one that runs in seconds. Same goes for building a pre-calculated data cube in the data warehouse. The question is whether you're going to read that information enough to offset the cost it takes to maintain the pre-calculation. For often-used queries and tables, this is an easy question to answer. What most data analysts I've seen struggle with is when a request has repeated for the second or third time and it's not clear whether it's a long term repeating request or a temporary fad. For those situations, I look at how expensive will it be to build and maintain the pre-calculation as my tie-breaker.

How "elegant" I want to be – We all love to use systems that have had lots of thought put into them. The databases with minimal waste, with clear relationships that are easy to query, with few weird gotcha flags. Having a system that's clearly designed to handle a specific set of use cases, and has opinions on how those use cases should be accomplished, are usually nice to use compared to sprawling data systems that grew organically with little coordination. Given all the other constraints, and especially time constraints, I make a decision on how much energy I'm going to expend trying to do the data modeling well.

Future use cases – it's almost impossible to justify the work to build a data warehouse based on existing use cases, all the time it takes to design, build, and integrate stuff, even for a small scrappy system, makes it more trouble than its worth. So the natural thing is to look forward to the stuff it can enable for stakeholders – More reports! Better Dashboards! Answers to questions we couldn't answer before! Even taking away the things that we use to sell the idea to executives, seeing clear future uses makes it easier to justify spending time now.

Utility for others – Related to future uses, sometimes, precomputing things enables others to make use of the data that previously weren't able to. Sometimes people frame it as "Democratizing data analysis" but honestly, only very specific people and teams are usually motivated enough to make use of the infrastructure. Oftentimes, enabling other people to do cool things can lead to a larger overall impact than enabling what you alone can do.

Implementation difficulty – Easy stuff is better, but some stuff is necessarily hard. Whatever the case, this factors into my willingness to dive into a project.

Appetite from team and sponsors – you can make the best case for a system that you want, but if there's no interest from the people who make the big decisions and sign the purchase orders, you can't really get very far. Sometimes the best you can do is make incremental improvements that get you closer to your ultimate goal.

You smash all these considerations in your head and then come to a design that you think works. On the bright side, your specific solution will probably unique to your situation, so there's no right or wrong about it. All that matters is things being functional in the end.

Like I wrote about before, design problems are tricky, so focus on solving the problems in front of you.

Subscribe!

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
Send a one time tip (feel free to change the amount)
Share posts you like with other people!
Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Avoid the lure of working on metrics over constructs

Measuring Snow is Decidedly Not Easy

Bridging real measurements to idealized ones

Effective (training) firehose sipping

Things to consider when modifying your data systems

Moving between time and space

Things to think about when designing a solution

Randy Au

Avoid the lure of working on metrics over constructs

Measuring Snow is Decidedly Not Easy

Bridging real measurements to idealized ones

Effective (training) firehose sipping

Moving between time and space

Things to think about when designing a solution

About this newsletter

Supporting the newsletter

Subscribe to our newsletter

Randy Au