Storage is cheap, but not thinking about logging is expensive
We stand at the beginning of a dark period of US history. I wish everyone enjoys safety and peace in the coming days. Please take care of yourselves. Meanwhile, as I mentioned two months ago, we'll keep counting on here...
It's 2025 and amusingly enough, the contours of this story remains eternal.
The average cost of storage on a per gigabyte basis continues to trend downward, as it has been for decades. Despite this trend, businesses and even individual consumers are able to generate increasingly more data, which means they wind up spending significantly more on data storage in absolute terms as time goes by. I'd love to cite studies and statistics on this but every source I can find on these come from biased industry sources and it's hard to get an objective number.
So, even to this day, despite plenty of data folk warning about this over the years, I still see engineering teams decide that they will "log everything and figure things out later" when it comes to measuring telemetry from their software. After all, storage is cheap! What if we throw away data that we might need?
Later, a data analyst is asked to come in and make use of the data and they discover a thousand issues with the data that make it barely usable. The data isn't clean, they actually hadn't logged "everything" and there were gaps that made counting things impossible. The schema changed constantly, the business logic wasn't clear. Worst of all, there was so much data that it took a ton of effort to process even a day's worth of data, so while the storage part was 'cheap', every analysis query against the data incurred a bunch of costs involving compute and (sometimes) network use.
But, if that data person is very persistent, they can chip away at the problem until they finally get usable information out of their data – another "migrate to data driven development process success" right?
Well, not really.
This is because even if teams eventually learn the lesson that they need to have a plan in place for the questions they want to answer to get value out of their data, that lesson alone doesn't stop them from collecting way too much data.
The reason is that if you go up to a product lead and ask them what things they would like to know about to help them make decisions on what to build and optimize, you're going to get a huge unordered list of ideas. I was once asked to review a success metrics plan for a product and was greeted with a document that had over 100 proposed metrics. I wish I was exaggerating because I had to read through all of them to make comments.
After a detailed review and meeting, we whittled the list down to about three "most important" metrics, a handful of "things we should keep an eye on" and then the rest were put in either unnecessary (and thus unimplemented) or easy to implement and thus put in for future use but without any reporting. The conversation made the team ask themselves what they would actually do with the metrics if they were high or low. It took them some time to figure out that most of the questions they had would just satisfy their curiosity about how their product worked while not really being a catalyst for changing their plans. Things got a lot clearer once they adopted a more decision-oriented stance towards their metrics plan,
But if not for that metrics review process, that team would have easily generated a a hundred metrics, instrumented dozens of widgets logging endless data, and very few of those numbers would have been used for anything. It's ridiculously cheap for engineering to log more user events, it's probably two lines of extra code per event, and the storage costs are a rounding error in their operational expense tables. But even such teams realize the monumental amount of work it would require to develop a dashboard that would even show a fraction of those metrics. More importantly, even the most non-data savvy of engineers are usually able to recognize that a dashboard of 100 charts would go unused and ignored due to information overload. But those same teams won't think that far ahead about their metrics plans unless asked.
Cheap storage enables this data hoarding behavior. It's an open invitation for the creation of data-flavored tech debt. By letting you declare that everything is important with respect to data collection, you get to delay thinking about what is actually important. Ship first and think later! Surely we can data mine what the actual important metrics are once we everything recorded. All the pain involved in using the resulting datasets is a testament to how ineffective that strategy is.
And while the cost of lazy logging is probably most evident in shoddy metrics and painful reporting, it's not limited there. I've seen tons of systems that eventually were scaled out to handle ridiculous volumes of data, utilizing thousands of processing and storage nodes, in order to answer relatively simple business questions. These systems started out because "logging is cheap" and any latency issues could be resolved by bigger hardware and horizontal scaling.
The cost and processing times are extremely high relative to the information you get. Teams get frustrated at the latency and start building preprocessing pipelines and DAGs to bring latency down – now you've got multiple processing layers and people inevitably start interlinking pipeline jobs. It becomes a giant brittle web of data pipelines that no one can untangle. The system becomes too big, too distributed to seriously refactor and simplify without a huge multi-year effort that is usually doomed to failure before the refactor even starts. You're paying for precious computing resources every single step of the way. Finally, because the architecture is set in stone this way, any new team that comes around gets to make the same mistakes again!
We need to get teams to stop speculating about what information would be useful
Cheap hardware, whether it's storage or compute, is a temptation to be lazy that won't be going away any time soon. Eyeing my giant pile of hobbies in the background, I'm not particularly great at, or knowledgable about, resisting temptation. But it probably boils down to having the discipline and processes in place that force people to be clear about what they're going to do with their data once they have it. Data isn't a collector's item, it's a cost, a burden, a potential legal risk, a target for hackers, and general nuisance.
So here's the relatively handwavy process I use to engage with teams and have them figure out what's important.
- For every metric, ask them what they're going to do if the value comes in really high, or really low – if their behavior won't change, then the metric isn't useful for decision making, so implementing the underlying tracking is low priority
- If a metric going off the rails will cause them to get really concerned and do a really deep investigation to find out what's wrong (like if revenue dropped 50%), you probably have a high-level metric that's important
- If the metric is used to make a one-off decision, like the launch go/no-go, then it's a temporary metric, but the underlying instrumentation is often useful to have over the long term
- All the metrics that are just "good to know" are low priority. Only implement them if we have an actual hypothesis that the metric is important and relevant to something going on
- Freely implement new metrics and their instrumentation as needs arise and you have new decisions, new hypotheses to test out, there's never a problem adding things you need to get the answers to questions. This process is usually slow and doesn't grow things too much
- Finally, it's really hard to DELETE logging instrumentation because you never know what relies on it downstream. So don't worry too much about it
You can see my rules of thumb don't really put a stop to collecting plenty of data. It merely places barriers to limit the unthinking growth of tracking. It's entirely possible (and maybe desirable) to design things to be much more strict. But it's a start.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
- Send a one time tip (feel free to change the amount)
- Share posts you like with other people!
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!