Analyzing logs takes less and less custom code nowadays

Apr 15, 2025

With only two weeks left to write up all the notes the logs analysis class coming up, there's little room in my head for much side research this week (and probably next week).

This week are some thoughts about an interesting(?) hiccup I ran into while trying to write out part of the class – specifically, the "using a general purpose programming language to do logs analysis" part of it.

The irony is that... like with any other modern programming general purpose programming language, you can do anything you want. Your code is primarily constrained by your imagination and skill, not anything particular about your task. And obviously, analyzing text logs is much simpler than a lot of other tasks we can be asked to do with a programming language.

Have some really funky complex labeling task that needs to read the entire log file three times to fully trigger? Go right ahead. Want to do something extremely efficiently with a bespoke algorithm leveraging bespoke log behavior? Or want to just implement the slowest thing to ever make you think about the Halting Problem? You're free to do so. All of those and everything in between can be created within your analysis code – and that is such a fundamentally useless insight that I'd be rightly thrown out of a window were I to say it. It's neither interesting nor helpful. It certainly wouldn't justify spending energy to try to learn the obvious.

It's much the same problem when I tried unsuccessfully to learn Linux in the early 2010s. Once I had the thing installed, I asked what could I do with it and people's response was "anything you want!". But didn't know what was even possible. I certainly didn't have any actual need or project to do. So I got bored with staring at a command line and quirky window manager, and went back to using Windows for its game and better hardware support. I certainly don't want my class to replicate such a dreadful experience.

And so, the better question is "what possibilities I can give examples of to illustrate why anyone in the modern data age we live in now would opt to drop out of using a SQL-based solution to write analysis code themselves".

Wait, why is SQL the assumed default?

Because, for the vast majority of use cases, the mostly structured format of logs files are just so much easier to work with at scale using SQL. The data is also often stored within large distributed databases nowadays, so even if you wanted to run custom code on the data, you'd still need to at first pull the dataset down using basic SQL. If that's not the case and you're working with some kind of obscure data format that doesn't lend itself to being placed into a structured table, then the whole project was already doomed to being a painful custom analysis job anyways.

When the primary use case is finding and counting rows of events that match a set of conditions, SQL is practically purpose built for that kind of work. It's just so much faster to leverage the built-in features of SQL from a development, query-writing, debug, iterate, perspective to do so than it is to spin up some code to do the same thing with a normal language. Plus, SQL's functionality has been constantly expanded by demand from analysts over the years, so "specialized" things like regular expressions, analytic/window functions, and JSON support have made their way into the language. There's increasingly less reason to have to drop out to a non-SQL method.

But even though SQL's awesome for the majority of use cases, no tool is perfect for such a generic task. That's why we occasionally find ourselves doing dubious things like writing custom UDFs and recursive CTEs to solve our problems. Some problems are just easier to reason about with a different computational paradigm than what relational algebra affords us. Some such problems can be solved using SQL with a bit of effort and cleverness, but others are impossible without extensions to the language.

For the class, I've decided to showcase one of the class of problems which are tricky to pull off in SQL – marking specific rows of data based on that row's relationship with prior rows of data. A classic example of one such problem would be generating user sessions. There are pure-SQL solutions to the problem, though the solutions are not at all obvious and take some cleverness and careful thought to arrive at. It's also likely possible to do it with vendor-specific language extensions that do stuff like add looping and variables. None of those solutions are something you would get right immediately from scratch without some hints. Meanwhile a code-based labeling system of the same data, assuming it can read all the data in order, is extremely easy to write and reason about since you're mostly just keep track of states. Language paradigms do still matter sometimes.

It's my hope that in showing some memorable use cases, that I'm not wasting people's time teaching a method they're unlikely to work with on a typical day. It'll be something for special occasions and unavoidable one-offs. But hey, I'm also teaching students how to do analysis straight off the Unix command line so "future utility" isn't exactly a strong point of my course design.

Regardless, I really like framing things within historical context, and it's good to show people how we started in darker, more inconvenient times hacking models and boilerplate together in Python to do what is essentially a simple count(*) with some group bys. Maybe it builds some appreciation for what we have, however awkward it is.

Subscribe!

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
Send a one time tip (feel free to change the amount)
Share posts you like with other people!
Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Avoid the lure of working on metrics over constructs

Measuring Snow is Decidedly Not Easy

Bridging real measurements to idealized ones

Effective (training) firehose sipping

Analyzing logs takes less and less custom code nowadays

Wait, why is SQL the assumed default?

Randy Au

Avoid the lure of working on metrics over constructs

Measuring Snow is Decidedly Not Easy

Bridging real measurements to idealized ones

Effective (training) firehose sipping

Wait, why is SQL the assumed default?

About this newsletter

Supporting the newsletter

Subscribe to our newsletter

Randy Au