Everyone is capable of doing logs analysis
I'm slowly adapting highlights from my logs analysis class into posts, this one being the first post of the series. Realistically only a fraction fits into my normal post format, so I've quickly hacked together a page for paid subscribers to get a link to the full class slides. The link is also in the blog site's top nav for easy access. The actual class itself had a lot of improvisation on my part and discussions amongst students, so I don't feel that sharing reduces the value people get from attending the class much.
In class, the first, most fundamental, point I wanted to make about logs analysis is that most people have the ability to do it. Humans have been doing it for ages. We just happen to call people who do logs analysis strange titles like... "historian" and "accountant", and sometimes "data scientist". The act of looking at a bunch of "records of things that happened" and pulling out meaning from those records, whether via quantitative or qualitative methods, is probably a skill as old as storytelling itself.
As an example, if you take a look at this simple log file below, you should be able to make some sense out of it with absolutely no prior training in any logs analysis technique.
7:00 - Bedroom
7:30 - Kitchen
8:00 - Bedroom
8:20 - School
8:45 - Train Station
9:50 - Grand Central Terminal
9:55 - Subway Station
10:25 - Subway Station
10:30 - Office Building
10:35 - Classroom
Most people would at least venture a guess as to what this log is talking about. They'd also probably notice what is recorded, and likely have imagined some of the things that aren't recorded.
The main reason why you should be able to at least make some educated guesses about what this log file means is because you have domain knowledge about what humans do. If you have any familiarity around NYC, or how adults behave in general, the log is surprisingly legible. I would even venture a guess that any random adult walking on the street in Manhattan could at least make a reasonable guess as to what kind of person this log is talking about.
Meanwhile, if I showed random people logs of an obscure bespoke computer program, or the daily financial transaction logs of a giant corporation, those same people (and myself included) would likely have no idea how to even start making sense of the data, let alone draw any kind of meaning from it.
This highlights that the problem of logs analysis is primarily a problem of domain knowledge. The tech is actually a secondary problem that we adopt as we deal with the primary problem. Nowadays the discussion around logs analysis is centered around technology because a lot of data logs are generated by computer systems. Due to the volume involved, we use computers to increase the speed and efficiency of the analysis. But before the modern usage of the term, people did, and still do, logs analysis with paper, pencil, and brainpower.
At the end of the day, computers and math can tabulate and calculate things for us, but we humans are the only ones with the domain knowledge to determine what statements make sense to say given the data available. Only humans get to decide if a number going up, down, or sideways is "a good thing", "a bad thing" or "doesn't even matter".
The realization that tech is secondary to the process came to me as I was drafting up the narrative arc of the course, and it's honestly what made me decide to put a lot of emphasis on the tech-agnostic style I aimed for in the class. Everyone in class comes from a specific domain where they are more than qualified to judge what meaning a given number carries. They already know how they should be measuring something in a way that is appropriate for the problem they have. I don't need to teach that impossible-to-teach skill. Instead, I can focus on teaching them how to get at the number they need with something as little as a stock install MacBook and log files. The CLI, Python, and even SQL code given in the examples acted as points of discussion about the core logic and reasoning that could be applied much more broadly. The tools that everyone will have available to them at their own workplace will vary wildly, and I wanted to make sure that the tools won't matter because they'll understand the underlying motivation and reasoning. They should be able to reproduce the methods as needed using what they have.
Ultimately, there's a handful of things you need to do in order to make sense of most data logs at scale. In my mind, they break down like this:
- Identify 'events/items of interest', these may be single entries in the log, or a sequence of entries. You may be generating the log entries yourself, or someone else is.
- Have a way to reliably pick out your items of interest from all other entries, typically done with code in some fashion.
- Count how often your items of interest occurs, using all sorts of different tools and methods to get that job done.
- Use your analysis skills to compare, slice, and dice the counts to tell your story.
Obviously there's plenty of depth and nuance that go into each section. But it is my sincere belief that getting people to the point of "being able to clearly count the things they are interested in" is easily 60% of the journey to mastery. That opinion is probably no surprise to anyone since I literally named this newsletter "Counting Stuff". Anyone who gets to that point can already answer a lot of very interesting and important questions using even just tools built into the standard Unix/Linux installation.
From then on, the class starts going into deep nerd territory. We go through using standard Unix/Linux command line tools to count some pretty involved stuff. We later use programs to count more complex stuff that doesn't neatly fit into a command line. Then we cap it all off by showing how SQL can do the majority of the work of the previous technologies all at once, while allowing even more new possibilities of joining in metadata.
I don't have a fully formed plan for adapting those topics yet as a post. As usual I'm largely writing on vibes and what feels like a digestible chunk, and even explaining how to use something like cat, grep, and wc -l are used to do basic counting needed a pretty significant buildup. I'm sure future-me will figure something out. The most likely direction to explore is I'll write about a specific situation and solution as a way of illustrating a method.
But for now, my advice is that if you've always been wanting to do logs analysis but was too afraid to. Just look at some data manually, filtered down to items of interest, and see what ideas jump out at you. You'd be surprised.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — homepage, contact info, etc.
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
- Send a one time tip (feel free to change the amount)
- Share posts you like with other people!
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!