I'll be teaching a logs analysis course in April
I'm somewhat thankful I have a dozen urgent things to do, like writing a class, taking up my attention so I only want to scream curses into the void only a couple of times a day when I see what is going on in the news.
Fresh off of running a gamedev conference, and still in the middle of camping on an inflatable mattress for almost a half year, somehow I thought it was a good idea to agree to teach a class 🤣. Link to the class here, hosted by the Quant UX Association, and right after Chris Chapman's Choice Modeling class. The class will be in person in Manhattan, April 28-30, in the afternoon. Yes, I'll be going outdoors for this! Shocking!
The topic will be logs analysis, something that was probably the foundation of my career for over 15 years. This means that I'm going to be spending the coming weeks taking a very hard look at the dense interconnected web of knowledge in my head and figure out a way to turn it into a linear teachable experience. This is the first post of what is bound to be an unordered set of posts as the topic eats up my attention. I promise not every post from now until April will be about logs, but there'll be one here or there.
What do I mean by "logs analysis"
Within a tech and IT context, logs analysis means analyzing data that's recorded in various telemetry and system logs for some purpose. Very often these logs are long generated records of events in chronological order – web server access logs, file change logs, access audit logs, event logs, etc.. Analyzing these logs tend to imply the use of technological tools because the sheer volume of data is beyond what any human can read and process unaided.
But to be very honest about it, the basic idea of log analysis isn't unique to data or technology at all. Historians and other researchers have been looking at chronological records and using the information to draw inferences for centuries and the basic ideas are very similar. What everyone does is they look at the sequence events in the logs, use extensive domain knowledge to fill in gaps that are not logged in, then use the story that emerges to find trends and patterns.
The newer "Big Data" method of logs analysis just effectively scaled the process up. Since the data is in a semi-structured format, we can do things like count how often interesting sequences come up and put a number to observations that used to be difficult to quantify. We could also use various methods to surface the most common patterns (for some definition of 'common' and 'pattern') and discover novel behaviors.
Whatever the goal, the thing to note is that domain knowledge is the most important skill for this process. It is also the skill I cannot teach. There's a host of very simple counting methods that can be applied to a dataset to get some pretty interesting results, and those methods (like a simple cat datafile | grep kittens | wc -l) don't take a lot of time to learn. But it'd take years of experience and study to know what things are interesting and worth counting.
So that's why I'm always very excited to teach people how to work with logs from scratch, because a lot of really interesting questions can be solved with simple technological tools paired with sharp domain knowledge. It's really rewarding for the students since with a bit of practice they can see a path to answering the questions they want to ask by simply counting the right things. We'll definitely be going into different methods and ideas for counting things.
Also what's with the "no data infrastructure" part of the description?
So the course description mentions learning how to do logs analysis with no data infrastructure. I'm a bit of a fan of the old school pedagogical method of having people learn how to do long division at least once before learning how to use a calculator, so the point is to start out doing logs analysis with nothing but a basic command line, whether in MacOS, a cheap Linux virtual machine, or even Windows WSL2. There's no SQL server, no massive data pipeline framework, no Tableau license, no data warehouse. It's truly starting in the dark ages like when I started out and all I had was a CSV file and Excel.
Aside from a fossilized teaching style, the other reason for taking this route is because I'm sure there's people who are learning this process and are going to be the first person in their organization to ever consider doing it. They're not going to have the benefit of having any infrastructure in place to help them. It's going to be them, their laptop, and a bunch of data. So I want to give them at least the ability to start there. The idea is that once they have initial results and get familiar with their particular existing data sources, they work with more experienced data engineering folk to figure out how to wrangle their unique systems into producing what they need.
Anyways, I gotta write the class
I've got a decent sketch of what I want to teach in my head already. I've even got some fun datasets and examples in mind. Now I just need to translate all those ideas into a concrete set of lessons to spread across three days.
Obviously, the hardest part is going to be slowing down. Teaching is not something to be rushed, and that stands in opposition to how my NYC upbringing means I tend to talk very quickly. Add that to my general excitement for the topic and I just know I need to actively prevent myself from flying into firehose mode. If I have my way, we're gonna start in the dark ages and speedrun all the way to showing how the same task can be done in SQL which can be sent to massive distributed systems. But that's ambitious so we'll see if I can get there in time.
Anyways, if you're in the area during the spring and it sounds interesting, convince your employer or someone to pay for the class! Even if you're not, I can probably sneak out to grab lunch those days.
Either way, I'm sure it's going to be fun.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
- Send a one time tip (feel free to change the amount)
- Share posts you like with other people!
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!