Making fake data for class, then making them worse🙃
This marks the last week before I have to each my logs class and as you can imagine, I'm very busy right now ... playing Slay the Sp— no, I mean, writing code up to generate example datasets to be analyzed. Time is just ZOOMING BY. halp.
While I initially wanted to pull example datasets of stuff like Apache web logs for a taste of realism, it became increasingly apparent that given the limited time allotted to the class, and the many side rabbit holes that exist in explaining the ins and outs of a real log file, it was better for everyone involved that we used a bespoke, contrived, data set. That way, students have data they can understand and work with without the aid of a bunch of domain knowledge that I'd have to impart. That way, students can confidently assume that any failure of code on their part is most likely from their own mistakes.
Another thing I realized while preparing the code is that generating this dataset is probably easiest to do with bespoke code. Sure, there are synthetic dataset libraries out there that can be used to generate something similar, but it's surprisingly not necessary in this case.
For a typical "e-commerce site web log" application, the potential actions to log are pretty limited, so it's actually pretty easy to simulate the entire flow from a random user visiting, browsing items, putting things into their cart, checking out and making payment. Since it's so easy to come up with a relatively full accounting of all the actions involved, there's not even a need to look at historical examples to derive distributions for generating new data. Just model the process using code and go! Plus, even if it the rows don't make 100% sense within context of each other, it'll still work out since we'd be unlikely to be using the file in a way that will uncover the weirdness.
For now, you can see the code I'm using to generate this stuff in this repo, I have a lot to go in the next few days, so expect things to just constantly change...
Another interesting thing about the dev process this time is that I knew enough about the specifics of my problem that I had relatively little trouble using an LLM tool for generating the code that came close to my requirements. It was a pretty nice blessing to not have to manually fuss with all the bookkeeping steps needed to make sure all the entries at least passed the initial sniff test.
But then, there's data cleaning
it wouldn't be a logs analysis class that did not at the least discuss what can go horribly wrong with logs data. After all, while business logic and launches can always be rolled back when an issue arises, logs data is usually intended to be immutable. Store it and forget about updating it.
The problem is that juggling all the little interactions needed to make a believable is already quite difficult to do. Even for a toy dataset with 5 columns, there was still a large number of subtle little interactions to keep track of. Trying to generate a subtle set of errors that works within the boundaries of those interactions sounds like a lot of work for little payoff.
So instead I went the lazy way and slapped together some code that just goes and chaos-monkeys a preexisting log file. Drop a few rows! Duplicate some! Screw with some timestamps to throw things out of order! Go nuts and change some random characters in one of the fields for fun (even though this error doesn't really happen in real life).
While these surface errors do occasionally appear in daily work, the space for bugs is nearly infinite, unlike my very finite brain and fixed deadline. I have this itch in the back of my mind right now about how there's at least a couple of classes of insidious bugs that I just haven't remembered. The worst part is I'm not sure WHAT I'm missing, other than a vague sense of insidiousness. Because data bugs are traumatic.
One of these days I really need to sit down and try to figure out how to properly teach data cleaning in some kind of class setting.
So what happens after the classes?
The classes have been a really good excuse to get me to kick around a bunch of thoughts in my mind. I think after actually doing the teaching and figuring out the rough spots, I'll have a clearer picture and be able to turn it into something I can share.
Maybe I'll do it as a subscriber thing, see how that goes. I dunno. Obviously I do not plan things ahead very far. I'll report back next week because I've got like 5 datasets to slap together!
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
- Send a one time tip (feel free to change the amount)
- Share posts you like with other people!
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!