Basic logs analysis via code
Phew, since there's a slight pause in chaos while the call for speakers period for DataBS-Conf is open, I can take a couple of hours to add to the pool of supporter's content. Thanks to everyone's support we can plan for a fun conference since everyone here is helping pay the cost of software needed for the event!
In the previous post of this series(?) Logs analysis from a bygone era, the command line, I went over the idea of using just Unix command line tools to do pretty common logs analysis tasks. The conclusion was that CLI tools can, with some cleverness and pushing complex logic up into the log generation code, handle most analysis questions that fit the pattern of "one independent log line = one interesting event".
While a lot of logs analysis fits that pattern, not everything does. The most common example of a log problem that doesn't is any calculation of durations. Examples of this problem include "time on site or page", "time to complete a task", "length of a user's browsing session", and so on. Durations typically require two distinct log entries, a start and an end, to calculate a value. While it's possible to move the duration calculation up into the log generation part of the measurement framework, there's a lot of other technical hurdles that come with that method too. Oftentimes it is more convenient and feasible to simply do the math after the data is collected.
The reason that CLI tools typically fail at this is because these problems effectively require memory and custom data models. CLI tools were designed to accomplish their basic tasks effectively reading one line at a time and throwing away previously seen data while applying a specific algorithm, they're certainly not going to make use of a custom data models. So the only way to get our work done is to leave the CLI and head to something that gives us access to the ability to remember things – an actual programming language.