Book Review: Solve Any Data Analysis Problem
Know that standing offer I mention in every post about sharing cool stuff people share with me? Requests don't come in very often but I'm always super happy to put effort into sharing stuff. There's two this week, the big book review up top, but also a smaller one at the end. So don't miss them.
About two whole months ago David reached out about a data book he was writing and offered to send me the early access copy to get my thoughts on it. I of course happily agreed to at least take a look. But aside from getting the book, there weren't any strings attached and I get to decide if I wanted to share it with others, and I can say whatever I want. Then it sat in my pile of unread work for two months 🙃. But I finally managed to get off my butt and read the currently 6 chapters available (of 10 listed in the contents) in the early access program.
So the book is titled "Solve Any Data Analysis Problem: Eight projects that show you how" by David Asboth . The book's stated goal is to take 8 very common data projects and show the reader how a data analyst would systemically work through the projects from start to finish. Here, "start" is roughly when a request comes in – meaning the process includes a bit of the familiar "what the heck do you actually mean by this?" debate that happens with stakeholders. "Finish" largely entails getting to a point where you can show some preliminary results to someone and consider the next step since our work doesn't really have a well-defined "end".
Here's what the chapters I had access to were covering and my own rough summary of what they cover
- Project 1 - Identifying customer geographies - working with messy address data to make an analysis
- Project 2 - Who are your customers? - modeling data and pulling information from different, conflicting, data sources
- Project 3 - Metrics matter - setting metrics
- Project 4 - Getting creative with data sources - reading data from PDFs
- Project 5 - Handling categorical data - working with labels, missing values, classifying, etc.
If you've been working as a data analyst for even just a couple of years, many of the projects should sound extremely familiar to you. They're representative examples of the work we do every day, accompanied by realistic data sets and code, with all the ugly data warts and all included.
For example, the first project involves figuring out how much money customers spend from London as opposed to the rest of the United Kingdom. The problem is that you have data from the sales database, which primarily have a full text address field. The chapter then goes into showing examples of how figuring out "London" from an address is non-trivial – starting with just basic spelling/formatting issues, but also going deeper. The author then leads the reader step-by-step into identifying issues and showing ways deal with the issues (along with explanations for why a given method was chosen).
For someone who may have never encountered this specific problem space before, it's a really smooth and curated guide on how to tackle the nitty gritty details of "doing the work". It first shows you WHY something is an issue, WHEN it would actually bite you, and only then does it show a potential solution. Most "guides" you read out there just throw a method at you and let you figure it out.
In short, this is exactly the kind of book I would've liked to write myself if I were trying to teach people how to do something. It's often how I write many of the posts on this newsletter. Except David has long beat me to writing the book so I don't have to.
But this brings up the biggest problem I have, that has nothing to do with the book and everything completely about me – I AM VERY MUCH NOT THE TARGET AUDIENCE. With so many years of data analysis behind me, I've done multiple variations of all the projects provided. I have opinions, many of which are quite ingrained into my soul. I have visceral reactions to certain problem setups because I've had them go wrong so often.
For example, I'm literally one sentence into the first project description, at "you’ve been asked to report on spending volumes for London-based customers versus those based in the rest of the United Kingdom." and I'm already ready to shout "what do you mean by 'London'?" to a nonexistent executive. Is it the official city borders, or the metropolitan area that probably includes nearby towns that have merged together over time, or some other definition. I can also anticipate how massively deep a rabbit hole "figure out the city by analyzing the address" can go. If I were writing this book, I could easily spend a chapter on each of these subproblems. So the whole time I was reading through the book, across all the scenarios, I found places where alarm bells would go off in my head over these issues.
But David smartly chooses to not go too far into these rabbit holes because showing a fresh analyst the depths of the abyss isn't the point of this book (nor should it be for any book). The problem has to be set up and shown to the reader before it is solved. The provided solution to the problem is also not "complete" and leaves a lot of the more subtle details out because, again, they're a distraction. You are supposed to learn how a problem is found, diagnosed, and overcome. You are not supposed to commit a solution to rote memory.
While I know to the depths of my soul that these omissions and white lies MUST exist for a coherent book to be written, it doesn't prevent me from having that internally horrified feeling of watching a car that is going to crash in one of five ways, but not knowing exactly when and which way it will crash. I've got opinions about parts of the rabbit holes should/shouldn't be shared. Those opinions will differ from everyone else who have opinions on this same topic because I'll weight them differently based on past experiences. There's a huge gap between intellectually accepting a disagreement of viewpoint, and emotionally accepting one.
Putting my silly gripe aside, I do want to make it clear that I did learn stuff from the book. We all have our own ways of thinking through and solving problems, and watching another person who clearly knows what they're doing use alternative tools or methods to get to a similar endpoint is valuable even for veterans. This is why I write the many posts about methods that I write, and also why I enjoy watching other people "do the same work" because I can spot and appreciate the subtle differences in methodology.
Also, there are problems that I'm just less familiar with than others. For example, I very rarely have to work with PDFs, so it's nice to just see someone who knows what they're doing explain how they go about it. It's certainly easier than having to piece the process together from a patchwork of blog posts of unknown date and compatibility.
Clojure Tidy Tuesdays
A different reader, Kira McLean, sent in something they were working on – A collection of Tidy Tuesday explorations in the Clojure language. I have to admit that I am a complete novice when it comes to functional programming so I can barely make heads or tails of it. But I do think it's cool to see data work done in other languages outside of R and Python.
For those unfamiliar, Clojure is a dynamic functional language that is based on Lisp and runs on the Java runtime. Functional languages typically feel alien to people like me, who learned to program in the more familiar imperative/object-oriented programming paradigms. Recursion also features prominently in a lot of functional programming, and that alone proves to be a hurdle.
That said, Python and other languages pull in concepts that originated in functional languages like list comprehensions, lambda expressions, map, reduce, and filter. If you're somewhat familiar with those concepts, you probably have a small taste of what it is like to use a functional language and can perhaps appreciate the power. Not enough to easily pick up the languages without effort, but at least there's some familiar faces out there.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything
Supporting the newsletter
All Tuesday posts to Counting Stuff are free and will remain so. Generous support from subscribers keeps the servers running and makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Share posts you like with other people
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Consider a paid subscription to help pay for the servers and get occasional supporter-only posts
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!