Who doesn't love a giant inflatable duck?

Everything's small data again

TopPost Apr 30, 2024

Anyone old enough to remember the time when "Big Data" was the buzzword of the day? Everyone was hyped up about processing data so big it couldn't fit into memory. We were going to find patterns and correlations to revolutionize business... or something of that sort. I think we can all admit that the so-called "big data revolution" has made some changes to the world, but not nearly as much as dreamed during it's heyday. Unluckily, that same unfounded optimism and hype has shifted to "AI".

While having huge amounts of data seems to have useful implications in a machine learning context, one place where big data seems to have been especially useless is in situations where we are trying to draw inferences about some research question. In my grad student days in social science, we could only dream about having even an sample size of 100, let alone 100,000 in a controlled experiment. Oh the things we could find if we had access to that many people!

But years later, despite having to work on products that potentially have millions or billions of users and data points, I don't think I've really encountered a situation where I even came close to needing to use of that many data points. While you could theoretically find all sorts of weird statistically significant differences by comparing a few million users on two sides of a comparison, few of the questions are all that interesting.

For example, you might be able to show that men prefer to click the left button on a page over the identical right one by a tiny amount, while the reverse is true for women. By the sheer brute force of shoving 8-figures of sample into a formula not designed for the application, you could say that statement is "statistically significant". But I can't think of a situation where anyone would care to know the result, even if it wasn't a weird statistical artifact.

There's limitations on the kinds of interesting questions you can ask of massive, "zoomed-out", population-scale data sets. Think of how many questions are answerable if you have 1 experimental variable and a 50/50 split of male/female gender info. You can certainly find some interesting relationships like life expectancy – but you'd be helpless to answer the inevitable follow up question of why that is.

We're more able to find meaning by comparing groups to one another to isolate effects. Segmenting is critical to that job – think about how many times you've had to control for variables just to make sense of a given effect. Each control winds up dividing your data more until you might be forced to stop controlling for certain factors simply because you've run out of sample size to subdivide. Even if you don't subdivide by demographics, you can subdivide by time and just pull shorter-time samples. Alternatively, teams will sometimes choose to expose a smaller proportion of the population to one experiment in order to "reserve" more of the base population for other experiments that they fear might have interactions.

So paradoxically, having large data sets doesn't mean we use them as-is, it means we can make more useful small datasets.

A lot of the big data hype was based on the fact that our tools, the computation and storage and algorithm parts, could finally handle these large-at-the-time datasets. New horizons were opening and people were excited. I distinctly remember how amazed people were that Excel went from supporting 65k rows to 1 million in 2007. You could maybe max out a consumer desktop at about 4GB of RAM for the first time around then. And since companies make money off selling tools, the hype and marketing followed. Look how much DATA we can work with, said the big-iron computing companies that sold mega databases and mainframes of the time.

Now the hype cycle has faded to the point where you'd be marked dated for using the term. Horizontally scaled data warehouse solutions are now easily available on your favorite cloud. Laptops can pack 64GB of RAM at modest cost and run for a workday. "Big" has stopped being a selling point and I rarely see people flexing on how many rows they had to wrangle any more. Even the online discussions about how to "work with large data sets" are slowly fading to the background because our tools have pushed the boundary of what "fits in memory" to extents few people will ever need to contend with. The most common solution to the question of "how do you analyze a trillion rows of data" is now "load it into your data warehouse vendor of choice, or maybe open DuckDB/Polars/etc and do it there". And guess what, once you load the data into either solution, your next move is very likely going to be finding a way to extract the interesting subsets of rows out to draw your inferences from.

So why am I harping on something that we all pretty much know and do in practice to the point where we don't even think about it any more? It's because we're returning back to the age of data work prior to the Big Data fad where people working with data didn't have think about whether their tools could do the needed work. We just now simple work with "data" again, no size qualifiers.

The rare-for-2010 skills of writing custom code and algorithms to handle large datasets are are barely differentiators now. Students learning R or Python for their class projects now can use almost identical tools to work with a trillion rows if needed. Even an analyst who only stays in Excel-land has access to features in that ecosystem that them them pull slices of data from databases through things like PowerBI or ODBC drivers. Even now, the fundamentals of data system architectures like streaming systems and batch processing systems are increasingly well known so no one has to reinvent those wheels any more. For any of those functions, there's a couple of products and solutions for that. Learning how to use such solutions is much more accessible than inventing one from scratch.

Over the past 20-30 years, analysts of all stripes are finally at the point where tools don't necessarily have to dominate the conversation because the vendors are tripping over themselves selling AI stuff now. It also doesn't mean we have completely reverted to how things were before. My opinion is that processes, applications, and methods are going to take more importance in the conversation over time. There's only so much arguing about or favorite database or library that can be done.

In addition to a shifting conversation, we've successfully plowed a new career path where "mere data analysts" can develop coding ability and can grow towards an engineering discipline if they want, either in data engineering , ML engineering or similar. Who could've imagined that such new options would come into existence. That's pretty cool.


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

  • randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted, so support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Tags