You're Collecting Too Much Data!

GuestPost Jun 4, 2024

[Note from Randy: This week is a guest post from Alan about a topic that I very much believe in – how lazy data collection practices is often worse than no data. Big thanks because this week is a hyper-stressful week with zero mental bandwidth for me. Also my Quant UX Con 2024 talk about effective experiments is June 13th, 7pm EDT if you're interested! Come watch even if you need to use the no-questions-asked free academic ticket!]

The idea of "big data" is still alive and well, with companies still trying to collect as much data as they can about everything that users do. But should they be collecting all of this data? More specifically, should you be collecting all of this data? I think it's all too much. It's too much conceptually. It's too much operationally. And most of it isn't useful. You should probably just get rid of it.

Conceptually, isn't collecting more data better?

You probably want to know whether you should do X or Y, or maybe do nothing at all. Those are all valid options, and data can help you make that decision. And while conventional wisdom suggests that it's better to have too much data rather than not enough, you probably aren't using all of it. You may be tempted to ask "what if we later decide we want to filter or sort our data according to X or Y?" to which I would respond "why aren't you doing that right now?" There are a few obvious answers to the question. Maybe you don't have the right data, so having more of it definitely won't help. Maybe you just haven't gotten around to it yet. Why not? I can't actually answer this question for you, because it turns out this might not be a data problem at all, so collecting more data isn't going to help.

One common pitfall associated with having lots of data is that it can invite speculation about causal relationships; this is a trap that can lead to fishing expeditions that aren't attached to a clear action plan. The biggest issue is that causality is difficult to establish. Mostly this is because people want to look for behavioral outcomes, but people do things for all sorts of reasons, and data can only be used to examine some of those things. Human decision-making is a complex process informed by preexisting knowledge, and data cannot really tell you what the user already knows. This doesn't mean that causality is impossible to determine; an A/B test can help you establish cause and effect. Unfortunately, it isn't always possible to set this up, especially if you are trying to make changes to a live product. Instead, you may have to settle for collecting live data and making inferences. However, this is risky, because maybe you guess incorrectly and can't find a relationship between your hypothesized causes and visible effects. In that situation, collecting more data is not going to help you.

The idea of being data driven is that you aren't just guessing, and this should be reflected in the data you choose to collect and examine. Instead of trying to collect everything, focus on collecting the data that's meaningful and get rid of stuff you aren't using.

There are operational reasons to collect less data.

The conceptual challenge of collecting relevant data is probably the most important consideration, but all of that data also needs to live somewhere. Storage is not prohibitively expensive, but it isn't free. And you aren't just storing the data, you're trying to do something with it. That means eventually you also need to look at it. As it turns out, iterating over data is slow because disks are slow. Even just moving the data around is slow, especially when the data won't fit into memory. There are distributed solutions available, but that's just making it into an infrastructure task even before you can hope to extract anything from the data itself.

There are no shortcuts here; at best, the time and space required are directly proportional to the amount of data you have. Realistically, you're probably going to want to filter or sort the data, and that's extra overhead on top of pushing the bits around. If you care about things like time and cost, you should be thinking about how to reduce the amount of data you're dealing with.

Surely data aggregation can help with some of these issues? Absolutely! The tradeoff is that you lose some granularity in exchange for more efficient use of space. You're also trading time now for time later, because any processing you do now does not need to be repeated later. But you still had to do it in the first place. If you're going to lose granularity anyway, it's possible you didn't need that level of granularity in the first place. This is great news, because it means that maybe you don't need to collect some of that stuff if you're going to just throw it out anyway.

I'd be remiss if I didn't also talk briefly about the regulatory aspects of keeping everything. You must now spend time and energy on compliance. Are you going to discover something in your data that will justify the effort? Maybe, but now you're weighing a possible outcome against a real cost, which seems like it isn't such a great deal. If you aren't using personally identifiable data for a good reason right now, maybe just get rid of it and save yourself a lot of trouble later.

What do you keep and what do you discard?

When it comes to deciding what to keep and what to throw out, there's no single answer that will work for all situations. Going back to data fundamentals, the main thing you can do with data is detect changes. But this requires that you know your baseline and can isolate your intervention. People screw these up all the time. Without a baseline, you just have numbers. So if you want to detect a change, start by keeping a baseline. Maybe you look and see that 61% of users interacted with some new feature you just introduced. Is that good or bad? Who knows? You check the numbers again next month and now the number is 59% of users. So what? Even if you decide that the change is meaningful, you still don't know what to do about it. This is not a problem that can be solved by just collecting more data, because quantity is not the actual issue.

It does however serve as a reminder that you should really only collect data about things you can control. This is often overlooked or lost among a deluge of other data points, but ultimately it's the one that matters the most when you're trying to turn your data into something actionable. It can also be tempting to track external factors, but you should try to avoid making recommendations about things you don't control!

Outcomes can also be tricky to measure. Too often businesses are interested in user sentiment or some other qualitative characteristic. But if you don't ask that question directly, it's hard to infer it from other data. As a result, you end up trying to guess at your outcome, and this is not ideal. If there's an outcome you are interested in, you should try to measure it directly. Figure out what you can change, and then collect data that tells you what outcome you might expect.

There are real costs associated with collecting too much data.

I feel like the push towards "big data" and "larger models" has obscured the real value of data in the first place. The goal is not to accumulate data in hopes of extracting insights. Data quantity comes at a cost. Not only is there a very real cost to storage and processing, it can fool you into wasting time on speculative fishing expeditions with imagined causes leading to unmeasurable outcomes. But the data doesn't always account for the things you cannot control. There's almost never a single thing you can change that will cause someone to choose one thing instead of another. And sure, maybe if you collect everything, you can try to guess at what the influencing factors might be. Or maybe it's all just noise, and you're paying the cost in terms of time and money. What we all really need is better data, not just more of it.

Alan Au is a data nerd at large. Previous careers include experience as a freelance journalist, an enterprise software developer, a bioinformatics researcher, a data scientist, and an independent board game designer. He's trying to figure out what's next!