Just some ducks floating amongst cherry blossom petals

Democratizing data might not be about skills

May 14, 2024

Last Thursday I needed to vent some thoughts about training and teaching data skills out to subscribers, but putting those ideas down really helped pave the way to much more clarity this week. Thanks for letting me use the space for that kind of thinking.

Let's imagine that you have set up a system/process to democratize data access (we'll leave the definition of what that means for another time). People who are not "professional data people" now have access to various bits of data to use as they want. To what extent do you as an organization trust that these people are going to make effective use of the data?

The answer to that question, and it has many conflicting answers, is why there are so many different opinions about whether companies should give their employees access to data and to what extent. The answer obviously depends strongly upon the context of who's making decisions, what sorts of decisions are going to be made, the potential costs of failure, the overall general culture, etc..

My personal opinion is that broad access to raw data is a bad idea because, simply due to how many quirks exist in raw data, the odds that people draw "bad conclusions" is much greater than any benefit. So the more reasonable alternative is that access to data needs to be gated by some minimum competency in working with data. That is to say, metrics carefully curated to be "clean" can be shown to everyone, give data cubes of cleaned data that allow slicing in a BI tool for people who have better number interpretation skills, then increasingly more access to raw-er data the closer the person gets to having a data analyst skill set. That sounds pretty reasonable, right? Except I don't know how to determine the bars between those layers. How would I, or anyone else, know when people are capable enough to be set loose upon data? For that matter, how do I know that even I myself should be looking at the data that I'm looking at, in the way I'm looking at it?

The problem of measuring skill bars just seems too unsolvable because the complex, interconnected web of knowledge needed to do data analysis resists simple measurement. Even analyzing something as simple as revenue numbers requires knowing something about financial reporting conventions, business logic, as well as the principles of forecasting. Depending on the specific question inferential statistics may come into play, or programming simulations might be needed. And that's just one family of problems for one metric. Webs of knowledge differ in small or large ways when moving between problem domains or spaces. So I suppose the only way to determine a list of "minimal data skills" is to endlessly list out the concepts used for various tasks and see what the common denominators wind up being – for example, the concept of bias, sampling, hypothesis testing, projecting, counting, classifying...

Since I'm not able to come up with a "complete minimal skill list for data analysts", I feel like I might just be barking up the wrong tree entirely.

Where do we learn analysis anyways?

I don't think I've ever taken a class at any school level, from elementary school all the way to college, that specifically taught me how to "do data analysis". It's just something we learned along the way while trying to make sense of the world. We learn how to solve math word problems all throughout school. We are forced to collect, summarize, and analyze bits of data for science projects. Later in life, we get projects where we see data in various guises and have to draw conclusions from them. Finally, if you get into graduate level research, you're exposed to the hard work of trying to pull knowledge out of uncooperative data.

We picked up all these analysis skills through practice along the way. More importantly, we learned those skills in an environment where people would point out our mistakes and help us learn. Our teachers in school did this. We did this for our fellow students. When you hit academia-level research work, you get to "join the scientific community", which just meant you've hit the edge of where people know what is correct, and instead rely on peer review – a.k.a. have peers collectively try to point out how bad your work is – for this function.

To put it more clearly, we learn to be good data analysts by being in an environment, a community, of people who will demand our analysis and reasoning is solid and justified. This translates even into the industry world where I know that my peers and stakeholders will attempt to call BS on things that I say. Those folk aren't dumb, even if they don't have the same technical skills, they can follow a thread of reasoning. That fear of getting called out by all those folk pushes me to constantly ask if I'm doing things right, constantly wonder if there's flaws to catch.

Meanwhile what's the biggest fear about democratizing data? That people will come to bad conclusions and subsequently make bad decisions without anyone calling out the BS before it's too late. Essentially, people who aren't trained in data haven't developed the reflex where finding a result that confirms their biases alerts them to a potential error. They instead gleefully follow their confirmation bias. The environment that they exist in hasn't instilled that fear into them.

The same applies for the answer to why is it that I somehow "know" when I'm treading on dangerous ground analyzing data I'm not familiar with. If math is the same all around, why is it that I instinctively feel uneasy analyzing data outside the spaces I'm familiar with? It's because I don't know how to anticipate what peer experts would throw objections about. That fear of being called out keeps me from making bold claims without doing extra sanity checking. Even if I do make a bold claim, it's often presented as "hey, this is weird and I don't know why".

Perhaps the whole thing about why is it so hard to democratize data is because it's hard to build an environment and culture around working with data. It's hard to find enough people who can work well enough with data and have the time and energy to call out the flaws in each other's work.

Build the environment through people

Anyone's who's ever tried to affect the culture of a group will know that changing culture is very difficult because it means changing many people over time. Just think of all the times you've seen some executive announce some new cultural initiative and the endeavor falls flat within a few months.

So my current thinking about this is that tutoring people on how to pull and manipulate data is not quite enough to get them to a place where they can be trusted to do analysis on their own. There's never going to be some checklist of skills that will bestow that trust on those folks either. Instead, what I need to do is have those people share their work with similar folk so that they can get the feedback they need to grow while also giving feedback to others. It can't be a simple one way street where they have me review and comment their work because that makes me the single point of analytical failure.

So perhaps the best way to democratize data is that the condition to retain access to data is they need to join and share their explorations to a group of peers with similar access. Force them to develop the idea of having their analyses pass a minimal peer process.

I dunno if this will actually work in practice. It still seems like a very heavy handed process. But given that I've failed to create such communities at work in the past 15+ years, I'd be willing to give it a try in the future.


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

  • randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted, so support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!