Even stats-savvy people can get confused by Standard Error visualizations
Stats talk? Here? What a rarity!
I don't read stats papers often since they typically go over my head, but somehow this one came across my feed at some point and it's one that's worth pointing out because I'm pretty sure I've fallen prey to the confusion they've described too.
First off, the paper: https://www.pnas.org/doi/10.1073/pnas.2302491120
The core of the paper is that there's a qualitative difference in how humans, even experts, interpret very common visualizations of Standard Error and Standard Deviation when presented in commonly used visualizations even though the two statistics are deeply related through sample size. Here's a copy of their core figure:

The problem is that in most scientific literature, we are used to reporting the uncertainty of the inference, that is, the standard error. In visualizations, our error bars show where we think the true population mean is. The more samples we use in our study, the more narrow the bars become. The chart above on the left shows there is a clear, significant difference between the two classes. Papers want to show this because then you can show that a statistically significant effect exists and the means are unlikely to be equal.
The problem is that the uncertainty of individual outcomes (a.k.a. the individual observations) are characterized by the standard deviation and that does NOT change with increasing sample sizes (duh). Every time a subject of a study is observed, they can show up randomly all sorts of distances from the sample mean. When you plot the distribution of the sample onto the same chart, it does two things. First it expands the y-axis scale because there's so much variation it can't fit otherwise. Second it shows how there is a huge amount of overlap exists for individual observations even though the means differ.
What's notable is that when presented with the left side SE graphs, my guess is people make the flawed assumption that the samples would sorta cluster in a sorta normal-ish random distribution around the mean, scaled to be near the error bars. Even though that is clearly an incorrect way of interpreting the standard error because we can control that range by simply adding more samples. This misinterpretation can happen especially if the reader is not statistically savvy (or just not paying attention). This translated to subjects in the study estimating effect size to be larger when they saw only the Standard Error charts without seeing visualizations of the Standard Deviation. When shown identical data with Standard Deviation information included (though subjects were told it was a different study), subjects consistently made lower, better guesses of the strength of the effect size.
Here, effect size estimation was operationalized as subjects stating their guess of the probability that a random treated subject would have a higher or lower value than a control subject. This estimate of the effect size is useful because a true value can be calculated and used as a reference point.
This effect of being misled by Standard Error graphs isn't novel to these authors or the field. Others have noted this problem and had suggested alternative visualizations to mitigate the problem. Previous studies had show laypeople SE and SD plots and shown laypeople get misled by the SE charts. The current paper's contribution to the discussion is that they showed these charts to people who are all supposed to know better – medical professionals that have experience reading research studies, data scientists, and tenure track professors, and these experts still make a similar pattern of mistakes. But when these participants were shown the SE charts, they more or less did the same thing. This occurred even when subjects were prompted to think about the meanings of SE and confidence intervals, which you'd think would make them a bit self aware of their mistake.
(Incidentally, the paper also measured a bunch of other things like whether subjects thought the paper should be published in a journal, etc.. That's less interesting to us so I skipped it.)
Luckily, studies like this one have also shown that the bias can be influenced quite well by simply plotting the distribution of observations alongside the SE results.
The paper then goes to discuss how science in general has focused on emphasizing statistical inference results, perhaps because prediction of individual outcomes is so difficult. The suggestion the authors give is to just make sure that BOTH SE and SD information is communicated to readers in close proximity so that readers can make better interpretations.
Turning to industry, I think we perhaps go one step further in reporting findings and simply put a star next to findings that are significant and don't even concern our audience too much about the effect size. If we do, I think it's common to convert the effect into some tangible optimistic projection like "this 3% lift would translate to an extra $10 million in revenue a year". Since make stakeholders take statistical significance as a mere box to check, we aren't really incentivized to create charts. On average, I guess not accidentally misleading a stakeholder with a chart because they don't care is better? Maybe?
That said, we do show similar charts to our peers, probably put of habit of being former academics. In those situations these interpretation failure studies are pretty important to us. I'm 100% sure that I've misinterpreted similar charts before and am quite confident that I'll do it again when I'm never the slightest bit distracted. It's just so damn easy to make this mistake.
So let this be a lesson to us, and everyone else!
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
- randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted, so support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
- Share posts you like with other people!
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!