Counterfactual analysis, an (organizationally) unloved method

data-culture May 5, 2026

The past couple of days, I've been pondering over counterfactual analysis. For those who might not be familiar with the name, the gist is that it is a backwards-looking what-if analysis that attempts to answer the question "if this change did not happen, what would we expect the present to look like?" It's often used in the social science and related fields where it's not possible (or ethical) to do a normal experiment for any number of reasons.

The basic idea of counterfactuals is that you have to construct a special selection of data points in your data that lets you make the argument of what would have happened. For example, if you made a change to the registration page of a web site to get more user sign-ups, you could argue that the number of users visiting the page is your special group (because you know you didn't touch that process) and the old sign up rate was 1% (from lots of historical data) while the current one is showing 2% since the change went out. Make some assumptions, do some math, and you can attribute an extra 5000 new users in the past month to the change.

As you can imagine, building a counterfactual can range from being relatively straightforward, to being completely impossible. You need to identify some magical group of data points that you can assume to be consistent both before and after the treatment that you can also make a strong claim to it being correlated to your metric in question. For user registrations, it's a pretty straightforward to consider all the new users who visit the signup page as being that group you can track against. Imagine trying to build a comparable data set for an experiment that involves some messy population like "people who saw a particular TV ad in a certain market". The setup might not be amenable to the method, or the data doesn't exist, or while there's data it's not nearly enough to draw conclusions with.

The allure of the counterfactual is that they're more concrete than the kind of hypothetical projections you normally get from an A/B test. Whole testing your change, the observed lift from your change has all sorts of errors built into the measurement just by the nature of sample sizes alone, the observed rate might have been 2%, or it could've been 3%, or 1.5%. If you take whatever point estimate you get from me the experiment and just run with it, you can land in a position where someone asks you the uncomfortable question of why did your projection not quite reconcile with reality 6 months later. Sometimes, your estimate could have been spot on, but factors outside of the model may have completely changed things.

Yet, counterfactuals are rarely used in industry

I've done probably a handful of such analyses in my career. They're genuinely useful certain circumstances – especially when no one bothered to do an A/B at all. It's often not a lot of extra work to go from a post-hoc quasi-experiment analysis to a counterfactual analysis. There's quite a bit of overlap of analysis work when you're hunting for things that changed. If you go so far as to calculate a possible "change from baseline" you're essentially doing the same sort of analysis under a different name.

But I can barely think of other instances where I've been asked to do a counterfactual otherwise. Very often, industry doesn't adopt a method usually because of simple reasons like, "it's too complex", or "it doesn't work". Whereas for counterfactual analysis, it's a bit peculiar because while the method is somewhat limited by the types of situations it can be applied, those situations can appear fairly consistently for many business problems. Moreover, the method is pretty effective within those situations because there's often few alternatives. And despite that, in my experience it's a pretty rarely applied method.

The primary reason that I can identify is the incentives of modern day product development pretty much work against utilizing counterfactual analysis.

First, a lot of people who don't work with data a lot likely aren't aware that it's a method. While we're all taught to not let our stakeholders dictate what methods we apply, this lack of knowledge means that people don't come to us at a time when doing a counterfactual analysis makes sense. Since the timing of the analysis is "after something has gone out for a sufficiently long period of time", the chances are very slim that a stakeholder will come at the right time, with the right sort of question, that leads to us even suggesting that this is an analysis we should be doing.

Secondly, a counterfactual analysis feels like redoing old work. In the eyes of a stakeholder, we already figured out the "lift" during the experiment phase. In many organizations, the A/B test results was one of the primary gating functions used when deciding to ship a feature or not. Industry has spent the past 20+ years teaching product managers and engineers that hypothesis testing was The Gold Standard for making product decisions, an oversimplification to be sure, but

But the most salient reason is probably because various people on the feature team have likely put those results on their performance reviews and have moved on to other things. While I haven't quite sat down to do the math, but my intuition is that if you had a hypothesis test result and wait 3-6 months to come back and reconfirm that result via a counterfactual, there's a much bigger chance that you'll find a result that is smaller than the original test's effect size. It's hard to say whether it's regression to the mean, some un-accounted for system interaction, populations the treatment works on being limited, or maybe the whole previous experiment was a statistical fluke.

Regardless of the actual cause, people and organizations generally don't like it when you go back and tell them that the decision they made didn't work out the way they thought it would. The experience also tends to stick in their memories. So why would they fund work that does nothing but encourage those results?

In theory, there's supposed to be one group of stakeholders who are interested in doing counterfactual checks on launch success – people who are responsible for actual performance and not just the appearance of performance within a snapshot of time. Oftentimes these are executives and people who have a particular axe to grind around decision-making metrics. Everyone else is pretty much aligned on not doing such analyses because it can't benefit them even in the best circumstances, while potentially harming them if results contradict earlier work.

And so this method sits in a very peculiar position in the toolbox of data techniques. It's undoubtedly useful for the situations where the stars align and so it's recommended that people learn how to do such analyses at some point. The basic idea is simple enough that many practitioners wind up reinventing the basic method from first principles. At the same time, it's also an analysis that is overlooked, whether from ignorance or all the other organizational reasons. I find myself pondering over this kind of work exactly because it pulls into sharp focus a certain kind of dysfunction in organizations.

Fixing the organization problems is going to be a much more difficult project. In theory, having a way to help show "did we make the right decision based on our data" would be a welcome tool for an organization that purports to being "data-driven". In practice, many organizations practice a form of "data-driven theater" and methods that go against the overall narrative of "we always make number go up!" aren't rewarded.

While it's part of our jobs as data scientist and UX researchers to be "the bearer of bad news" to organizations who don't want to hear it, it's rarely a good idea to make that our sole personality. If anything, constantly saying "this didn't work" doesn't help suggest things that may actually work. Given the extended timelines that are needed to make new features and changes, you can't rely on pure trial by exhausting all other possibilities for a product strategy.

But still, it's worth keeping the method in mind just in case the opportunity to use it does arise.

Subscribe!

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — homepage, contact info, etc.

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
Send a one time tip (feel free to change the amount)
Share posts you like with other people!
Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!