Talk summary: Designing experiments for maximizing getting things done

TopPost Jun 20, 2024

Yay. Now that the house-hunting is over, I have more brain space to do some writing. Plus motivation to write more because mortgage payments 🫠

If I have to choose between text and video, I'm a text sort of person. The problem is that even though the slides to my QuantUXCon talk are available, my presentation style tends to include a lot of ad-libbed voice over on the slides. So the only real way to capture what I talk about at a talk is to either listen to the recording... or I write up a summarized transcript. This post is pretty much an adaptation of the talk contents and the slides so that busy people who prefer text can reference it.

Probably the most interesting bits is the Q&A section because (to my surprise) we had a bunch of good questions that were fun to answer. Having hoested a bunch of conferences before, Q&A time is when things can fly off the rails but we had a good session.

We all do experiments

Experimental setups are a foundation of science, and we in industry had adapted those methodologies to our work. The great thing about experiments is that they let us "try things and find out". We can apply the method very broadly and still get useful results.

What's notable is that industry's use of experiments is very much a-theoretic. We don't have a "theory of idealized checkout pages". Such an idealized theory doesn't exist. Instead, industry just cares that a treatment caused a result. Industry just tries out a new thing, if it provides a desirable benefit, industry will go with it without worrying much about the mechanisms.

This theory-less worldview is very convenient for industry. Results are results. Even if we don't have a theory or mechanism in mind, the things that work better may be studied later by academics and incorporated into a theory somewhere.

More importantly, experimentation means that every idea can stand before the impartial judge of experimentation. Every person can propose an experiment and see if it beats the alternatives along whatever dimension we care about. Experimentation allows for the democratization of testing ideas. (Caveat, this assumes the experiments are run correctly and without bias, which can be a problem.)

The arc of using data

This talk is specifically talking about issues that tend to be encountered at a certain phase of data use maturity. So as a brief aside, I show the arc of how companies tends to use data. I specifically call it an "arc" because it goes up... and then down.

Start by collecting existing data
Use analysis methods to find correlations, make decisions
Aspire to be data-driven, start doing experiments at great cost
Widespread adoption, tooling improvements, data improvements
We can test anything!
We should test everything!
OK, OK. Test the important stuff

The thing to remember about this arc is that progress through is not linear. At any point things can change and we fall backwards a few steps. People change, systems get updated, processes change, priorities change, and all that can put a reset on the whole data-driven enterprise.

Today's talk is primarily focused on the phase that happens once it becomes cheap and easy to run experiments. The tooling for testing has been refined so that it minimizes the number of mistakes that can be made with an experiment at the mechanical level. This is the phase where you start hearing people say "we should test everything because we can test anything".

Bad experiments

And so, when companies hit this level of data usage maturity, it becomes increasingly likely that they start engaging in bad experiments. The word "bad" here is specifically for the experiments that are mechanically correct, but somehow just misses the point. They're the experiments where all the sampling and randomization and treatments are functioning as needed, but you're still looking at the test wondering why are we even running this test to begin with.

Examples of such bad experiments include:

Experiments as a cure for decision paralysis – test everything because we can't decide what to do
Tests must be very realistic – when people over-index on ecological validity and insist that only live tests are good because lab-based studies in controlled environments aren't realistic enough to get good results
Rejection of self-report metrics – when people insist that only "hard stats" from logs and activity matter, because surveys and other self-report measures are subjective and bad
Eye exams pretending to be A/B tests – tiny copy changes, objects shifted over 3 pixels, maybe changing the wording of a paragraph in the EULA.
Tests that will run until the heat death of the universe – due to a mix of tiny effect sizes, difficult to acquire samples, and just unreasonable expectations of statistical power
Testing amnesia – teams forget whether an idea has been tested before

All these examples of bad experiments are costly, not just the time and attention cost needed to create them, but also there's a cost in user experience and trust. No experiment comes free.

Making things better

Overall, my suggestions for making things better involves three themes:

Do bigger, bolder (a.k.a. more noticeable) treatments with tangible effects
Have experiments that aim to learn, not just decide
And then remembering the lessons learned over time

The means we should run fewer, more meaningful experiments, because running a ton of tiny experiments is just p-hacking with more steps. Just because a test is easy to run doesn't mean it's worth doing. As the stewards of experimentation in our workplaces, we should be advocating for not testing things that aren't worth testing.

Don't treat a gradual rollout that's leveraging the testing infrastructure as an experiment. People very often leverage testing frameworks to launch a new feature while keeping an eye on key metrics. People get very used to calling these things experiments because they often use the testing framework. But we never learn anything from these launches unless we screw up badly.

Stop testing on tiny populations. They're going to take too long to run, and they're likely not worth doing until you are on a bigger scale. This also applies to situations like eye-exam tests where a web page might have a ton of traffic, but the change is so obscure that the number of people who will notice it is tiny, and so the effective sample size of users is also tiny.

Stop running tests that have tiny effect sizes. Sometimes, even if a test goes 100% in your favor, the end effect will be tiny and not worth the time until the business scales up more.

Have an experimental memory. This is an extremely difficult task, but it's important to have records of what tests are done and what the results were. This prevents teams from accidentally redoing an experiment, and also serves as a way to see how concepts and ideas play out across time.

Test ideas, not individual designs. Most designers have a concept in their head that provides the inspiration for the design that actually gets tested. Sometimes, the specific implementation doesn't work out, but perhaps the idea is sound and we need to iterate. It's important to make sure we have discussions about tests at this more abstract level because it will generalize much better.

We should talk about theory a bit more. People are very much intimidated by the word "Theory" and feel like they aren't qualified to engage with the topic. But people all have an idea in their mind about how the world works that is providing the inspiration for the designs they are testing. People have no problem engaging with "design patterns" or "user experience" or "user goals", even though all of those are slightly more concrete examples of a more academic theory.

These are organizational problems

The use of experiments, both good and bad, happens at the organization level. The fixes to these problems will also have to be at the organization level. That means it is a long and difficult road to embark on. But it's very much worth doing.

Q&A

Q: How do you decide if a sample size is enough?

A: So the Research 101 grad school, we learned how to do a sample size calculation. Grab a textbook and calculator. Somehow figure out your effect size and make decisions about your statistical power needed. Plug into a formula and you get the needed sample size. Industries like clinical trials have rigorous processes for this. But in industry... everyone who's asked a stakeholder what their expected effect size is will get a shrug in response. So we guess, 1%, or 10%, etc. My personal rule of thumb is ~1000 is a good place to start. We're mostly testing different formatting on a web page so it doesn't warrant being too serious about it.

Q: What do you recommend if the sample size is small due to budget or participant pool?

A: Yes, we love big samples because it means our error bars are tight. But we often forget that in social science, samples of 100, 50, 20 are still valid. The error bars might be wide, but you can have good conversations about whether the boundaries of the 95% confidence interval are anywhere near what you need them to be. Sometimes that conversation is enough.

Q: We're starting to implement A/B testing in prod. What sorts of guardrails do you recommend?

A: Guardrails are tricky. The first guardrails will be the people running and designing the experiment. The most dangerous things is when experiments become biased. Maybe the sampling is broken and participants see both treatments, or there's a skew in the demographics, etc.. Metrics also need to be designed fairly because it's easy to accidentally bias things. Other guardrails you will find as you come across issues in practice.

Q: Can you give an example of what you call "theory"?

A: So a big-T "Theory" that's relevant for web design would be something like Information Seeking Theory [note: this is actually a whole group of theories about information seeking behavior]. All that finds applications in designing webpage navigation. But asking a designer with no research background to engage with the theory is very intimidating for them.

Meanwhile, small-t "theory" would be something like the understanding of how users check out when making a purchase. It draws on things like the mental model of the user, the understanding of the goals of the user, etc.. All that stuff can be tied back to formal academic theories, but you don't talk about them directly. You essentially build a story around how people are using the checkout page, and stakeholders have a much easier time digesting and engaging with that story. These more abstract discussions are useful to have because it can lead us to creating new, interesting designs.

Q: How can we help designers work with theory?

A: Designers have ways of understanding how people use things, but they have a different language for it that has nothing to do with the academic theory. It's often useful to show how their ideas have parallels in formal theories. Those parallels can often yield predictions that can turn into new design ideas. But it's up to use to bridge the gap.