Talk writing: doing experiments better
Bi-weekly Thursday posts are primarily for subscribers and feature more in-progress type work and thoughts from Randy. This week is about experiments.
Why do I do this to myself? I somewhat impulsively submitted a talk idea to Quant UX Con 2024 in June and... whelp, I got accepted and now have to start the process of writing up the talk. It's a whole thirty minutes of time and again I'm regretting the fact that I'm a fast paced speaker who's going to wind up having extra time.
This week's Subscriber post will be the first inklings of what will become my talk. I'm part using this space to vent a bit on industrial experimentation practices. The idea is that if I just vent and breathe fire all over the place, some useful diamonds emerge out of the charred rubble. What, that's not how diamonds are formed – get a better analogy? Yeah, I'm aware, and I'm not gonna.
So the base talk proposal is providing strategies for teams to have better, more effective, experiments. Ever since grad school, I got to watch people propose experiments to try to understand whatever it was they were interested in, be it academic or industry research. Some ideas were clever and effective, while many others were bad for all sorts of reasons. Over time, I started seeing patterns in the badness, hence the talk.
As an example, I have a distinct memory from my Communications graduate seminar where we were reading papers about understanding how people responded to avatars versus face-to-face media. (It was the mid 2000s before even smartphones, video chat was expensive enterprise stuff). Since we were grad students, we were asked to discuss the validity of the studies, the generalizability of results, etc.. We also were to discuss improvements, extensions, and critiques of the experiment.
During the discussion, I noticed that one thing that multiple students kept focusing on was ecological validity of the results. "Effect A was found in a lab setting, wouldn't it be better and generalizable if we checked things in a more realistic setting?" Seasons researchers would probably be shaking their heads right now since it's very common that small social science effects that are just barely found in lab settings are almost impossible to reproduce in less controlled settings. But at the time, all the students in the room went along the idea that this would be an improvement.
That experience has stuck with me for all this time. It's a very common sort of critique that new researchers will raise because "make it more realistic!" is can be applied almost universally to anything. And I continue to see parallels to that behavior in industry work. This time, instead of grad students, it's non-researchers who are either proposing experiments or critiquing results. There's a very strong bias to "make stuff realistic" in some way, to help it pass a certain sniff test.
That's just one of the things I want to vent and explore in my talk.
Overall, experimental work is already akin to coming up with a plan for finding a p<0.05 sized needle in a haystack. Lots of knee-jerk reactions wind up dumping more hay (noise) onto the stack.
Aside from the impulse to make effects harder to determine thanks to adding "realism", I've also seen other things... in no particular order.
One that drives me up a wall are excessively timid experiments that end up with people questioning whether people noticed a change or not. This is usually done to "avoid breaking things". Let's add a bit of extra explanation text to see how it affect sales. Let's change the color of the checkout button. Let's incrementally test out specific elements of a new design before we put together our redesign. There's lots of factors that go into how boldly you make your experimental treatments, but I've repeatedly see teams get to a point where their experiments are "too costly to fail" and so... their experiments rarely fail. Or more precisely, it doesn't matter either way if they fail.
I've also repeatedly seen teams use the A/B testing framework as a rollout service – that is, they use the ability to show a variation to 5%, 10%, 50%, 100% of users as a way to slowly roll out new changes. This overall is probably harmless behavior, but it leaves a bunch of legacy cruft in the codebase unless people clean it out.
Then there's the pathological case where "we must test this [extremely minor] change because testing is part of the launch process" when the risk of not testing is miniscule.
I'm sure over the next few weeks I'm going to be able to come up with more ideas, but these are a decent starting point. If readers have their own fun gripes I'd love to hear them.