The "art" part of setting metrics
One of the most common things I see in a quantitative UX researcher job posting is the need to "help teams set metrics". While the request seems relatively straightforward, I've always found the whole process to be very messy to describe. Metrics are simply just numbers – they're just measurements embedded in some kind of interpretive context. It's completely open ended as to what could be proposed as a metric, but the number of ones that are useful for a team are significantly rarer. Helping teams define the best metrics to use feels almost as much like an art as a science.
The most naive interpretation of helping a team set metrics resembles working with a team to help them pick some numbers out of a hat. If the team is building some product, they have an idea of what numbers they want to report to show how they are doing. In such a simple context, then the data analyst or quant researcher working with that team can potentially mean they review that list of candidates and decide which of those metrics are "the best".
Anyone who has been working for more than a couple of years will instantly see problems. Teams are motivated to report numbers that make them look good, and to hide or explain away numbers that make them look bad. The team's desired outcome may not actually be aligned with goals of the larger organization as a whole. Similarly, teams often don't have access to a data practitioner to help them understand what they should be measuring, and so they tend to count things that are convenient to count – new users, total accounts, sign ups, orders made, etc.. More complicated measures like longer term retention, interaction funnel dropoffs, and targeted usage telemetry are often forgotten.
So what can often happen with a less experienced analyst or researcher in these situations is they don't know enough to push back on poor metrics definitions. It doesn't help that more junior data folks get put into these situations with more senior ranking product leads because there's usually fewer data folks to go around. It's easy to get swept up with the flow and pick out overly simplistic metrics that are barely more useful than vanity metrics. I know I've done this plenty of times in my younger days.
For more experienced practitioners, a metrics setting exercise becomes less a "find a workable metric" exercise and more of a co-ideation session. The assumption is that the team very likely doesn't have the ideal metrics on their potential wishlist and it's our job to talk to them and figure out what works best for their constraints.
This is where we get a lot of the more established recommendations on how to set metrics. We ask questions like "What does success actually look like?" "What decision would this metric force you to make?" "What's the broader business goal we're trying to achieve here?" These are guideposts in a conversation with product stakeholders that can honestly go in any direction. I've had conversations where someone's put lots of thought into the questions before I even have a chance to ask them. I've also had conversations where someone has completely not thought about them before I asked.
The heavy lifting work for having these conversations is to juggle a couple of things. First is obviously listening to the team and what they're building, what their goals and decision points are. Another is keeping in mind what broader goals exist and to what extent the metrics and goals point in the same direction even in the face of potential gaming down the line.
Finally, while keeping all these details in mind, we have to make sure that the metrics have useful properties. I think this is something that isn't talked about too much when discussing metrics. A lot of times, we don't necessarily discuss them directly with stakeholders because it can get unnecessarily technical, but they're floating in the back of our minds as the discussion unfolds because it'll affect our recommendations.
We want ideal metrics to be sensitive to important changes while insensitive to unimportant noise. We also want them to react relatively quickly to potential bad news by being a leading indicator (like paid subscription sign-ups) as opposed to a lagging indicator (like subscription renewal cancellations). They ideally can't be too affected by random, unexplainable noise. It also should reflect the broader population or system in question and not be overweighted on the behavior of a small subgroup. They also shouldn't be easily broken by bad behavior from users (or over-aggressive PMs).
None of these details and a million others are immediate deal-breakers – every production metric is going to be susceptible to one or more issues at any point. But the balance of issues can make a proposed metric more or less appealing to use. The question for us to figure out in our heads is whether these little imbalances outweigh the sheer utility of a given metric.
As an example, we might have a metric on user satisfaction that is super strong and reliable by having our sales team ask some simple questions during routine client check-ins. Those surveys can be highly correlated with important business outcomes like renewals. But if asked whether we take that satisfaction metric and raise its importance to be amongst the company's north star metrics, the decision isn't so simple. Maybe the metric is a poor indicator because angry customers who are about to cancel their contract would refuse to take a sales check-in call. Maybe we find that the survey works very well in one country but translates very poorly in a different country with different norms. Maybe we just don't do enough sales calls and its effectiveness as a metric starts to drop if we scale it up. What if leadership wants to tie sales bonuses to the outcomes of this metric and thereby invoking Goodhart's Law on it?
Throughout this process, we're being asked to use our experience and predictive imaginations in order to peer into the future and see how a given metric may affect the decision-making behavior of an organization. It's largely an impossible task. Stakeholders can misinterpret a metric much more creatively when we can imagine failure modes.
But regardless of whether we can peer into the future or not, we have to pick something. And here I think is where a lot of us wind up leaning on our past experience and domain knowledge. You're rarely going to get into trouble metrics that directly tie into revenue are elevated to being most important – because capitalistic endeavors eventually demand revenue. Same goes for basic metrics around user adoption and retention for a product – you can't make money without users who are willing to pay and continue paying. Alternatively making sure users can get through an important sequence of actions (like paying) is also pretty safe. All of these are decent first pass metrics in that, if we know nothing else about how anything works, these will usually give an indication as to how well a product is performing. None of these potential metrics really explain why something is performing well or not, but they're enough to trigger an investigation. To want to have a metric in place and ready to target specific issues requires you to predict exactly where issues are going to show up, and that's a very tall order for any product that is trying to do something new. Maybe it's possible if you've seen a similar-ish example from before, but very often even past experience is of little help.
This is why metrics setting is so hard to describe. The initial stuff is straightforward, but everything past that points devolves into a ton of specifics, with a lot of smaller concerns nipping at the edges. For many projects, I don't think there's a singular correct answer to what the best metric is. The best I hope for is setting up a couple of sufficiently sensitive alarm bells that trigger when things need attention. The real work begins upon the investigation of what is causing the alarms to go off.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.
Counting Stuff Official Forums: Discuss posts, or other data topics with the community.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
Supporting the newsletter
All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:
- Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
- Send a one time tip (feel free to change the amount)
- Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
- Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!