Just a set of amusing lucky beers at the local grocery

Antagonistic data systems

Nov 19, 2024

Nothing says "post moving" like the kid bringing some plague back from school to share with the family x_x...

When we discuss working with data, it is often an unspoken assumption that the systems we work with are "friendly" or simply neutral. It's typically data generated within a company and while there may be bugs and errors that exist, they weren't put there intentionally and can be corrected with effort. On average, everyone's goals and intentions are aligned. A pretty big chunk of the tech world operates under these conditions. Even if they're not from the same company, data might come from a non-profit or government agency whose integrity is generally trusted – you assume that despite its many documented quirks and foibles, the US Census data is generally reliable to use.

But today, I would like to remind folks that there are plenty of situations where such trust doesn't, or shouldn't, exist.

In my mind, one of the clearest examples in relatively recent memory of a data system that is outright antagonistic to its users was Facebook's misrepresentation of video metrics in the mid-2010s that sparked a lot of controversy about a "pivot to video". Some argue that it ultimately hurt news publishers as they reconfigured their media staffing to do expensive video that ultimately wasn't what users wanted. By the time people realized video wasn't working out, long term damage had been done to organizations and careers.

Other examples include claims that certain countries and government agencies manipulate their economic reports for various geopolitical reasons. The accusations themselves often have political motivations behind them, but regardless of their merit or lack thereof, the fact remains that not everyone is on the same team in these situations.

The difference between "friendly" systems and "antagonistic" or "adversarial" ones is primarily the degree of trust we give to the system and to what extent we can verify the results coming out of that system. While it is standard procedure to check for errors in our data sets and try to triangulate things, we usually afford friendly systems with the assumption that things are working as intended towards a shared goal. If a few metrics disagree slightly, we might just overlook them while assuming the discrepancy is the result of an innocent bug. Meanwhile, antagonistic systems are assumed to be producing results that are beneficial to one side and possibly detrimental to the other. This change in perception has a lot of deep effects on how we handle work.

Facebook's video metrics scandal is a good example of what an antagonistic system looks like. Facebook had every incentive to encourage content on their platform that keeps users there longer – video that plays in the feed instead of links to articles that go off site. Most importantly, Facebook largely owned all the data that could be analyzed. Since videos were playing on the main feed and not the publisher's own web properties or servers, it was extremely difficult to verify whether the metrics reported by Facebook are true or not. Most users and brands would take the stated metrics at face value because what other option was there? So Facebook had motive and ability to misrepresent things (willfully or not) to direct their users to do things good for Facebook.

I have another example from very early in my career at a rather sleazy 3rd rate ad-tech company. It was 2009 and I needed a job post financial meltdown, so I got hired to be an ad operations analyst, essentially deciding which advertising sources would be connected to which web properties. Ostensibly, my job was fairly objective since I would be analyzing activity and conversion behavior to make sure that advertisers were getting decent return for their marketing dollars.

But when I'd talk to the finance manager to see how ad money was being received and being paid out to various publishers, I learned that throughout the advertising chain, various providers would declare some percentage of ad clicks to be "fraudulent activity" that they weren't going to pay for. Sometimes that would be my doing because there'd be a sudden flood of sketchy click traffic or similar. Other times, it's just because the COO or CFO was taking an extra 10% off of payments because the upstream advertiser was complaining and demanded a 10% refund. Most of the justification was just "proprietary fraud detection system". What's worse is that this sort of behavior was pretty rampant amongst many lower tier providers.

You would think that advertisers would be able to verify the amount of clicks, traffic, and conversions they'd be getting from ads, and to some extent they could. But thanks to the unreliable nature of technology, everyone expected 5-15% of traffic to just be broken for various reasons. Maybe a user simply closes a tab. Maybe there's a browser error, or malformed URL. Other times people disable cookies and Javascript. And other times there really is fraudulent behavior that needs to be dealt with. All of this is presented to the end advertisers and publishers identically with a single reasoning. People definitely took advantage of the power to declare clicks to be good or bad with little in the way of a check or balance.

Yeah, working in such an environment for 10mo pretty much sealed how I never worked on ads again in the following 15+ years of my career.

Paranoia is healthy

So the primary thing to do when dealing with an adversarial system is to figure out what things you have to trust, and what things you can triangulate and verify. Ideally, the thing you're verifying with is something that is under your control and can be trusted – your own servers or measurements. In the case of internet ads, the verification point is based on traffic landing on your web server. If the ad service says 100 people clicked on the ad, you ideally expect 100 users that came with that source attribution. Given the randomness of internet tech, you'd very likely accept that 5-10% of traffic breaks or cancels itself before the requests completes so significant deviations from that are considered red flags.

Sometimes you have to go out of your way in order to verify a piece of data. Maybe you don't trust traffic/store visitation data to a physical store, so you might be justified in doing actual field research and sending someone to manually count people going in and out of the building. Or you run special promo codes for tracking.

Obviously, systems that don't let you independently verify metrics readouts are the most suspicious ones. For example, Twitter's analytics metrics are almost completely within Twitter's control because they own all the data and the analysis pipelines. The only time you'd see your own metrics is if you owned a server and counted clicks coming to it from a tweet. An unscrupulous owner could easily manipulate the reports to say whatever it is they want. A clever person might even manipulate the numbers to make them internally consistent. Our only recourse is to decide whether we trust the numbers or not.

What's frustrating, or fun, about working with adversarial data systems is that it really forces you to think very hard about everything. From what things are potentially suspect (and why), to finding clever ways to triangulate observations to determine whether a metric is reliable or not. It stretches muscles that we don't normally have to exercise unless you have an interest in looking at things sideways like a security researcher.

I'm sure that if you stop and think about it, you have access to some kind of data that is at least slightly suspect in its truthfulness. I highly recommend spending a bit of time at least going through the thought experiment of trying to figure out how would you prove to yourself whether that data source is unreliable or not.


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

  • randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
  • Send a one time tip (feel free to change the amount)
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!