Almost a decade ago, I visited Chengdu and this is the view from the taxi we took.

Crime-ing with Data Science

Apr 1, 2025

About two years ago I wrote up a post looking into a criminal complaint from JPMorgan Chase, (at the time the post was for paid subscribers but I just set the post to free subscribers also), accusing Charlie Javis, founder of the company Frank, of committing fraud by generating synthetic lists of users to make the company seem to have much more users than it actually had. JPMorgan Chase then acquired the startup for $175 million.

Watching data science in a fraud lawsuit filing is fun!
And slightly infuriating

The reason I was reading criminal finding at the time was because of two things:

  1. The accused had hired a data scientist to generate millions of fake-but-realistic user emails in order to inflate their user counts
  2. The original complaint had been citing detailed records of emails of the requests going down. If there's anything to learn from this, don't conspire to defraud a bank and then hand over the email servers with all the conspiratorial conversations saved with mandatory retention processes to the bank you're defrauding.

I'm coming back to this over the weekend because the case had concluded with a jury finding the defendants guilty of multiple counts of conspiracy, fraud, etc.

The end of the NPR article I linked to happened to say this: "Prosecutors said Javice ended up paying a college friend $18,000 to create millions of fake names with pedigree information. " I had mentioned two years ago that I was surprised that the data scientist hadn't been dragged into this mess as a conspirator since they were a healthy sneeze away from knowingly participating in the fraud instead of just being a hired gun. But hey, I don't know how criminal prosecution works.

Anyways, data talk.

The acquisition of Frank happened in 2021, meaning all the drama, fraud, and faking of data would have been happening during the leadup to the deal. This means it all happened just a couple of years before LLMs took the world by storm.

I can't help but think to myself that had this all happened just two, maybe three years later, instead of hearing about how a data scientist was paid $18k to build and tweak a model to fake a giant list of customer emails, we would've seen a subpoena to OpenAI with GPT prompts asking for code that does the same thing. At the time news of the case broke, I had been taking it as an example of how being a data scientist can sometimes put us in danger of crossing ethical, and legal, lines. That of course still remains true for us regardless of what age we live in.

But now, I'm sure that if an unscrupulous CEO wants a dirty data task done, a LLM could be pretty easily led into doing the task. People have been working on models that help with data analysis tasks, and it's easy enough to couch a request to not be obviously illegal. No need to deal with pesky humans who might have the context to question the ethics. And there's fewer witnesses unless you count logs in the API provider. Heck, LLMs aren't even all that necessary in some situations since there are legitimate packages out there that can help generate synthetic data for you.

(The lack of ) Opportunities for fraud

Another interesting thing I noticed about this whole story is that the fraud fell apart because it eventually had to face reality – JPMorgan tried to send a email campaign to the fake userlist and the majority of them bounced so hard it set off an investigation. Honestly, this sort of fraud could have been perpetuated by anyone with programming skills and some time. So to call it a crime perpetuated, or even enabled, by data science is a giant stretch. It took a very unique set of circumstances to even enable this feeble attempt.

That lack of criminal effectiveness doesn't seem to be unique to this situation either. If you search around, it's really very very difficult to find examples of a data scientist finding ways to unethically profit from their work in the field. At most, our jobs allow us to enable our employers to do unethical things like deploy dark patterns, enable discrimination, collude and fix prices, etc., but we ourselves can't do much on our own. It is pretty easy for us to do harm to one or more individuals given our position of influence on large scaled systems that can potentially touch the lives of millions of people. It's a completely different matter to find a way to personally gain from bad actions.

Since our role is largely one of force multiplication – enabling existing things to be done faster, better, cheaper – we're typically attached to a team that has the actual power of execution. Much of the work of a data job is to build up enough trust with teams to be able to persuade people to follow our models and recommendations. So while we can might have to struggle with ethical issues like "what's the best way to hide the unsubscribe button" or "should I build the murderous AI bot", rerouting money into our own bank account is usually not in the cards.

The closest I can come up with is finding some way to leverage privileged access to confidential data and using it for gain somehow, either selling for industrial espionage or perhaps to enable insider trading. I suppose it's possible for a data scientist to be part of a larger conspiracy of fraud – having broad data access and control over reporting can be very useful in enabling and hiding criminal activity. I'm not exactly a criminal mastermind here, but that's about all I can come up with.

So this week, maybe on a break or something, ponder over whether there's actually any opportunity for some data scientist somewhere to get rich doing some data crimes. I'd love to hear ideas if anyone's got fun ones.


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

  • randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
  • Send a one time tip (feel free to change the amount)
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!