Summary: Data Mishaps Night 2024

Newsletter Mar 12, 2024

Data Mishaps Night is one of my favorite data conferences and you've heard me rave about it multiple times over the years. For 2024, I planned ahead (gasp) and did a reporter-y thing by asking for permission to write a summary – and I got it! So while the event is not recorded to provide a safe space for people to talk about their mistakes, there'll be a record!

Now, some caveats. With the sole exception of the keynote speaker, all talks are lightly anonymized. The talks are summarized from notes I was taking during the whole event so please excuse any misunderstandings/errors on my part. Two hours of lightning talks is a lot of notes and concentration. Finally, to help preserve the anonymous nature of the event, I've randomly shuffled the order of talks.

Anyways, here we go, every talk from that night.

Counting Stuff exists 100% thanks to reader subscriptions. If you enjoy the weekly posts about data, process, tech, and hobbies, show your support with a subscription.

In the keynote, Hadley Wickham spoke about how he almost broke the Tidyverse by creating a feature that solved a problem, but was confusing to many in the community, to the point where some said the system was "hellishly" confusing to use.

The feature overcomes a technical issue within R about how there's different concepts that were both commonly called "variables" – things within a dataframe and things within the executing R language environment are different sorts of variables and work differently under the hood. The Tidyverse often lets you mix operations of the two in a generally nice way thanks to the local context available. But R will throw error if you try to build abstract functions with the exact same patterns.

The mishap came through the creation of the tidyeval library, with an implementation that involved declaring things with a very unintuitive !! enquote( var ) syntax for operations like group-by. Using the new syntax told R to reach "into" var because it a dataframe, instead of trying to operate upon var itself. Understanding the feature syntax required understanding how R handles code at an advanced level most users aren't at. But since the code was interesting from a programming language design perspective, he kept trying to convince people to accept it, eventually giving 13 public talks trying to explain it to the community.

In the end, all that effort failed to gain traction, and the team came around to building out a new syntax that was much easier for people to understand using {{ }} braces. Along they way, they learned to get feedback from smaller communities first, implemented a tidyups process modeled after Python's PEP system, as well as learning to listen to users when they struggle with documentation of use cases.

Also, you have to be willing to kill your darlings at times – the sunk cost fallacy is real.

One person presented a museum of their data visualization failures. They showed the lesson of how data viz can often help with either exploring data, or explaining insights, but can rarely be both at the same time.

They once made a visualization of how a government benefit program affected individual families in a timeline, a family would sign up, get aid, events happen, they graduate out. This resonated with people like case workers very well so the speaker decided to do a mega timeline with ALL people in the program. The resulting diagram was a bunch of stacked bars across time, with an overlapping color scheme to indicate when someone was in two programs (blue + red = purple!). That chart was put into presentations and they would getting tons of questions about it so the speaker thought that great positive engagement... until months later they realized it was all clarifying questions. The chart was eventually retired and turned into a data science interview question, "can you explain what this chart trying to do?"

Another time, they wanted to visualize the reasons why cases to a program closed and used a network diagram to show the reasons. They spent a lot of time on it, but luckily they realized it was terrible immediately and could've been replaced with a sentence "the top three reasons are..."

Quote of the talk: "I'm from the Midwest, and when someone says 'Maybe I'm dumb, but could you walk me through this again?' you know you've burnt a bridge that will take many bar charts to repair."

One person was working with geographic data in R. In a classic case of "How hard could it be?", they had set out to use a GeoJSON file from a data source, put it into the sf library that they were very familiar with, and be done in a short while...

Then they kept hitting errors and issues with the file not doing anything. After much struggling, they eventually learned that the GeoJSON file they had was malformed in numerous ways. For example they needed to provide a reference to the coordinate system being used, the shape geometry specified was supposed to be "polygon" instead of "ring", and other fields were missing and had to be corrected.

Finally, after all those fixes were implemented, the file worked. Some libraries are like magic because they hide so much complexity away from users that you don't have to even learn about all that they do, but when things break it's a very very steep learning curve to fix things.

One person was working with a non-profit in India. In India, there are tons of tiny clothes ironing businesses all over and the businesses use very old-style coal box irons to do the ironing – they are literally metal boxes with burning coals inside. The non-profit developed an improved electric (I think) iron to distribute to such businesses, which improves their health while improving their income due to lower maintenance and better performance. The non-profit wanted to collect data to prove their impact on getting the market to adopt the product. While doing the study, they realized that they had somehow impacted more people than was actually impacted, something was wrong in the data.

To do the study, they surveyed a sample of businesses in an area to see what they were using. And since surveys have error bars, they went to the distributers of the irons and asked them how many businesses had gotten the new product. This let them have two numbers in order to check if their survey data was good. Except.... over the course of months needed to collect the data, something got confused and they ADDED the two numbers together instead of comparing them, effectively double-counting all businesses affected. They only found out by talking to the actual manufacturers of the irons and getting a final reference dataset.

One person was ingesting data from Stripe, there were Stitch and Fivetran pipelines pulling data for various reasons in their setup. The data was used for important monthly revenue reporting among other things. One day, the CEO notices some bills that were paid weren't showing up in revenue reports. That sparks a big search through the very complex financial business logic. Eventually they find bugs in BOTH vendor's Stripe implementations because a change had happened upstream. Both vendors didn't find the issue to be urgent enough during the December holiday season. While they could manually fix things for now, they needed a permanent solution so the team goes in to try to update the pipeline systems, update the dbt schema. They even remembered to test their changes out in a testing environment first. But then on the production launch... the VP of sales chimes in that they're still seeing weird things...and so they work on a second fix... and now somehow revenue numbers are doubled.

Eventually, after more fixes, they found another bug in how they were recording the data and that finally got rid of the duplications. Touching complex revenue code is hard, and having a VP asking if things would be fixed within the day is extra hard.

One person spoke about how they, as a PhD student, got asked to do a data project to pull data from a UN source for analysis. The trouble was that the data had been collected over many years and the industry codes and stuff had changed during the course of collection. Luckily the UN had also provided concordance data to help researchers navigate the changes, but it still took a ton of work (obviously way more than estimated) to do the cleaning. They had meticulously followed best practices of documenting and having unit tests and checking in their code, so they finally sent their work in. And then they checked their results against a known-good data and discovered that there were about 30k rows missing in the data. and had to go debug them. If not for the known-good reference, they wouldn't have known that they had made an error somewhere.

In the end, after some apologetic emails, the client was happy and used their work to do complete their research, while the speaker kept on thinking about what would happen in this process if they didn't have known-good data to compare against. How could anyone be confident in their work? And part of exploring those questions has become the starting chapters of their PhD thesis.

One person was an ML Engineer, they had a data warehouse in Redshift... and to put it nicely, it was a complex, interconnected, possibly sometimes circular, pipelining system. No one fully understood all the details. It was such a mess that the speaker was always trying to simplify things over time.

So one day, they found a query on the internet that helped identify tables in Redshift that didn't use a much disk space and were thus probably empty. So they dropped two empty tables and called it a day. They happened to be on-call that day and so they soon got notified of an issue... then more and more issues popped up until everything seemed on fire. Turned out that 6 years of data in Amazon Glue had been deleted... data that was foundational to the business.

What had happened was that apparently Redshift lets you define tables that use Glue, and it's just effectively a pointer reference – aka, they don't use any disk space. More helpfully, if you drop the table in Redshift, the Glue table also gets deleted. Maybe they deleted the whole business by mistake.

So this person had to reach out to teams that owned the affected systems and figure stuff out. Luckily, they learned that dropping the table only deleted a bunch of Glue's metadata, so the raw base data still existed. They just had to recreate the metadata... which they had to do by replaying the raw data back into the system over the course of two days.

So yeah, beware of dropping tables.

One person told a story about how they spent almost $800k on Snowflake.

At a small company, they moved to Snowflake back when it was a tiny company. Since they were migrating to a new database, they needed to build a new event data collection system to move the data into snowflake.

So they built a 24hr batch update airflow job. But no one wanted to wait 24hrs for data. So they did some work to cut it down to 12hrs, 6hrs, finally 4hrs... and they hit a limit on the batch loading due to file size limitations. But obviously things can still be better – so why not just build a streaming ETL system using 24 parallel workers to stream data into the database within a few minutes if not seconds? The system worked – amazing.

Where's the mistake? Well, the tech worked so how much did it cost? All the S3 operations and storage were decently cheap. But snowflake? Well... the original system configured a giant batch warehouse to process 24hr sized files. The new system used 24 parallel workers to move the data quickly. The configuration file was never changed so there were 24 giant sized machines running in parallel almost constantly. The bill would've cost $800k a year.

Luckily, the stream processors handled tiny amounts of data, so once the mistake was caught via the in-house spend tracking system (that had raised the warning with the executives), everything was sized appropriately and the Snowflake part of the system cost maybe $1200/yr. We're all so used to not thinking about costs while doing analysis that it can come as a surprise when you start building infrastructure that other people use and scales.

But... all this wasn't the true mishap. The true mishap is that the speaker inadvertently became the single point of failure for a 24/7/365 streaming ETL system. Never, ever, do that.

One person took on a new role at a startup as the solo data person. They were using Fivetran to move data into Snowflake to do an analysis.

A bit later, CTO says they need to monitor the Fivetran spend better because itw as getting very expensive. Whut? Apparently the new table was synced on a 14 day free trial. The was a budget estimation tool and the active row volume looked okay at the time so they launched it. But suddenly spend spiked after the trial. There was an alert email that was sent out, but nothing killed the job that was spending money, so it ran up a bill of 25x the ETL budget within a 24hr period.

The root cause was that the table being synced was big and it was regularly being updated, so that kept the active row count really high despite what the estimation tool was saying. And the billing account didn't have a cap on it. Plus Fivetran itself is expensive and their estimation tool just didn't do a good job estimating how bad things would be.

So the lesson was to better understand the underlying table behavior, don't give ETL tools blank checks to spend, and... ask for at least a partial refund (because the estimator was so wrong).

One person told the story of a project from their PhD student days.

They had a bunch of data collected about how highway paving was affecting vegetation growth in the Amazon. Data was sourced from fellow researchers, other students, online, tons of places. At the time, the person wasn't very proficient in R or Python yet, so they... decided it'd be faster to do it all in Excel. Yeah.

A quirk in the analysis meant that there could be no missing values, so they manually merged the data together, and then fixed the missing values in Excel by hand using various methods. They didn't take clear notes on what they did, though unclear ones were taken. Moreover, they saved over their raw data files. Even now they weren't sure what they were thinking back then.

When time came to formally write up the research, they obviously couldn't remember all the details of their changes. By luck, they were able to recover some of the raw data in email archives, others were on USB drives. But the biggest issue was the core dataset that everything was based on was lost, the PhD student who generated it had left, and the person didn't know how to extract the needed data from the original sources.

Luckily, their advisor independently decided they should be using a different vegetation data set later, thus saving the whole project. This time around, everything was documented, and raw data files are kept safe and sacred.

The speaker now tries to teach their students this lesson. "Don't view your raw data as the enemy – to be vanquished".

One speaker told a story about how they confused business expectations with their professional passions.

They were (and continue to be) super excited to work on modeling and analysis problems as a data scientist. They were working on a startup on a project they were passionate about but was asked to drop that and work on a project that had zero data analysis work or any other data science related stuff that got them excited. They thought it was ridiculous and pushed back.

But before long, they were pulled aside and told to either work on the new project or leave the company. They worked on the project.

The episode taught them evaluate whether their work is aligned the goals of the company, as well as their own professional goals.

One person wrote about how they did a bunch of modeling work, and the model was wrong, and it didn't matter.

They were a new hire, inherited an ML model, was super excited about working on it with clients, testing parameters, putting into production. They weren't very excited about the logic doing the table joins. Since they didn't know the data at the time, they sorta didn't worry about it since they had so much else to do. Either way, they did all this work, and it was ready to go... but no one was using it. A year later, the model isn't being used, but clients were pointing out some weirdness they observed. They realize that the SQL is wrong, but surprisingly the model was super robust and it was doing fine regardless of the bad SQL so it didn't practically matter.

Later, looking into the data closely, they realized the joins weren't even possible for some of the data. So the model isn't work from a perfectly clean information state. But the model continues works fine though. If you call it, happily do its job.

But none of that even matters because no one was actually using the model anyways.

So yeah, make sure work on things that get used.

One person was leading an analytics team at a hospital. They were running a large data aggregation pipeline that was collecting data from over 2500 hospitals, processing and then populating dashboards. They took the data and produced benchmark reports every month that hospitals would then use to monitor and identify potential issues.

Eventually they started observing errors in the data. The errors were random, not systematic, and very weird. But there were so many data sources, hospitals, quirks, they didn't trust their data. The problem seemed to come up in a metric for a specific condition. They weren't sure why it was being weird, so they started trying to isolating the process. Except there's a lot of steps of data handling along the chain. So, being experts at working with healthcare data, they launched a huge project to review everything from the bottom up. They talked to people who were collecting the data, looked at all the data quality management practices, did analysis to looked for errors, systematic causes, regional causes, etc. They couldn't find anything.

They had talked to almost everyone... except the vendor that aggregated the data being entered into spreadsheets. The vendor had made a UI change to the system collecting data that various hospital staff would upload their data. Some people figured out how to do it right, some didn't. Some people within a hospital did it right, some didn't. The crazy errors they were observing was caused by that minor UI change.

So the lesson is that there were assumptions baked into even where you look for errors, and you sometimes really have to talk to everyone.

And that's it for Data Mishaps Night 2024!

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted, so support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
Share posts you like with other people!
Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Avoid the lure of working on metrics over constructs

Measuring Snow is Decidedly Not Easy

Bridging real measurements to idealized ones

Effective (training) firehose sipping

Summary: Data Mishaps Night 2024

Tags

Randy Au

Avoid the lure of working on metrics over constructs

Measuring Snow is Decidedly Not Easy

Bridging real measurements to idealized ones

Effective (training) firehose sipping

About this newsletter

Tags

Subscribe to our newsletter

Randy Au

Recommended for you

Book Review: Solve Any Data Analysis Problem

"Medium indirect sunlight"? What?

(Human) networking is making friends