ML vs code rot

Dec 17, 2024

Next week will be the last Thursday post of 2024, so I figured it might be fun to answer any questions readers send in. If you happen to have any random questions you'd like me to answer, send them to me via email or on BlueSky.

As a UX researcher and former data analyst, I've never had to work with machine learning things besides just studying the high level concepts of it. I personally find it more interesting to understand what users are doing than to do model building and infrastructure, so it has never been a problem for me to watch from the distant sidelines.

Sadly, all that had to change the past couple of weeks because as part of some infrastructure UX research work, I needed to stand up some realistic AI training workloads. The specific details of what I'm training actually don't matter since I'm more interested about the user and developer experience in setting those workflows up than actual model output, so I figured it'd be easy to use some kind of public benchmark model. Just about any recent benchmark would work since I just want to read in a bunch of data and use a bunch of CPUs and GPUs. Plus a benchmark is probably set up to be portable enough for other people to run. Perfect plan. I even had taken a cue from someone who had written notes doing a similar thing at work about a year ago, who used some training benchmarks from mlcommons. They didn't leave examples of what changes they made to their code, but I'm sure I could figure it out.

And wouldn't you know it, I failed. Hard.

First I try the "retired" unet3d benchmark since that was the one that had been successfully run before. And for whatever reason that I can't fully understand, I can manage to get it to run on one GPU but the moment I increase the world size and put it into distributed mode it locks up on the workers trying to connect to the main thread. No amount of debugging around networking, firewalls, permissions seemed to unblock it. There might be something broken in how it spins up job threads, though I can't for the life of me understand why that might be the case. nor am I interested in doing surgery on the code to force it to work.

So after about a week and a half of banging on this in between meetings and other work I gave up. Maybe the "not retired" stable_diffusion benchmark will run cleanly. After following all the setup instructions I find that the code flat out errors due to an trainer failing to initialize – because somehow an http library isn't behaving as expected? Whut? There's even an open issue on github about it breaking.

Finally, when I tried the "single_stage_detector" benchmark, I just started laughing when the provided script to download the dataset immediately threw an error for because the argument that denotes the dataset name/url to download in the function signature had changed. Fixing that minor error helped, but it died again a few more steps down.

Didn't we literally invent the involved dance with python venv and Docker containers to AVOID such errors? There's a curated requirements.txt file and everything in these benchmarks. In theory they should "just work" if you follow the instructions in the README files – but they clearly don't, whether because of some quirk in my virtual machine environment, or some dependency upstream of the defined ones had somehow shifted. It's hard to know what's broken when the whole point is to modify as little code as possible. Whatever is the root cause, the pace of bitrot on these benchmarks is pretty astounding.

Now, I'm moderately sure that if I was sufficiently motivated, or pressured to ship something, I could figure out what was critically different in my machine setups from whatever the benchmarks are assuming. I could also go dig into these reference implementations and somehow rewrite the job launch bits to conform to a known working test example. That should slowly remove whatever bit of code broke and caused the issue, but that is a nontrivial amount of labor for something that should've been turnkey easy.

The fact that I slammed into such a situation twice in a row probably indicates just how common these difficulties are. I don't think I'm particularly unlucky, and the authors of the readme files in the repository are very sincere in their efforts to provide documentation and scripts to download and prepare the datasets needed for running the benchmark. By all rights, this should have worked. So the problem must lie somewhere within the many layers upon layers of software packages, drivers, frameworks, dataset provider APIs and whatever else is going on. It's enough of a mess that even honest efforts to lock down those variables with a Docker containers and virtualenv cant fully keep grips on all of it.

In other forms of software, one way to solve dependency hell would be to static link everything and package it all together into a single binary package to distribute. There's obviously a cost to doing this, but you'd at least get some clarity around the problem. But right now I haven't seen an equivalent version that redistributes every single line of required code to run a model from a bare machine install, probably because there must be a bunch of hurdles to overcome, including redistribution license issues and the simple question of "who's being paid to host these giant datasets and code?". On one level I can understand this, these training datasets come in at multiple gigabytes/terabytes that costs serious money to host. Nvidia drivers are closed source binaries and not everyone uses those GPUs. Open source packages often follow a handful of compatible licenses but you technically can't guarantee distribution is allowed unless you check everything. No one has the energy and time for that. And so we get theoretical approximations of redistributable packages that sorta work until they get stale.

Incantations over standards

While this post contains a lot of my complaints with this poor developer experience, I think the overall pattern shows just how far we are from having a clean way to do these arguably not-that-complex things.

If you break down the actions I'm having so much trouble with here, it's simply:

Download a dataset from a remote location, possibly pre-process it with some scripts
Define a bunch of configuration options relevant to ML training, ideally this was given to me by the owners of the benchmark
Run the ML training code, which is ideally a self-contained complete black box of code that only has abstractions to talk to my local GPUs
Spit out the resulting model to disk

It's a batch job. Humans have been running batch jobs since we invented the term back in the postwar period when people shoved stacks of punch cards into giant mainframes. The problem isn't that we don't know how to do this, it's that there are nearly infinite ways to do them.

A lot of the benchmarks I'm failing to use have all adopted their own strategies of doing this, whether it's invocation arguments to the model, a bunch of shell scripts, a custom launcher built in Python, or some messy mix of all of the above. Some do some work inside a Docker container, some do work outside the Docker container, some don't bother to use Docker. They resemble magical incantations than the organized logical inevitability of "software". I'm pretty sure that once someone figures out an incantation that works for their situation, they reuse as much of it as possible in their subsequent projects instead of going through the pain repeatedly. I don't blame them because I would totally do the same thing. But it does mean everyone just does things their own way and it persists for ages because there's no real force that pushes for convergence.

I'm not going to call for someone to create a standard because calling for standards rarely works out. But what I will hope for is some kind of market pressure for us to standardize on better practices. The benchmark suite I'm currently failing to use at the very least puts forth some minimal requirements for the reference implementations:

Each reference implementation provides the following:
* Code that implements the model in at least one framework.
* A Dockerfile which can be used to run the benchmark in a container.
* A script which downloads the appropriate dataset.
* A script which runs and times training the model.
* Documentation on the dataset, model, and machine setup.

Maybe something like better requirements for publication or better industry tooling and expectations will push things towards a more consistent for people on the developer side. It's going to take an incentive structure to fix this in the long term.

But until then, it's a giant load of BS.

Subscribe!

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions
Send a one time tip (feel free to change the amount)
Share posts you like with other people!
Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Measuring UX friction in practice

A silly desktop motor device for me (and y'all)

Data work in the fast fashion code era

Avoid the lure of working on metrics over constructs

ML vs code rot

Incantations over standards

Randy Au

Measuring UX friction in practice

A silly desktop motor device for me (and y'all)

Data work in the fast fashion code era

Avoid the lure of working on metrics over constructs

Incantations over standards

About this newsletter

Supporting the newsletter

Subscribe to our newsletter

Randy Au