The remains of a pier, or platform thing on the west side of Manhattan. It existed for a reason that probably few remember.

Things are simple until they are complex

technical-culture Jun 2, 2026

Story time! Over a decade ago, I joined Bitly as a data analyst, and since it was a very tiny company at the time (maybe close to 50-100ish at the time) I got to work closely with the engineers and listen in when they were explaining how the tech worked.

At an extremely high level, Bitly's core tech offering was just redirects using short links, something that was very important when Twitter was an increasingly important platform and still only allowed a max of 140 characters. A link shortener system is very simple to do in concept.

  1. Take the user's URL and assign it to a shortened, ideally randomized value (to prevent guessing attacks and other stuff
  2. Make a table/mapping of URL<->short_url
  3. Whenever someone requests the short_url from your web server, look up where it should map to, and use a 302 redirect to redirect the browser to the correct long URL

Just about any student in programming, if allowed to get some help from libraries that handle the messy HTTP/web aspects, can code up one of these things as a homework problem. You can literally have any AI tool write up a basic system to do the basic functionality in a night or two of work. Now, let's assume someone manages to build a business out of this (which they did and as far as I know continue to do), why would they need a bunch of really smart engineers to build out this "homework project"?

The reason is because very very simple ideas become extremely complex when you try to do things at scale.

Take for example the URL shortening process. Most programmers would immediately think of using some form of a hashmap/dictionary type data structure. You take an input URL, run it through your hash function, it gives you a bunch of bytes that if you take the front X bits of, you get a shortened URL. You get the added benefit that attempts to map identical URLs can point to the same shortened link (if you want to let it). Lookups are speedy O(1) operations and even with a relational database, simple indexes make it trivial. For millions upon millions of links, this overall method works without a hitch.

But then things get "webscale". How many links get created and clicked on the internet every second? What if something goes super viral within a short time period, like a big sporting event or New Years celebrations?

At some point, if you keep creating link after link, you're going to find two URLs that collide on the same truncated hash result... and so you'd have to put in a rule to handle those situations like finding a second 'slot' for the URL. If you keep adding links at internet speed, you might start running into the very ugly possibly that your hash function becomes "full", where every unique value the truncated hash function can spit out has been mapped to a URL already. Now you somehow have to expand the number of bits available to do this mapping without breaking everything else. Or do you allow old things to break?

Meanwhile, while they're trying to get the math and hash functions right enough to make a product that doesn't fall over, the engineers also have to contend with performance. No one would use a link redirect service if it were dodgy, unreliable, or slow. There's very real pressure to do all the serving as quickly as possible, to the point no one notices. At the time Eng was running a very large cluster of tuned nginx web servers to handle all the traffic they were getting as quickly as possible. The architecture behind that whole thing is its own collection of optimizations layered upon optimizations as the team had to overcome the various problems it faced along the way.

Then you layer on top of this any kind of data collection/analytics infrastructure that you may want. How is abuse and spam handled? Product wants some special enterprise feature now. All these require more complexity, more branches of code, more layers of abstraction. By the time a new data analyst joins the team many years later, the original "simple problem" is barely visible any more buried beneath the dozens of layers of situation specific engineering.

Something that gets missed very often is how much of the complexity of man-made systems, be it in software of physical life, is that systems accumulate and shed complexity based on the context that it's in. SWEs eventually learn that a system suited for 100 users falls apart long before 100 million users, system architecture problems of this shape appear in more senior interviews. But because data can be collected from each evolution of the system with very little change in schema, all that complexity an be hidden from us until at the very last moment an architectural change ruins an analysis.

This is one of those things that I have trouble explaining to newer data folks. We typically have to work with very complicated man-made systems in order to measure what is going on, and no one put in crazy complexity for the sake of having more stuff to maintain. Production software is forever in a state of falling apart in novel ways. Sometimes it's important to zoom in and understand what all the complexity is about, and other times it's sufficient to zoom out and gloss over all the complexity. But regardless of what strategy you feel is appropriate for your project, it's necessary to have an understanding, however rough, of the whole context around the complexity. Maybe the complexity exists to solve a problem that is Very Important to your work, and other times the complexity had been introduced to solve a problem that isn't relevant today. Having at least a cursory understanding of the history surrounding a system is often very important in gaining this understanding.

Another thing to remember is that tech and software systems are all very "young" in the grand scheme of things. At most, software complexity only dates back as far as the invention of computers and programs some 70-odd years ago. There are plenty of human systems that have existed for centuries or millennia before that – for example most economic and governmental systems. Having an appreciation for the sheer potential for complexity helps us stay humble when we're asked to work with such systems. They don't create literal courses about the US Census data to show how the many aspects of that data work for the mere fun of it. It's ridiculously easy to draw incorrect conclusions about the data when you gloss over the literal centuries of statistical history built up in that system over time.

One final thing to note is that complexity doesn't necessarily have to continuously increase. It's not like entropy and the second law of thermodynamics. Just like how it used to be that analytics platforms relied heavily on NoSQL datastores to do Big Data, with all its inherent complexity and mess, but it all got tucked away under the neat SQL-compatible hood of modern scaled analytics database systems. Sometimes when problems get big and important enough, people become motivated to find a good solution to get rid of it.


Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something, a data-related post to either show off work, share an experience, or want help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

"Data People Writing Stuff" webring: Welcomes anyone with a personal site/blog/newsletter/book/etc that is relevant to the data community.


About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

Supporting the newsletter

All Tuesday posts to Counting Stuff are always free. The newsletter is self hosted. Support from subscribers is what makes everything possible. If you love the content, consider doing any of the following ways to support the newsletter:

  • Consider a paid subscription – the self-hosted server/email infra is 100% funded via subscriptions, get access to the subscriber's area in the top nav of the site too
  • Send a one time tip (feel free to change the amount)
  • Share posts you like with other people!
  • Join the Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
  • Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Tags