Networking basics for data work

Jan 23, 2024

For my day job, I work on the UX team of a major cloud provider. Obviously, one of the recurring questions we pose with our users is what makes our stuff hard to use. And while there's all sorts of complex, difficult concepts involved in using The Cloud as a developer, there is one thing that I hear about on a very consistent basis – networking. It is a unique engineering discipline that can even be confusing to seasoned developers – just look at some textbooks on the topic.

For data science work, we can usually go decently far without worrying about networking. We use the infrastructure provided by other teams that have done the hard work of making sure everything can talk to everything else. But there's a precarious tipping point where you start having to create things like servers and services, and suddenly you need to learn a lot in short order to get anything working.

I'm writing this whole with a fair amount of trepidation because I don't think I'm anywhere near an expert in the topic myself. But here's the concepts and simplifying white lies that I have to lean on when doing more lower level infrastructure work.

You're working with very old distirbuted protocols

A lot of networking protocols have a very long history. They were created long ago, put into use, and then accumulated largely backwards compatible updates to handle issues as things arose.

These protocols also are meant to work in a distributed fashion where there normally isn't a central authority. In theory, with enough cables and hardware, you could build a completely separate internet-like network. Some people build "home labs" with used enterprise network gear to get practice working with networks. Universities built Internet2 when they wanted better bandwidth to each other for research use. Large companies may even deploy their own private networks using the same protocols and concepts that don't touch the open internet.

This highlights that convention can be just as important as technical details. The tech can be set up to operate in all sorts of ways, but Maybe your network administrator has reserved certain IP addresses for specific functions and set stuff up with those assumptions. Very often those settings like firewalls and routing rules can be spread across a bunch of places and you wouldn't know unless you knew were aware of those conventions or did the legwork to figure out the details.

So here's the bare minimum that I use to make sense of the world.

Most things work in bits

While the vast majority of computing works in bytes and other much larger units, networking is one of the few places where bits is commonplace. Bandwidth is typically quoted in bits-per-second. Network addresses are typically displayed in decimal (IPv4) or hexadecimal (IPv6) for human convenience, but the systems actually work on a bit level. Things like subnet masks, CIDR notation just seemed like numeric incantations to me that made no sense – and they make absolutely no sense in decimal because they primarily carry meaning in binary.

Luckily, whenever I need to work with these concepts, there's lots of tables, calculators and tools available to help you make sense of things.

Everything has a bunch of addresses

I'm sure you've heard of IP addresses. You've probably heard of IPv4 and IPv6. You've also probably occasionally heard about MAC addresses. Under the TCP/IP framework that the internet is largely built upon, addresses are identifiers that are used so network gear knows exactly which wire or radio to transmit to, so there's lots of arcane methods used to disseminate this information that we largely will ignore.

MAC addresses can also be ignored by us in normal work. They're what lets your Ethernet switch know which port to send packets to your computer. That's why sometimes those MAC addresses pop up when debugging your home network. MACs don't let you send stuff over the broader routed internet.

IP addresses are what we typically work with and are familiar with. We're usually assigned an IP address from some higher network authority, usually a DHCP server set up by whoever owns our network, or a static IP given to us by someone, be it's the IT department or maybe your service provider.

What's important to understand here is that an IP address is NOT associated with a single computer like we casually think of it. Instead, it's associated with a specific network interface. As far as networks are concerned, anything that can communicate on the network either has an address to allow things to be directed to it, or it doesn't exist. This is what lets you have computers with multiple network cards (or Wi-Fi) connect to the same or different network and yet still be able to send/receive stuff without getting too confused.

Addresses aren't enough: ports

In casual network talk (is that even a thing?), we pretend that if we have a computer's IP address, we can communicate with it. But that's not the complete truth – under the TCP/IP framework we at least need to use a port number too.

A port number is an additional "location", kinda like a door number in an apartment building, ranging from 1-65,535. The reason port numbers exist is because we want computers to do multiple things.

Imagine a computer that listened on the network for messages to it's IP address. If we wanted a server to do more than one thing, when we send an HTTP web request to it, it needs to respond in a certain way. SSH connections need to be handled another way. It would be giant pain to require the OS to identify every possible protocol up front so that it can route every request correctly. Plus how would you handle new custom protocols in such a setup?

Instead, we split the address into 65k ports and just have our software listen at specific ports. The OS networking code treats the full address+port combination as a socket and will just forward the data coming in to the correct software listening to the socket without having to worry about the data being transmitted. The software listening will know how to handle whatever data is coming in. This is much more flexible.

Only one application on a computer is allowed to listen on a given socket, and in the common situation where the computer only has one visible IP address, that means ports determine what is being connected to. Since certain port numbers are either officially reserved for specific use (port numbers under 1024 are governed by the IANA) or are well known to be in use by popular software, the port numbers is often omitted and we forget they exist. For example, our browsers know to connect to port 80 for HTTP and 443 for HTTPS requests. SSH clients know to go to port 22.

We only have to deal with port numbers when we run custom stuff that deviates from the common port numbers... Like when we run a Jupyter notebook locally and it defaults to port 8888.

Routing

99.99% of the time, we do not have to understand how packets are routed around on the internet. We can often trust that if we want data to go to an address, the infrastructure will attempt to get the data unless a barrier stops the traffic.

Instead, we just need to be aware that there are special IPv4 address ranges that are designated as for "private network use". That means the global wider internet is not supposed to route those packets, and anyone can use them for any purpose within networks they control.

The private IP ranges:
10.0.0.0 – 10.255.255.255
172.16.0.0 – 172.31.255.255
192.168.0.0 – 192.168.255.255

You'll probably find them familiar because most home networks use the 192.168.x.x range, while a lot of cloud operators may give you "internal IPs" that live on the 10.x.x.x range.

The important part to know is that if you're using a computer that has a private IP address, you can probably talk to the other computers on the private address because they're likely on the same network (assuming a bunch of white lies about subnets in that address space and internal routing setups). If that same computer has an external IP address, things on the global internet should also be able to communicate with that computer. If the computer has BOTH, it can do fancy router-like things like take messages from one side and transmit things to the other – the OS figures out which interface to send data to in order to reach the correct host addresses.

As briefly mentioned, these large address ranges can be divided into subnetworks ("subnets") using a string of bits that mark how many bits at the start of the IP address designate the "network" part versus the rear bits that designate the specific host. That's the "subnet mask", and often looks something like 255.255.255.0 which translates to binary 11111111.11111111.11111111..00000000 . This is how admins can isolate one group of computers from another despite using the same private IP range. Sending traffic from one subnet to another requires having a router be set up to know that a route from one subnet exists to another, or otherwise it's blocked as undeliverable. This is why sometimes you can have two machines with IPs that start with 192.168.x.x but still can't talk to each other.

Firewalls and NAT and stuff

At this point we know enough to understand the causes of some common network trouble spots – the edges where one network connects to another. Stuff can be frustratingly unable to connect because something is not allowing communication even when you know the correct address and ports to use.

Firewalls exist on networks to (among other things) prevent traffic flow to/from certain addresses and ports under the theory that if you block communication for ports not in use, then you lower the risk someone can find a way to exploit and breach your systems. But this security setting is also often a cause of why you spin up a server but later realize a firewall somewhere (because they can exist in a few places) is preventing you from connecting to it. The solution is often to figure out where the firewall is (on the computer itself, or somewhere else), and get the person who controls it to adjust the firewall to allow traffic for that address+port.

Another potential source of issues is Network Address Translation (NAT). At a high level, it's a computer (often a router) that sits a the border between two networks and manages a mapping table of incoming-ip:port -> another-ip:port. It can let one address, like your home network's public IP address, serve multiple computers by forwarding requests appropriately. It also allows you to map incoming port numbers to other port numbers. For example, you could have all incoming messages for port 1234 go to 1.2.3.4:80, while all messages for port 2345 goes to 2.3.4.5:80. That way, you can have multiple servers appear like they're running at the same address on different ports. It's less common to find NATs out in work networks, but you sometimes come across them handling special situations.

Hostnames, DNS, etc

Finally, we get one more very confusing layer on top of the already confusing world of IP networking – naming things. All of the internet can work with just IPs and numbers, but our poor human brains can't handle it. So the Domain Name System (DNS) was invented to create a distributed system for associating names with numeric addresses.

There's a bit of a running joke that for many network outages, "it's always DNS". While the idea of having a big distributed database that maps hostnames to IPs "can't be that difficult", there's tons of room for misconfiguration, DNS servers dying, and many layers of independent machines that cache DNS records and thus send traffic to unexpected places. And that's ignoring the myriad of security issues associated with DNS over the decades that had to be papered over, leading to even more ways to misconfigure stuff.

From a DS point of view, we don't want to be mucking with DNS beyond using tools given to us by people who know better to point a name to a stable IP. Most computers have a local /etc/hosts file to manually map a name to an IP for convenience or testing reasons, but that's only valid for that machine and tends to cause name collision issues down the line once you've forgotten it exists. Doing anything more complex generally requires configuring a DNS server and understanding stuff much more than we do here.

Debugging issues

So if you shoved all this trivia into your head... how does one actually use it to resolve issues?

For me, I spend a lot of time trying to imagine what the heck the network infrastructure between the computer I'm controlling and my target computer is. Are the machines I'm using actually connected to the same network? If not, what could potentially be between them? This can get very convoluted. I recently had to hop through 3 computers in as many different networks with firewalls in order to finally connect to a server I was testing.

There's lots of tools like ping, traceroute, dig, whois, that can help you probe different bits of a network in order to figure out where the problem is – is it a machine under my control or something else?

One important thing to remember here is that asking other people to test things can sometimes help. This newsletter had a redirect loop issue that didn't affect me, but affected someone else across the country – eventually it was a CDN config that hadn't fully propagated yet due to caching. Things can legitimately work for some people and not yours and then magically change without warning.

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts. Under re-construction thanks to *waves at everything
Approaching Significance Discord — where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.

This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Share posts you like with other people
Consider a paid subscription to pay for the servers and encourage more writing
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Effective (training) firehose sipping

Back to work, but onward with chaos

Divorcing data collection from data analysis, slightly

Making shiny rocks from man-made crystals

Networking basics for data work

You're working with very old distirbuted protocols

Most things work in bits

Everything has a bunch of addresses

Addresses aren't enough: ports

Routing

Firewalls and NAT and stuff

Hostnames, DNS, etc

Debugging issues

Randy Au

Effective (training) firehose sipping

Back to work, but onward with chaos

Divorcing data collection from data analysis, slightly

Making shiny rocks from man-made crystals

You're working with very old distirbuted protocols

Most things work in bits

Everything has a bunch of addresses

Addresses aren't enough: ports

Routing

Firewalls and NAT and stuff

Hostnames, DNS, etc

Debugging issues

About this newsletter

Supporting the newsletter

Subscribe to our newsletter

Randy Au