Friday, August 28, 2009

The Internet, from Bottom to Top

This is a post I've been meaning to write for a while. It's a high-level overview of how the Internet works, starting at Ethernet cables and working our way up.

I'll be following something called the OSI model throughout this post, or at least the relevant bits. It divides the functioning of a network into seven layers, much like a seven-layer burrito at Taco Bell. Unlike a burrito, though, only the first four layers are generally relevant. (Also, hot sauce does nothing to make networks better. But I digress.) This kind of separation into independent layers is really important, as we'll see later on.

Layer 1 - Physical layer

The physical layer is mainly concerned with the specifics of wiring. Ethernet cable (more properly called CAT-5 or 6, since we're talking specifically about the cabling) falls under this layer, as do fiber, coaxial cable, telephone lines, and various other things. Wireless radios also fall under this category, if we're talking about Wi-Fi.

It's important to know of this layer, and to be able to tell the different kinds of cables apart I guess, but this is far from the most interesting layer in the stack, so I'm not going to go into any great detail about the signaling rate of CAT-5 versus CAT-6, or anything like that. It's interesting to somebody, I'm sure, but it's just not my cup of tea.

Layer 2 - Link layer

The link layer is responsible for sending messages between computers that are on the same local network. Link layer protocols are closely tied to the physical layers they run on top of, to the point that it almost doesn't make sense to separate them into two layers, because each link-layer protocol is heavily dependent on the properties of its physical layer.

I'm going to be talking mainly about Ethernet here, because it's simple, and because I know just enough about other link-layer protocols to embarrass myself. >_> Wi-Fi, for instance, has to deal with situations where there's interference that neither transmitting party can detect, because they're both within the range of the receiver, but not of each other. Madness!

Ethernet is designed to work when all the computers in a network are just sort of plugged into each other on a shared line (the "Ether"). This has the benefit of making simple networking equipment extremely cheap and simple, since the network switch can get away with doing next to nothing. It works using a protocol that's really similar to talking with a group of people: each computer just starts transmitting to the entire network when it's ready, if it doesn't see anybody else transmitting, and if another computer starts transmitting at the same time and there's a collision, they both wait a random amount of time and then try again.

In practice, detecting collisions and waiting a bit to recover from them hurts network performance if there's more than one host on the network trying to communicate. With a modern network switch, the switch will process the packets and take care of sending them one by one to only their designated recipients, so collisions are significantly reduced.

Every Ethernet device has an address hardcoded into it - the "MAC address" - which is supposed to be a globally unique address - no other device in the world should have the same one. Some hardware will let you change its MAC address, but if you change it to the same address as another device on the network, weirdness will ensue. This is in contrast to IP addresses, which I will get to in a second.

Layer 3 - Network Layer

This layer is where real magic starts to happen. Layer 2 lets us communicate with computers that we're connected directly to, but how do we communicate with systems that are anywhere on Earth (or even off Earth)?

In the IP (Internet Protocol) layer, the packets of data from layer 2 can be shuttled between local network segments. Given an IP address, every computer on the Internet either knows the layer 2 address of the computer that has that IP, or knows the layer 2 address of a "router" - basically another computer that's connected to multiple networks, and knows which addresses are on each one. Iterate this across enough networks, usually at least a few dozen, and a data packet can reach any computer in the world that has an IP address. (Well, mostly - more on the abomination that is NAT later on.)

"So wait," you may be asking, "why add another set of addresses? Don't we already have layer 2 addresses we could use?"

But there are a few problems with that. First, not everybody uses the same layer 2 protocol, and not all layer 2 protocols have the same kinds of addresses. A new layer is needed on top for everybody to be able to communicate. Second, layer 2 addresses are distributed pretty randomly. If you want to reach somebody with a given MAC address, it's easy if they're on the same network, since you can just send out a broadcast message to the network, but it's impossible if they're on another network.

IP gets around that by having each router keep track of which IP addresses can be found where, in a routing table. On a home network, it's pretty simple, since all IP addresses except the one the router has are out on the Internet, and can be reached through whatever type of connection it's using. For big routers out on the Internet, the situation is a lot more complicated, and beyond the scope of this post. This is why IP addresses are assigned, rather than just being attached to the hardware when it's manufactured - whoever assigns the address is responsible for also remembering where packets meant for that address should go.

IP addresses are 32 bits long, which means there are 2^32 of them, or a little over four billion. It turns out there are a lot more than four billion people in the world, and most of them want to be on the Internet. Who could have seen this coming? D:

Currently, we're solving this problem mainly with NAT - network address translation. With NAT, you have a router with a single IP, which assigns arbitrary IP addresses to the computers it routes for. If you combine this with some fiddling around with layer 4, you can have a lot of computers behind one real IP address, and it mostly works! Except that none of the computers knows their real IP address, which makes network programming kind of a pain. Oh, and you can't connect to any of them from the Internet without manually setting that up. Aside from those problems, it works!

(Actually, as far as NAT goes, we've got it pretty good here in the US. American companies and organizations managed to snap up a huge fraction of the available IP addresses in the early days of the Internet, so our ISPs can afford to give out public IP addresses to all their customers. You should hear some of the horror stories I've heard about NAT in places like India...)

The long-term solution to address space depletion is IPv6. (Internet Protocol, version 6. Current IP is actually IPv4.) IPv6 has 128-bit addresses, which means that not only can every man, woman, and child on Earth have an IPv6 address, but their constituent atoms can have unique addresses too. In other words, it's comfortably large enough that every computer can actually be on the public Internet, and we won't have to play dirty tricks to make it last longer anytime soon. IPv6 also has some other nice features, like getting rid of packet fragmentation and header checksums, which makes things simpler for routers (which, in the Internet core, are pretty overworked these days).

So why isn't everybody (or, realistically speaking, anybody) using IPv6? Technologies are in place for the transition to happen gradually, but unless your ISP is actively deploying it, the only way to get IPv6 is to get a tunnel to a tunnel broker, which is kind of a pain. ISPs aren't actively deploying it, because websites don't use it yet. Websites don't use it yet, because end-users won't see any benefit from it. The IPv6 transition could happen at any time, all the pieces are there, but it's one of those things that nobody has any reason to do unless somebody else does it first. (I'd call it a chicken-and-egg situation, but everybody knows the egg came first. Really, I'm not even sure why people still use that phrase.)

On the other hand, the projection is that in about two years, we'll run out of new IP addresses to give out, after which a market in IP addresses will probably form, and when the price goes high enough, IPv6 will sort of just happen. This is another case in which the layered architecture of the Internet is really useful. The entire layer 3 protocol, arguably the most important one in the entire stack, can just be switched out with no changes to layer 2 and minimal changes to layer 4. Cool, huh?

Layer 4 - Transport layer

IP is pretty neat, but it's not the last word in this. IP operates on a best-effort basis, with extremely lax guarantees, to reflect the realities of computer networks. The Internet will do its best to get your packets from A to B, but it's allowed to lose them, or mess them up, or send them out of order, or even send multiple copies if it really wants to mess with your head. In computer science, if you don't like the properties a system gives you, standard operating procedure is to build another layer on top of it that gives you what you want. :D

TCP is the most widely used transport layer protocol, for good reason. It gives you a stream that you can read and write data from and to, without having to worry about packets and the funny things that they're allowed to do. It does this by assigning packets numbers, so that dropped packets can be detected, waiting for acknowledgement from the other side, to guarantee packet delivery, and adding data checksums, so that corrupted data can be detected. (This is definitely an oversimplification, but TCP can get pretty complicated.)

The interplay between TCP and IP is kind of interesting in itself. It reflects the "dumb network with smart edges" principle - the Internet at large only sees IP packets, which are easy to deal with, and so can be dealt with in incredible volume. Only the edges of the network, where connections are actually being made from and to, have to deal with the much more complex TCP protocol. At the inception of the Internet, there were other protocols which made much more stringent demands on their network layers. TCP/IP wiped the floor with them, largely because it made almost no demands of its network layer other than that it usually work. This allowed just about anybody could connect to the Internet using just about any method they could send packets over. (Those links are a prime example of what network engineers consider funny. Bonus: somebody actually built it!)

UDP is the next most commonly used layer 4 protocol, and the only other one you're at all likely to have heard of. It offers very similar guarantees to IP, so it's not usually that useful, except when people want just a bit more performance than they can get from TCP. There are also newer protocols, such as SCTP or DCCP, which would probably be really useful if anybody actually used them. I'm basically only mentioning them here for completeness, though.

Layers 5-8 - Meh.

Layers 5-7 (Session, Presentation, and Application) are defined in the OSI stack, but they don't really apply to networks, so I'm just going to leave them out. Layer 5 is sometimes used in some protocols where persistent sessions are useful, but layers 6 and 7 are almost universally ignored.

Layer 8 is actually a network geek in-joke - following the progression of the OSI stack, layer 8 could be interpreted to be the user, so a "layer-8 network failure" is basically equivalent to PEBKAC, or an ID-10-T error.

Hmm, this post went on for a lot longer than I'd expected. >_> The most important thing to take from it is that the Internet runs on many different layers, each of which builds upon the last, allowing it to run on a wide variety of hardware, and under a wide variety of conditions. It also allows for upgrades, since changing out a layer usually needs changes in at most one layer directly above it.

No comments: