The Internet is a Series of Tubes

What happens when entering a website into a browser

The average person visits 17.2 websites per day, according to figures I’ve just completely invented. But how do we reach the banal and the blissful sites that we numbly scroll from dawn til dusk? What are the arcane mechanics of this infernal machine, this series of tubes known as the interwebs? Let me take you on a journey from keyboard to server and back. Out of deference to my school, let’s use https://www.holbertonschool.com as our example. Type it into the address bar of your preferred browser, click Enter and let’s get going. Don’t forget your towel.

It will be fun; don’t panic.

Before we get going though, there are a couple of things that must be made clear. What we are looking for is not a physical location of a server, but rather an IP address to which to direct our request. It is part of the magic of TCP/IP. TCP (Transfer Control Protocol) breaks the DNS request into packets with headers and body sections to be sent, while the IP (Internet Protocol) portion that actual knows the routing portion. An apt analogy would be using a moving service to pack and move your apartment. The person that packages the item it like the TCP while the IP is more like the driver/pilot/gondola captain that actually has to follow a predetermined map to the address. Then TCP unpacks them at the destination. But before we can get to all that We actually have to know the IP address that connects to the much more human-readable https://www.holbertonschool.com.

In order to equate the url text to an IP address, the first thing that will happen on pressing enter is your web browser is going to check it’s cache. It’s going to ask “Have I been there before?” if yes, the IP is off to the races. If not, it will ask the operating system if it has been there before. Again if yes, the IP routing begins. You might be asking why ask the OS when you’ve already asked the the browser cache. Consider that you might have more than one browser or might have some other point of contact with the internet. It’s all about minimizing load on servers by stopping the DNS request from cascading any further than it needs to.

Joey Mousepad doesn’t have a great memory for names

After checking locally, you machine will send a request to your ISP. It will check its own cache and see if the URL has been resolved to an IP before. More astute readers might again notice that checking the cache could greatly cut down on the amount of requests that get past this stage. Pooling known IP addresses from previous resolutions into a codex will cut out almost all but the most obscure requests. For the purposes of a full illumination of the process however, let’s assume that neither you, nor any one in your ISP’s service area has ever been to the requested URL. Then it’s time to meet the resolver!

Think of the resolver as an intrepid research assistant. It doesn’t know the IP address, but it has a set of rules to follow to find it. The first of these is to find the root server. The root server sits at the top of the DNS hierarchy. It will direct the resolver to the next level of search, which is the Top Level Domain Server for whatever suffix is requested, .com in this case. The TLDN will then point to the to name servers that will actually handle any requests on that URL. But wait, you’re thinking. How can it know them all? There are so many .com sites. The answer is through a registrar. Sites like godaddy or hostgator allow for the purchase and registration of URL names. The websites must be registers with the various TLDS for which they offer hosting. Once purchased, the hosting website provides a means (usually a web interface) to supply one or more name servers. The registrar reports these architect supplied names servers to the TLDN service. That’s how they are able to connect the dots.

What is a name server, though? A name server directs all traffic for that domain. If a site is small, the name server might be the server actually serving the web pages to the client. In most cases though it’s front door to something much larger, like a cloud hosting service such as AWS, or DigitalOcean. If it’s a cloud service, they will know how to direct the request to the server. Once authenticated, the name server will provide the IP address. At this point the resolver can return the results to the client all the way back at the client, that is, your home machine.

Once the IP has been resolved, a TCP/IP request is sent from the client for the web page. Let’s just assume that Holberton is using a big cloud hosing service like AWS. AWS will direct the request to the appropriate server within their network. That server will almost certainly be a load balancer with a firewall. What does a load balancer do, you ask? Well it’s primary function is in its name. It manages incoming requests and distributes them across servers according to whatever algorithm optimizes the load. The second thing it does is handle the SSL/HTTPS requests. This allows incoming and outgoing traffic to be safely encrypted for data privacy. This is done in the load balancing server to remove processing load from the web servers which are driving the websites. Let’s make another assumption and assume that they have more than one load balancer grouped together to handle any requests. Having more than one load balancer adds a layer of redundancy in case on fails. And what’s a firewall, you ask? A firewall permits or denies incoming and outgoing traffic to a server.

After being directed to the server that will handle our request and passing through its own firewall, our request is dropped off at the web server. The web server is software that actually generates and returns web page information as requested. The web server is is listening for requests on port 80 (HTTP) and 443 (HTTPS). If the website was static that would be all that is needed. It could render the text, images, HTML, CSS, and JS but wouldn’t be able to take in or return back any information. For that, we need an application server.

An application server is the software that allows for the manipulation and storage of data. A Flask or Django framework uses jinja templating and views to actually render the appropriate front end pages. It actually drives the website. If the front end is a frame of a car, the application server is the engine. As mentioned, the application server also talks to a database that allows for the storage and retrieval of data in an organized fashion.

Once the application server has received any information it needs from the database, it then returns a rendered view to the web server. Web servers then use TCP to pack the results into packets, IP routes it back to the client IP, and TCP unpacks the packets which the browser renders. Below is a schema of a the structure we just traversed.

And there you have the workings of one type of website. This type is known as a LAMP stack (Linux, Apache, MySQL, Python). There are other web stacks like the MEA(R)N stack but those will have to wait for another post.

Until then Culpae Non Carborundum.

I am a software student from Tulsa OK