CS50 for Lawyers 2019 – Internet Technologies, Cloud Computing

CS50 for Lawyers 2019 – Internet Technologies, Cloud Computing


[MUSIC PLAYING] SPEAKER: You type an address into
a browser, you send an email, you perhaps have a video
conference or a chat online. Have you ever stopped
to consider what exactly is going on underneath the hood, so
to speak, of those pieces of software? And really the entire
infrastructure that somehow connects you to the person or persons
with whom you’re communicating. Well it turns out that there’s
a whole stack, so to speak, of internet technologies that underline
the software that you and I use these days, every day. And indeed, the software
that we use, browsers and email clients and the
like, are really abstractions, very user friendly abstractions, on
top of some lower level implementation details. And these days, too, have we built
abstractions even above those so known as the cloud, an abstraction on top
of this underlying infrastructure that enables us to do most anything
we want computationally without even having that hardware locally. So let’s see if we can’t distill what
goes on when you do type an address or a URL into the address bar
of a browser and then hit Enter. Or you type out an email– specify someone’s email
address and then hit Enter. What exactly is going
on underneath the hood? Well, at the end of the day, I dare say
that what your laptop, and my laptop, and our desktops, and even our
servers are capable of really is just sending messages in envelopes
back and forth across the internet. Virtual envelopes, if you will. Now in our human world, an envelope
needs a few things on the outside. If you want to send a letter or a card
or something old school to someone, you need to address it, of course. And you need to put, perhaps in
the middle, the recipient’s name, and address, and other details. You might put in the top left hand
corner, by convention, your own name and or address. You might even put a
little memo in the bottom that specifies what’s inside or
fragile or some other annotation. So this metaphor of the
physical world is actually pretty apt for what’s going on
underneath the hood in computers. When you have a computer
plugged into a network or connected wirelessly
to a network, it really is just sending and receiving
envelopes, virtual envelopes, that at the end of the day are
just patterns of zeros and ones, but collectively, those zeros and ones
represent your email or the request that you’ve made of a web server,
the response you’re getting back from that web server. So let’s see if we can’t formalize
exactly what these lower level primitives are, consider exactly how
they’re layered on top of one another, because thereafter we can
build almost anything we want on top of this infrastructure once
we understand what those underlying building blocks actually are. So let’s consider how
we actually address this envelope in the first place. After all, when I turn on my
laptop or turn on my phone or open up my desktop in the morning,
how does that computer or that phone even know what its own
address is on the internet? Because just as in our
human world, wherein you need to be uniquely
addressable in the physical world in order to even receive an
envelope or a card or a package, so do computers need to be uniquely
identifiable on the internet. Now for our purposes, now we
can consider the internet just to be an internetworked
collection of computers connected via wires, connected wirelessly. There’s some kind of interconnectivity
among all of these devices and these days our phones and internet
of things devices and other things still. So let’s just stipulate that
somehow or other there’s a physical connection, or
even a wireless connection, between all of these various devices. So those devices all
need unique addresses, just like a building in the
human world needs an address. For instance, the computer science
building here on campus is at 33 Oxford Street, Cambridge,
Massachusetts, 02138, USA. With that precise information, can
you send us a real mail or a package or anything else through the
physical world in order for it to arrive on our doorstep? But what if you, instead,
wanted to send us an email and get it to that
building, or really me, wherever I am physically in the
world on my internet works device? You need to know my
computer’s address, you need to know my phone’s address,
or at least the mail server that’s responsible for
receiving that message from you. Well, it turns out that most any
network on a campus, in a corporation, even at home these
days has a DHCP server. Stands for a Dynamic Host
Configuration Protocol, and that’s just a fancy way of
describing a server that is constantly listening for new laptops, new desktops,
new phones, new other devices, to wake up or be turned on and to shout out
the digital equivalent of hello, world, what is my address? Because the purpose
in life of these DHCP servers is to answer that question. To say, David you’re going to go
ahead and be address 1.2.3.4 today. Or David, you’re going
to be 4.5.6.7 or 5.6.7.8. Any number of possibilities
can be used to represent uniquely my particular device. So DHCP servers are run by the
system administrators on a campus, in a company, in an
internet service provider. More generally, they’re
run by whoever provides us with our internet connectivity. They just exist on our network. But these DHCP servers also
give us other information. After all, it’s not really sufficient
just to know what my own address is. How do I know where anyone
else in the world is? Well, it turns out that
the internet is filled with devices called routers
whose purpose in life, as their name suggests, is to
route information from point A to point B to point C and so on. And those routers, similarly,
need to know these addresses so that they know upon receiving
some packet of information, some virtual envelope, in
which direction to send it off. So these DHCP servers also tell me not
just my address, but also the address of the next hop, so to speak. I, as a little old laptop
or phone or a desktop, I have no idea where 99.999 percent
of the computers in the world are, even higher than that perhaps. But I do need to know where the
next computer is on the internet, so that if I want to send
information that leaves this room, it needs to go to a router
whose purpose in life is to, again, route it further along. And generally there might be
one, two, maybe even 30 steps or hops in between me and my destination
for that email or virtual envelope, and those routers are all
configured by people who aren’t me, system administrators beyond
this, beyond these walls to know how to route that data. So we can actually see evidence
of this that you yourself have had underneath your
fingertips all this time and you might not have
ever poked around. For instance, if you want
to see your own address, keep an eye out for a
number of this form. It’s a number dot number dot number dot
number, and each of those place holders represents a specific value, either
starting at zero or ending at 255. In other words, each of these hashes
can be any value between 0 and 255, and that range 0 to 255 well
that’s 256 total possible values. That’s eight bits. Ergo, each of these place holders
represents 8 bits, 8 more bits, 8 more, 8 more. So an IP address, by
definition, is 32 bits. And there it is. IP, an acronym you’ve
probably seen somewhere, even if you’ve not thought
hard about what it is, stands for Internet Protocol. Internet Protocol mandates that
every computer on the internet, at the risk of oversimplification, has
a unique address called an IP address. And those IP addresses look like this. If these IP addresses are composed of
32 bits, how many possible IPs are there and therefore how many possible
machines can we have on our internet? Well, 2 times 2 times 2, 2 to the 32,
so that’s four billion, give or take. By design of IP addresses,
you can have four billion, give or take, possible permutations of
zeros and ones if you have 32 in total, and that gives you four
billion, maximally, computers and phones and internet
of things devices, and the like. Now that sounds big,
but not when each of us personally probably carries one IP
address in our pocket in our phone, maybe another on our wrist these
days, one or more computers in our life, not to mention all of the
other devices and servers in the world that need these addresses, too. So long story short,
this is version 4 of IP. It’s decades old, but there’s
also a newcomer on the field called IPv6, version 6. There isn’t really to be a version 5. And IPv6 is only
finally gaining traction because we’re running so
short on IPs that it’s becoming a problem for campuses,
for companies, and beyond. But IPv6 will use 128
bits instead of 32, which gives us many, many, many,
many, more possibilities, bigger numbers than I can even pronounce. So that should cut it
for quite some time. But not every computer on the internet
needs a public IP address, only those envelopes, so to speak, that
need to leave my pocket, or my home, or my campus, or my company. It turns out, as a short term mechanism
to squeeze a bit more utility out of our 32-bit addresses,
which are still omnipresent and the most popular
among the versions, well we can actually distinguish
between public IP addresses that do actually go out on
the internet and private addresses. And indeed, if your
own IP address happens to start with the number 10
and then a dot or the number 172.16 and then a dot, or the
number 162.168 and then a dot, and then something else, well, odds are,
your computer has a private IP address. And this is just a feature of the little
router that’s probably in your home, or the bigger router on your
campus or corporate network, that enables you to have an IP address
that’s only used within the company, only used within your home,
and cannot, by definition, be routed publicly beyond your
company, beyond your home, because the router will stop it. And so here we actually have the
beginnings of a firewalling mechanism, if you will. In the real world, a
firewall is a device that prevents fire from going from
one store to another, for instance. In the virtual world, a
firewall is a piece of software that prevents zeros and ones from
going from one place to another. And in this case do we
already have a mechanism via public and private addresses
of keeping some data securely, or with high probability
securely, within our company versus allowing it to
go out on the internet. So we’ll see now some screenshots
of some actual computers from Mac OS and Windows alike that
reveal their IP addresses, and you yourself can see
this on your own machines. For instance, here on
Windows 10 is a screenshot of what your Network Preferences,
so to speak, might look like. And if you focus down here, it’s
a bit arcane at first glance, but IPv4 address is 192168.1.139
when we took that screenshot. And indeed, it starts with
192168 which means it’s private, and indeed, I took this screenshot
while we were within a home network, and so that suggests it can be used
to route among computers in that home but not beyond. You’ll see, too, if we
move on to the next screen where you see more advanced
network properties, you can also see the dimension
of this default gateway, which is synonymous with router, default router. 192168.1.1. So a default router or default
gateway is that first hop, so that if I want to send
an email outside of my home, I want to visit a web page
outside of my company, all I need do is hand that virtual
envelope containing that email or that web request off to the
machine on the local network that has that IP address. I have no idea where it’s going to
go thereafter, to hops two and three and beyond, but that’s why
we have this whole internet and even more routers out there. They, the routers, intercommunicate
and relay that data, hop to hop to hop, until it
finally reaches its destination. Now where did I get
my IPv4 address from, where did I get my default gateway from? From the DHCP server in my home,
in my company, or whatever network I happen to be on. And Mac OS is the same. If these screens are unfamiliar,
you might recognize this, under System Preferences in Mac OS. Here, while connected to
Harvard University’s network, you can actually see that my
IP address was 10.254.16.242. That number, too, starting with one
of those internal or private prefixes, indicative of the fact
that even within Harvard, where we keeping all of our Harvard
traffic internal to Harvard, and then not exposing that externally. And indeed, if we look in the
more advanced preferences here, we can see that the router
for my Mac was 10.254.16.1. Which is to say this Mac, when it’s
ready to send something off campus, simply hands that envelope off
to this particular router here. And the router’s job, ultimately, that
first hop, a border gateway or border router, literally
referring to a computer that physically or metaphorically
is on the edge of a campus or company, its purpose in
life is to simply change what’s on that envelope
initially from the private IP address to one or more
public IP addresses, thereby maintaining this mapping. So this might result in everyone
else in the world thinking that I and you and everyone
else in my company or campus are all actually at the same IP
address, but that’s not true. Each of our personal devices
has a private IP address and that router can actually
translate via something called network address translation,
or NAT, from a private address to public and back. And so in this way, too, can
a company help mask the origin or the identity of whoever it is that’s
accessing some internet-based service. Of course, that same company could log
what it is that’s leaving the company and coming back in, so via
subpoena or another mechanism, could someone certainly
figure out who was accessing that service at a particular
time, but the outside world would need help knowing that. And so here, even within Harvard,
it’s done perhaps for that reason. But also perhaps in order
to use one public IP address among hundreds or
thousands of university affiliates, so that frankly we just don’t
need as many IP addresses. So what might be both a
technological motivation can also have these policy
side effects as well. So IP itself. Protocol. Well, what does that actually mean? A protocol– it’s not a language, per
se, it’s not a programming language. It’s really just a set
of conventions that govern how computers intercommunicate. IP, specifically, says that if you
want to send a message on the internet, you shall write a sender address on
the envelope and a recipient address on the envelope, and that will
ensure that the routers know what to do with it, and
they’ll send it back and forth in the appropriate directions. IP gives us some other features,
as well, fragmentation among them. It turns out for efficiency if you’ve
got a really big email or a really big file, whether a
PowerPoint file or video file, it’s not really fair to everyone else
to kind of jam that onto the network and to the exclusion of other people’s
data at any given point in time. And so IP tends to fragment
big files into smaller pieces and send them in multiple
envelopes that eventually get reassembled at the other end, so
that there is a compelling feature as well. But this leads, of course, to a
slippery slope of implications for net neutrality and for
companies or governments to actually then start to
distinguish between quality of service of this type of data
and this other type of data. Why can they do that? Well, it’s all quantized at
a very small unit of measure, and within these packets
are additional information. Not just those addresses, but hints as
to what type of data is in the packet. Is it an email, is it a web page, is
it a video conference, is it Netflix, is it some competitor service? And so ISPs or companies or
governments can certainly distinguish among these types of packets
and treat them theoretically, and all to really these days, differently. So that they’re derived simply
from these basic primitives. Now we can very quickly
go pretty low level. If you actually look back
at the formal definition that humans crafted decades ago for
what IP is, this is how they drew it. You might call this ASCII art,
to borrow a phrase from our look at computational thinking. It’s sort of an artist’s
rendition of some structure just by using the keys
on his or her keyboard. And so these dashes and
pluses, really, just are meant to draw a rectangular
picture, nothing more. The numbers on top represent
units of 10 bits at a time. Here’s bit 0, here’s 10, here’s 20,
and over here is the 32nd such bit. So start at zero, and you count as
high as 31, so that’s our 32nd bit. And we can see a few
details within here. We can see details like
the source address, and it’s the whole width of this
picture indicating that indeed this is a 32-bit value that composes the
source address or the sender address. Destination address is just as
wide, so there’s another 32 bits. There’s options and other time to live. You can specify just how many
routers this can be handed off to before the router should
say, we just don’t know where this destination is, we shall give up. And there’s other
fields as well in here. Now what are we really looking at? This is just an artist’s
rendition of what it means to send a pattern of bits. The first few bits
somehow relate to version. The next few bits relate to IHL and
type of service and total length. Eventually, the pattern
of bits represents source address and destination address. So any computer that’s receiving
just a series of bits wirelessly or over the wire in the form of
wavelengths of light or of electricity on a wire, simply needs to realize,
oh, once I’ve received this many bits, I can infer that those bits
were my source address, those were my destination address. But again, this is so low level, it’s
a lot more pleasant to sort of think about things at the virtual level. An envelope that just has this
information written on it, and let’s not worry about an abstraction
level below this one, wherein we get into the weeds of this data. But it turns out that IP is not the
only protocol that drives the internet. In fact there’s several, but
perhaps the other most common one that you’ve heard of is that one here. TCP. Transmission Control Protocol. Now this is just a protocol
that solves a different problem. Rather than simply focus on
addressing computers on the internet and ensuring data gets
from one point to another, TCP is about, among other
things, guaranteeing delivery. TCP adds some additional zeros and ones
to that envelope on the outside of it that helps us get that antelope
to its destination with much higher probability. In other words, the
internet’s a busy place. Servers are constantly
getting new users, routers are receiving any number
of packets at any given time, and sometimes there are
spikes in connectivity. People might all be tuning into
some news broadcast online streaming lots of video, or downloading
the latest news all at once, or everyone’s playing
the latest game online, and so there can be
these bursts of activity. And honestly humans don’t necessarily
engineer with those bursts of activity in mind, and so routers get
busy, computers get busy. And when they get busy, they might
receive an envelope of information and realize, wait a minute, I
don’t have enough hands for this, and packets get dropped, so to speak. In fact that’s a term of ours, to
drop a packet just means to ignore it. You don’t have enough memory,
enough RAM inside of your system to hang onto it for any length
of time, so you just ignore it. Now this would be pretty darn
frustrating if you send an email and only with some probability
does it go through. Now in practice that might
feel like it happens, especially when things get
caught up in spam and the like, but in practice you really do want
emails that are sent to be received. When you request a web page,
you want the entire web page. And even if those are big emails
or big web pages that are therefore chopped into fragments, you really
want to receive all of the fragments and not just only some of
the paragraphs in the email, or only some sections of the web page. So TCP ensures that you get all of
that data at the end of the day. Well hopefully not at the end
of the day, but ultimately. And so what TCP adds to the envelope
is essentially a little mental note that this is packet number one of
two, or one of three, or one of four, in the case of an even larger file. And so when the recipient of this email
or this web request gets the envelope and realizes, wait a minute, I’ve got
numbers two and three and four, wait a minute, I’m missing
the first envelope. TCP tells that Mac or
PC or other computer, go ahead and send a message
back to the sender saying, hey, I got everything except
packet one, please resend. That’s going to take a
little bit of extra time, but that packet can be
resent and TCP knows how to reassemble them
in the proper order so that the human ultimately sees their
entire email or that entire web page and not just some portion thereof. So what does TCP really look like? Well, let’s just take a quick peek
underneath the hood here, too. And here we see a similar pattern of
bits but not addresses, that, again, is handled by IP itself, but
you see mention of source port, and destination port. Sequence number, which
helps with the delivery, and then other options
as well, all of which we relate to the delivery
of that information. But these two up here, looks
like 16 bits each, source port and destination port. Those two have value, because
TCP does something else. It doesn’t just guarantee that data
gets from one point to another, it also helps servers distinguish one
type of data from another, and in turn allows companies and universities
and internet service providers or governments to distinguish
different types of data because it’s right there on
the outside of the envelope. In particular, TCP
specifies what protocol is being used to convey this packet
of information from one computer to another. In other words, there’s lots of
internet services these days. There’s email, there’s chat,
there’s video conferencing, there’s web browsers, and more. So that’s a lot of possibilities,
a lot of patterns of zeros and ones that can be in these envelopes. So how, upon receiving an envelope, does
a server know what type of information is in it? Especially big companies. Google, for instance, supports
all of those services. Video conferencing,
email, chat, and more. So when Google’s servers
receives a packet of information, how does Google know that
this is an email from you, as opposed to a chat message from
you, as opposed to a video from you that you’re uploading to YouTube? You need to be able to distinguish
these various services because at the end of the day,
they’re just patterns of bits. Well, if we reserve some
of those bits, or really some of the markings on this virtual
envelope, for just one more number we can distinguish
services pretty easily. In fact, HTTP, an acronym
that you might not know what it means but
you’ve surely seen it a lot, since our hypertext
transfer protocol and it’s the conventions via which browsers and
servers send web pages back and forth. Well, by convention,
humans decided years ago to call that service number 80. TCP port 80, so to speak. And the secure version of that, HTTPS,
they decided to number that 443, just because they’d already used quite
a few numbers in between those two values. INAP is the protocol via which
you can receive emails or check your email, that’s used different ports
depending on whether you’re using it security or insecurity like 143 or 993. SMTP, which is outbound email,
can use similarly 25, 465, or 587. And then, if familiar, there’s
something called SSH, secure shell. This is what developers
might use at a lower level to connect from one computer,
say a laptop, to a remote server. That tends to use port 22. And there’s hundreds,
there’s actually thousands of others, as many as 65,000
possibilities, but only some of those are actually standardized. So this is to say what
ultimately is going on the outside of an envelope
is not just a user’s address but when I as a computer send
a message to some other server and for instance my address
is 5.6.7.8 I’ll write that in the top corner of the envelope. If the recipients of this envelope
are supposed to be 1.2.3.4 I do write that in the
middle of the envelope, but I need to further specify IP
address 1.2.3.4 but port number, let’s say, 80, if it’s a
request for a web page. So conventionally you would do
:80 to distinguish that service. And then of course because of TCP
I need to number these things, so if it’s a big request or a big
response I better write one of two, one of three, or one
of four, or the like. And so the envelope I’m
ultimately left with is something a little more like this. On the outside is this recipient’s
address, on the outside is the sender’s address,
and on the outside is the sequence number of
some sort that specifies how many packets I’ve actually sent
and hopefully will be received. So TCP then allows the recipient to
see this envelope, realize, oh this is for my web server. Google can hand it off to the
appropriate piece of software that governs its web
servers and so it’s not confused for something else like
an email, a chat message, a voice conference, or the like. And again, all of these
features derive quite simply from these patterns of
bits that esoterically happen to be laid out in this way, but
if we abstract away from that and stipulate that just think about it
like the real world with an envelope, it’s really just these numeric
values that somehow help us get data from one point to another. Collectively now, these two protocols,
which are so often used hand in hand, are generally very abbreviated TCP/IP. It’s two separate protocols,
two separate conventions used in conjunction. Some of this information is just
written in different places, if you will, on the virtual
envelope, but TCP/IP settings are what you might look for
on a Mac or PC or server to actually configure
this level of detail. But of course, I’ve taken
some liberties here. If my goal is to send a
message from one computer to another, a chat message,
an email, anything else, you know what, I’m pretty
sure I have no idea what the IP address is of any colleague. And I have no idea what the IP
address is of Google or Facebook or any number of popular websites
that I might even visit daily. I don’t even know people’s phone numbers
anymore but that’s another matter. In the context of words, though,
on the internet all of us, of course, type words, not numbers,
when we want to reach some destination. We go to facebook.com or gmail.com
or google.com or bing.com or any number of other
domain names, so to speak. And of course, that’s
what you would write on the outside of an
envelope in the human world, ideally as many words as possible,
not just numbers let alone bits alone. And ideally our
computers would similarly express exactly what we humans
know, which is these domain names that are part of URLs. So it turns out we need the help of
at least one more service among all of these internet technologies. We need the help of a service
called DNS, domain name system. A DNS server is a server that quite
simply translates domain names like gmail.com and bing.com and
google.com into their corresponding IP addresses. We, the humans, might have
no idea what they are, but odds are there’s at least one
human or more in the world, probably who works for those
companies, that does know. And provided he or she
configures their DNS servers to know that
association of domain name to IP address, the equivalent of just an
Excel file with one column with names and the other column with numbers, IP
addresses well their server can then answer questions from little old me. And indeed what my phone knows how to
do these days, what my Mac, my PC knows how to do is when my human types
in gmail.com and hits enter, the very first thing that my
browser, and in turn my operating system like Mac OS or Windows
does, is it asks the local DNS server for the conversion of
whatever I typed in, gmail.com, to the corresponding IP address. And hopefully, my own network be
it at home or on campus or in work, has the answer to that question. But the world’s a big
place, and odds are my home does not know the IP address
of every server in the world. Odds are my campus or company doesn’t
know the IP address of every server in the world, especially since
they’re surely changing continually as new sites are coming online
and others are going offline. So how do we know? Well DNS is a whole
hierarchical system whereby you might have a small DNS server, so
to speak conceptually here on site. But then your internet service
provider or ISP, Comcast, Verizon, or some other entity, they probably have
a bigger DNS server with more memory, with a longer list of domain
names and IP addresses. And you know what, even if
they don’t know everyone, there are probably what are
called root servers in the world, that much like the root of a
tree, is where everything starts. And indeed, you can find
out from these actual root servers on the internet,
the mapping, effectively, between all of the dot coms
and their IP addresses. All of the dot govs or the dot
nets and their IP addresses. And frankly, even if they don’t know
the answer by definition of root server they will be configured
to know who knows. And so DNS is very hierarchical,
and it’s also recursive. You might ask a local server, which
might ask a more remote server, which might ask and even further away server. That server might say, wait a minute,
I know, this server knows, and then the answer eventually
bubbles its way back to you. And long story short,
we can be efficient. We don’t have to constantly
ask this question. We can cache those results locally. Remember them in my
browser, in my Mac or my PC. There’s downsides there, though, too. By remembering that mapping
of domain name to IP address, I can save myself the trouble of asking
that same question multiple times a day or even per week or even per minute. The catch, though, is that if Google
changes something, or Facebook reconfigure something
and that IP changes, caching might actually be a bad thing. And so here, too, even at
the level of the internet do we see these series of trade-offs. You might save time by caching, but
you might sacrifice correctness, because now the servers recollection of
that IP address might become outdated. And so this is a whole
can of worms, ultimately, and speaks to what it really
means to be an engineer in the world of internet technologies
to anticipate to think about and ultimately to solve these problems. There is no sure fire solution other
than to expect that you’ll need to accommodate these changes over time. So in Windows, can
you see this yourself? Well, if you open up those
same Wi-Fi properties or wired properties that you have, you’ll see
again, not only your IPv4 address, but it was there all this
time, your IPv4 DNS servers one or more IP addresses turns out it’s
exactly the same by coincidence but also by design on this computer
of my router or my default gateway 192168.1.1. Which is to say that if this PC needs
to know an answer to the question, what is gmail.com’s IP address it is
simply going to ask the local server that has that address and that
DNS server, and this is important, cannot have itself a name. We need to know what its IP
address is, otherwise, of course, we get into this endless loop. If we know only the name of our
DNS server but only the DNS server can convert that to an IP address, we’ll
never actually answer that question. It’s more of a catch-22. And even if it does
have a name, you need to know manually, via your DHCP server
somehow, what its IP address actually is. Mac OS, the same. And here on campus, Harvard happens to
have redundancy like most any company. They don’t have just one DNS server
they have at least three here, 128.103.1.1, and a couple
of others, as well. And again, I got these automatically
when I turned on my Mac or my phone or my PC via that local DHCP server. So let’s see if we can’t
mimic what it is my Mac, your PC, your phone is doing
everyday all day long, but rather unbeknownst to us. Here I have what’s
called a terminal window. This is just a textual
interface to my computer here. Can exist on Macs, or PCs, or
other operating systems, as well. And it allows me to execute
by typing commands textually, only at my keyboard, no mouse,
exactly the types of commands that your browser and other software
are effectively executing or running for you. For instance, suppose
I genuinely do want to know the IP address of gmail.com. I can ask this program as follows. nslookup, for name server look
up, and then I can go ahead and type literally
gmail.com and hit Enter. Here, visually, we see on the screen
one answer that it’s 172.217.3.37. And this comes from a server
whose IP address in this room is 10.0.0.2, which we know now
to be a private IP address, and indeed, here on
campus we have servers that are local only to this room, this
building, or this set of buildings here. Now this is a little
interesting because I’m pretty sure business is good
for Google, and surely they don’t have just one server
and therefore one IP address. Well, it turns out that there’s a whole
hierarchy of servers out there, most likely, that my data goes to and
thereafter through on Google’s end. The one IP address that they’re
telling me is theirs is 172.217.3.37, but once my packet of information gets
there to Mountain View, California, or wherever their servers
happen to be closest to me, then they might have any number of
servers, dozens, hundreds, thousands, that can actually
receive that packet next. This just happens to be the outward
facing IP that my own Mac or PC or phone actually sees. Well, let’s see if we can’t trace the
route to gmail.com via another command, literally traceroute, can I see
the packets of information line by line leaving my computer and making
their way, ultimately, to Google. I’m going to go ahead and
do this once, so dash q1 means do one query, please, at a time. And then I’m going to go
ahead and say, quite simply, gmail.com, and then Enter. And we will see, line by line,
the sequence of IP addresses of every router that is to
say hop between me and Gmail. On occasion we’ll see
these asterisks instead, which indicates that that
router isn’t having any of this, it’s not responding to my requests,
so we can’t see its IP or anything else about it. But we can see that in 17
steps does data leave my laptop and end up at gmail.com,
and along the way it encounters all of these routers that
have these unique IP addresses but not names, it seems, and
the amount of time it takes for my data to get
from my laptop to gmail.com is, oh my, 0.967 milliseconds. Less than one millisecond is
required to get data or an email from my computer to gmail.com itself. Now what about all of these other
measurements of time up above? Each of these represents
the number of milliseconds it took during this process for data
to go from my laptop to this router, then to this router,
then to this router. Now, of course, it seems
strange that it takes more time to get these to these close routers
than it does to these further away. But there, too, if I
ran this all day long I would get different
numbers continually, it depends how busy those routers
are at that moment in time. It depends what else
everyone here on campus is doing, or other people in the
world at that moment in time. Routers might be a little slow
to respond because they’re busy doing something else. My data might get
dropped in other contexts and need to be resent, which
is just going to take time, and I don’t even see that
happening on the screen. But it’s fair to say that these give
us a sense of the range of times it might take to go from
a point A to point B, and let’s say 1 to 20 milliseconds
or even 32 milliseconds, somewhere in there is our average,
and that can vary over time. But that’s pretty fast, and indeed,
even though it took a moment to run this whole test, this is why an
email can be sent from your computer and be received nearly
instantly by someone around the world, because at the end
of the day, we’re limited, really, ultimately, by the speed
of light and little more. Well, to be fair, hardware and
cost and everything in between, but you can certainly transmit your
data faster than you can yourself. But what if we want to go
farther away than gmail.com? Odds are they probably do
have servers in California, but probably here on the east coast
of the US as well, let alone abroad. What if I deliberately try
to access a domain that is, in fact, abroad and go there? Well, let me go ahead
and visit via traceroute, say, www.cnn.co.jp, the domain
name for CNN’s Japanese website. And then we’ll add
just dash q1 this time at the end, which is fine, too,
to query the server just once. And here we see the
sequence of steps, one after another, whereby the data’s
leaving my laptop and in turn campus, and then we see some
anonymous routers in between. But the 30th there seems to be
just in time, because within it seems 178 milliseconds do
we make our way to Japan. Now that’s quite a few milliseconds
more, but that rather makes sense. Whereas it might take one
to 20 to 32 milliseconds to get from here to Gmail either
on the east coast or west coast, I’m kind of not surprised that
it takes an order of magnitude more, almost to factor of 10, to get
to Japan, because there’s not only a whole continent between us
here in Cambridge and Japan, there’s also an entire
Pacific Ocean between us. And indeed, there are Transatlantic,
Transpacific, and transoceanic cables all around the world these days
that actually transmit our data, not to mention all of the wireless
technologies we have, satellites and below. And so it does stand to reason that
even though none of these routers were paying attention to me at
that moment for privacy sake, this last one indicates that 200
milliseconds later we can get halfway across the world digitally. And so that does rather speak to
just how quickly these low level primitives operate, and
we can talk far longer about how these things work
than it actually takes time to actually get the data there. So then together we have
TCP/IP via DHCP can we get the addresses that we need to use
to address my envelopes and others, as well. Via DNS can we convert those domain
names into IP addresses and even back. And those internet
technologies are ultimately what govern how our data gets from point
A to point B. But what is the data? Indeed, everything thus far
is really just metadata. Information that helps our actual
data that we care about get from one point to another. But it’s the data at the end of
the day that I really care about. The contents of my email, the
contents of my chat message, the voice that I’m sending over
a video conference, or even just the contents of a web page. Indeed, perhaps the most popular
service that you and I use every day is just that, pulling
up pages on the web. So just how is a web page
specifically requested and received? Well, it turns out that http://
that you’ve surely seen, but probably not typed for some
time, because your browser, odds are, just inserts it automatically
or even invisibly for you. That HTTP is yet another protocol in
this stack of internet technologies. Hypertext transfer protocol. A set of conventions that browsers and
web servers have agreed upon long ago to use when intercommunicating. And to be clear, then,
what exactly is a protocol? Well, it’s just a convention. We humans have protocols even
though we might not call them such. When I meet someone new on the
street I might reach up to him or her and say, hello, my name is David. And that protocol results
in that other person, if polite, in extending their
hand too, reaching into mine and probably saying as well, hello,
nice to meet you or how are you. That’s a human protocol that
we were taught some time ago, and culturally we have all agreed
here in the US to, generally speaking, greet each other in that manner. Computers, similarly,
have standardized what goes not only on the outside of these
envelopes but what goes in the inside, as well. And so if, for instance,
the goal at hand is to request a web page of a
canonical website like www.example.com, let’s consider exactly what
is inside of this envelope. Well, first of all here we have a
proper URL, uniform resource locator. These days, your browser, whether it’s
Chrome or Safari or Edge or Firefox, probably doesn’t even show
you all of this information. In the interests of simpler
user interfaces or UIs, browsers have started to hide
these so-called protocol here at the left, even the ww here,
the hostname in the middle, leaving you oftentimes with just
example.com or the equivalent somewhere at the top of your screen. But if you click on that
address, typically you’ll see more information such as that here. And sometimes there’s more
information that’s just implicit. It turns out if you try to
visit http://www.example.com or any similar domain name, what
you’re likely reaching for is a very specific file on that server. But how do we reach it? Well, highlighted in yellow here is
what’s called the domain name itself, example.com. This is something that
you buy, or really rent, on an annual basis via an internet
registrar, a company, that via the associations on the
internet that govern IP addresses domain names has been
authorized to sell, or really rent, you and anyone else a domain
name for some amount of time, usually one year or two
years or 10 or anywhere in between, for some dollar amount. And what you get, then, is the
ability, for that amount of time renewable thereafter, to use
that specific domain name. It might be dot com,
or dot net or dot org, or any number of hundreds of others
of TLDs, or top level domains. Indeed, that suffix there is what
represents the type of website, at least historically, that it is. Dot com for commercial, dot net
for network, dot edu for education, or dot gov for government. Of course, all of those
TLDs, or top level domains, were very US centric by
design, and so far it was generally a cohort of Americans
that designed a lot of this system initially. Of course, other countries
have shorter TLDs. Country codes, dot US, dot
JP and others that signify a specific country in which they’re in. And these days anyone can
buy a dot com or dot net, but not everyone can buy a dot gov or
dot edu, or several other top level domains, as well. It depends on whoever controls
that particular suffix. This here we might call the hostname,
the name of the specific server that you were trying to visit that
lives within that domain name. In other contexts, you
might call this a subdomain, indicating what subdivision of a
company or university you’re actually trying to access. And then down here on the right,
implicitly so to speak, is a file name. It is human convention, but
not required, that the name of the file that contains the web page
that a server serves up by default, happens to be traditionally index.html. It could also be index.htm or any
number of other names or extensions, but this is among the most common. So if you don’t mention that
via just a slash, it’s implied, and it’s that file or
any other file, that’s implied or even specified explicitly
that is inside of this envelope. That’s the whole point of this
virtual packet of information, to encapsulate the request for a
page and the actual page itself. At the end of the day, it’s
HTML, Hypertext Markup Language, an actual language in which pages are
written, that’s inside that envelope, but it’s transmitted there via HTTP. The protocol, the set of conventions
via which browser and server agree to send and receive that information. So what does that information look like? And just what have these
computers agreed on? It turns out that
inside of this envelope, when it represents a request for
a web page like my URL there, are these lines here. GET/HTTP/1.1, where
get is clearly a verb, by definition in all
caps in this protocol, slash means the default page of the
website index.html or something else. And then often a mention of host
colon and then the name of the host that you’re actually looking for. Because it turns out servers
can do so many things. Not just Google servers with
voice and chat and other services, one web server can actually
serve up multiple websites. Example.com, acme.com,
Harvard.edu, google.com, all of us can actually have shared tendencies, so
to speak, on the same server in theory. And so by mentioning what actual
website you want inside of the envelope, the recipient of this envelope can make
sure that it serves you my home page and not someone else’s. But beyond that, there needs to be
additional information, as well. You might explicitly specify
the name of the file. And again, we humans have nothing
to do with any of this, ultimately, we have just typed that URL. But it’s our browser, on
Mac OS or Windows or phones, that’s packaging up this information
inside of a virtual envelope and sending it out,
ultimately, on our behalf. And indeed, if all goes well and
that envelope reaches point B and it’s opened up and it represents
the name of a web page that does, in fact, exist, the response
that I hope to get back in another envelope
from point B to point A is going to contain an
HTTP message like this. Literally the name of the protocol
again, HTTP/1.1, and then a number, and optionally a phrase. 200 is perhaps a number
you’ve never actually seen, even though it is the best
possible response to get. 200 means, quite literally, OK. The web page you
requested has been found and has been delivered in
this response envelope, OK. The type of content you’ve
received is in this case text/html. Which is to say inside of that
envelope is a clue to your browser what kind of content is inside deeper. Is it text.html, like the
contents of a web page? Is it an image/png like a graphic,
or image/gif, something animated, or video/mp4, an actual video file,
this so-called MIME type or content type is inside of the envelope for
your browser so as to provide a hint, so as to know how to
display it on the screen. There’s so many other headers,
as well, but these two alone really specify almost
as much information as you need in order to render
that response for the user. Now as an aside, there
are other versions. And increasingly in vogue,
though not yet omnipresent, is HTTP2 which has additional
features, particularly for performance and getting data to
you even more quickly. It simply replaces that 1.1
with a two and the response, though, comes back almost the same. So let’s consider an example
then, such as harvard.edu. It turns out that http://harvard.edu
is not where Harvard wants you to be. In fact, let me go ahead
and pull up my browser here and visit precisely that URL. http://harvard.edu, Enter. And within seconds do I find myself
not at harvard.edu, but rather at ww.harvard.edu and moreover
at https://www.harvard.edu. In other words, even though I
specified a protocol of HTTP, a domain name of harvard.edu,
and no hostname, so to speak, I have actually been whisked
away, seemingly magically, to this URL instead, for reasons both
technical and perhaps marketing alike. For today, though, let’s focus
on exactly how this came to pass. Well, it turns out that inside of
the envelope with which Harvard, or any server, replies to me can
be additional metadata, as well. Not just 200 OK, but really
the equivalent of uh-uh, there’s nothing to see
here, go here instead. So let me go ahead and run a program,
again in that black and white window known as my terminal
window, whereby I can pretend to be a browser
without all of the graphics and without all of the
distraction and focus only on the contents of
those digital envelopes. Here the program I’m going to run
is called curl for connect to a URL, and I’m going to specify dash I which
is to say I only want the HTTP headers. I’m going to go ahead now and say
http://harvard.edu, nothing more. When I hit Enter now, here
are the complete headers that come back from the server. No dot dot dot this time, we
see everything, in fact, here, but notice the first line. It’s not 200 OK, but rather
301 moved permanently. Like, where did Harvard go? Well, it turns out that Harvard
has specified its new location down here as https://www.harvard.edu. Now there’s other
lines of headers there, HTTP headers as they’re called,
each of which starts with a word, perhaps with some punctuation, and
a colon, followed by the value. Location, value, go to this location
is the general paradigm there. But why might Harvard not want to
show me their web page at the address that I typed? Well, it turns out that HTTP
is by definition insecure. The extents to which
the message is encoded is quite literally in English
or English-like syntax, such as that we’ve been looking at here. It’s just that text that’s
inside the envelope. If instead, though, you want to encrypt
those contents so that no one knows what web page you’re
requesting or receiving, and your employer and your university
administrator or your internet service provider or country does
not know what you’re doing, be it for personal reasons,
financial, or otherwise, well then you want to use HTTPS. And Harvard University, like
so many companies today, is insistent that you
actually visit them securely, if only because it’s best
practice, but it also prevents potentially private
information from leaking. And so here with this location
line is Harvard saying, no, we will not respond
to you with OK via HTTP, we have moved permanently to
a secure address at HTTPS, where the S denotes secure. But why the www? Back in the day, you
probably did have to type for many companies, www.example.com
instead of just going to example.com and hoping that you end
up in the right place. Well, humans have gotten more
comfortable with the internet over the past years, over
the past decades, and indeed, whereas years ago, in order to advertise
yourself effectively on the web, you might have indeed needed to
go to press on your business card or advertisement with
http://www.something.com. But all of us have kind of seen
HTTP enough, if not HTTPS as well, you don’t need to tell me to type that. And indeed, my browser no
longer requires me to type that, so now you see business
cards and advertisements with just www.something.com. But you know what, I’m
not new to the internet. I know what ww is, and I
know what dot com is as well, don’t even bother showing me or telling
me on your card or your website or ad that it’s www.something.com,
just tell me something.com. And so browsers have been
getting more user friendly and humans have been
getting more familiar, and so we tend not to see
those prefixes anymore. But it turns out that for technical
reasons, for security reasons, it tends to be useful
to have a subdomain. As an aside, for things
like cookies, it’s useful to keep cookies
in a subdomain as opposed to the domain itself just to narrow the
scope via which they can be accessed. But also for marketing
sake, it would be nice if everyone in the world, whether they
type harvard.edu or www.harvard.edu, ultimately end up in the same
location just because that’s how we want to present ourselves to the world. And so for both technical and marketing
and security reasons alike might Harvard or a company want to
redirect to a URL like this one here. Now what does your browser know to do? Well, when your browser receives
not 200 OK, in which case it just shows you the page, but
instead receives 301 moved permanently, it instead looks for that location
line and takes you there instead, at which point then
you’ll get that 200 OK. And so this, again, is with browsers do. HTTP is what they understand, and
know by definition of that protocol how to handle these cases. But not everything is always OK and not
always has something moved permanently. Sometimes something’s just not found. And in fact, of all of these
numbers we’ve seen thus far, odds are you’ve not seen or cared
about 200 or even 301, but most of us have probably at least once seen 404. Why? Why in the world is that
the number we somehow see anytime you visit a web page that’s
gone, or anytime you mistype an address and you reach a dead end? Well, for better or for worse the
designers of websites for years have exposed this value to end users
even though it’s not all that useful. But it’s indeed the
unique value that humans decided some years ago would uniquely
represent the notion of a page not being found. So if inside of that virtual envelope
comes back a message 404 not found, the browser can say that
literally or perhaps display a cute message to that effect, but
the reason that you’re seeing that 404 is because quite literally
and mind numbingly that is just the low level status code that
has come back from an HTTP server. And there’s more of these, too. In fact, 200 OK is the
best you might get. 301 moved permanently we’ve seen. 302 found is another
form of redirection, but a temporary one instead. 304 not modified is a response that
a server can send for efficiency. If you visited a web
page just a moment ago and you happen to hit
reload or click on a link and get back the same content again,
it’s not terribly efficient or good business for a company to incur
the time and perhaps financial cost to retransmit all of those bits
to you, and so it might instead respond with an envelope more
succinctly with 304 not modified without anything else deeper in that
envelope, no additional content. And so this way your browser will just
reline its own cache, its own copy, so to speak, of the original request. Meanwhile, if you’re not allowed to
visit some web page because you’ve not logged in or you don’t
have authorization there, too, well 401 unauthorized
might instead come back. As might 403 forbidden. 404 not found means
there’s just nothing there. 418 I’m a teapot was an
April Fool’s joke some years ago where someone went to the
lengths of actually writing a formal specification for what a server
should say when it is in fact a teapot. But the worst error you might see,
and most users would never see this, but developers of software would
is five zero zero, 500, which represents an internal server
error, and almost always represents a logical or a syntactic error in
the code that someone has written, be it in Python or any
number of other languages. And now a fun example, perhaps,
to bring all this home. It turns out that safetyschool.org
is an actual address on the web. And indeed, it happens to
have been bought or rented for years now by some Harvard alum. And indeed, if you
visit safetyschool.org, you shall find yourself
at this website here. http://safetyschool.org. We find ourselves whisked
away to www.yale.edu. But how is that implemented? Well, let’s again turn
to our terminal window, where we can see really the
contents of that virtual envelope. And if in here in my terminal
window I again type curl dash I, http://safetyschool.org,
well I see all of the headers that are exactly coming back. And indeed, here, safetyschool.org
has permanently moved for years now to this location, http://www.yale.org. A fun jab at our rivals that
some alum has been paying now for years on an annual basis. So we now have a pair of
protocols, TCP and IP, via which we can get data,
any data, from point A to point B on the internet. Sometimes that data is itself HTTP
data that is a request for a web page or a response with a web page. But what if there are so many others
trying to access data at point B– that is to say, business is good, and
a web server out there is receiving so many packets per second that the
server cannot quite yet keep up? The routers in between might very well
be able to handle that load perfectly because those are much bigger servers,
conceptually and physically, with far more CPUs and RAM and
therefore can handle that load, but some business’ server out
there is only finite in capacity. And so what happens when you need
to scale to handle more users? Well, you might have
initially just one server such as that Dell server pictured here. This is what’s called
a rack server, insofar as it’s designed to exist on a rack
that you slide this thing into, and it happens to be one
rack unit or 1.5 inches, which is simply a
standardization thereof. Inside of this rack server
is its hard drive, and RAM, and CPU, and more pieces, but
it’s exactly the same technology that you might have in
a box under your desk or even in the form factor of a
laptop, just bigger and faster. And, to be fair, more expensive. But it’s only so big, indeed,
it’s only 1.5 inches tall and some number of inches deep,
which is to say there’s only a finite amount of RAM in there. There’s only a fixed
number of CPUs in there, and there’s only so many gigabytes,
presumably, of disk storage space. At some point or other, we’re
going to run out of one or more of those resources. And even though we’ve not really gotten
into the weeds of how a server handles and reads these envelopes,
it certainly stands to reason that it can only
read with finite resources some finite number of
packets per unit of time, be it second or minutes or days. And so at some point
if business is booming, we might receive at any
given point more packets of information that we can handle and
indeed, like some routers, if they’re overwhelmed, we might just
drop these incoming packets, or worse yet, not expect them and
just crash or freeze or somehow behave unpredictably. And that’s probably not
good for our business. So how can we go about
solving this problem? Well, the easiest way, quite simply,
is to scale vertically, so to speak. That is don’t use that server. Instead, buy one that’s bigger with more
RAM and more CPUs and more disk space and faster internet
connectivity, and really just avoid that problem altogether. Why is this compelling? Well, cost of it aside, you
don’t have to change your code, you needn’t change your
configuration in software, you need only throw hardware and in
turn, to be fair, money at the problem. Now that in and of itself might alone
be a deal breaker, the money alone, but at some point if we want to handle
that business, we’ve got to scale up, but even this is shortsighted. Because at the end of the day, Dell only
sells servers that operate so quickly and have so much disk space. Those resources, too,
are ultimately finite. And while next year there
might be an even bigger version of this same machine
out there, this year you might have the top of the line. So at some point, one server,
even with so many resources, might not be able to handle all of the
packets and business you’re getting. So what do you then do? Well, there is an opportunity to
scale not vertically, so to speak, but horizontally instead. Focusing not on the top
tier machines, but instead, two of the smaller ones, or as needed
three or four or more of the same. In other words, spending
lower on that cost curve, getting more hardware,
hopefully, for your money, but such that the net effect
is even more CPU power and more disk space and more RAM than
you might have gotten with that one souped up machine itself. And heck, if you really
need the capacity, you can buy any number
of these big servers, but you do somehow ultimately
have to interconnect them. And here now is where
there’s a trade-off. Whereas money was
really the only barrier to solving this problem initially,
though easier said than done, now we have to re-engineer
our system, because no longer are packets of internet data
coming in from our customers and ending up in one place,
they now have to somehow be spread across multiple servers. So how might we do this back in the day? Well, back in the late 90s, when
Larry and Sergey of Google fame built out their first
cluster of servers, they didn’t have those
pretty Dell boxes, rather, this was, now in Google’s
museum, reportedly one of their first racks of servers. Notice there’s no shiny
cases, let alone logos, but instead, lots of circuit boards on
which are hard drive after hard drive after hard drive and suffice to say
so many wires connecting everything. And even though, ironically, this
picture seems to be vertical, this is, perhaps, one of the earliest
examples in our internet era of scaling horizontally. Each of these servers, which is
represented by each of these boards, is somehow interconnected in
such a way that those servers can intercommunicate. But how? Well, let’s consider, with
the proverbial engineering hat on, how servers might
somehow intercommunicate. If up here, for instance, is
just some artist’s rendition of someone’s laptop, a potential
customer who’s sending us packets, that customer might previously have
been accessing our server, which we’ll represent here with a
box and just call it A. Server A. There’s no
other servers involved, but there is some internet in between
us here, so we’ll assume that this is the so-called cloud, so to speak. And I, as this laptop or the
customer, has a connection and so does that one and
only server have the same. This picture is fairly straightforward. Now you request a web
page via this browser, it somehow traverses the
internet via routers, and then ultimately ends
up at that server A. But what if, instead,
there’s not just A, even if it’s top of the line,
because that’s not enough, but instead their servers A and
B together here for this website. Well, here we might now have two boxes,
the same size or bigger or smaller, but ultimately finite, as well. And somehow we need to now
decide how to route information from customer to server
A or B. In other words, there is now virtually
a fork in the road, left or right, that
packets need to traverse. So how can we implement
this building block? Well, again, as always, go
back to first principles. We know from our stack
of internet technologies that we already have a
mechanism via which to translate domain names into IP addresses. And if each of these
servers, by definition of IP, has its own IP address, why not
just use DNS to solve this problem? When a customer requests example.com,
perhaps answer that request with the IP address of A. And then when
a second customer somewhere else out there on his or her laptop
asks for example.com next, return to them the IP address of
B and vise versa, again and again. Literally adopting a
round robin technique of sorts, whereby one time you
answer A, the next time you answer B, and back and forth you go. On average, you would like to think that
this uniform distribution of answers will give you 50% load,
that is to say traffic, on one server and 50% on the other. But perhaps this customer is
more of a shopper than this one, and they end up imposing even
more load on A than on B, so there with that simple
heuristic you can get skew. You might not even use round
robin, you could just use random, but on there on average yes, you’ll
send 50% traffic left and 50% right, but some of those users might
be heavier users than other. So perhaps we should have
some form of feedback loop, and DNS alone might not be sufficient. We really need there to be a
middle man, such as this dot here, that decides more intelligently
whether to send data to A or to B. And we’ll call this thing here,
this dot now, a load balancer. Aptly name insofar as
it balances load that’s incoming across multiple servers. But how? Well, if these connections between
A and B in this load balancer are not uni-directional,
but bidirectional somehow, literally a cable that allows
bits to flow left and right. Could we perhaps have A just continually
report back to that load balancer saying, I have capacity,
I have capacity? Whereas B might say, I’ve
got too many customers. And logically, then, this load balancer
can just start sending no traffic to B and send all of it to A or vise versa. Of course, logically,
we could find ourselves in a situation where both A and B
are too busy, what then do we do? Well, at some point we have
to throw money at the problem and solve it by just adding hardware. And so C might be added to
the mix with that same logic, but the load balancer
just has to know about it. So all fine and good,
we seem to have solved the problem in a very straightforward
way, but as with computer science more generally, there’s probably
a price paid and a trade-off, and not just financial. Unfortunately, even though I have two,
maybe even three servers now, therefore seemingly having high
availability of service, that is any one of these servers
theoretically could go down and I’ve still got 2/3 of my capacity. But there’s a single point of
failure here, an SPOF so to speak, that could really derail
the whole process. What happens if this load balancer,
which while pictorial is just a dot, is actually a server
underneath the hood itself? What if that load balancer
goes down, or what if that load balancer itself gets overwhelmed? It does not matter how many servers you
have here, A through Z, if none of them can be reached. So this simple architecture
alone is not a solution. And indeed, this is what is meant
by architecting network itself. This design is probably not the
best, especially for business. And so let’s start anew, at least
down here inside this company, and consider if one load balancer is
not great, what’s better than one? Well, honestly, two. And so let’s now draw
them a bit bigger, where here we have a load balancer on
the left, and here on the right, and we’ll number them 1 and 2. Whereas our servers we’ll continue to
name A and B and C and perhaps even through Z. And now we just have
to ensure that we have connections to both load balancers,
and that each load balancer can connect to each server
in this sort of mesh network here. It’s wonderfully redundant
now, albeit a bit complex. But because we have all of
these interconnections now, we can ensure that even if one or two go
down, data can still reach A, B, or C. But how to know whether load balancer
1 should be doing all of this or load balancer 2? You know what, why don’t we draw
another connection between 1 and 2? And a very common paradigm in
systems is heartbeats, quite simply. Much like you and I have every second
or so a heartbeat saying we are alive, we are alive, hello world,
hello world, if you will, so here might load balancers 1
and 2, themselves just servers, say I am alive, I am alive,
hello number 2, hello number 2. And if 2 does not hear 1
eventually, or if 1 does not hear 2, the other can just commandeer that role. By default, only 1 will be load
balancing, but if it goes offline, 2 will presume to take over. And now we have this property
generally known as high availability. Even if we lose one or more
servers can we still stay up, and there is no single point of
failure, at least here in this picture, because we now have that
second load balancer. But if we look a little
higher, it would seem that we do actually have another
single point of failure in here, and now we go down the rabbit hole. If this line here to
the cloud, the internet, represents my internet
connection, my ISP, what if that ISP, Comcast, Verizon
or any other, itself goes down, a big storm and a loss of power
might take my whole business offline. Well, the best way to
solve that would be to access someone else’s
internet connectivity and make sure you’re connected to that. And in fact, if we keep going
with load balancer 3 or even 4 or server D, E, and F, this picture very
quickly starts to get so intertwined. But this is how you do it. And not too long ago was this done
entirely with wires and hardware. But these days this topology,
if you will, this architecture, is increasingly done in software. And indeed, the whole
thing is done in the cloud. Less frequently do staff of
companies find themselves crawling along the floor and in wiring
closets and in data centers, so to speak, making these
connections possible, but rather they do it
virtually in software. And indeed, thus was born the cloud. Well, it turns out that as
Moore’s law, so to speak, helps us in each passing year, we
seem to have computers that are half as expensive and twice as fast. Can we ride that sort of curve
of innovation in such a way that we can solve even more
problems each year more quickly? And yet, with each
passing year, I the human am not getting any better or
faster at checking my email or using the web, so we increasingly
have on our laptops and desks and our server rooms
more computational power frankly than we really know what to do. And so increasingly in vogue these
days is to virtualize that hardware, and to take physical hardware with so
many CPUs and so much RAM and so much disk space and to write
software that runs on it that creates the illusion that one
computer is two, or one computer is 10. That is to say, through
software can you write code that virtualizes that hardware,
thereby creating the illusion that you can have one server per customer
but all 10 of those customers are on the same machine. Virtualization includes products
like VMware and Parallels and other companies
as well, and it’s just software that runs on
top of the hardware and creates this illusion, which
then is all the better for business. If you can sell one piece
of hardware multiple times but not necessarily in a way
that you’re over-provisioning it to multiple customers, but rather
you’re isolating each of those customers from one another, giving them not
only the illusion of their own machine but indeed, the constraints
whereby my data can’t be accessed by another customer who
only has cloud access there, too. And indeed, this is really in
part, why we have now this cloud. The cloud is more of a buzz
word than anything technical. Indeed, using the cloud just
means using servers somewhere else that someone else is managing. No longer do companies with as much
frequency have their own server room in their office, or their own data
center in some warehouse somewhere. Rather, they virtualized even
that piece of their product using Amazon or Microsoft
or Google or others out there that provide
you with access to servers that they themselves
control, but they provide you with access to the illusion of your very
own servers known as virtual machines. And via this process can
we take ever more advantage of so many of those new
CPUs and disk space and RAM that otherwise might
frankly go to waste, because there’s only so much we can
typically do with one such machine. Hence you might now think of this
design, this stack, so to speak, as follows. In green here pictured is
infrastructure, the physical hardware that you have bought. Here in blue is the
hypervisor, the software called VMware or Parallels or something
else, that virtualizes this hardware and creates the illusion
that you actually have three machines, for instance, on one. And within each of those
machines, which you can think of is just a separate window on
that computer, double click to open computer A so to speak
and computer B and computer C, each of those virtual machines, you
can install your own operating system differently in each of
those virtual machines. Some version of Windows
in A, another in B, and maybe Linux or Unix or something
else in C. And then within A, B, and C can you install your own apps, your
own software, or so can your customers, thereby being isolated, not so
much physically but virtually, from everyone else. But of course, there’s always a price. While this might take better
advantage of the increasing computational resources
that we have in these boxes, there seems to be some duplication here. And indeed, in computer
science, anytime you start duplicating resources
or efforts there’s probably an opportunity
for better design. And while this technology
itself is still nascent, there’s a newcomer to the
field called containerization, and it exists in multiple forms. But containerization shares
more software, in some sense, underneath the hood, so that you might
install an operating system not three times but once, and share it across
those machines but in such a way that one cannot access the other. And on top of that layer,
here called Docker, one of the most popular
incarnations thereof, you have as before your
infrastructure, the actual hardware, on top of which is your own operating
system, be it Windows or Linux. On top of that is this program
called Docker that provides you then with the ability
to run A through F apps instead of say, just three
because the overhead, so to speak, computationally is not quite as
much as with virtual machines. Here we have three operating
systems, each installed independently on the same hardware, we’re
just surely consume time, whereas here you have just one
operating system theoretically and then more room for more apps there upon. So whereas containerization
allows you ultimately to isolate one app from
another, in virtual machines allow you to isolate one
machine from another, they do this through different
techniques and with disparate overhead. And surely in the years to
come will this overhead only get chipped away at as
we humans get better about running more and more software
and less and less but more capable hardware. There than we have these internet
technologies all the way up to cloud computing itself, whereas
the technologies we’ve looked at are fairly low level protocols that
simply get zeros and ones from point A to point B. Once we have that ability
and we can stipulate that we can do it, we can build any number of
abstractions on top of it. In HTTP, for instance, do we
have effectively an application, known as web browsing, via which we
can transmit text and images and sounds and so much more. And via the cloud itself
do we have the ability now to slice up individual machines
as though they are multiple and that picture before can be
implemented not with two load balancers and three servers physically,
but maybe, just maybe, with just one. One server that’s been
so virtualized or in turn containerized so that you can have
different parts of its hardware each implementing different pieces of
functionality that collectively implement that architecture. And so whereas back in the day
might you actually physically wire all of those disparate
types of machines together, now can you do it virtually in software
literally with keystrokes and mouse clicks because someone has written
software that abstracts away that underlying hardware in such a way
that you can think about it virtually. Now at the end of the day, the
servers in Google’s and Microsoft and Amazon’s closets are still
completely physical themselves with so many cables, but you can reroute
information, those zeros and ones, different ways virtually
thanks to these layers that we’ve built on top of
these internet technologies.

3 thoughts to “CS50 for Lawyers 2019 – Internet Technologies, Cloud Computing”

Leave a Reply

Your email address will not be published. Required fields are marked *