Always on a quest to make the Web faster (see Ping Brigade), I've
been thinking about
ways to speed up fetching web pages. My investigation started with the fact that many popular web servers are backed by multiple
IP addresses. For example:
$ host google.com google.com has address 22.214.171.124 google.com has address 126.96.36.199 google.com has address 188.8.131.52 google.com has address 184.108.40.206 google.com has address 220.127.116.11 google.com has address 18.104.22.168 google.com has IPv6 address 2607:f8b0:4002:c01::8a
That's 6 IPv4 and one IPv6 address. Looking at this list, I don't know
which one is closer/faster/less loaded. How does, say,
a browser pick one of these? Well, typically one of these will be picked at random and a connection will be attempted. The browsers
are smart enough to know whether the given machine has access to IPv6, IPv4, or both and usually will prefer IPv6 if it is available.
After one address is picked, it is cached by the browser as the address
for the given hostname. The connection is attempted with a
fairly large timeout (30-45 seconds). If the connection succeed, the browser proceeds as usual to fetch the web page over HTTP. If it fails,
the browser uses the next address in the list, making another attempt.
Notice how terrible this is for the user experience. If one of the 7 of
Google's servers is having an issue, then the user has to wait for
30-45 seconds before the next attempt is made. This is bizzarre, especially since Google has no way to control the timeout of the browser
(well, maybe Google does since they ship Chrome, but any other server operator does not). This is definitely a setting you would want to control:
saying that your web servers are expected to respond within, say 5 seconds, would give you control over the browser behavior. We do, in fact have
something similar server-side, when we set up reverse proxies: you can define how long to wait for a response from each upstream server before
moving onto the next one and marking the timed-out one as failed. However, we have no such system to control browser behavior. I imagine if we
were to add it, we could leverage DNS to supply such information.
I digress. The good news is that browser vendors can fix this with no
need for new infrastructure! All they'd have to do is change the
they use to select the server they connect to. After playing around with different ways of initiating TCP connections, I created a prototype of
a Python framework for that attempts to make multiple connections and selects the first one that succeeds. Some details:
Let's start with the idea that there are multiple A/AAAA records for the
hostname we are trying to connect to. This is certainly
the case with some very popular websites (Google, Facebook, etc.) My hypothesis is that if we make multiple simultaneous connection requests
to each IP address, the first server to successfully complete the three-way TCP handshake will fit one or more of the following criteria:
- The server is physically closer to the client than others.
- The server is closer to the client than others in terms of network locality (e.g.: if you are in San Francisco and the server is in LA, but your packets are routed through NYC, you are going to have a bad time.)
- The server has a faster, lower-latency internet connection than others.
- The server is less loaded.
- The server is up, while others are down.
The last condition is especially important: we don't want our users
waiting for 30-45 seconds just to realize that one of the servers is
I expect that the other conditions will have a marginal decrease in the time it takes to fetch a resource over HTTP as well. Having this hypothesis,
an obvious solution becomes apparent: let's make simultaneous connection attempts!
connie-experiment is the result of my work on this topic. It is a
for making and measuring simultaneous TCP connections. It provides several useful interfaces on top of Python's socket API (which is really just
a thin wrapper around the BSD socket API). First, it provides the
connie_connect() function which returns a connected socket to a
host/port combination as requested by the caller. Under the hood,
connie_connect() does the following:
getaddrinfo()to get a list of IPs associated with a given host/port.
- Caches the association between host/port and the list of IPs (currently for 60 seconds).
- Creates one non-blocking socket per IP address. Calls
- Uses Linux's
epoll()to look for the first connected socket.
- If after the specified connection timeout no connection is established, it raises an exception.
- If one of the sockets successfully connects,
close()is called on the rest of the sockets. The connected socket is returned.
Additionally, the codebase includes a similar implementation of a TCP
connection function called
cached_dns_connect() where the
host/port => IP addresses mapping is also cached for 60 seconds, but only a single IP address is picked and a connection is attempted. This
function is provided to help create benchmarks against
connie_connect() and the regular
Built on top of
cached_dns_connect(), are two
which establish their connections using the two newly implemented connection functions respectively.
Lastly, there are two benchmark files: one to test a TCP
connect/disconnect sequence, and another to test fetching web resources
after establishing connections using the different strategies provided.
I encourage everyone to try these out for yourselves. I was able to try
these on a few different setups and the results are below. As you will
connie_connect() almost wins every time, sometimes by a significant
margin. In the cases where one of the servers is not reachable,
(think IPv6 address from a client with no IPv6 connectivity), that server is simply ignored by
connie_connect(). Let's take a look:
The test was run 300 times per function, choosing to connect to one of: google.com, apple.com, yandex.ru, facebook.com, maps.google.com, google.cn Using a public Wi-Fi. IPv4 only. $ python conbench.py connie_connect: count 300, total 18.6567, per call: 0.0622, std dev: 0.003 cached_dns_connect: count 300, total 25.5700, per call: 0.0852, std dev: 0.003 plain_connect: count 300, total 32.9295, per call: 0.1098, std dev: 0.016 $ python httpbench.py connie_fetch: count 300, total 61.3235, per call: 0.2044, std dev: 0.009 cached_dns_fetch: count 300, total 62.3081, per call: 0.2077, std dev: 0.005 plain_fetch: count 300, total 77.4779, per call: 0.2583, std dev: 0.009 Time Warner residential connection. Native IPv4 + tunneled IPv6 from he.net. $ python conbench.py connie_connect: count 300, total 21.7207, per call: 0.0724, std dev: 0.0032 cached_dns_connect: count 300, total 28.0278, per call: 0.0934, std dev: 0.0038 plain_connect: count 300, total 39.2406, per call: 0.1308, std dev: 0.0036 $ python httpbench.py connie_fetch: count 300, total 62.5270, per call: 0.2084, std dev: 0.0107 cached_dns_fetch: count 300, total 63.9759, per call: 0.2133, std dev: 0.0063 plain_fetch: count 300, total 98.9979, per call: 0.3300, std dev: 0.0100 From a 100Mbit/s dedicated server. IPv4 only $ python conbench.py connie_connect: count 300, total 12.3907, per call: 0.0413, std dev: 0.004 cached_dns_connect: count 300, total 12.2159, per call: 0.0407, std dev: 0.004 plain_connect: count 300, total 13.4622, per call: 0.0449, std dev: 0.003 $ python httpbench.py connie_fetch: count 300, total 32.6758, per call: 0.1089, std dev: 0.0060 cached_dns_fetch: count 300, total 36.5850, per call: 0.1219, std dev: 0.0064 plain_fetch: count 300, total 35.2439, per call: 0.1175, std dev: 0.0058 From a 1000Mbit/s dedicated server. Native IPv4+IPv6 $ python conbench.py connie_connect: count 300, total 9.6421, per call: 0.0321, std dev: 0.0028 cached_dns_connect: count 300, total 14.9740, per call: 0.0499, std dev: 0.0039 plain_connect: count 300, total 14.0381, per call: 0.0468, std dev: 0.0028 $ python httpbench.py connie_fetch: count 300, total 29.4031, per call: 0.0980, std dev: 0.0059 cached_dns_fetch: count 300, total 33.9132, per call: 0.1130, std dev: 0.0053 plain_fetch: count 300, total 32.2257, per call: 0.1074, std dev: 0.0056
As you can see, on residential connections
connie_connect() makes a
significant difference, so the hypothesis holds up. When run
in server environments with solid internet connections, it still makes a significant difference when fetching web resources, but the time to
open/close a connection is very comparable to the canonical connection strategy. Lastly,
connie_connect() works as promised when
one of the servers is down.
Pro's and Con's
There are extra benefits to using an abstraction like the one provided
connie_connect(). For one, IPv4 vs IPv6 is abstracted
away. Instead of looking at whether the system has a global IPv6 address and worrying whether the network is configured properly, you can
rest assured that the connected socket returned to you works, regardless of the underlying IP protocol. This way the transition to IPv6
becomes a little more seamless. Secondly, DNS entries are cached in memory for a small amount of time, preventing you from having to do
extra DNS lookups. Third, this connection strategy works for any code, not just browser to webserver. For example, it can be used when
implementing client-side code for connecting to a REST or SOAP API, or establishing any TCP connection.
There are some major negatives to using this project in production.
First, creating extra sockets and attempting to connect them will
result in more open files for your kernel to keep track of. Depending on the workload, you may quickly run out of file descriptors which is a
bad thing. One way to remedy this would be to add a limit on how many simultaneous connections will be attempted (my gut feeling is that 2-3
would yield good results).
Second, with every call to
close() you are introducing
a syscall, which will result in a context switch from your userspace code to the kernel code. On a mostly idling workstation running a browser,
this overhead will be minor compared to the latency of establishing a TCP connection, but on a busy server with many processes all competing for
system resources it can become more important. Additionally, memory usage will be higher since more RAM is needed to keep track of TCP connections
and polling objects.
Thirdly, the current implementation works on Linux only. This is due to
my choice to use
epoll() instead of the more portable,
but more limited
select(). This is an implementation shortcoming
which can be addressed by implementing the system-specific
polling mechanisms for the popular OS's of today.
Lastly, this code is experimental. It was written to run benchmarks, and
has not been vetted in long-running processes, etc. It is also not
thread-safe (the cache class has no locking). I do not believe it would require much to get it polished enough to run on mission critical
systems, but that was not the goal of the project, so I am holding off saying anything else until the project is properly reviewed by people who,
unlike me, know what they are talking about.
You can get the source for connie-experiment from the GitHub
repo. Please feel
free to comment and contribute. I would love to hear what your thoughts are. I would especially love to hear from the Google Chrome and Firefox
teams regarding the feasibility of getting this strategy implemented in browsers.