GitHub Pages, our static site hosting service, has always had a very simple architecture. From launch up until around the beginning of 2015, the entire service ran on a single pair of machines (in active/standby configuration) with all user data stored across 8 DRBD backed partitions. Every 30 minutes, a cron job would run generating an nginx map file mapping hostnames to on-disk paths.
There were a few problems with this approach: new Pages sites did not appear until the map was regenerated (potentially up to a 30-minute wait!); cold nginx restarts would take a long time while nginx loaded the map off disk; and our storage capacity was limited by the number of SSDs we could fit in a single machine.
Despite these problems, this simple architecture worked remarkably well for us — even as Pages grew to serve thousands of requests per second to over half a million sites.
When we started approaching the storage capacity limits of a single pair of machines and began to think about what a rearchitected GitHub Pages would look like, we made sure to stick with the same ideas that made our previous architecture work so well: using simple components that we understand and avoiding prematurely solving problems that aren’t yet problems.
The new Pages infrastructure has been in production serving Pages requests since January 2015 and we thought we’d share a little bit about how it works.
After making it through our load balancers, incoming requests to Pages hit our frontend routing tier. This tier comprises a handful of Dell C5220s running nginx. An ngx_lua script looks at the incoming request and makes a decision about which fileserver to route it to. This involves querying one of our MySQL read replicas to look up which backend storage server pair a Pages site has been allocated to.
Once our Lua router has made a routing decision, we just use nginx’s stock proxy_pass feature to proxy back to the fileserver. This is where ngx_lua’s integration with nginx really shines, as our production nginx config is not much more complicated than:
location / { set $gh_pages_host ""; set $gh_pages_path ""; access_by_lua_file /data/pages-lua/router.lua; proxy_set_header X-GitHub-Pages-Root $gh_pages_path; proxy_pass http://$gh_pages_host$request_uri; }
One of the major concerns we had with querying MySQL for routing is that this introduces an availability dependency on MySQL. This means that if our MySQL cluster is down, so is GitHub Pages. The reliance on external network calls also adds extra failure modes — MySQL queries performed over the network can fail in ways that a simple in-memory hashtable lookup cannot.
This is a tradeoff we accepted, but we have mitigations in place to reduce user impact if we do have issues. If the router experiences any error during a query, it’ll retry the query a number of times, reconnecting to a different read replica each time. We also use ngx_lua’s shared memory zones to cache routing lookups on the pages-fe node for 30 seconds to reduce load on our MySQL infrastructure and also allow us to tolerate blips a little better.
... continue reading