Recent Load Balancer Problems
Over the past few weeks, we’ve had a few service interruptions that can all be traced back to one cause – instability in our high-availability load balancer setup. Here’s a…
Over the past few weeks, we’ve had a few service interruptions that can all be traced back to one cause – instability in our high-availability load balancer setup. Here’s a brief summary of our existing load balancing setup, a look at its problems, and what we’re doing to fix it.
Load Balancing at GitHub
To handle all of GitHub’s incoming HTTP, SSH and Git traffic, we run quite a few frontend servers. In front of these servers, we run an IPVS load balancer to distribute incoming traffic, while sending reply traffic via direct routing. When this server fails, GitHub is down.
IPVS doesn’t require beefy hardware, so in our initial deployment, we set it up on a small Xen virtual server. We run a handful of Xen hosts inside our private network to power utility servers that don’t require the power of a dedicated server – smtp, monitoring, serving HTML for GitHub Pages, etc. A few of these services require high availability – like Pages.
HA Virtual Servers using Linux-HA, DRBD, and Xen
To achive high availability of virtual servers that require it, we combine LVM, DRBD, Pacemaker, and Heartbeat to run a pair of Xen virtual servers. For example, right now GitHub Pages are being served from a virtual server running on xen3
. This server has a DRBD mirror on xen1
. If heartbeat detects that the pages virtual server on xen3
isn’t responding, it automatically shuts it down, adjusts the DRBD config to make the LVM volume on xen1
the primary device, then starts the virtual server on xen1
.
Enter STONITH
However, in recent weeks, we’ve had a few occurrences where load on our Xen servers spiked significantly, causing these heartbeat checks to time out repeatedly even when the services they were checking were working correctly. In a few occurrences, numerous repeated timeouts caused a dramatic downward spiral in service as the following sequence of events unfolded:
- Heartbeat checks timeout for the pages virtual server on
xen3
. - Pacemaker starts the process of transitioning the virtual server to
xen1
. It starts by attempting to stop the virtual server onxen3
, but this times out due to high load. - Pacemaker now determines that
xen3
is dead since an management command has failed, and decides that only way to regain control of the cluster is to remove it completely. The node is STONITH‘d via an IPMI command that powers down the server via its out-of-band management card. - Once
xen3
is confirmed powered-off, Pacemaker starts the virtual servers previously residing on the now-deadxen3
onxen1
, and notifies us we’ll need to manually intervene to getxen3
back up and running.
If the Xen server that is killed was running our load balancer at the time, HTTPS and Git traffic to GitHub stays down until it comes back up. To make matters worse, our load balancers occasionally require manual intervention to get into their proper state after reboot due to a bug in their init scripts.
A Path to Stability
After recovering from the outage early Saturday morning, we came to the realization that our current HA configuration was causing more downtime than it was preventing. We needed to make it less aggressive, and isolate the load balancers from any impact in the event of another services’ failure.
Over the weekend we made the following changes to make our HA setup less aggressive:
- Significantly reduce the frequency of Heartbeat checks between virtual server pairs
- Significantly increase the timeouts of these Heartbeat checks
These changes alone have reduced the average load and load variance across our Xen cluster by a good bit:
More importantly, there hasn’t been a single false heartbeat alert since the change, and we don’t anticipate any more soon.
We’re also ordering a pair of servers on which we’ll run a dedicated HA pair for our load balancers. Once these are in place, our load balancers will be completely isolated from any HA Xen virtual server failure, legitimate or not.
Of course, we’re also working on improving the configuration of the load balancers to reduce the MTTR in the event of any legitimate load balancer failure.
We’ve just recently brought on a few new sysadmins (myself included), and are doubling down on stability and infrastructure improvements in the coming months. Thanks for your patience as we work to improve the GitHub experience as we grow!
Written by
Related posts
Apply now for GitHub Universe 2023 micro-mentoring
As part of our ongoing commitment to accelerate human progress through Social Impact initiatives, we’re offering students 30-minute, 1:1 micro-mentoring sessions with GitHub employees ahead of Universe.
The 2023 Open Source Program Office (OSPO) Survey is live!
Help quantify the state of enterprise open source by taking the 2023 OSPO survey.
Godot 4.0 Release Party 🎉
We are delighted to host the Godot 4.0 Release Party at GitHub HQ on Wednesday, March 22 from 6:30 pm to 9:30 pm. And you’re invited!