Context aware MySQL pools via HAProxy

At GitHub we use MySQL as our main datastore. While repository data lies in git, metadata is stored in MySQL. This includes Issues, Pull Requests, Comments etc. We also auth…

| 11 minutes

At GitHub we use MySQL as our main datastore. While repository data lies in git, metadata is stored in MySQL. This includes Issues, Pull Requests, Comments etc. We also auth against MySQL via a custom git proxy (babeld). To be able to serve under the high load GitHub operates at, we use MySQL replication to scale out read load.

We have different clusters to provide with different types of services, but the single-writer-multiple-readers design applies to them all. Depending on growth of traffic, on application demand, on operational tasks or other constraints, we take replicas in or out of our pools. Depending on workloads some replicas may lag more than others.

Displaying up-to-date data is important. We have tooling that helps us ensure we keep replication lag at a minimum, and typically it doesn’t exceed 1 second. However sometimes lags do happen, and when they do, we want to put aside those lagging replicas, let them catch their breath, and avoid sending traffic their way until they are caught up.

We set out to create a self-managing topology that will exclude lagging replicas automatically, handle disasters gracefully, and yet allow for complete human control and visibility.

HAProxy for load balancing replicas

We use HAProxy for various tasks at GitHub. Among others, we use it to load balance our MySQL replicas. Our applications connect to HAProxy servers at :3306 and are routed to replicas that can serve read requests. Exactly what makes a replica able to “serve read requests” is the topic of this post.

MySQL load balancing via HAProxy is commonly used, but we wanted to tackle a few operational and availability concerns:

  • Can we automate exclusion and inclusion of backend servers based on replication status?
  • Can we automate exclusion and inclusion of backend servers based on server role?
  • How can we react to a scenario where too many servers are excluded, and we are only left with one or two “good” replicas?
  • Can we always serve?
  • How easy would it be to override pool membership manually?
  • Will our solution survive a service haproxy reload/restart?

With this criteria in mind, the standard mysql-check commonly used in HAProxy-MySQL load balancing will not suffice.

This simple check merely tests whether a MySQL server is live and doesn’t gain additional insight as for its internal replication state (lag/broken) or for its operational state (maintenance/ETL/backup jobs etc.).

Instead, we make our HAProxy pools context aware. We let the backend MySQL hosts make an informed decision: “should I be included in a pool or should I not?”

Context aware MySQL pools

In its very simplistic form, context awareness begins with asking the MySQL backend replica: “are you lagging?” We will reach far beyond that, but let’s begin by describing this commonly used setup.

In this situation, HAProxy no longer uses a mysql-check but rather an http-check. The MySQL backend server provides an HTTP interface, responding with HTTP 200 or HTTP 503 depending on replication lag. HAProxy will interpret these as “good” (UP) or “bad” (DOWN), respectively. On the HAProxy side, it looks like this:

backend mysql_ro_main
  option httpchk GET /
  balance roundrobin
  retries 1
  timeout connect 1000
  timeout check 300
  timeout server 86400000

  default-server port 9876 fall 2 inter 5000 rise 1 downinter 5000 on-marked-down shutdown-sessions weight 10
  server my-db-0001 my-db-0001.heliumcarbon.com:3306 check
  server my-db-0002 my-db-0002.heliumcarbon.com:3306 check
  server my-db-0003 my-db-0003.heliumcarbon.com:3306 check
  server my-db-0004 my-db-0004.heliumcarbon.com:3306 check
  server my-db-0005 my-db-0005.heliumcarbon.com:3306 check
  server my-db-0006 my-db-0006.heliumcarbon.com:3306 check

The backend servers need to provide an HTTP service on :9876. That service would connect to MySQL, check for replication lag, and return with 200 (say, lag <= 5s) or 503 (lag > 5s or replication is broken).

Some reflections

This commonly used setup automatically excludes or includes backend servers based on replication status. If the server is lagging, the specialized HTTP service will report 503, which HAProxy will interpret as DOWN, and the server will not serve traffic until it recovers.

But, what happens when two, three, or four replicas are lagging? We are left with less and less serving capacity. The remaining replicas are receiving two or three times more traffic than they’re used to receiving. If this happens, the replicas might succumb to the load and lag as well, and the solution above might not be able to handle an entire fleet of lagging replicas.

What’s more, some of our replicas have special roles. Each cluster has a node running continuous logical or physical backups. For example, other nodes might be serving a purely analytical workload or be partially weighted to verify a newer MySQL version.

In the past, we would update the HAProxy config file with the list of servers as they came and went. As we grew in volume and in number of servers this became an operational overhead. We’d rather take a more dynamic approach that provides increased flexibility.

We may operate a MySQL master failover. This may be a planned operation (e.g. upgrading to latest release) or an unplanned one (e.g. automated failover on hardware failure). The new master must be excluded from the read-pool. The old master, if available, may now serve reads. Again, we wish to avoid the need to update HAProxy’s configuration with these changes.

Static HAProxy configuration, dynamic decisions

In our current setup the HAProxy configuration does not regularly change. It may change when we introduce new hardwares now and then, but otherwise it is static, and HAProxy reacts to ongoing instructions by the backend servers telling it:

  • I’m good to participate in a pool (HTTP 200)
  • I’m in bad state; don’t send traffic my way (HTTP 503)
  • I’m in maintenance mode. No error on my side, but don’t send traffic my way (HTTP 404)

The HAProxy config file lists each and every known server. The list includes the backup server. It includes the analytics server. It even includes the master. And the backend servers tell HAProxy whether they wish to participate in taking read traffic or not.

The HAProxy config file lists each and every known server. The list includes the backup server, the analytics server, and even the master. The backend servers themselves tell HAProxy whether they wish to participate in taking read traffic or not.

Before showing you how to implement this, let’s consider availability.

Graceful failover of pools

HAProxy supports multiple backend pools per frontend, and provides with Access Control Lists (ACLs). ACLs often use incoming connection data (headers, cookies etc.) but are also able to observe backend status.

The scheme is to define two (or more) backend pools:

  • The first (“main”/”normal”) pool consists of replicas with acceptable lag that are able to serve traffic, as above
  • The second (“backup”) pool consists of valid replicas which are allowed to be lagging

We use an acl that observes the number of available servers in our main backend. We then set a rule to use the backup pool if that acl applies:

frontend mysql_ro
  ...
  acl mysql_not_enough_capacity nbsrv(mysql_ro_main) lt 3
  use_backend mysql_ro_backup if mysql_not_enough_capacity
  default_backend mysql_ro_main

See code sample

In the example above we choose to switch to the mysql_ro_backup pool when left with less than three active hosts in our mysql_ro_main pool. We’d rather serve stale data than stop serving altogether. Of course, by this time our alerting system will have alerted us to the situation and we will already be looking into the source of the problem.

Remember that it’s not HAProxy that makes the decision “who’s in and who’s out” but the backend server itself. To that effect, HAProxy sends a check hint to the server. We choose to send the hint in the form of a URI, as this makes for a readable, clear code:

backend mysql_ro_main
  option httpchk GET /check-lag
  http-check disable-on-404
  balance roundrobin
  retries 1
  timeout connect 1000
  timeout check 300
  timeout server 86400000

  default-server port 9876 fall 2 inter 5000 rise 1 downinter 5000 on-marked-down shutdown-sessions weight 10
  server my-db-0001 my-db-0001.heliumcarbon.com:3306 check
  server my-db-0002 my-db-0002.heliumcarbon.com:3306 check
  server my-db-0003 my-db-0003.heliumcarbon.com:3306 check
  server my-db-0004 my-db-0004.heliumcarbon.com:3306 check
  server my-db-0005 my-db-0005.heliumcarbon.com:3306 check
  server my-db-0006 my-db-0006.heliumcarbon.com:3306 check

backend mysql_ro_backup
  option httpchk GET /ignore-lag
  http-check disable-on-404
  balance roundrobin
  retries 1
  timeout connect 1000
  timeout check 300
  timeout server 86400000

  default-server port 9876 fall 2 inter 10000 rise 1 downinter 10000 on-marked-down shutdown-sessions weight 10
  server my-db-0001 my-db-0001.heliumcarbon.com:3306 check
  server my-db-0002 my-db-0002.heliumcarbon.com:3306 check
  server my-db-0003 my-db-0003.heliumcarbon.com:3306 check
  server my-db-0004 my-db-0004.heliumcarbon.com:3306 check
  server my-db-0005 my-db-0005.heliumcarbon.com:3306 check
  server my-db-0006 my-db-0006.heliumcarbon.com:3306 check

See code sample

Both backend pools list the exact same servers. The major difference between the pools is the check URI:

option httpchk GET /check-lag

vs.

option httpchk GET /ignore-lag

As the URI suggests, the first, main pool is looking for backup servers that do not lag (and we wish to also exclude the master, the backup server, etc.). The backup pool is happy to take servers that actually do lag. But still, it wishes to exclude the master and other special servers.

HAProxy’s behavior is to use the main pool for as long as at least three replicas are happy to serve data. If only two replicas or less are in good shape, HAProxy switches to the backup pool, where we re-introduce the lagging replicas; serving more stale data, but still serving.

Also noteworthy in the above is http-check disable-on-404, which puts a HTTP 404 server in a NOLB state. We will discuss this in more detail soon.

Implementing the HTTP service

Any HTTP service implementation will do. At GitHub, we commonly use shell and Ruby scripts, that integrate well with our ChatOps. We have many reliable shell building blocks, and our current solution is a shell oriented service, in the form of xinetd.

xinetd makes it easy to “speak HTTP” via shell. A simplified setup looks like this:

In the above, we’re in particular interested that xinetd serves on :9876 and calls upon /path/to/scipts/xinetd-mysql to respond to HAPRoxy’s check requests.

Implementing the check script

The xinetd-mysql script routes the request to an appropriate handler. Recall that we asked HAProxy to hint per check. The hint URI, such as /check-lag, is intercepted by xinetd-mysql which further invokes a dedicated handler for this check. Thus, we have different handlers for /check-lag, /ignore-lag, /ignore-lag-and-yes-please-allow-backup-servers-as-well etc.

The real magic happens when running this handler script. This is where the server makes the decision: “Should I be included in the read-pool or not?” The script bases its decision on the following factors:

  • Did a human suggest that this server be explicitly included/excluded?
    This is just a matter of touching a file
  • Is this the master server? A backup server? Something else?
    The server happens to know its own role via service discovery or even via puppet. We check for a hint file
  • Is MySQL lagging? Is it alive at all?
    This (finally) executes a self-check on MySQL.
    For lag we use a heartbeat mechanism, but your mileage may vary.

This xinetd/shell implementation suggests we do not use persistent MySQL connections; each check generates a new connection on the backend server. While this seems wasteful, the rate of incoming check requests is not high, and negligible in the scale of our busy servers. But, furthermore, this better serves our trust in the system: a hogged server may be able to serve existing connections but refuse new ones; we’re happy to catch this scenario.

404 or error?

Servers that just don’t want to participate send a 404, causing them to go NOLB. Lagging, broken or dead replicas send a 503. This makes it easier on our alerting system and makes it clearer when we have a problem.

One outstanding issue is that HAProxy never transitions from DOWN to NOLB. The automaton requires first going UP. This is not an integrity problem but causes more alerting. We work around this by cross checking servers and refreshing if need be. This is a rare situation for us and thus of no significant concern.

Operations

This small building blocks design permits us to do simple unit testing. Control and visibility are easily gained: disabling and enabling servers is a matter of creating a file. Whether forced to exist by a human or implied by server role.

These scripts integrate well within our chatops. We are able to see the exact response HAProxy sees via simple chatops commands:

shlomi-noach
shlomi-noach
.mysql xinetd my-db-0004 /check-lag
hubot
Hubot
200 ; OK

Or we can interfere and force backends in/out the pools:

shlomi-noach
shlomi-noach
.mysqlproxy host force-disable my-db-0004
hubot
Hubot
Host my-db-0004 disabled by @shlomi-noach at 2016-06-30 15:22:07
shlomi-noach
shlomi-noach
.mysqlproxy host restore my-db-0004
hubot
Hubot
Host my-db-0004 restored to normal state

We have specialized monitoring for these HAProxy boxes, but we don’t wish to be notified if a single replica starts to lag. Rather, we’re interested in the bigger picture: a summary of the total found errors in the pools. This means there’s a difference between a half empty main pool and a completely empty one. In the event of problems, we get a single alert that summarizes the status across the cluster’s pools. As always, we can also check from chatops:

shlomi-noach
shlomi-noach
.mysqlproxy sup myproxy-0001
hubot
Hubot
OK
mysql_ro_main OK
  3/10 servers are nolb in pool
mysql_ro_backup OK
  3/10 servers are nolb in pool

We’ve stripped our script and config files to decouple them from GitHub’s specific setup and flow. We’ve also open sourced them in the hope that you’ll find them useful, and that they’ll help you implement your own solution with context-aware MySQL replica pools.

Related posts