Improving large monorepo performance on GitHub

Every day, GitHub serves the needs of over 56M developers, working on over 200M code repositories. All but a tiny fraction of those repositories are served with amazing performance, for…

|
| 7 minutes

Every day, GitHub serves the needs of over 56M developers, working on over 200M code repositories. All but a tiny fraction of those repositories are served with amazing performance, for customers from around the world.

Like any system as large as GitHub, some of the gaps in our coding and architecture are only discovered when they’re pushed to their limits—like having thousands of developers updating the same repo every day—and GitHub received feedback from a handful of our largest monorepo customers that they were seeing performance problems that were impacting their ability to complete push operations.

And so was GitHub. github/github is our monorepo, and we were experiencing occasional push failures ourselves.

To start our investigation, we worked with internal teams, and engaged with customers, to understand and optimize their usage of GitHub, including coalescing multiple pushes into single pushes, which reduced the number of locking write transactions they were incurring, and even helping with reworking some of their DevOps processes to eliminate unnecessary push operations. Still, we were seeing unusual levels of push failure.

In response to this, GitHub Engineering created Project Cyclops. Project Cyclops has been a multi-month effort across our Git Systems org (which includes our Git Storage, Git Protocols, and Git Client teams), working with GitHub’s upstream Git contributors and engineers in our Web front-end teams, to find solutions to these problems. After a lot of collaboration and hard work, and multiple improvements, we’ve driven push errors down to nearly zero, even for our largest monorepo customers with thousands of contributors.

Multiple improvements

Project Cyclops resulted in many different approaches to improving our monorepo push performance. Together, they improved our ability to handle push traffic by at least an order of magnitude.

Improving repository maintenance

By default, GitHub runs a repository maintenance routine after every 50 git push operations, or after we receive 40MB of unpacked files, which makes sure we have up-to-date packfiles for great clone/fetch performance, and which cleans up and de-duplicates data in the repo. Depending on the size of the repository, maintenance takes anywhere from a few seconds to a few minutes.

For large monorepos, with a lot of developers, 50 push operations doesn’t take long to accumulate, and therefore maintenance gets scheduled frequently, while developers are still pushing changes to the repo. We had a number of those repos failing to complete maintenance within our maximum time window. When a repo fails maintenance, performance for both push and reference updates suffers, which can lead to manual toil for our engineers to sort out those repositories again. We’ve reduced those maintenance failures to nearly zero. Specifically, we made improvements in git repack, and in how we schedule maintenance retries.

Making git repack faster

During repo maintenance, we run git repack to compress loose objects and prepare for fast clones and fetches.

To compress a set of objects into a single pack, Git tries to find pairs of objects which are related to one another. Instead of storing all of the object’s contents verbatim, some objects are stored as deltas against other related ones. Finding these deltas takes time, and comparing every object to every other object gets infeasible quickly. Git solves this problem by searching for delta/base pairs within a sliding window over an array of all objects being packed.

Some delta candidates within the window can be rejected quickly by heuristics, but some require CPU-intensive comparisons.

We’ve implemented a parameter to limit the number of expensive comparisons we’re willing to make. By tuning this value, we’ve reduced the CPU time we spend during git repack, while only marginally increasing the resulting packfile size. This one change eliminated nearly all of our maintenance failures.

Spurious failures and more frequent retries

Before Project Cyclops, when a repo failed maintenance for any reason, we wouldn’t schedule it to run again for seven days. For many years, this rhythm served our customers well enough, but monorepos today can’t wait that long. We introduced a new spurious-failure state for specific repository maintenance failures—the ones that generally come from lots of push traffic happening during maintenance—that allows us to retry maintenance every four hours, up to three times. This means that we’ll get to retry during a customer’s off-hours, when many fewer pushes are happening. This change eliminated the remaining maintenance failures, and therefore eliminated more toil from our on-call engineers.

Removing an artificial limit

On our file servers—the servers that actually hold git repositories—we have had a parameter in place for years that slowed down the rate of push operations we processed on each one. GitHub is a multi-tenant service, and, originally, this parameter was meant to ensure that writes from one customer won’t monopolize resources on a server, which would interfere with traffic from all of the other customers on that server. In effect, this was an artificial cap on the amount of work our servers could do.

After an investigation where we slowly raised the value of this parameter to allow 100% of push operations to run immediately, we found that performance with our current architecture was more than good enough, and not only did we raise the limit, we removed the parameter from our code. This immediately improved our monorepo performance and eliminated many push-related errors.

Precomputing checksums

GitHub, by default, writes five replicas of each repository across our three data centers to protect against failures at the server, rack, network, and data center levels. When we need to update Git references, we briefly take a lock across all of the replicas in all of our data centers, and release the lock when our three-phase-commit (3PC) protocol reports success.

During that lock, we compute a checksum on each replica, to ensure that they match and that all replicas are in sync. We use incremental checksums to make this faster, and during normal operation, this takes less than 50ms, but during repair operations, where we recompute the checksum from scratch, it takes longer. For large monorepos, the lock was held for 20-30 seconds.

We made a change to compute these replica checksums prior to taking the lock. By precomputing the checksums, we’ve been able to reduce the lock to under 1 second, allowing more write operations to succeed immediately.

Our customers have noticed

One of our large monorepo customers keeps their own internal metrics on git performance, and their numbers show what ours do: their push operations have shown no failures for months.

Another monorepo customer who was experiencing too many failures was planning a migration to start with a fresh repository, minimizing the number of references in the repo, in an attempt to improve push success. After these changes, our metrics showed their push failures at nearly zero, and a survey of their developers in December found no reports of recent push failures at all. They cancelled the migration, and continue running with great performance.

Want graphs? Here are graphs from these customers showing push failures dropping to nearly zero as we rolled out fixes.

A graph showing git push failures at a customer dropping to zero.

A graph showing git push failures at a customer dropping to zero.

What’s next?

Like we said, we’ve got push failures down to nearly zero. Some of those failures are caused by random Internet networking issues, and are beyond our control. As for the rest, we’re looking at ways to eliminate those last annoying failures where we can, and to continue to make GitHub faster.

In the Git Systems world, we’re refreshing our storage hardware to make it faster. We’re also in the middle of a significant refactoring effort, doing our part to decompose GitHub’s famous Ruby monolith, and writing a new microservice in Go that will improve repository performance for every single user on GitHub.

Results

Project Cyclops has led to better performance and the elimination of failures for customers with large monorepos, less wasted CPU cycles on our fleet of file servers, and has significantly improved the experience of using GitHub for thousands of developers at some of our largest customers, including our customers using GitHub Enterprise Server.

It has also made small but noticeable improvements for everyone who uses GitHub.

We’ve improved single-repository update traffic rates by at least an order of magnitude. We now have years of headroom on our current architecture to handle the growth of even the largest monorepos.

Special thanks

We want to say how incredibly grateful 💚 we are to our monorepo customers who have collaborated with us to make GitHub better. Their help in reporting, and sometimes even diagnosing, problems was instrumental to addressing them. ✨ Sparkles all around! ✨

Related posts