GitHub Availability Report: November 2021
In November, we experienced one incident resulting in significant impact and degraded state of availability for multiple services.
In November, we experienced one incident resulting in significant impact and degraded state of availability for core GitHub services, including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks.
November 27 20:40 UTC (lasting 2 hours and 50 minutes)
We encountered a novel failure mode when processing a schema migration on a large MySQL table. Schema migrations are a common task at GitHub and often take weeks to complete. The final step in a migration is to perform a rename to move the updated table into the correct place. During the final step of this migration a significant portion of our MySQL read replicas entered a semaphore deadlock. Our MySQL clusters consist of a primary node for write traffic, multiple read replicas for production traffic, and several replicas that serve internal read traffic for backup and analytics purposes. The read replicas that hit the deadlock entered a crash-recovery state causing an increased load on healthy read replicas. Due to the cascading nature of this scenario, there were not enough active read replicas to handle production requests which impacted the availability of core GitHub services.
During the incident mitigation, in an effort to increase capacity, we promoted all available internal replicas that were in a healthy state into the production path; however, the shift was not sufficient for full recovery. We also observed that read replicas serving production traffic would temporarily recover from their crash-recovery state only to crash again due to load. Based on this crash-recovery loop, we chose to prioritize data integrity over site availability by proactively removing production traffic from broken replicas until they were able to successfully process the table rename. Once the replicas recovered, we were able to move them back into production and restore enough capacity to return to normal operations.
Throughout the incident, write operations remained healthy and we have verified there was no data corruption.
To address this class of failure and reduce time to recover in the future, we continue to prioritize our functional partitioning efforts. Partitioning the cluster adds resiliency given migrations can then be run in canary mode on a single shard—reducing the potential impact of this failure mode. Additionally, we are actively updating internal procedures to increase the amount each cluster is over-provisioned.
As next steps, we’re continuing to investigate the specific failure scenario, and have paused schema migrations until we know more on safeguarding against this issue. As we continue to test our migration tooling, we are classifying opportunities to improve it during such scenarios.
In summary
We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.
Tags:
Written by
Related posts
GitHub and JFrog partner to unify code and binaries for DevSecOps
This partnership between GitHub and JFrog enables developers to manage code and binaries more efficiently on two of the most widely used developer platforms in the world.
2024 GitHub Accelerator: Meet the 11 projects shaping open source AI
Announcing the second cohort, delivering value to projects, and driving a new frontier.
Introducing GitHub Copilot Extensions: Unlocking unlimited possibilities with our ecosystem of partners
The world of Copilot is getting bigger, improving the developer experience by keeping developers in the flow longer and allowing them to do more in natural language.