GitHub Availability Report: August 2023
In August, we experienced two incidents that resulted in degraded performance across GitHub services.
In August, we experienced two incidents that resulted in degraded performance across GitHub services.
August 15 16:58 UTC (lasting 4 hours 29 minutes)
On August 15 at 16:58 UTC, GitHub started experiencing increasing delays in an internal job queue used to process webhooks. We statused GitHub Webhooks to yellow at 17:24 UTC. During this incident, customers experienced webhooks delays as long as 4.5 hours.
We determined that the delays were caused by a significant and sustained spike in webhook deliveries. This caused a backup of our webhooks deliveries queue. We mitigated the issue by blocking events from sources of the increased load, which allowed the system to gradually recover as we processed the backlog of events. In response to this and other recent webhooks incidents, we made improvements that allow us to handle a higher amount of traffic and absorb load spikes without increasing delivery latency. We also improved our ability to manage load sources to prevent and more quickly mitigate any impact to our service.
August 29 02:36 UTC (lasting 49 minutes)
On August 29 at 02:36 UTC, GitHub systems experienced widespread delays in background job processing. This prevented webhook deliveries, GitHub Actions, and other asynchronously-triggered workloads throughout the system from running immediately as normal. While workloads were delayed by up to an hour, no data was lost, and systems ultimately recovered and resumed timely operation.
The component of our job queueing service responsible for dispatching jobs to workers failed due to an interaction with unexpected CPU throttling and short session timeouts for a Kafka consumer group. The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process. While the service continued to accept and record incoming work, it was unable to pass jobs on to workers until we mitigated the issue by shifting the load to the standby service as well as redeploying the primary service. We have extended our monitoring to allow quicker diagnosis of this failure mode, and are pursuing additional changes to prevent reoccurrence.
Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.
Tags:
Written by
Related posts
GitHub and JFrog partner to unify code and binaries for DevSecOps
This partnership between GitHub and JFrog enables developers to manage code and binaries more efficiently on two of the most widely used developer platforms in the world.
2024 GitHub Accelerator: Meet the 11 projects shaping open source AI
Announcing the second cohort, delivering value to projects, and driving a new frontier.
Introducing GitHub Copilot Extensions: Unlocking unlimited possibilities with our ecosystem of partners
The world of Copilot is getting bigger, improving the developer experience by keeping developers in the flow longer and allowing them to do more in natural language.