In February, we experienced two incidents that resulted in degraded performance across GitHub services.
February 26 18:34 UTC (lasting 53 minutes)
February 29 09:32 UTC (lasting 142 minutes)
On February 26 and February 29, we had two incidents related to a background job service that caused processing delays to GitHub services. The incident on February 26 lasted for 63 minutes, while the incident on February 28 lasted for 142 minutes.
The incident on February 26 was related to capacity constraints with our job queuing service and a failure of our automated failover system. Users experienced delays in Webhooks, GitHub Actions, and UI updates (for example, a delay in UI updates on pull requests). We mitigated the incident by manually failing over to our secondary cluster. No data was lost in the process.
The incident on February 29 also caused processing delays to Webhooks, GitHub Actions and GitHub Issues services, with 95% of the delays occurring in a 22-minute window between 11:05 and 11:27 UTC. At 9:32 UTC, our automated failover successfully routed traffic, but an improper restoration to the primary at 10:32 UTC caused a significant increase in queued jobs until a correction was made at 11:21 UTC and healthy services began burning down the backlog until full restoration at 11:27 UTC.
To prevent recurrence of the incidents in the short term, we have completed three significant improvements in the areas of better automation, increasing the reliability of our fallback process, and expanding the capacity of our background job queuing services based on these two incidents. For the longer term, we have a more significant effort already in progress to improve the overall scalability and reliability of our job processing platform.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.
The post GitHub Availability Report: February 2024 appeared first on The GitHub Blog.
View the full article