GitHub Availability Report: May 2025

wccwcc
Jun 12, 2025 - 05:00
 0  0
GitHub Availability Report: May 2025

In May, we experienced three incidents that resulted in degraded performance across GitHub services.

May 1 22:09 UTC (lasting 1 hour and 4 minutes)

On May 1, 2025, from 22:09 UTC to 23:13 UTC, the Issues service was degraded and users weren’t able to upload attachments. The root cause was identified to be a new feature which added a custom header to all client-side HTTP requests, causing CORS errors when uploading attachments to our provider. We estimate that ~130k users were impacted by the incident for ~45min.

We mitigated the incident by rolling back the feature flag that added the new header at 22:56 UTC. In order to prevent this from happening again, we are adding new metrics to monitor and ensure the safe rollout of changes to client-side requests. We have since deployed an augmented version of the feature based on learnings from this incident that is performing well in production.

May 28 09:45 UTC (lasting 5 hours)

On May 28, 2025, from approximately 09:45 UTC to 14:45 UTC, GitHub Actions experienced delayed job starts for workflows in public repos using Ubuntu-24 standard hosted runners. This was caused by a misconfiguration in backend caching behavior after a failover, which led to duplicate job assignments reducing overall capacity in the impacted hosted runner pools. Approximately 19.7% of Ubuntu-24 hosted runner jobs on public repos were delayed. Other hosted runners, self-hosted runners, and private repo workflows were unaffected.

By 12:45 UTC, the configuration issue was fixed through updates to the backend cache. The pools were also scaled up to more quickly work through the backlog of queued jobs until queuing impact was fully mitigated at 14:45 UTC. We are improving failover resiliency and validation to reduce the likelihood of similar issues in the future.

May 30 08:10 UTC (lasting 7 hours and 50 minutes)

On May 30, 2025, between 08:10 UTC and 16:00 UTC, the Microsoft Teams GitHub integration service experienced a complete service outage.

During this period, the integration was unable to process user requests or deliver notifications, resulting in a 100% error rate across all functionality, with the exception of link previews. This outage was caused by an authentication issue with our downstream authentication provider.

While the appropriate monitoring was in place, the alerting thresholds were not sufficiently sensitive to trigger a timely response, resulting in a delay in incident detection and engagement. Once engaged, our team worked closely with the downstream provider to diagnose and resolve the authentication failure. However, longer-than-expected response times from the provider contributed to the extended duration of the outage.

We mitigated the incident by working with our provider to restore service functionality and are working to migrate to more durable authentication methods to reduce the risk of similar issues in the future.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: May 2025 appeared first on The GitHub Blog.

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0