4949
Open Source

10 Critical Updates on GitHub Availability and Scaling

GitHub has faced two recent service interruptions that have disrupted workflows for millions of developers. In a candid update, the platform’s leadership acknowledged these failures and outlined a comprehensive plan to handle the explosive growth in software development activity. This listicle breaks down the key details from that announcement, including the causes of the outages, the scaling strategy, and the engineering changes being implemented to ensure reliability for years to come.

1. Two Recent Outages: What Happened?

GitHub experienced two distinct availability incidents in the past month, both of which fell short of the platform’s internal standards. While the exact origins differ—one tied to database load, the other to a cascading failure in a dependency—the common thread is the dramatic increase in traffic from automated and agentic workflows. These disruptions caused partial or full downtime for core services like repository creation, pull requests, and Actions. GitHub’s transparency in sharing root causes marks a shift toward more proactive communication with its user base. The incidents serve as a reminder that even the most robust distributed systems can buckle under unexpected growth patterns when not properly isolated.

10 Critical Updates on GitHub Availability and Scaling
Source: github.blog

2. The Apology and Commitment to Reliability

GitHub’s leadership issued a direct apology, stating unequivocally that the outages were “not acceptable.” This message goes beyond a simple mea culpa; it signals a strategic pivot. Reliability is now the top priority, surpassing new feature development. The company acknowledges that past investments in capacity planning did not keep pace with the real-world usage explosion, especially from AI-driven tooling. To rebuild trust, GitHub has pledged to provide more detailed post-mortems and real-time status updates going forward. The apology also underscores a cultural change—moving from reactive fixes to a proactive reliability engineering mindset that anticipates failure modes before they impact users.

3. Massive Scaling Plan: From 10X to 30X Capacity

In October 2025, GitHub launched a plan to increase infrastructure capacity by tenfold. By February 2026, that target was already obsolete. The new goal is to scale to 30 times the current capacity within 18 months. This revised target reflects the velocity of change in the software industry, particularly the rise of autonomous coding agents that trigger hundreds of actions per second. The scaling strategy involves moving away from custom data centers toward public cloud infrastructure, adopting horizontally scalable services, and redesigning databases to handle write-heavy workloads. GitHub is also investing in chaos engineering to validate that the new architecture can withstand unexpected demand spikes.

4. The Driver: Explosive Growth in Agentic Development

Since December 2025, the way software is built has undergone a quiet revolution. Agentic development—where automated agents create, review, and merge code—has skyrocketed. Metrics such as repository creation rates, pull request volumes, API calls, automation scripts, and large-repository workloads have all doubled or tripled in just a few quarters. This isn’t a temporary trend but a structural shift in developer workflows. GitHub’s systems were originally designed for human-paced interactions; now they must handle machine-speed bursts. The platform is rearchitecting to treat this agentic traffic as a first-class workload, with dedicated queuing, rate limiting, and priority lanes to prevent cascading failures.

5. The Domino Effect: How a Single Pull Request Strains the System

A single pull request today touches over a dozen subsystems: Git storage, merge checks, branch protection rules, Actions runners, search indexes, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. At high scale, even small inefficiencies compound. A slow cache miss turns into a database query; a database query that times out triggers a retry storm; that storm overwhelms the queue, which then delays other services. GitHub engineers describe this as “hidden coupling” where one slow dependency can degrade multiple product experiences. The fix involves breaking these chains—implementing circuit breakers, dedicated resource pools, and graceful degradation so that a hiccup in webhooks doesn’t take down Git operations.

6. New Priorities: Availability, Then Capacity, Then Features

GitHub has formally reordered its engineering priorities: availability first, capacity second, and new features third. This means feature releases may slow down while reliability improvements take precedence. The team is focused on “reducing unnecessary work”—for example, eliminating redundant permission checks and optimizing notification delivery. They are also improving caching strategies to keep read-heavy workloads off the database. Another key initiative is moving performance-sensitive paths out of the monolithic Ruby codebase into Go services that are easier to scale independently. This new order of priorities is already showing results in reduced p95 latency for core Git operations.

7. Short-Term Fixes: Tackling Bottlenecks

Immediate actions included moving webhooks from a MySQL-backed system to a more scalable backend, redesigning the user session cache to reduce database hits, and rewriting authentication and authorization flows to cut database load by orders of magnitude. GitHub also took advantage of its ongoing migration to Azure to quickly provision additional compute resources—spinning up thousands of VMs in hours rather than weeks. These short-term fixes were necessary to stop the bleeding while longer-term architectural changes are designed. The team also implemented more aggressive rate limiting on API endpoints that were being hammered by automated agents, ensuring that legitimate developer traffic isn’t starved.

10 Critical Updates on GitHub Availability and Scaling
Source: github.blog

8. Isolating Critical Services to Reduce Blast Radius

One of the key lessons from the outages was that systems like Git hosting and GitHub Actions were too tightly coupled to other workloads. Engineers are now isolating these critical services into dedicated compute clusters with separate databases, caches, and load balancers. This minimizes the blast radius when a non-critical service fails. The isolation strategy involved a careful dependency analysis—mapping every call between services and categorizing them by risk tier. Once the coupling points were identified, the team introduced circuit breakers and bulkheads to prevent failures in one tier from cascading. This work is ongoing but has already prevented a repeat of the worst incidents.

9. Modernizing Code: From Ruby to Go

GitHub’s core application has long been a Ruby on Rails monolith, which is notoriously difficult to scale under high concurrency. The platform has accelerated the migration of performance-critical and scaling-sensitive code into Go, a language known for its efficient handling of concurrent requests. This rewrite affects the most heavily trafficked endpoints—pull request merge logic, repository search, and webhook delivery. Early benchmarks show 5x improvements in throughput and 20x reductions in memory usage. The migration is happening incrementally to avoid massive rewrites, using strangler fig patterns to route traffic to new Go services while the legacy Ruby code remains as a fallback.

10. Multi-Cloud Migration: The Path Forward

Already in the process of moving out of smaller custom data centers into public cloud infrastructure, GitHub accelerated its timeline to achieve true multi-cloud readiness. The plan now involves running active workloads across both Azure and AWS, with automated failover between them. This reduces dependence on any single provider and adds a layer of geographic redundancy. Multi-cloud also allows GitHub to tap into the best pricing and capacity from different clouds. However, it introduces new challenges: data consistency across regions, latency management, and complex network topologies. Engineers are building abstractions to make the underlying cloud provider transparent to services, so that a zone outage doesn’t require manual intervention.

Conclusion

GitHub’s candid update reveals a platform in the midst of a major transformation. The two recent incidents were painful but sparked a necessary rethinking of priorities. By scaling capacity 30x, isolating critical services, modernizing the codebase, and embracing multi-cloud, GitHub is building a foundation that can handle the explosive growth of agentic development. While there is still work to do—particularly around latency and failover automation—the roadmap is clear. For the millions of developers who depend on GitHub daily, the promise is a more resilient, faster, and ultimately more reliable platform. The coming months will test whether these architectural changes translate into consistent uptime, but the direction is undoubtedly the right one.

💬 Comments ↑ Share ☆ Save