ByteByteGo 9小时前
How the Google Cloud Outage Crashed the Internet
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入剖析了2025年6月12日谷歌云平台(GCP)大规模宕机事件,揭示了导致全球服务中断的技术细节。文章探讨了故障的根本原因——Service Control系统中的一个未被充分测试的bug,以及应对措施和后续恢复过程。通过这次事件,文章旨在总结经验教训,为提升云服务的稳定性和可靠性提供参考。

💥 **事件起因:** 2025年6月12日,谷歌云的Service Control系统因新引入的配额检查功能中的一个bug而崩溃,导致全球范围内API请求失败,引发大规模服务中断。

⚙️ **技术细节:** 该bug源于对Spanner数据库中不完整或缺失的策略更新的处理不当,触发了Service Control代码中的空指针错误,导致系统崩溃。该功能未受功能标志保护,导致错误代码在生产环境中被激活。

⏱️ **应对与恢复:** 谷歌工程师迅速启动了内部“红按钮”机制,禁用有问题的代码,并逐步恢复服务。然而,由于us-central-1区域的重启风暴,导致该区域的恢复时间较长。

📢 **沟通问题:** 由于云服务健康仪表板本身也受到影响,谷歌的首次公开承认延迟,导致客户对事件的了解滞后,加剧了问题的严重性。

7 Key Insights from the State of DevSecOps Report (Sponsored)

Datadog analyzed data from tens of thousands of orgs to uncover 7 key insights on modern DevSecOps practices and application security risks.

Highlights:

Plus, learn proven strategies to implement infrastructure as code, automated cloud deploys, and short-lived CI/CD credentials.

Get the report


Disclaimer: The details in this post have been derived from the details shared online by the Google Engineering Team. All credit for the technical details goes to the Google Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

On June 12, 2025, a significant portion of the internet experienced a sudden outage. What started as intermittent failures on Gmail and Spotify soon escalated into a global infrastructure meltdown. For millions of users and hundreds of companies, critical apps simply stopped working.

At the heart of it all was a widespread outage in Google Cloud Platform (GCP), which serves as the backend for a vast ecosystem of digital services. The disruption began at 10:51 AM PDT, and within minutes, API requests across dozens of regions were failing with 503 errors. Over a few hours, the ripple effects became undeniable.

Among consumer platforms, the outage took down:

The failure was just as acute for enterprise and developer tools:

In total, more than 50 distinct Google Cloud services across over 40 regions worldwide were affected. 

Perhaps the most significant impact came from Cloudflare, a company often viewed as a pillar of internet reliability. While its core content delivery network (CDN) remained operational, Cloudflare's authentication systems, reliant on Google Cloud, failed. This led to issues with session validation, login workflows, and API protections for many of its customers. 

The financial markets also felt the impact of this outage. Alphabet (Google’s parent) saw its stock fall by nearly 1 percent. The logical question that arose from this incident is as follows: How did a platform built for global scale suffer such a cascading collapse? 

Let’s understand more about it.


Special Event: Save 20% on Top Maven Courses (Sponsored)

Your education is expiring faster than ever. What you learned in college won’t help you lead in the age of AI.

That's why Maven specializes in live courses with practitioners who have actually done the work and shipped innovative products:

This week only: Save 20% on Maven’s most popular courses in AI, product, engineering, and leadership to accelerate your career.

Explore Event (Ends Sunday)


Inside the Outage

To understand how such a massive outage occurred, we need to look under the hood at a critical system deep inside Google Cloud’s infrastructure. It’s called the Service Control.

The Key System: Service Control

Service Control is one of the foundational components of Google Cloud's API infrastructure. 

Every time a user, application, or service makes an API request to a Google Cloud product, Service Control sits between the client and the backend. It is responsible for several tasks such as:

The diagram below shows how the Service Control works on a high level:

In short, Service Control acts as the gatekeeper for nearly all Google Cloud API traffic. If it fails, most of Google Cloud fails with it.

The Faulty Feature

On May 29, 2025, Google introduced a new feature into the Service Control system. This feature added support for more advanced quota policy checks, allowing finer-grained control over how quota limits are applied.

The feature was rolled out across regions in a staged manner. However, it contained a bug that introduced a null pointer vulnerability in a new code path that was never exercised during rollout. The feature relied on a specific type of policy input to activate. Because that input had not yet been introduced during testing, the bug went undetected.

Critically, this new logic was also not protected by a feature flag, which would have allowed Google to safely activate it in a controlled way. Instead, the feature was present and active in the binary, silently waiting for the right (or in this case, wrong) conditions to be triggered.

The Triggering Event

Those conditions arrived on June 12, 2025, at approximately 10:45 AM PDT, when a new policy update was inserted into Google Cloud’s regional Spanner databases. This update contained blank or missing fields that were unexpected by the new quota checking logic.

As Service Control read this malformed policy, the new code path was activated. The result was a null pointer error getting triggered, causing the Service Control binary to crash in that region.

Since Google Cloud’s policy and quota metadata is designed to replicate globally in near real-time as per Spanner’s key feature, the corrupted policy data was propagated to every region within seconds. 

Here’s a representative diagram on how replication works in Google Spanner:

As soon as each regional Service Control instance attempted to process the same bad data, it all began to crash in the same way. This created a global failure of Service Control. 

Since this system is essential for processing API requests, nearly all API traffic across Google Cloud began to fail, returning HTTP 503 Service Unavailable errors.

The speed and scale of the failure were staggering. One malformed update, combined with an unprotected code path and global replication of metadata, brought one of the most robust cloud platforms in the world to a standstill within minutes.

How Google Responded?

Once the outage began to unfold, Google’s engineering teams responded with speed and precision. Within two minutes of the first crashes being observed in Service Control, Google’s Site Reliability Engineering (SRE) team was actively handling the situation. 

The sequence of events that followed is as follows:.

The Red Button Fix

Fortunately, the team that introduced the new quota checking feature had built in a safeguard: an internal “red-button” switch. This kill switch was designed to immediately disable the specific code path responsible for serving the new quota policy logic. 

While not a complete fix, it offered a quick way to bypass the broken logic and stop the crash loop.

The red-button mechanism was activated within 10 minutes of identifying the root cause. By 40 minutes after the incident began, the red-button change had been rolled out across all regions, and systems began to stabilize. Smaller and less complex regions recovered first, as they required less infrastructure coordination.

This kill switch was essential in halting the worst of the disruption. However, because the feature had not been protected by a traditional feature flag, the issue had already been triggered in production globally before the red button could be deployed. 

Delayed Recovery in US-CENTRAL-1

Most regions began to recover relatively quickly after the red button was applied. However, one region (us-central-1), located in Iowa, took significantly longer to stabilize.

The reason for this delay was a classic case of the “herd effect.” 

As Service Control tasks attempted to restart en masse, they all hit the same underlying infrastructure: the regional Spanner database that held policy metadata. Without any form of randomized exponential backoff, the system became overwhelmed by a flood of simultaneous requests. Rather than easing into recovery, it created a new performance bottleneck.

Google engineers had to carefully throttle task restarts in us-central1 and reroute some of the load to multi-regional Spanner databases to alleviate pressure. This process took time. Full recovery in us-central1 was not achieved until approximately 2 hours and 40 minutes after the initial failure, well after other regions had already stabilized.

Communication Breakdown

While the technical team worked to restore service, communication with customers proved to be another challenge.

Because the Cloud Service Health dashboard itself was hosted on the same infrastructure affected by the outage, Google was unable to immediately post incident updates. The first public acknowledgment of the problem did not appear until nearly one hour after the outage began. During that period, many customers had no clear visibility into what was happening or which services were affected.

To make matters worse, some customers relied on Google Cloud monitoring tools, such as Cloud Monitoring and Cloud Logging, that were themselves unavailable due to the same root cause. This left entire operations teams effectively blind, unable to assess system health or respond appropriately to failing services.

The breakdown in visibility highlighted a deeper vulnerability: when a cloud provider's observability and communication tools are hosted on the same systems they are meant to monitor, customers are left without reliable status updates in times of crisis.

The Key Engineering Failures

The Google Cloud outage was not the result of a single mistake, but a series of engineering oversights that compounded one another. Each failure point, though small in isolation, played a role in turning a bug into a global disruption.

Here are the key failures that contributed to the entire issue:

Conclusion

In the end, it was a simple software bug that brought down one of the most sophisticated cloud platforms in the world. 

What might have been a minor error in an isolated system escalated into a global failure that disrupted consumer apps, developer tools, authentication systems, and business operations across multiple continents. This outage is a sharp reminder that cloud infrastructure, despite its scale and automation, is not infallible. 

Google acknowledged the severity of the failure and issued a formal apology to customers. In its public statement, the company committed to making improvements to ensure such an outage does not happen again. The key actions Google has promised are as follows:

Reference:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com




Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

谷歌云 宕机 DevSecOps Service Control
相关文章