TechCrunch News 2024年12月13日
OpenAI blames its massive ChatGPT outage on a ‘new telemetry service’
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI的ChatGPT、Sora及其API服务周三遭遇大规模中断,历时约三小时。事后调查显示,故障并非安全事件或新产品发布导致,而是由于新部署的遥测服务配置不当,导致Kubernetes API服务器过载,进而影响了DNS解析等关键服务。尽管OpenAI在用户受影响前几分钟已检测到问题,但由于Kubernetes服务器过载,修复过程缓慢。此次事件暴露了多系统同时失效且以意外方式相互作用的问题。OpenAI承诺将改进基础设施变更的监控和分阶段推出机制,以避免类似事件再次发生。

⚠️OpenAI服务中断并非安全事件或新产品发布引起,而是新部署的遥测服务配置不当所致。该服务旨在收集Kubernetes指标,但其配置意外导致了资源密集型的Kubernetes API操作。

⚙️遥测服务故障导致Kubernetes API服务器过载,进而影响了DNS解析。DNS解析是将域名转换为IP地址的关键服务,其故障导致OpenAI的许多服务无法正常运行。

⏱️OpenAI虽在用户受影响前数分钟检测到问题,但由于Kubernetes服务器过载,修复缓慢。此外,DNS缓存机制延迟了问题可见性,使故障范围扩大。

🛡️OpenAI承诺将改进基础设施变更的监控和分阶段推出机制,并建立新机制以确保工程师在任何情况下都能访问Kubernetes API服务器,以防止类似事件再次发生。

OpenAI is blaming one of the longest outages in its history on a “new telemetry service” gone awry.

On Wednesday, OpenAI’s AI-powered chatbot platform, ChatGPT; its video generator, Sora; and its developer-facing API experienced major disruptions starting at around 3 p.m. Pacific. OpenAI acknowledged the problem soon after — and began working on a fix. But it’d take the company roughly three hours to restore all services.

In a postmortem published late Thursday, OpenAI wrote that the outage wasn’t caused by a security incident or recent product launch, but by a telemetry service it deployed Wednesday to collect Kubernetes metrics. Kubernetes is an open source program that helps manage containers, or packages of apps and related files that are used to run software in isolated environments.

“Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused … resource-intensive Kubernetes API operations,” OpenAI wrote in the postmortem. “[Our] Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large [Kubernetes] clusters.”

That’s a lot of jargon, but basically, the new telemetry service affected OpenAI’s Kubernetes operations, including a resource that many of the company’s services rely on for DNS resolution. DNS resolution converts IP addresses to domain names; it’s the reason you’re able to type “Google.com” instead of “142.250.191.78.”

OpenAI’s use of DNS caching, which holds info about previously-looked-up domain names (like website addresses) and their corresponding IP addresses, complicated matters by “delay[ing] visibility,” OpenAI wrote, and “allowing the rollout [of the telemetry service] to continue before the full scope of the problem was understood.”

OpenAI says that it was able to detect the issue “a few minutes” before customers ultimately started seeing an impact, but that it wasn’t able to quickly implement a fix because it had to work around the overwhelmed Kubernetes servers.

“This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways,” the company wrote. “Our tests didn’t catch the impact the change was having on the Kubernetes control plane [and] remediation was very slow because of the locked-out effect.”

OpenAI says that it’ll adopt several measures to prevent similar incidents from occurring in the future, including improvements to phased rollouts with better monitoring for infrastructure changes and new mechanisms to ensure OpenAI engineers can access the company’s Kubernetes API servers in any circumstances.

“We apologize for the impact that this incident caused to all of our customers – from ChatGPT users to developers to businesses who rely on OpenAI products,” OpenAI wrote. “We’ve fallen short of our own expectations.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI 服务中断 遥测服务 Kubernetes DNS
相关文章