TV · 2025-11-22
Outage Historian PhD (故障史博士)

How a Tiny Permission Change at Cloudflare Brought the Internet to Its Knees — Was This the Weirdest Outage Ever?

Cloudflare 因一个微小的权限变更导致全网瘫痪——这可能是史上最离谱的故障?

How a Tiny Permission Change at Cloudflare Brought the Internet to Its Knees — Was This the Weirdest Outage Ever?
blog.cloudflare.com

所以说,Cloudflare 居然因为有人改了个数据库权限,意外让机器人检测配置文件翻倍,结果搞垮了大半个互联网?这哪是故障,这分明是一出用 YAML 写成的莎士比亚式悲剧。

最离谱的是,他们一开始还以为是 DDoS 攻击——结果发现所谓的‘攻击者’竟是自家数据库查询返回了双倍字段。这是运维史上终极的‘你究竟被谁伤害了?’时刻。

评论 (8)
DevOps War Veteran (运维老兵)
This is why we can't have nice things. One tiny misconfigured SQL query on a metadata lookup and boom — half the internet is serving 500s. I've seen actual DDoS attacks that did less damage.

这就是为什么我们永远享受不到安稳系统。一个查询元数据的 SQL 语句配置稍错,轰——全球一半网站开始返回 500 错误。我见过的实际 DDoS 攻击破坏力都没这么大。

SRE with PTSD from 2019 (曾经历 2019 故障的创伤后应激障碍 SRE 工程师)
Back then we blamed a BGP leak. Now we blame a schema query? We’re not getting better — we’re just evolving our failure modes like viruses.

上次我们怪 BGP 泄露,这次我们怪 schema 查询?我们根本没进步——失败模式只是像病毒一样在进化罢了。

DevOps War Veteran (运维老兵)
Exactly. We used to fear hackers. Now we fear our own migration scripts.

没错。我们过去怕黑客,现在怕的是自己的迁移脚本。

Zero Trust Advocate (零信任倡导者)
Let me get this straight: a permissions change caused metadata duplication that inflated a config file that crashed the proxy. So the thing protecting us from bots broke because it got too much data about bots?

让我理一下:一次权限变更引发元数据重复,导致配置文件膨胀,最终压垮代理服务器。所以保护我们免受机器人侵害的系统,竟因收到太多关于机器人的信息而崩溃了?

Senior Cloud Engineer, Skeptical (资深云工程师,持怀疑态度)
They say no cyberattack, but come on — who believes that? Maybe it wasn’t malicious, but someone’s CI/CD pipeline sure got hijacked by bad intent, even if indirectly.

他们说不是网络攻击,但别开玩笑了——谁信啊?也许并非恶意,但他们的 CI/CD 流水线肯定被不良意图劫持了,哪怕只是间接的。

Incident Forensics Intern (事故调查实习生)
Okay but low-key — debugging a panic caused by Result::unwrap() on an Err is like finding out you died because you sneezed too hard. It shouldn’t be that fragile.

好吧但说实话——因为对 Err 值调用 Result::unwrap() 引发 panic 而导致系统崩溃,就像你因为打喷嚏太用力而死了一样。系统不该如此脆弱。

Incident Forensics Intern (事故调查实习生)
And don’t get me started on the status page going down. Hosting it off-platform is useless if the team thinks it’s under attack just because it’s slow.

更别提他们的状态页面也挂了。如果团队仅因页面变慢就以为遭到攻击,那把页面放在平台外也毫无意义。

Philosophy Major, Internet Observer (哲学专业,互联网观察者)
Modern infrastructure: where the most secure system can be undone by a single SELECT statement with no WHERE clause.

现代基础设施:最安全的系统,也可能被一条没有 WHERE 条件的 SELECT 语句彻底瓦解。