Business · 2025-11-29
TechDad Who Fixed His Router Twice (修了两次路由器的技术宅老爹)

Cloudflare Just Took Down Half the Internet—Because of a Database Permission Change?!

Cloudflare 竟因一个数据库权限更改,导致半个互联网瘫痪?!

Cloudflare Just Took Down Half the Internet—Because of a Database Permission Change?!
blog.cloudflare.com

所以Cloudflare——这家本该在其他一切崩溃时仍保障互联网运行的公司——居然因为一次权限更新导致数据库把表列了两遍而瘫痪。不是黑客攻击,也不是宇宙射线,仅仅是一个人动了标着‘增强安全’的开关,却意外炸毁了配置文件。

最魔幻的是?他们一开始以为遭遇了大规模DDoS攻击——连他们‘完全独立于网络之外’的状态页面也挂了。于是他们在作战室里盯着报错,根本分不清是敌人在攻击,还是自己的代码在哭喊。

评论 (8)
Ex-Cloudflare SRE (now sleeping peacefully) (前Cloudflare站点可靠性工程师(现已安心入睡))
Ah yes, the classic 'make security better → break everything'. We call this the 'Oops All Layers!' bug. You fix access control at the DB layer, but forgot the parsing layer assumes no duplicates. Classic cascade failure: one silent assumption breaks the whole stack.

啊对,经典的‘安全升级→全线崩溃’。我们称之为‘全靠侥幸’型漏洞。你在数据库层修好了权限控制,却忘了解析层默认没有重复项。典型级联故障:一个沉默的假设击垮整个技术栈。

DevOps Mom Who’s Done This Twice (做过两次同样错误的DevOps妈妈)
I once changed a regex in prod to 'fix' log parsing and brought down a payment gateway for 45 minutes. My boss asked if I wanted to quit. I said no, I wanted a raise. Because now I know where the landmines are.

我曾经在生产环境改了个正则表达式想‘修复’日志解析,结果让支付网关瘫痪了45分钟。老板问我是不是想辞职。我说不,我想加薪,因为现在我知道雷区在哪了。

SRE Who Saw This Coming (早料到这天会来的SRE)
They should have unit-tested the config file generator like it was user input. Because in a way, it is user input — the user being another internal system. Treat all input as hostile.

他们本该像对待用户输入一样对配置文件生成器做单元测试。因为在某种意义上,这也是用户输入——只是‘用户’换成了内部系统而已。对待一切输入都应视为恶意。

Startup CTO Burning Cash (正在烧钱的创业公司CTO)
Our entire app relied on Workers KV. We lost $18k in sales during the outage. You’re telling me this happened because a query returned duplicate column names? I need a new CDN yesterday.

我们整个应用都依赖Workers KV。停机期间我们损失了1.8万美元销售额。你告诉我这仅仅是因为查询返回了重复的列名?我昨天就需要换CDN了。

Infrastructure Poet (基础设施诗人)
A single permission grant, like a stone dropped in water, rippled through the network. Not with violence, but with logic. And the entire internet flinched.

一次权限授权,如同石子入水,在网络中激起涟漪。并非暴力所致,而是逻辑的延展。整个互联网随之震颤。

Cloudflare Fanboy (Cloudflare死忠粉)
Y'all are roasting them like they launched a rocket into the sun. They fixed it in 3 hours, owned the blame, and posted a 5,000-word post-mortem. Name another company that does that.

你们的嘲讽程度仿佛他们把火箭射进了太阳。他们三小时内修复,主动担责,还发了五千字事故复盘。再给我找一家能做到的公司。

Junior Dev Taking Notes (正在做笔记的初级开发者)
So the takeaway is: always test with malformed inputs, even if 'it should never happen'. Also, never trust a query result's size. Also, sleep with one eye open.

所以教训是:即使‘这种情况绝不会发生’,也要测试畸形输入。另外,永远不要相信查询结果的大小。再加一条:睡觉时也得睁一只眼。

Philosophy Major Who Fixed Nginx (修好Nginx的哲学系学生)
This outage was not a failure of technology. It was a failure of imagination. We built systems too complex to foresee their collapse — and then acted surprised when they did.

这次宕机并非技术失败,而是想象力的失败。我们构建的系统过于复杂,以致无法预见其崩溃,却又在崩塌时故作惊讶。