少点错误 06月03日 02:57
SRE's review of Democracy
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文讲述了团队接手一个名为“Democracy”的旧系统,并尝试进行维护的故事。该系统结构复杂、模块间耦合严重,维护团队面临着缺乏系统知识、监控不足等挑战。通过代码审查和深入调查,团队发现了系统的关键组成部分,如“Separation of Powers”和“Election”机制,以及“FreePress”监控系统。文章也揭示了系统中存在的问题,如缺乏故障分析文化、指标失真等。最终,团队尝试引入新的指标,并思考如何解决系统中的问题,展现了对旧系统维护的深刻思考。

🔍 系统概览:团队接手一个名为“Democracy”的旧系统,该系统结构复杂,模块间高度耦合,并且维护团队对系统缺乏深入了解,这使得维护工作充满挑战。

💡 关键机制:“Separation of Powers”并非冗余系统,而是各分支执行不同业务逻辑,并相互监督。 “Election”机制是一个分布式共识算法,用于确保只有一个控制系统作为主系统运行,并在系统出现问题时触发备用系统。

📊 监控与指标:“FreePress”监控系统与生产系统紧密相连,但其在故障发生时的表现未知。“GDP”指标被用作系统健康状况的主要指标,但实际上更像是一个吞吐量指标,这引发了对指标有效性的质疑。

⚠️ 存在问题:系统中缺乏故障分析文化,导致对系统故障模式的理解不足。此外,受影响的组件在被隔离后,修复机制也不明确,这给系统的稳定性带来了潜在风险。

Published on June 2, 2025 6:50 PM GMT

Day One

We've been handed this old legacy system called "Democracy". It's an emergency. The old maintainers are saying it has been misbehaving lately but they have no idea how to fix it. We've had a meeting with them to find out as much as possible about the system, but it turns out that all the original team members left the company long time ago. The current team doesn't have much understanding of the system beyond some basic operational knowledge.

We've conducted a cursory code review, focusing not so much on business logic but rather on the stuff that could possibly help us to tame it: Monitoring, reliability characteristics, feedback loops, automation already in place. Our first impression: Oh, God, is this thing complex! Second impression: The system is vaguely modular. Each module is strongly coupled with every other module though. It's an organically grown legacy system at its worst.

That being said, we've found a clue as to why the system may have worked fine for so long. There's a redundancy system called "Separation of Powers". It reminds me of the Tandem computers back from the 70s.

Day Two

We were wrong. "Separation of Powers" is not a system for redundancy. Each part of the system ("branch") has different business logic. However, each also acts as a watchdog process for the other branches. When it detects misbehavior it tries to apply corrective measures using its own business logic. Gasp!

Things are not looking good. We're still searching for monitoring.

Day Three

Hooray! We've found the monitoring! It turns out that "Election" is conducted once every four years. Each component reports its health (1 bit) to the central location. The data flow is so low that we have overlooked it until now. We are considering shortening the reporting period, but the subsystem is so deeply coupled with other subsystems that doing so could easily lead to a cascading failure.

In other news, there seems to be some redundancy after all. We've found a full-blown backup control system ("Shadow Cabinet") that is inactive at the moment, but might be able to take over in case of a major failure. We're investigating further.

Day Four

Today, we've found yet another monitoring system called "FreePress." As the name suggests it was open-sourced some time ago, but the corporate version have evolved quite a bit since then, so the documentation isn't very helpful. The bad news is that it's badly intertwined with the production system. The metrics look more or less okay as long as everything is working smoothly. However, it's unclear what will happen if things go south. It may distort the metrics or even fail entirely, leaving us with no data whatsoever at the moment of crisis.

By the way, the "Election" process may not be a monitoring system after all. I suspect it might actually be a feedback loop that triggers corrective measures in case of problems.

Day Five

The most important metric seems to be this big graph labeled "GDP". As far as we understand, it's supposed to indicate the overall health of the system. However, drilling into the code suggests that it's actually a throughput metric. If throughput goes down there's certainly a problem, but it's not clear why increasing throughput should be considered the primary health factor...

More news on the "Election" subsystem: We've found a floppy disk with the design doc, and it turns out that it's not a feedback loop after all. It's a distributed consensus algorithm (think Paxos)! The historical context is that they've used to run several control systems in parallel (for redundancy reasons maybe?) which resulted in numerous race conditions and outages. "Election" was put in place to ensure that only one control system acts as a master at any given time. The consensus algorithm is based on PTP (Peaceful Transfer of Power) protocol. The gist is that when most components are reporting being unhealthy, it is treated as a failure of the control system and the backup ("Shadow Cabinet") is automatically activated. The main control system then becomes the backup. It's unclear how it is supposed to be fixed while in backup mode, though.

Day Six

I met guys from Theocracy Inc. in a bar last night and complained about the GDP metric. They suggested using GNH ("Gross National Happiness") metric instead. I'll tell Ethan to add such a console first thing on Monday.

We've also dug into the operational practices for "Democracy". It turns out there was no postmortem culture. Outages were followed by cover-ups and blame-shifting. The most damaging consequence is that we have no clear understanding of the system's failure modes.

We now have a better understanding of the "Judiciary" branch, one of the "Separation of Powers" branches. It evaluates whether components are behaving according to pre-defined set of rules. If they are not, they are removed from production and put into suspended state ("BSD jails"). It's unclear how are they supposed to be fixed while suspended. (We've seen a similar problem with the backup control system so we might be missing something essential here.)

It's Sunday tomorrow. I am taking a day off to think about how to solve the mess. Hopefully, nothing will blow up while I'm away.


Author's note: Originally posted in 2017. Still funny. Reposting.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

旧系统维护 系统架构 监控 故障分析
相关文章