Meta tests instant power loss across production regions without warning

Meta deploys region-wide power-loss testing in production

Meta introduced Instantaneous PowerLoss Storm, a testing protocol that de-energizes entire production data center regions without advance warning to validate recovery behavior. The test injects a power supply fault, triggers asynchronous shutdown signals across millions of services, and measures time to recovery against real incident baselines.

The effort exposed two architectural failure modes. The first: circular dependencies among Twine control-plane services (Scheduler, Allocator, Broker, Zelos) that manifest only during region-wide bootstrapping, when millions of services attempt to start and discover each other simultaneously. Meta solved this with Belljar continuous integration tests to catch dependency cycles early, plus a purpose-built Twine recovery kit that can "jumpstart" orchestration services without external input.

The second failure: a "boomerang" scenario where unavailability event signals, intended to orchestrate service shutdown, ended up shutting down the orchestrator itself, orphaning services that could never receive cleanup signals. Meta chose the simpler fix: allowing control-plane services to ignore power-related shutdown signals rather than maintain a hardcoded exclusion list.

Meta established explicit tradeoff boundaries. Data loss, permanent facility damage, and sustained multi-region impact are unacceptable. Transient service errors, individual rack failures within thresholds, and bounded staleness in routing tables are tolerable if they can be remediated within typical mean time to respond (MTTR) windows. Testing began in pre-production regions, moved to "shadow" replicas of live production, then progressed to the company's smallest live production regions before scaling to critical storage, AI, and data warehouse workloads.

Regional resilience is now a scaling constraint

Meta's data centers operate at two scales: sub-regional fault domains and multi-building regions. Previous Disaster Readiness testing validated fault domains reliably. A region is 50–60 times larger and involves coordinated recovery across hundreds of thousands of servers sharing common power and network paths. At that scale, design assumptions that hold in fault domains break.

The constraint is autonomy. A powered-off region cannot rely on external signals or manual intervention to recover. Millions of services must discover each other, establish leadership, and converge on a consistent state without external coordination. The circular dependency and boomerang problems reveal how hidden assumptions in control-plane architecture can prevent that autonomy entirely.

This matters because capacity deployment and AI workload scaling are outpacing the ability to validate them. Meta notes that reliability and velocity are "two facets of the same coin." Without this testing foundation, rapid region expansion introduces hidden failure modes that only surface during real outages.

Build region-wide recovery into initial architecture, not retrofit

Control-plane circular dependencies are not exotic. They emerge naturally when services depend on schedulers to start, schedulers depend on service discovery to function, and service discovery depends on schedulers to replicate. Test for these in CI/CD with dependency graph analysis (Belljar-style) before they reach production. If they slip through, you need a recovery mechanism that does not depend on the systems it is recovering.

Asynchronous signaling systems (like Meta's unavailability events) must exclude the orchestrator itself from its own shutdown signals when those signals are power-related. This is a one-line policy fix that prevents entire classes of cascading failures.

Plan region testing incrementally. Pre-production is safe but unrealistic. Shadow replicas add fidelity. Smallest production regions add real-world blast radius. Only after validating smaller regions should you attempt de-energization of regions holding critical workloads. Each iteration teaches the infrastructure and the team.

Meta tests instant power loss across production regions without warning

Our Take

Why it matters

Do this week

Meta deploys region-wide power-loss testing in production

Regional resilience is now a scaling constraint

Build region-wide recovery into initial architecture, not retrofit

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software