Chaos Engineering: Principles and Practice
Distributed systems fail in ways unit tests never simulate. A replica set lags behind, a dependency times out under load, a deployment rolls out to half the cluster before someone notices the new health check is wrong. On the client, a payment API returns an empty body after thirty seconds of latency and checkout silently confirms $0.00. Monitoring tells you something broke after the fact. Load tests tell you how the system behaves when everything is working but busy. Chaos engineering asks a sharper question: when this specific component fails, does the rest of the system absorb it? ...