The most expensive word in an incident is «recovered». In one real incident — a loss of network aliases in Docker Swarm — recovery was announced twice: after the first announcement the regression came back within an hour, and after the second, silent damage kept accumulating for another four days, because the control plane confidently reported «everything is up». A premature announcement costs more than the downtime itself: it stands the team down from war mode and burns trust in every «it works now» that follows.
The framework that keeps an incident from turning into chaos has three parts: severity classification, the Incident Command System roles, and a formal definition of recovered.
Severity is assigned in a minute — and it decides everything
SEV1 — full prod outage, data loss, or a security breach: page within five minutes, all-hands, status page, leadership in the loop. SEV2 — major degradation of a key feature: page within 15 minutes, on-call plus manager. SEV3 — degradation with a workaround: ticket within an hour, team channel. SEV4 — cosmetic, next business day, backlog.
The point of this table is not the classification itself but that it lifts the first decision off a tired human at 3 a.m.: severity determines who gets woken up, which channel opens, and how much time is allotted to respond. The argument «is this really a SEV1?» is deferred to the postmortem — downgrading an incident after the fact is cheap, sleeping through an escalation is expensive.
ICS: the commander does not touch the keyboard
For SEV1/SEV2 the Incident Command System kicks in — a set of roles borrowed from the world of fire services. The Incident Commander makes decisions and coordinates — and does no hands-on work. The Operations Lead runs the technical diagnosis and mitigation. The Communications Lead updates the status page and keeps customers and leadership informed. The Scribe keeps the timeline in the #incident-NNN channel. SMEs are pulled in by domain — database, network, a specific service.
One rule above all: the IC cannot dig and coordinate at the same time. An overloaded commander loses the whole picture, the team loses synchronisation — and an hour later it turns out two engineers were rolling out incompatible mitigations in parallel.
Mitigate ≠ Fix
The incident lifecycle: Detect → Triage → Mitigate → Resolve → Learn. Mitigate means stopping the bleeding: rollback, scale up, a disabled feature flag. Fix means resolving the root cause in calm conditions, afterwards. The classic mistake is debugging the root cause at the peak of an incident, when a rollback would have restored the service in two minutes: 90% of incidents are tied to a recent change, so «what shipped in the last two hours» is checked before the first stack trace.
The three gates before the word «recovered»
This lesson formalises into three checks, without which the incident does not close.
The first gate is stable-for-N-minutes under production traffic. N scales with severity: SEV1 — 30 minutes or more, SEV2 — 15 or more. A single green dot on the graph is not a trend.
The second is downstream-artefact verification. For every critical output of the affected services — sitemap, RSS, public API, scheduled exports — freshness and correctness are checked, not just the health of the service itself. The combination «service up, artefact stale» is a classic hit to credibility.
The third is the post-recovery discovery audit, and it deserves a section of its own.
Until all three gates have passed, the incident status is «stabilised, verification ongoing». Not «closed».
The control plane lies: declared vs actual
The most insidious failure mode is common to any orchestrator with internal DNS: the control plane reports «everything is up» while the actual registrations are partially missing or stale. In Kubernetes a pod can be Running but never make it into Endpoints — readiness, a node taint. In Docker Swarm overlay aliases drop, and reconnecting the network in place does not restore the bookkeeping. In Consul — agent split-brain or an expired registration TTL. Downstream consumers, meanwhile, silently get 504 or NXDOMAIN.
The answer is a recurring «declared vs actual» audit: what is declared (the service selector, the aliases from the compose stack, the service definition) is compared against what is actually registered (EndpointSlice, overlay DNS, consul catalog). The result is pushed to an external receiver — an Uptime Kuma push monitor or healthchecks.io: a downed cluster breaks its own alerting too, so a local cron with an alert is no good here.
From the same place comes a practical observation about reset: for an orchestrator with opaque internal state, scaling to zero and back is stronger than an in-place restart, because it releases the scheduler's bookkeeping. If «restart the pod» did not help, the state is corrupted deeper than it looks from the surface.
«Recovered» is a contract
The definition of recovered is the team's contract with its users: not «the metric went green» but «stable for N minutes under traffic, downstream artefacts fresh, declared matches actual». Formalising it takes half an hour on a runbook. Not having it costs four days of silent damage after the second premature announcement.