Note

Error Budget as a Stop Button: SLOs Without Panic

Error budget turns reliability into a resource you can spend — and multi-burn-rate alerts turn it into a page that's actually worth waking up for.

Aleksandr Khomutov May 25, 2026 ≈ 4 min

A minor bug pushed the error rate up by 0.05%. The on-call engineer's first instinct is to roll everything back at once. But a rollback under pressure introduces risk of its own: a new deploy, a new failure surface, a decision made in panic mode. The seasoned SRE did something else — they looked at the remaining error budget. With 98% of the monthly budget still left, there was plenty of room for a calm fix-forward, and an emergency rollback would only have added instability. That isn't a failure — it's planned imperfection, which is exactly what the budget exists for.

The entire value of the SLO approach rests on a single idea: before you react to a deviation, look at the budget. A high remaining balance means time for a thoughtful fix. A low one means mitigate now, write the postmortem later. A number instead of an argument.

A budget for risk, not for perfection

SLI is a measurable metric of what the user actually feels: the share of successful requests, the share of requests faster than 200 ms. SLO is the target value of that metric over a window (say, 99.9% over 30 days). And error budget = 1 − SLO is the permissible "share of bad." For 99.9% over a month, the budget works out to roughly 43 minutes of downtime.

The key mental shift: those 43 minutes are not something to be ashamed of — they are a resource you can spend. On risky releases, on experiments, on migrations. Burning the budget down to zero is a signal, not a catastrophe.

It also follows that chasing "five nines" (99.999%) is almost always counterproductive. The cost of reliability grows exponentially, and above 99.9% you run into the SLOs of your own dependencies — the cloud load balancer, DNS, external APIs. Realistic targets are more grounded: an internal admin tool lives at 99%, a public website at 99.9%, a critical API at 99.95%, payments at 99.99%. A goal of "99.999% and faster" doesn't inspire a team — it demoralises it.

The Error Budget Policy — a lever, not a decoration

On its own, the budget is just a number. What gives it force is a policy: actions agreed in advance, keyed to how much budget remains.

Above 50% remaining — ship features freely.
20–50% — require a soak in staging before prod (a day, say).
Below 20% — freeze non-critical changes, focus on reliability.
Budget exhausted — full freeze, the whole sprint goes to reliability work.

The point of this table is that it ends the eternal argument over "should we freeze right now." The decision is tied to a number, not to who's loudest in the chat. The business gets a tool to manage the speed-versus-stability trade-off without having to read PromQL.

But a policy with no teeth is useless. If the budget is exhausted and the team keeps shipping features because of "the deadline," the budget is decorative and the whole construct loses its meaning. A policy works only as far as the organisation is willing to follow it.

Multi-burn-rate: how the budget becomes an alert

A typical mistake is to wire up an alert for "the SLO dropped below 99.9%." That threshold fires on a loss of less than a percent and wakes the on-call for nothing. Far more precise is to measure the burn rate — the speed at which the budget is consumed — across several windows at once.

The logic is simple. If we burn 2% of the monthly budget in an hour, that's critical and a middle-of-the-night page is justified (fast burn, burn_rate_1h > 14.4 × 0.001, confirmed by a short 5-minute window). If 5% goes in six hours, that's serious but a daytime ticket will do (slow burn, burn_rate_6h > 6 × 0.001 with a 30-minute confirmation). The dual window — a long one plus a short confirming one — filters out false alarms from a random spike.

The multipliers 14.4 and 6 aren't magic; they're the standard tables from the SRE Workbook for a 99.9% SLO. Generators like Sloth emit these rules from a short YAML spec, adding recording rules for the 5m/30m/1h/6h windows so that Grafana and the alerts read ready-made metrics instead of computing heavy PromQL on every render.

The same burn rate also works as a deploy gate: a rule of "halt rollouts at a 10× burn rate" is transparent for CI/CD. The arithmetic is convincing — 10× for 99.9% burns the monthly budget in three days; ship a regression at that moment and full burnout arrives in hours. This dovetails with the policy: an exhausted budget means a full freeze, a high burn rate a pre-emptive one.

Taken together, SLO, error budget and multi-burn-rate aren't three separate practices but a single chain: the metric measures the user's pain, the budget turns it into a resource, and the policy and alerts turn that resource into decisions. The budget becomes a shared language in which the business and the engineers can agree — without panic and without arguments.