SRE & Observability
Engineering notes and deep-dives on SRE & Observability, with practical examples and lessons from experience.
-
An incident closed twice: severity, ICS roles and three gates
Why «recovered» is the most expensive word in an incident, and the three checks that must pass before you say it.
Read -
Four golden signals: what they actually catch and why the stack is VictoriaMetrics + Loki
What each of the four signals really catches, and three traps where «we have monitoring» turns out to be green checkmarks above a broken service.
Read -
Error Budget as a Stop Button: SLOs Without Panic
Error budget turns reliability into a resource you can spend — and multi-burn-rate alerts turn it into a page that's actually worth waking up for.
Read