SRE & Observability

Engineering notes and deep-dives on SRE & Observability, with practical examples and lessons from experience.

Note 13.06.2026 ≈ 4 min

An incident closed twice: severity, ICS roles and three gates

Why «recovered» is the most expensive word in an incident, and the three checks that must pass before you say it.
Read
Note 06.06.2026 ≈ 4 min

Four golden signals: what they actually catch and why the stack is VictoriaMetrics + Loki

What each of the four signals really catches, and three traps where «we have monitoring» turns out to be green checkmarks above a broken service.
Read
Note 25.05.2026 ≈ 4 min

Error Budget as a Stop Button: SLOs Without Panic

Error budget turns reliability into a resource you can spend — and multi-burn-rate alerts turn it into a page that's actually worth waking up for.
Read