Google SRE's «four golden signals» — latency, traffic, errors, saturation — is the baseline at every architecture interview. In practice engineers recite them from memory and check the box, and a year later the incident postmortem shows that none of the four caught what they were supposed to.
What each signal actually catches
Latency. Not average — average hides the tail. Only p95 / p99. A request that takes five seconds one time in ten is invisible in the average, but the unlucky user leaves. Split latency of successes from latency of errors: a 500 in 50 ms and a 200 in 5 seconds are two different fires.
Traffic. Not «how many RPS» but where they go. Per-target imbalance on the load balancer above 40 % means hidden degradation: one node becomes a digital black hole, 99 % of traffic is served, 1 % hangs without an alert. Slowness is worse than an outage — retention drops silently.
Errors. Explicit 5xx is half the picture. Implicit errors — timeouts, dropped upstreams, retries that «succeeded» on the sixth attempt — never reach the standard 5xx counter. Error budget is built on success rate, not on 5xx.
Saturation. The most misleading metric. An I/O-bound service (waiting on the database or an external API) at p99 latency of 8 seconds shows CPU at 12 %. Pods are not loaded — they are waiting. HPA autoscaling/v2 driven by CPU/memory is completely blind to this; for the autoscaler the picture looks like «a system with no traffic», so it scales down. Saturation is measured via request rate, queue depth, SLO burn rate — that is, via KEDA, not native HPA.
Cause-and-effect chain of a failure: traffic↑ → saturation↑ → latency↑ → errors↑. Tracking it catches the problem before user impact lands. Reciting signals from memory does not.
Why the stack is VictoriaMetrics + Loki
Prometheus as the reference implementation of the four golden signals is still the standard. In production in 2026 the more common choice is VictoriaMetrics: 10× storage density, 10× more active series, MetricsQL as a superset of PromQL, built-in long-term storage with no separate Thanos/Cortex. vmagent is lighter than Prometheus as a scraper, vmauth covers multi-tenancy.
Logs — Loki. Not because it «beats OpenSearch», but because full-text indexing is expensive and rarely needed. Loki indexes labels only and stores the log body cheaply. Correlation between a metric and a log goes through a shared trace_id or request_id, not a full-text join. Promtail has moved to maintenance mode: on fresh clusters teams now ship Grafana Alloy — a single pipeline collector in River language that replaces Promtail + node-exporter + OTel Collector with one DaemonSet.
Explaining «we have monitoring» through tool names is a weak signal. A strong one is the data flow:
Application → OTel SDK / OBI
→ Collection (vmagent / Alloy)
→ Storage (VictoriaMetrics / Loki)
→ Visualization (Grafana)
→ Alerting (Alertmanager)
The schema holds under node swaps: Loki for OpenSearch, VictoriaMetrics for Datadog — the structure is the same. An architect explains the flow and the trade-offs at each node; an engineer lists tool names.
Three traps behind «we have monitoring»
Monitoring the producer, not the output. The ALB target group is green for four days while sitemap.xml is stale, because the background service that generates it lost its alias. The producer's health check never catches this. Every externally-consumed artifact gets a separate freshness probe — an Last-Modified age check. Monitor the output, not only the producer.
Cardinality explosion in labels. user_id, session_id, request_id, pod_uid, path without normalization — an instant TSDB death within weeks. These labels are blocked at metrics code review with the same discipline as production code review. High-cardinality data lives in trace attributes (sampled) and log fields, not in metric labels.
Local alerting on the same cluster. If the cluster goes down, so do the alerts that were supposed to warn about the outage. Control plane observability — a separate cluster, a managed service, or an external receiver. Push, not pull. If the observed cluster falls, the observer sees the silence as a signal, not as missing data.
The minimum that «we have monitoring» actually means
Not 50 alerts but 5–8 actionable ones. Every alert carries a runbook_url in annotations. Every critical alert is an SLO burn rate with a multi-window setup, not a threshold flap. Every service ships structured JSON logs with trace_id, service, team, env. And once a quarter — a disaster-visibility test: a simulated cluster failure. If the alerts never arrive and the dashboards stay dark, observability is one or two stages behind reality, and a real incident will show that more expensively than a drill.