Note

OOMKilled forensics: from pmap to cgroups memory.stat

Exit code 137 leaves no stack trace and no final log line, while the dashboard swears there was plenty of memory. The tools that answer «where did the memory go» — while the pod is still alive.

A container restarted with Reason: OOMKilled, exit code 137. The log cuts off mid-sentence: the SIGKILL comes from the kernel, and no handler can catch it. The dashboard, meanwhile, shows a calm 60% memory usage. This combination is the most common «silent» restart scenario in Kubernetes, and the «raise the limit and hope» reaction treats the symptom exactly until the next traffic spike.

The chain worth knowing by heart

It is not Kubernetes that kills — the verdict comes from the kernel OOM Killer. The memory limit from the pod spec is translated into a cgroup; when the process tries to allocate beyond it, the kernel picks a victim and sends SIGKILL. The container exits with code 137 (128 + 9), the kubelet marks it OOMKilled and, with restartPolicy: Always, restarts only the container, not the pod: the IP stays, sidecars keep running, volumes remain mounted. Repeat the loop and you get CrashLoopBackOff with exponential backoff up to the five-minute cap.

A useful contrast: Kubernetes does not kill for exceeding CPU — it throttles via cgroup bandwidth. CPU overruns degrade latency; memory overruns cut the process off without warning. These are two different failure modes, and mixing them in one alert is a mistake.

Why the dashboard «saw nothing»

The classic case: pods get killed regularly while the memory graph never climbs above 65%. The root cause is spikes 200–500 ms long: long enough to punch through the hard limit, too short for a 15–30 s scrape interval. Smoothing finishes the job: avg() over a long window shows 320 Mi where the peak hit 600.

The third trap is the wrong metric. The kernel makes the OOM decision based on the working set, so the anchor is container_memory_working_set_bytes, not the application heap. Spike detection:

max_over_time(container_memory_working_set_bytes[30s])
 # correlate with:
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

Spikes are not bugs: parsing a large JSON, GC pauses, a burst of concurrent requests. Combined with limits trimmed to the bone, they are the trigger.

Surgery on a live pod

While the pod is alive — between restarts — forensics runs through an ephemeral container with a shared process namespace:

kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>

--target opens up the main container's processes: cat /proc/1/status | grep VmRSS shows the current footprint, ls /proc/1/fd | wc -l catches a file descriptor leak. Next, diff two pmap -x 1 snapshots taken five minutes apart: which segments grew. If the diff shows no obvious growth, eBPF memleak from BCC prints stack traces of every unfreed allocation. All of it without a restart or an image rebuild, and it works on distroless too.

Ground truth — cgroups memory.stat

cadvisor metrics mislead on shared pages: libc and libssl, shared between containers, get counted in full for each one by RSS. The honest picture comes from the cgroup itself:

cat /sys/fs/cgroup/memory.stat   # cgroup v2
 # anon  — heap and stacks: the container's «real» memory
 # file  — page cache: the kernel reclaims it before killing anyone
 # shmem — shared between processes

The real OOM risk is memory.current − file − shmem — heap only. PSS (what smem reports) splits shared pages evenly between processes — an honest measure of per-container footprint. And a frequent false alarm: pgmajfault means major page faults, a symptom of an I/O bottleneck, not an OOM signal; alerts built on it catch the wrong thing.

Fix it, don't pray

It comes down to four moves. Limits derived from the p99 working set: limit = p99 × 1.5, with an OOM history — × 2.0; target baseline at 50% of the limit instead of 70%. Alerting on max_over_time over short windows instead of avg(). Load testing with realistic bursts, not smooth synthetic traffic. And since Kubernetes 1.36 there is an early signal: kubelet_psi_memory_full (PSI, GA) — the share of time all threads spend waiting for memory; that is an OOM candidate before the first SIGKILL.

OOMKilled stops being a mystery when the investigation goes layer by layer: kubectl describe — what happened; PromQL at the right resolution — when; pmap and memory.stat — where the memory went; right-sizing — so it does not happen again.

© 2026 axyi.ru · CC BY 4.0