Commentary

A Kubernetes Debugging Agent: Query Templates or Scripts?

Zinchenko hands the LLM MetricsQL templates; my VM skill feeds the agent a finished aggregate. I dissect the flexibility-versus-reproducibility axis and why read-only by blacklist is weaker than an allowlist.

Aleksandr Khomutov Original author: Arseny Zinchenko May 23, 2026 ≈ 4 min

I run a heavily customized Claude Code setup, and over the past year I built two things under it that bear directly on what Arseny Zinchenko describes in his piece on a Kubernetes debugging agent: a VictoriaMetrics skill sitting on top of six bash scripts, which I exercised against a live cluster at vm.ethinking.xyz, and a multi-week EKS rightsizing engagement (Phases A/B, hunting OOM, limit classes and JAVA_OPTS). So I didn't read his text as a technology overview — I read it as a description of a fork I'd stood at myself.

Zinchenko assembles a read-only Pod-debugging agent, packages it into a Claude Code plugin with its own marketplace, wires in the official VictoriaMetrics skills for metrics, logs and alerts — and deliberately drops MCP in favour of Skills. He states the goal with admirable bluntness: have an agent a developer can ask «why the hell did that Kubernetes Pod crash». And he admits up front that, instruction-wise, this is still a PoC. That's the right frame, and I'm not here to argue with it — I want to take apart one engineering axis the article touches but doesn't follow to the end.

My position: both schemes are correct, but at different points

The thesis is simple. Zinchenko hands the LLM MetricsQL and LogsQL templates and lets the agent compose the concrete queries on the fly. In my own VM skill I did exactly the opposite: every HTTP call runs through bash scripts, and the agent only ever sees pre-aggregated output and reasons over it. This isn't "better" versus "worse" — it's two ends of one axis: flexibility for ad-hoc investigation versus reproducibility and auditability.

His approach wins where the query is unknown in advance: you sit down to debug "why is Grafana restarting" (which is, incidentally, how he found the cause of his own restarts), and you need the freedom to reformulate MetricsQL five times a minute. My approach wins where the run has to be repeatable and explainable after the fact: the same script yields the same aggregate, the reviewer sees exactly what the agent saw, and a week later you can reproduce the number. That is precisely why my scheme survived the A/B rightsizing phases — there you cannot let the agent invent a fresh query each time: you need one methodology that produces comparable numbers across dozens of services against an OOM=0 baseline.

Where he's right — and it isn't trivial

I share the article's central argument wholesale. Zinchenko nails the point that MCP doesn't carry the knowledge you're actually after. His wording lands: «MCP only defines how to execute the query - not the query structure itself», and then — «the agent/LLM still decides on the query and the filters». A typed query(query: string) validates that you passed a string, but the MetricsQL string itself is still composed by the model. A skill, by contrast, can carry what MCP simply cannot: which exact stream filter applies in our VictoriaLogs, which mandatory cluster label applies in our VictoriaMetrics, which correlations distinguish CrashLoopBackOff from OOMKilled. That's environment context, not a function signature.

He's right about kubectl too: wrapping it in MCP buys little, because every model knows kubectl syntax by heart and the cluster specifics are enough to put into the SKILL.md. And he draws the use-case boundary honestly — «we're building a read-only agent that debugs, not deploys», which is why an off-the-shelf deploy skill doesn't fit him. His summary of the choice — «Bash + curl + our own skill with our context + official VictoriaMetrics skills» — is fully justified for exploratory debugging.

Where the scheme falls short: read-only by blacklist is weaker than it looks

Here's where my main objection begins, and it's a technical one. Zinchenko's read-only guarantee rests on deny-tools — a glob blacklist: it forbids kubectl delete, *curl* -X *, pipes into sh/bash, redirects >/>>, rm/mv/cp. The list looks solid, but a glob-pattern blacklist is a game the defender always loses. A write to a file without the > character is done via tee. A POST without -X is done with --data in a form the pattern missed, or via a here-string and command substitution. xargs and env carry a forbidden command inside a wrapper the blacklist never inspects. Every missed vector is a hole, and you can keep plugging them one at a time forever.

In my own setup I trust the inverse logic: an allowlist (what's explicitly permitted is allowed, everything else is denied by default) plus an intent-level gate on production actions — any kubectl exec/port-forward, SSH into prod, or terraform apply requires explicit, named approval in the current session. An allowlist doesn't suffer from the "forgot to forbid one more vector" problem, because by default everything is forbidden.

But let me state plainly a nuance the article is entitled not to dwell on: we're working under different threat models. His agent is read-only debug on a dev cluster, where the cost of a mistake is low and the blacklist catches the model's coarse slips. My gate is built for write/prod, where the cost of a mistake is an incident. So this isn't a takedown of his design — it's pointing at the ceiling: deny-tools is good as defence-in-depth and poor as the sole guarantee once your trust threshold for the environment rises.

What to do in practice

I'd reduce the choice between the two schemes to a single question: is the run one-off or repeatable? For investigative debugging — "why did this Pod crash right now" — take Zinchenko's variant: query templates plus the model's freedom to combine them is faster and doesn't require writing a script for every new slice of data. For anything you need to reproduce, review, or defend to a client — wrap the HTTP in scripts and feed the agent a finished aggregate; then the output is deterministic and the audit is trivial.

The A/B phases convinced me of this concretely: when rightsizing runs for weeks, across dozens of services, with limit classes and JAVA_OPTS tuning toward an OOM-free baseline, an agent that composes queries becomes a source of incomparable numbers. The script is that cheap, boring, reliable part I already wrote about in "The Harness Is the Cheap Part": the expensive, interesting bit is the model's reasoning, while reproducibility comes from exactly what you pinned down in the harness.

And dropping MCP in favour of Skills for kubectl is a decision I'll co-sign without reservation. Where the model already knows the syntax, a typed wrapper adds installation hassle and adds no knowledge. Zinchenko picked the right tool; I'm just filling in the other half of the map — the point where you start charging flexibility a reproducibility tax.

My position: both schemes are correct, but at different points

Where he's right — and it isn't trivial

Where the scheme falls short: read-only by blacklist is weaker than it looks

What to do in practice

Related articles

Agentic AI Security: Porting Enterprise Patterns Down to a Solo Harness

Saving Tokens in Agents: What the Survey Gets Right and What It Over-sells

The Short Memory File Collapses at Scale: Tokens Versus Rule Precedence