Commentary

Saving Tokens in Agents: What the Survey Gets Right and What It Over-sells

A map of four token-cost techniques — caching, lazy tools, routing, compaction. It buries the cache's real consequence in a footnote, while two counter-intuitive facts are worth the whole map.

Aleksandr Khomutov Original author: Ida Silfverskiöld May 23, 2026 ≈ 4 min

Every day I live inside one number this survey barely talks about: roughly five minutes. That is the lifetime of the prompt cache, and it silently sets the rhythm of all my work with long agent loops. When I schedule a deferred wake-up for an agent, I am not asking "how often can I poll for status" — I am asking "will the cache still be warm when I wake up." Stay under five minutes and the cache is hot, I pay pennies. Step past the window and I pay full price for the prefix on a cache miss. The entire question of a deferred task comes down to this: when does it even make sense to wake up.

Ida Silfverskiöld's survey "Agentic AI: How to Save on Tokens" is a map of four token-cost techniques: caching (prompt and semantic), lazy-loading of tools and MCPs, routing and cascading across models, and context cleanup via compaction. The map is tidy and honest — but it never carries the one load it should: the real consequence of the cache.

The cache is not a performance detail — it is the clock rate of the loop

The article frames prompt caching exactly as a "quick win" for long system prompts: put the static part first, stop overpaying to re-process it. All true. The author even names the key constraint — TTL: «Cached K/V takes up memory on the serving side which is why a lot of providers have a TTL window of around 5–10 minutes». It is delivered as a footnote about server-side memory.

For me it is not a footnote — it is a schedule. I run long loops with deferred wake-ups: the agent takes a step, goes off to wait for an external event, comes back. If it returns within the TTL, the system-prompt prefix and the accumulated context are still cached, and continuing costs as cached input. If it returns later, I pay for the whole prefix again at full rate. So I design the cadence of wake-ups around the cache window, not around polling speed. The survey never takes that step — it treats TTL as a storage characteristic, not as the variable that dictates the architecture of long-running tasks. That is the gap worth flagging: the most load-bearing consequence of the cache is hidden inside a subordinate clause.

The same mechanism explains why any "small thing" at the start of a prompt irritates me. The author lists the traps precisely: «a new space added, a reordered tool definition, a timestamp in the wrong place» — and that is it, the exact-prefix match fails and the cache misses. For a one-off request that costs seconds. For a loop with hundreds of turns it multiplies across every turn and every wake-up.

What is genuinely useful in the survey

Two things I would cut out and tape to the wall — precisely because they cut against the prevailing hype.

First, on learned routers. The author actually dug into the benchmark and reports that, per LLMRouterBench, «many learned routers barely beat simple baselines, such as keyword/heuristic routing». In other words, expensive learned routing often barely outruns a dumb keyword heuristic. That is exactly the counter-intuitive fact that saves you weeks: before you build a router model, try if-else on request difficulty — there may be no difference. The survey reports this but does not foreground it; I would.

Second, on subagents. They are sold as a way to cut cost, yet by the author's own numbers subagents shave «around 11% from the “no routing” option». Eleven percent is not a cost story. The reason is stated plainly: «the orchestrator often still stays in the loop for planning, synthesis, and retries». The orchestrator stays in the loop, so the large model still burns tokens. The real win of subagents is context isolation, not price. That is exactly what I keep them for: a separate subagent does not pollute the main loop's working state with its grep exhaust and logs. Anyone spinning up subagents to save money bought the wrong tool.

From the context-cleanup techniques, one claim is useful — the author frames it as two tiers of work: it is not enough to "compress the chat," you also have to «keep things clean as you add them to the working state». This matches what I do by hand: raw exhaust goes to an archive, and only what the next step needs goes into active context. Here she also cites Jia et al. on SWE-bench Verified — per the article, at 6x compression you get «a 5.0–9.2% improvement in issue resolution rates». I take that as one paper's reported result, not as a law: compaction helps quality too, but the number comes from a single experiment.

Where it is thin and where it is over-sold

The figure to treat with most skepticism is cascade savings. The author mentions the open-source CascadeFlow, which «claims 69% savings and 96% quality retention». It sounds like a jackpot, but the honest caveat follows immediately: «the prompts they tested had verifiable ground truth, such as math answers and multiple choice». That is the catch. 69% is a benchmark-flattered number: a cascade shines exactly where there is a verifiable ground truth (math, multiple choice) and a cheap "checker" can confidently reject a bad answer. On my work — code, review, ambiguous text — there is no verifiable ground truth, the cheap model is "confidently wrong," and the threshold has to be set conservatively, which eats the savings. I do not carry 69% into my head as an expectation.

The section on lazy-loading tools is correct, but the author herself grounds expectations: prompt caching and lazy-loading together are «not a huge change» on savings, and the value of tool search «isn’t just about savings» — it is about context cleanliness. Agreed; for money this is not the lever.

Semantic caching is described more honestly than usual: the author flatly calls it «a bit of a project» with a pile of pitfalls — similarity thresholds, TTL, multi-turn, separating users. And a sound piece of advice: enable it «after you see repetition in the logs rather than at the start». For a Q/A bot, maybe; for an agent firing unique requests, almost certainly not.

Verdict: good map, shallow terrain

This is a solid survey with a rare virtue — it is honest about trade-offs, it does not sell a silver bullet, and it deflates its own loud numbers. For someone building their first harness it is a reasonable starting point: four techniques, four calculators, clear boundaries of applicability.

But if you already have a working harness, the value here is not the catalogue — it is the two counter-intuitive facts: learned routers barely beat heuristics, and subagents save on the order of 11% (and you take them for isolation, not for money). Those I keep. Plus, for myself, the thought the survey never followed through: cache TTL is not a line about server memory, it is the clock rate you have to tune the rhythm of long loops and deferred wake-ups against.

The cache is not a performance detail — it is the clock rate of the loop

What is genuinely useful in the survey

Where it is thin and where it is over-sold

Verdict: good map, shallow terrain

Related articles

Agentic AI Security: Porting Enterprise Patterns Down to a Solo Harness

The Short Memory File Collapses at Scale: Tokens Versus Rule Precedence

An Obsidian Second Brain Is a Layer, Not the Agent's Whole Memory