Commentary

A Local LLM for Coding: Not «Which» but «Where»

A review of the practical method for picking a local coding model by hardware — and where local models have a real niche while only the cloud holds the line on precision.

Aleksandr Khomutov Original author: Anubhav May 23, 2026 ≈ 3 min

Anubhav frames his article cleanly: the best local coding model is not the one with the highest benchmark score, but the one your machine can actually run without freezing. He argues you should pick a local coding model by hardware, latency and privacy rather than by leaderboard screenshots: «You cannot pick a model based on a leaderboard. You have to pick a model based on your silicon.» I agree with that methodology almost entirely. But I have a prior question, and I think it matters more.

The question is not «which local model wins.» It is where a local model belongs at all. I keep LM Studio and a local MCP running, I exercise these models daily — and I still treat them as unfit for production. In «The Harness Is the Cheap Part» I argued that a local LLM must not sit where production precision is required. Real work I hand to the cloud. This article is a good occasion to sharpen that into a single line.

Where the boundary runs

The boundary is not «local versus cloud» as a worldview. It runs along the nature of the task. There are agentic loops dominated by token cost: a script reads a failing test, suggests a fix, compiles, repeats — fifty times a day. Anubhav describes exactly this scenario and draws the right conclusion: «You buy the hardware once and run infinite tokens.» Those iterations are cheap, high-volume and error-tolerant — a single bad attempt costs nothing, because the loop will try again anyway. This is where a local model fits perfectly.

Then there is the output the result depends on: the final patch, the architectural decision, the code that goes into review and into production. Here the cost of a mistake is not «an extra token» but a broken release. That output goes to the cloud — for me, to Claude. Cheap high-volume loops run local; precision-critical output goes to the cloud. That is the boundary, and it matters more than any model ranking.

What in the methodology is worth keeping

The most valuable thing in the piece is not the model list but the engineering method: «choose by your silicon, not by the leaderboard.» That I would keep in full, because it does not rot with the next release.

Q4_K_M as the practical quality floor. Anubhav warns plainly: «==The tradeoff is a slight loss in reasoning capability.==» Below Q4 for code, a model will «forget basic syntax and hallucinate variables.» That matches my experience.
A smaller model at Q8 beats a bigger one at Q2. «I always recommend running a smaller parameter model at Q8 rather than a bigger model at Q2.» Chasing parameter count and then crushing it with aggressive quantization is self-deception.
The KV cache competes with the weights for unified memory. Context occupies physical RAM alongside the model weights; that, not weight size alone, is what usually hits the ceiling.
Swap-to-disk kills throughput. Run out of unified memory and your «generation speed will immediately drop from 30 tokens a second to 1 token a second.» That is not degradation, that is the end.
Target ~15 tok/s for chat and ~40 tok/s for autocomplete. «Latency will kill your focus faster than a slightly wrong answer will» — a formulation I will sign. Below the threshold, take a smaller model or quantize harder.

And the hardware tiers — three practical classes (~16 GB unified, 32–64 GB or 24 GB VRAM, 128 GB and multi-GPU) — are an honest frame for the «what can I even run» conversation, with no illusions.

Where I would slow down

What deserves caution is the 2026 specifics. The article names quite definite models and numbers: a headline 80B MoE model with «==it scores 58.7% on SWE-bench Verified==», a family of 27B models, an April release from Google under Apache 2.0, a coding family from Mistral folded into a single 128B model, a separate reasoning model and a narrow autocomplete specialist. I cannot confirm those names or those figures. The open-weight ecosystem, as the author himself writes, «moves incredibly fast» — some of these models may have been renamed, and several SWE-bench numbers I have no way to verify. So I treat the names and scores as illustration, not fact: they show how to reason (by tier and quantization), but the specific chart will go stale by the next release. Lean on the principles, not on the table.

Verdict

As a hardware-tier framework the article is useful: it honestly ties «what is in your machine» to «what you can run» and sets the right latency thresholds. As a model leaderboard it should not be trusted: the names and SWE-bench numbers are fast-moving and partly unverifiable. The one keeper here is this: choose by your silicon and your niche, not by a SWE-bench screenshot. And the niche for local models is cheap, high-volume agentic loops where a mistake is cheap. Precision still lives in the cloud.

Where the boundary runs

What in the methodology is worth keeping

Where I would slow down

Verdict

Related articles

Claude Code Is a Skill: Notes on Leo Godin's Argument

Should you still learn to code in 2026 — notes on Marina Wyss

The 2026 Terminal Stack Through a Linux User's Eyes: What's Portable, What's MacBook Optics