Note

Three levers against Kubernetes overspend: idle, scale-to-zero, right-sizing

The average cluster runs at 20–30% utilization. You pay for the rest. Three levers that recover the money without rewriting your architecture.

The average Kubernetes cluster uses 20–30% of its provisioned capacity. That means 70–80% sits idle — and on the cloud bill that translates to up to 35% in preventable waste. On a $1M annual bill, that's $350K going not to resilience but to fear.

The root cause isn't negligence, it's defensive overprovisioning. A service OOM-crashed once, an engineer doubled the memory request — locally rational, globally wasteful. Blaming people is pointless: as long as the template baseline is inflated, every new service inherits the overspend. The fix is systemic, and it has three levers. None of them require rewriting your applications.

Lever 1. Right-sizing: bring requests back to reality

Inflated requests are the primary driver of waste. The scheduler packs pods by requests, not by actual usage: a service asks for 500m CPU, uses 95m — and the cluster autoscaler spins up extra nodes to hold air. Most services carry CPU limits 3–5× their real p99, the result of copy-paste from a template with no review.

The baseline rule is to collect 30 days of usage percentiles and recompute: new CPU limit = max(p99 × 1.5, max × 1.1, floor), new memory limit = max(p99 × 1.3, max × 1.1).

The key asymmetry: cut CPU aggressively (throttling is reversible — the pod just slows down), cut memory cautiously (an OOMKill drops the pod into a restart loop). The CPU floor is around 100m for GC spikes and startup; the memory floor depends on the runtime (JVM ~256Mi, Go ~32Mi). You don't compute percentiles by hand: Robusta KRR pulls the data straight from Prometheus and distinguishes bursty batch jobs from steady-state services, tuning how conservative its recommendations are. Reported savings: 35–50% of compute over 60 days.

Lever 2. Scale-to-zero: stop paying for idle

Right-sizing shrinks a running pod; scale-to-zero removes the pod entirely when there's no work. Three typical targets:

  • Async workers. CPU-based HPA is useless here: an idle worker burns no CPU and never scales down. You have to scale on demand — queue depth. KEDA brings pods from 0 to N by SQS / Kafka / RabbitMQ length, and cooldownPeriod damps the flapping.
  • Preview / PR environments. Scale-to-zero on inbound request rate yields around 60% savings on short-lived environments. Overnight a feature namespace honestly costs zero.
  • Scheduled non-prod. Stage/demo running 24/7 versus 8/5 is a 4× difference on the bill. The tool is kube-downscaler (the maintained fork is py-kube-downscaler): the downscaler/uptime="Mon-Fri 08:00-20:00 UTC" annotation on a namespace shuts environments down outside working hours, while downscaler/exclude keeps what must stay — a database for nightly tests, say. Real result: 168 → 60 compute hours a week and $2,700/month saved with one change; the morning scale-up takes under three minutes. KEDA scales on events; this is pure scheduling — for non-prod that is enough.

The entry cost is cold start (~30 seconds to the first response). For latency-sensitive queues you keep one warm pod (paused-replicas: "1"); for heavy long-running tasks you use a ScaledJob instead of a long-running worker. In Kubernetes 1.36 scale-to-zero for External/Object metrics arrived in native HPA too (alpha, feature gate HPAScaleToZero), but KEDA stays the default choice: production-ready, with dozens of scalers out of the box.

Lever 3. Node-level idle: bin-packing, consolidation, spot

Even with honest requests, idle cost remains: (node capacity − sum of requests) × node price. A static node pool pins the instance type and bin-packs poorly. Karpenter flips the model: it provisions nodes to fit specific pods (bin-packing), continuously re-evaluates the layout and collapses underutilized nodes (consolidationPolicy: WhenEmptyOrUnderutilized), and moves stateless workloads with a PDB onto spot — 60–90% cheaper than on-demand. The typical effect versus Cluster Autoscaler with managed node groups is minus 30–60%.

Spot isn't for everything: databases, replica-less singletons, and stateful workloads without graceful shutdown stay on on-demand. The working pattern is a baseline on on-demand plus burst on spot.

Savings Plans: buy them last

Commitment discounts reach 66% versus on-demand — and lock your baseline for one or three years. Bought before right-sizing, they cement the inflated spend: a discount on hot air. The working order: the three levers first, then 30 days of stabilisation, then a commit at 80% of average hourly spend — the buffer covers growth and seasonality. No-upfront gives away 3–5% of discount versus full-upfront but keeps your cash; Compute Savings Plans cover EC2, Fargate and Lambda with no instance-type lock-in.

Closing the loop: a process, not a one-off setup

The cardinal mistake is "set up the autoscaler and forget it." Load and traffic shift, and optimization without continuous detection rots. So the levers only work on top of two things. First, visibility: OpenCost allocates real cost by namespace/label and turns "EC2 = $X" into "payments-api = $Y." Second, policy: Kyverno or Gatekeeper block pods without requests, otherwise the overprovisioning story repeats within a month.

Three cheap habits keep the savings in place: a weekly automated per-namespace report in Slack — engineers see their own trend, no dedicated FinOps team; Infracost in PRs — the cost delta of a requests change is visible before merge; a VPA-recommendation review as an on-call handoff item. A real case run on this playbook: $67K → $31K a month (−54%) in eight weeks — the same levers plus discipline in logging and data transfer.

Basic measures deliver 15–30% with not a single architectural change. The rest is discipline, not magic.

© 2026 axyi.ru · CC BY 4.0