Note

Cilium and eBPF: replacing kube-proxy and removing iptables from Kubernetes networking

Everyone sees eBPF as a speedup, but the real value is a decision rule: where replacing kube-proxy pays off, and where iptables still wins.

Aleksandr Khomutov June 19, 2026 ≈ 3 min

kube-proxy is the most invisible component in Kubernetes: it quietly turns a Service's virtual IP into the address of a concrete pod. As long as you have a few dozen Services, nobody thinks about it. The trouble starts at scale — and in 2026 this old mechanism finally has a serious reason to be replaced.

Why the iptables mode hits a ceiling

In its default mode, kube-proxy walks an iptables rule chain linearly for every single packet. Below ~1000 Services that is invisible; above it, CPU cost grows in proportion to the catalog size. But worse than scale is blindness. iptables picks a backend pod at random — the kernel "tosses a weighted coin" — and the routing decision sees neither latency, nor connection-queue depth, nor the pod's health. Traffic will land on an overloaded pod or cross an availability zone even when a healthy pod sits right next to it on the same node. Under load this degrades into retries → cascading failures → P99 spikes.

1.36 dropped IPVS, and nftables fixes the wrong thing

Kubernetes 1.36 removed kube-proxy's IPVS mode — for anyone who once escaped iptables into IPVS for the sake of constant-time lookups, the options just narrowed. Since 1.31 kube-proxy has moved to an nftables backend: that fixes rule-update scaling and speed, but it leaves the root cause untouched. Decisions are still made at L4 and stay blind to application state.

How eBPF changes the rules

Cilium replaces kube-proxy with a single flag — kubeProxyReplacement: true. Instead of walking a rule chain, you get a constant-time lookup inside a compiled eBPF program in the kernel. The mental model is simple: iptables asks "which rule matches this packet?", eBPF answers "I already know what to do".

The key trick is socket-LB. Classic kube-proxy does DNAT on the packet after connect(), and every packet passes through conntrack. Cilium rewrites the destination at the syscall level — before the packet even exists. No conntrack, no SNAT overhead, lower latency. On top of that, Maglev consistent hashing pins one 5-tuple (src/dst IP, src/dst port, protocol) to the same pod — critical for cache layers like Redis, where random backend selection scatters requests and kills the hit rate. A debugging side effect: tcpdump on the pod side shows the already-rewritten destination instead of the Service ClusterIP — the first thing that throws people off.

Where it pays off and where it doesn't

The honest counterpoint matters more than the marketing. On a 1–3 node cluster, kube-proxy on iptables can beat Cilium on latency for pure pod-to-pod traffic: the eBPF overhead does not pay for itself on a small conntrack table of a couple hundred Services. Cilium also costs more to operate — it requires a Linux kernel ≥5.x, adds more knobs, and eBPF incidents are harder to debug than reading iptables rules.

The decision rule: replacing kube-proxy with eBPF is justified at ≥5 nodes, ≥100 Services, or when you explicitly need L7 policies, observability through Hubble, or Cluster Mesh. It is a conscious choice for load, not a "default always".

Where to start

First check the kernel version (≥5.x) on every node — that is a hard prerequisite, not a recommendation. Roll it out through canary nodes: kernel-level network problems are scarier than debugging iptables. Turn on Hubble from day one — flow visibility turns "the packet vanished somewhere" into "the datapath dropped the packet for this specific reason", and it pays off on the very first incident. On a bare-metal stack (kubeadm + Cilium with kubeProxyReplacement=true) this set has long been the standard.

And the reverse: if you don't need L7 policies, a mesh, or advanced observability, Calico + kube-proxy stays perfectly sufficient. The guiding principle for the network layer is the same as everywhere in infrastructure — don't pay in operations for capabilities you never use.

Why the iptables mode hits a ceiling

1.36 dropped IPVS, and nftables fixes the wrong thing

How eBPF changes the rules

Where it pays off and where it doesn't

Where to start

Related articles

Ingress NGINX reaches EOL: migrating to Gateway API without panic

OOMKilled forensics: from pmap to cgroups memory.stat

Kubernetes 1.36 (Haru): What Actually Changes in Production