Antonio's blog

Istio to Cilium: a grand yak-shave

In late March I went on the longest yak-shave of my life, all because I wanted to stop using Google Keep.

I took shoddy notes, so there will be the occasional dropped bowling ball of a sentence like "Cilium couldn't find the kernel modules it needed, so I had NixOS load them". Rest assured that that so clause required vast amounts of searching the web, grepping the Cilium codebase, and general flailing-about—as did many other so clauses to follow.

Dramatis personae

Because this document will, by nature, become outdated:

Install Joplin

I installed the Joplin server in my NixOS-based k3s cluster, behind an Istio- and MetalLB-powered Kubernetes Gateway. I use a single Gateway as a catch-all for HTTP services in the cluster, with a wildcard DNS record.

The iOS client could connect to the Joplin server, but the Windows desktop client threw getaddrinfo ENOTFOUND. While fiddling with the desktop client's settings and watching Wireshark, I could see the client sending A and AAAA queries simultaneously—yet it stubbornly preferred the IPv6 non-result to the IPv4 result.

Rather than figure out what was wrong with the client, I decided it was time to resolve an obvious shortcoming: my homelab has IPv6 enabled, but my Kubernetes cluster didn't.

Enable dual-stack on Istio

I destroyed and rebuilt my k3s cluster with dual-stack enabled, reinstalled Istio, added an AAAA wildcard record to my DNS zone, and programmed the Gateway with both addresses...

...only to find out that Istio programmed the Gateway with the IPv4 address only, because Istio's dual-stack support is pre-alpha and off by default.

I could have gone for it anyway, but I needed an excuse to try Cilium. It appears to have robust IPv6 support, plus a coworker has been encouraging me to try Cilium.

Install Cilium

Because Cilium's Gateway API support requires that Cilium be installed as a kube-proxy replacement, I destroyed and rebuilt my cluster again to disable kube-proxy. I then installed Cilium with the dual-stack, kube-proxy, and Gateway API features enabled.

Two problems came up before Cilium would even start. The first: Cilium couldn't find the kernel modules it needed, so I had NixOS load them.

The second: there's a safety valve preventing Cilium from handing out node CIDRs if they're much smaller than the cluster CIDR. So I made the node masks bigger.

Fix Cilium

At this point, Cilium was healthy, but no other pods were: they consistently failed their health checks and couldn't connect to anything outside the cluster. I spent several days on this before giving up and filing an issue. (Don't read it to the bottom or it'll spoil this story.)

With tcpdump and hubble I could see the SYN from a given node's kubelet, the SYN-ACK from a given pod, and...no ACK from kubelet. I was vaguely aware that the NixOS defaults in iptables were an issue, but I wanted a solution other than disabling the nodes' firewalls.

I remembered pwru, another program by the Cilium team. I'd packaged the then-current version for myself some months ago, but dropped it for lack of interest. I rescued that commit from git reflog, installed the program on my k3s nodes, logged into one of them, and ran pwru: it clarified instantly, with the magic word SKB_DROP_REASON_NETFILTER_DROP, that there was a DROP rule ruining my day somewhere.

There were only four DROP rules in the tables, one mentioning "rpfilter". In my days of feverish Cilium-related Googling, that term sounded familiar: there were reports of a sysctl that Cilium required to be disabled, net.ipv4.<...>.rp_filter. So I deleted the rule. Bam, cluster's healthy again.

It turned out NixOS enables reverse-path filtering by default. So I added rules for Cilium-related packets to bypass that filter.

"Since I'm using Cilium in kube-proxy mode", I thought, "I can use the L2 feature, like this and this, and spare myself reinstalling MetalLB, right?"

Wrong. Cilium's L2 feature doesn't support IPv6. I had to reinstall MetalLB.

Back to Joplin

Now there's a different error: the desktop client distrusts my homelab's root CA—which I've already installed in Windows' CA store. I've been bitten before by NodeJS using its own CA bundle, but I didn't have a solution handy for Windows and didn't care to go looking for one.

I exported the homelab CA's certificate somewhere and put the path in Joplin's "Custom TLS certificates" setting.

The Joplin desktop client, praise be, began syncing with my server.

And that's it

That's how, over the course of a week, I turned my Istio-powered k3s cluster into a Cilium-powered one, just because I wanted to store my own notes.

Epilogue

I had a WireGuard tunnel between a VPS and a NodePort on my NAS so I could access my music server without exposing my homelab. This worked with Istio and k3s' kube-proxy, but with Cilium the NodePort socket isn't actually published to the OS (in a way ss -lnt can see, anyway), so packets from the VPS got bounced with a TCP reset and SKB_DROP_REASON_NO_SOCKET. So, I moved the tunnel: I created a WG interface on my router, configured that interface instead of the NAS' as my VPS' peer, and added firewall and port-forwarding rules to pass WG packets to <NAS>:<NodePort>. The music proxy works again.

A few days later, I noticed the message "ps_bpf_recvmsg: No buffer space available" being logged every minute by each node's dhcpcd. The "bpf" in the name hinted at more Cilium trouble, and the usual net.core.[rw]mem_max incantations recommended on the Internet were no use. Then it hit me: Cilium interfaces don't need DHCP! I told dhcpcd not to bother with them. The messages stopped.