Thursday, March 7, 2024

Why Kubernetes Takes a Long Time

Why Kubernetes Takes a Long Time.

The Problem

Let's test something simple in Kubernetes on a fresh new bare-metal (running under Proxmox) RKE2 cluster, and deploy the classic intro app "numbers" from the book "Kubernetes in a Month of Lunches".  Other simple test apps will behave identically for the purposes of this blog post, such as the "google-samples/hello-app" application.

If you look at the YAML files, you'll see a "kind: Service" that has a "spec type LoadBalancer" and some port info.  After an apparently successful application deployment, if you run "kubectl set svc numbers-web" you will see a TYPE LoadBalancer with an EXTERNAL-IP listed as "<pending>" that will never exit the pending state and the service will be inaccessible from the outside world.

NodePorts do work out of the box with no extra software and no extra configuration, but you don't have to be limited to NodePorts forever.

The Solution

Kubernetes is a container orchestrator and it is willing to cooperate with external load-balancing systems, but it does not implement a load balancer.

That's OK.

If K8S can virtualize anything, why not virtualize its external load balancer?  This is not as bad of an idea as a VMware cluster getting its DHCP addresses set by a DHCP server running inside the cluster; if the cluster is impaired or down enough that the LB isn't working, the app probably isn't working either, so no loss.

We can install MetalLB in Kubernetes, which implements a virtual external load balancer system. https://metallb.universe.tf/

The Implementation

  1. Let's read about how to install MetalLB.  https://metallb.universe.tf/installation/  I see we are strongly encouraged to use IPVS instead of iptables.
  2. Research why we're encouraged to use IPVS instead of iptables.  I found https://www.tigera.io/blog/comparing-kube-proxy-modes-iptables-or-ipvs/ which explains IPVS scales roughly o(1) constant with traffic where the iptables version scales roughly o(n) with traffic.  OK, we have to use IPVS, which is an in-kernel load balancer that runs in front of or with kube-proxy and MetalLB.  Additionally, the K8S docs discussing kube-proxy are at https://kubernetes.io/docs/reference/networking/virtual-ips/
  3. Next, research IPVS.  Aggravating that every Google search for IPVS is autocorrected to IPv6, EVERY TIME.  Found http://www.linuxvirtualserver.org/software/ipvs.html
  4. Will this work with RKE2?  It's reported that both iptables and IPVS work fine with Calico.  RKE2 runs Canal by default which is Flannel between nodes and Calico for network policies, so I guess it's OK?  https://docs.rke2.io/install/network_options
  5. Time to set up IPVS on all RKE2 nodes.  The usual song and dance with automation, set up the first node completely manually, then set up in Ansible, test on the second node, then roll out slowly and carefully.  First IPVS setup step, install ipvsadm so I can examine the operation of the overall IPVS system, "apt install ipvsadm".  Not much to test in this step, success would be running "ipvsadm" and nothing weird seen.
  6. IPVS needs a kernel module, so without rebooting, modprobe the kernel ip_vs module, then try "ipvsadm" again, then if it works, create a /etc/modules-load.d/ip_vs.conf file to automatically load the ip_vs module during node reboots.
  7. Finally, add the IPVS config for kube-proxy to the end of the RKE2 config.yaml, merely tell kube-proxy-arg to use ipvs mode, and ipvs needs strict-arp.
  8. After a node reboot, RKE2 should have kube-proxy running in an IPVS compatible mode.  Success looks like running "ipvsadm" outputs sane-appearing mappings and "ps aux | grep kube-proxy" should show the options --proxy-mode=ipvs and --ipvs-strict-arp=true.  None of this manual work was straightforward and required some time to nail down.
  9. Set up automation in Ansible to roll out to the second node.  This was pretty uneventful and the branch merge on Gitlab can be seen here: https://gitlab.com/SpringCitySolutionsLLC/ansible/-/commit/65445fd473e5421461c4e20ae5d6b0fe1fe28dc4
  10. Finally, complete the IPVS conversion by rolling out and testing each node in the RKE2 cluster.  The first node done manually with a lot of experimentation took about half a day, the second took an hour, and the remaining nodes took a couple minutes each.  Cool, I have an RKE2 cluster running kube-proxy in IPVS mode, exactly what I wanted.
  11. Do I run MetalLB in BGP or L2 mode?  https://metallb.universe.tf/concepts/  I don't have my BGP router set up so it has to be L2 for now.  In the long run, I plan to set up BGP but I can spare a /24 for L2 right now.  Note that dual-stack IPv4 and IPv6, which I plan to eventually use, requires FRR-mode BGP connections, which is a problem for future-me, not today.
  12. Allocate some IP space in my IPAM.  I use Netbox as an IPAM.  Reserve an unused VLAN and allocate L2 and future BGP prefixes.  I decided to use IPv4 and 150 in my RFC1918 address space, I will add IPv6 "later".  I do almost all of my Netbox configuration automatically via Ansible, which has a great plugin for Netbox.  Ansible's Netbox integration can be seen at https://netbox-ansible-collection.readthedocs.io/en/latest/ The Ansible branch merge to allocate IP space looks like this: https://gitlab.com/SpringCitySolutionsLLC/ansible/-/commit/1d9a1e6298ce6f041ab4e98ad374850faf4a1412
  13. It is time to actually install MetalLB.  I use Rancher to wrangle my K8S clusters, it's a nice web UI, although I could do all the helm work with a couple lines of CLI work.  Log into Rancher, RKE cluster, "Apps", "Charts", search for metallb and click on it, "Install", "Install into Project" "System", "Next", "Install", and watch the logs. It'll sit in Pending-Install for a while.
  14. Verify the operation of MetalLB.  "kubectl get all --namespace metallb-system" should display a reasonable output.  Using rancher, "RKE" cluster, "Apps", "Installed Apps", namespace metallb-system should contain a metallb with reasonable status results.
  15. Configure an IPAddressPool for MetalLB as per the IPAM allocation in Netbox.  Here is a link to the docs for IPAddressPools: https://metallb.universe.tf/apis/#ipaddresspool Currently, I only have a "l2-pool" but I will eventually have to add a "bgp-pool".
  16. Configure an L2Advertisement for MetalLB to use the IPAddressPool above.  Here is a link to the docs for L2Advertisements: https://metallb.universe.tf/apis/#l2advertisement  Currently, I'm feeding "default" to "l2-pool" which will probably default to "bgp-pool" after I get BGP working.
  17. Try provisioning an application using a Service type LoadBalancer.  I used numbers-web as per the intro.  In the CLI, "kubectl get svc numbers-web" should show a TYPE "LoadBalancer" and an "EXTERNAL-IP" in your L2 IPAM allocation, and even list the PORT(S) mapping.
  18. Check operation in Rancher.  "RKE", "Service Discovery", "Services", click thru on numbers-web, the top of the page should contain a "Load Balancer" IP address, the tab "Recent Events", should see nodeAssigned and IPAllocated events, and the tab "Ports" should tell you the ports in use.
  19. Test in a web browser from the desktop.  Remember that the numbers-web app runs on port 8080 not the default 80.
  20. You can specify statically assigned IP addresses using a custom annotation described at: https://metallb.universe.tf/usage/#requesting-specific-ips  This is useful because I can add DNS entries in Active Directory using Ansible pointing to addresses of my choice.
For reference, a bare-bones ipaddresspool.yaml looks like this:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: l2-pool
namespace: metallb-system
spec:
addresses:
  - 10.10.150.0/24

And an equally bare-bones l2advertisement.yaml looks like this:

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
spec:
ipAddressPools:
  - l2-pool

The Summary

This took 20 logically distinct steps.  Don't get me wrong, K8S is awesome, MetalLB is awesome, RKE2 is awesome, however, everything takes longer with Kubernetes...  On the bright side, so far, operation and reliability has been flawless, so it's worth every minute of deployment effort.

Trivia

There are only two types of K8S admins, the ones who admit that at least one time they thought metallb was spelled with only one letter "L", and the ones who are liars LOL haha.  This is right up there in comedic value with RKE2 pretending that .yml files are invisible and only processing .yaml files.