Monday, April 8, 2024

Troubleshooting Strategies for Kubernetes, ndots: 5, and DNS blackholes

Troubleshooting Strategies for Kubernetes, ndots: 5, and DNS blackholes

Based solely on the title, some K8S old-timers already know the problem, and how I solved it.  Regardless, it's useful to discuss DNS troubleshooting strategies in K8S.

Troubleshooting strategy 1: Clearly define the problem as a goal.

I am trying to bring up a new RKE2 K8S cluster and roll some new workload onto it.  The cluster had no workload because it was brand new.  Three days ago I upgraded to K8S 1.28.8 and Rancher v2.8.3.  I deployed several Node-RED systems, the deployments, pods, services, PVCs, and PVs all look great and I can access the systems.  However, Node-RED is a node.js application and when I try to install additional packages to the palette, I get a long timeout followed by an installation failure in the GUI.  The goal is, I would like to be able to install node-red-node-aws and node-red-contrib-protobuf in Node-RED without errors.

Troubleshooting strategy 2: Gather related data.

The GUI is not much help; all we know is it didn't work.  Open a shell in Rancher on the K8S pod, and poke around in /data/.npm/_logs until I find the exact error message "FetchError: request to https://registry.npmjs.org/node-red-contrib-protobuf failed, reason: getaddrinfo ENOTFOUND registry.npmjs.org"  This looks bad, very bad; the pod can't resolve DNS for registry.npmjs.org.

Troubleshooting strategy 3: Think about nearby components to isolate the problem.

OK, DNS does not work.  However, the rest of the LAN is up, and the RKE2 system resolved hub.docker.com to initially install the container images, so at a reasonably high level, DNS appears to be working.  The problem seems isolated to the pods (not the entire LAN) and exactly one website.  Nest, stretch out, and try pinging Google.

ping www.google.com

Fails to resolve "no IP address found".  Let's try some other tools, sometimes nslookup will report better errors.  Ping asks the OS to resolve a domain name, nslookup asks a specific DNS server to resolve a domain name for it, a subtle difference.

nslookup www.google.com

Wow, that worked.  Let's combine the observations so far:  the operating system DNS resolver (the local thing using /etc/resolv.conf) failed yet asking the K8S DNS server directly, worked.  Local DNS resolution (inside the cedar.mulhollon.com domain) works perfectly all the time.  Interesting.

Troubleshooting strategy 4: Hope this is not your first rodeo.

This is not my first rodeo for solving sysadmin and DNS problems.  Always try the FQDN and non-FQDN hostname.  Surprisingly:

ping www.google.com.

That worked.  A DNS hostname that does not end in a period (plus or minus some ndots options) will check the search path before hitting the internet.  That's how a DNS lookup in a webbrowser on the LAN to nodered1 ends up pointing to the same place as nodered1.cedar.mulhollon.com, there are not two entries with different names, there's just a cedar.mulhollon.com in the DNS search path and a nodered1 A type record.  

When working through a puzzling IT problem, if this is your first rodeo, you will be in for a rough grind, but I've seen a lot over the years, making life a little easier.  My hunch about hosts with FQDN vs non-FQDN paid off.

Back to troubleshooting strategy 2, let's gather more data.  Inside the pod, it's /etc/resolv.conf looks like this:

search nodered.svc.cluster.local svc.cluster.local cluster.local cedar.mulhollon.com mulhollon.com
nameserver 10.43.0.10
options ndots:5

This is the longest search path and highest ndots option I've ever seen, which based on some internet research is apparently normal in K8S situations, and the nameserver IP looks reasonable.

Troubleshooting strategy 5: Enumerate your possibilities.

  1. Is 10.43.0.10 a working coredns resolver?
  2. The search path is long.  UDP packet size limit?  Some limit in the DNS resolver?  Could it be timing out trying so many possible search paths?
  3. What's up with the huge ndots option setting?  Problem?  Symptom?  Cause?  Effect? Irrelevant?
At this stage of the troubleshooting process, I had not yet guessed the root cause of the problem, but I had some reasonable possibilities to follow up upon.

Troubleshooting strategy 6: Search the internet after you have something specific to search for.

When you have described your problem in enough detail, it's time to find out how "the internet" solved the problem, someone else probably already solved the problem and is happily telling everyone.  Similar to this blog post.  I found a couple leads:
  1. Some people with weird ndots settings have massive DNS leaks causing lots of extraneous DNS traffic and running the famous K8S nodelocal DNS cache helped them cut the DNS traffic overload.  Is that my problem?  Not sure, but its "a problem".  I probably need to set up nodelocal DNS caching eventually, according to everything I read.  Make a note that because my RKE2 has IPVS installed, I need a slightly modified nodelocal DNS cache config.
  2. Some people with weird ndots settings had DNS queries timeout because they made too many DNS search queries.  Note that the "overall Node-RED palette installer" fails after 75 seconds of trying, and the command line "npm install node-red-contrib-protobuf" times out and fails in about 90 seconds, but my DNS test queries at the shell instantly fails so their problem is likely not my problem, also I have very low traffic on this new cluster and they have large clusters with huge DNS traffic loads which I would not have.  Also, there is a large amount of opposition from the general K8S community to 'fixing' high ndots option settings due to various internal traffic issues related to K8S, so we're not doing that.  I think this is a dead end to pursue.
  3. Rancher and RKE2 and K8S all have pretty awesome wiki DNS troubleshooting guides.  I plan to try a few!
Troubleshooting strategy 7: Know when to cowboy and when not to cowboy.

If the system is in production or you're operating during a maintenance window then you have a written documented meeting-approved change management and risk plan and a window to work, along with written prepared back-out plans.  However, if it's a pre-production experimental burn-in test system, then it's the Wild West, just make sure to write good docs about what you do to justify your learning time.  This particular example was an experimental new RKE2 cluster, perfectly safe to take some risks upon, I need to set up nodelocal DNS sooner or later anyway, and less DNS traffic might help this problem or at least can't make it worse.  So I talked myself into cowboy style installing Nodelocal DNS caching on this RKE2 cluster using IPVS, during the process of working an "issue".  I felt it was related "enough" and safe "enough" and in the end, I was correct, even though it did not solve the immediate problem.


The DNS cache worked with no change in results (although nothing is worse; I think). Note you have to add the ipvs option if you have ipvs enabled (which I do).

I switch my troubleshooting strategy back to "Gather related data.".  If my first and only real workload of Node-RED containers fails, let's try a different workload.  I spin up a linuxserver.io dokuwiki container.  Works perfectly except it can't resolve external DNS either.  At least it's consistent, and consistent problems get fixed quickly.  This removes the containers from the cause of the problem, it is unlikely to be a problem unique to Node-RED containers if the identical problem appears in a Dokuwiki container from another vendor...

Back to Troubleshooting strategy 6 of when in doubt search the internet.  I methodically worked through Rancher's DNS troubleshooting guide as found at:


kubectl -n kube-system get pods -l k8s-app=kube-dns

This works.

kubectl -n kube-system get svc -l k8s-app=kube-dns

This works.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

"nslookup: can't resolve 'kubernetes.default'"
This fails which seems to be a big problem.

Time to circle back around to Troubleshooting Strategy 7, and try some cowboy stuff.  This is just a testing burn-in and experimentation cluster, let's restart some pods, it's plausible they have historical or out-of-date config issues.  I restart the currently running pod.  It was restarted 3 days ago during a K8S upgrade and restarting the CoreDNS pods on a test cluster should be pretty safe, and it was.  The two rke2-coredns-rke2-coredns pods are maintained by the replicaset rke2-coredns-rke2-coredns.  I restarted one pod, and nothing interesting happened.  The good news is logs look normal on the newly started pod.  The bad news is busybox DNS query to kubernetes.default still fails.  I restarted the other pod, so now I have two freshly restarted CoreDNS pods.  Logs look normal and boring on the second restarted pod.  The pod images are rancher/hardened-coredns:v1.11.1-build20240305 The busybox query to kubernetes.default continues to fail same as before.  Nothing worse, nothing better.

I return to troubleshooting strategy 2, "Gather more data".

kubectl -n kube-system get pods -l k8s-app=kube-dns

NAME                                         READY   STATUS    RESTARTS   AGE
rke2-coredns-rke2-coredns-864fbd7785-5lmgs   1/1     Running   0          4m1s
rke2-coredns-rke2-coredns-864fbd7785-kv5zq   1/1     Running   0          6m26s

Looks normal, I have two pods, these are the pods I just restarted in the CoreDNS replicaset.

kubectl -n kube-system get svc -l k8s-app=kube-dns

NAME                        TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
kube-dns-upstream           ClusterIP   10.43.71.75   <none>        53/UDP,53/TCP   51m
rke2-coredns-rke2-coredns   ClusterIP   10.43.0.10    <none>        53/UDP,53/TCP   50d

OK yes I set up nodelocal caching as part of the troubleshooting probably 51 minutes ago, very reasonable output.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

Server:    10.43.0.10
Address 1: 10.43.0.10 rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
pod "busybox" deleted
pod default/busybox terminated (Error)

Time to analyze the new data.  After upgrading and restarting "Everything" it's still not working, so it's probably not cached data or old configurations or similar, there's something organically wrong with the DNS system itself that only presents in pods running on RKE2, everything else is OK.  Its almost like there's nothing wrong with RKE2... which eventually turned out to be correct...

Time for some more of Troubleshooting strategy 6, trust the internet.  Methodically going through the "CoreDNS specific" DNS troubleshooting steps:

kubectl -n kube-system logs -l k8s-app=kube-dns

.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29
.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29

kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}

Error from server (NotFound): configmaps "coredns" not found

That could be a problem?  No coredns configmaps?  Of course these are some old instructions and I have a new cluster with a fresh caching node-local DNS resolver, and DNS is mostly working so a major misconfiguration like this can't be the problem, so I poke around on my own a little.

I checked the node-local-dns configmap and that looks reasonable.

kubectl -n kube-system get configmap node-local-dns -o go-template={{.data.Corefile}}

(It would be a very long cut and paste, but it seems to forward to 10.43.0.10, which admittedly doesn't work, and this ended up being irrelevant to the story anyway so its not included in this blog post)

Ah, I see in the installed helm app for rke2-coredns the configmap name is actually named rke2-coredns-rke2-coredns, OK that makes sense now.

kubectl -n kube-system get configmap rke2-coredns-rke2-coredns -o go-template={{.data.Corefile}}

.:53 {
    errors 
    health  {
        lameduck 5s
    }
    ready 
    kubernetes   cluster.local  cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus   0.0.0.0:9153
    forward   . /etc/resolv.conf
    cache   30
    loop 
    reload 
    loadbalance 
}

The above seems reasonable.

Docs suggest checking the upstream nameservers

kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'

The nameserver info matches the configuration successfully used by 78 hosts configured by Ansible operating on this LAN, and superficially looks good (insert ominous music here; not to ruin the story, but the cause of the problem was the config was not good after all, it just mostly worked everywhere except K8S for reasons I'll get to later)

Let's contemplate Troubleshooting strategy 7: Know when to cowboy and when not to cowboy.  On one hand, I feel like enabling query logging and watching one specific query process at a time.  On the other hand, I feel pretty confident in my ability to enable logging, less confident in the infrastructure's ability to survive an accidental unintended log flood, and very unconfident about my ability to shut OFF query logging instantly if some crazy flood occurs.  I feel overall, that going cowboy by enabling query logging is a net negative risk/reward ratio at this time.  Someday I will experiment safely with query logging but today is not that day.

Well, I seem stuck.  Not out of ideas but the well is starting to run dry.  I haven't tried "Troubleshooting Strategy 1: Clearly define the problem as a goal" in awhile, so I will try that again.  I re-documented the problem again.  It looks like the story I placed at the beginning of this blog post, no real change.  It was still a good idea to focus on the goal and what I've done so far.  Maybe unconsciously this review time helped me solve the problem.

Let's try some more Troubleshooting strategy 2, and gather more data.  As previously discussed, I did not feel like going all cowboy by enabling cluster-wide DNS query logging.  As per Troubleshooting Strategy 4, hope this is not your first rodeo, I am quite skilled at analyzing individual DNS queries, so let's try what I'm good at:  We will pretend to be a K8S pod on a VM, and try all the search paths just to see what they look like.

From a VM unrelated to all this K8S stuff we've been doing, let's try the google.com.cedar.mulhollon.com search path.  That is my Active Directory domain controller and it returns a NXDOMAIN, this is normal and expected.

Following troubleshooting strategy 3, think about nearby components to isolate the problem, let's try the last DNS search path.  This will be google.com.mulhollon.com.  That domain is hosted by Google and it returns a valid NOERROR but no answer.  

Wait what, is that even legal according to the DNS RFCs?

Following troubleshooting strategy 5, enumerate your possibilities, I think it's quite plausible this weird "NOERROR header but empty data" DNS response from Google could be the problem.  This isn't my first rodeo troubleshooting DNS, and I know the search protocol for DNS takes the first answer it gets, so when internal resolution fails its last search path for host "whatever" will be whatever.mulhollon.com and Google will blackhole all incoming queries so it'll never try external resolution.  This certainly seems to fit the symptoms.  As a cowboy experiment on the test cluster, I could remove that domain from the DNS search path in /etc/resolv.conf and try again.  In summary, I can now repeatedly and reliably replicate a problem directly related to the issue in a VM, and I have a reasonable experiment plan to try.

Before I change anything, gather some more data under Troubleshooting Strategy 2.  I can now replicate the solution in a K8S pod.  I don't have root and can't edit the /etc/resolv.conf file in my NodeRED containers is mildly annoying, it's just how the docker containers are designed.

I found a container that I can successfully log into as root and modify the /etc config files.  With mulhollon.com (hosted at Google) if I try to ping www.google.com I get "bad address" because Google domain hosting blackholes missing A records, so weird but so true.
If I edit /etc/resolv.conf in this container, and remove mulhollon.com from the search path, SUCCESS AT LAST! I can now resolve and ping www.google.com immediately with no problems.  I can also ping registry.npmjs.org so that implies I can probably use it (although this test container isn't a NodeJS container or a Node-RED container)

Well, my small cowboy experiment worked, let's try a larger-scale experiment next.  But first, some explanation of why the system has this design.  In the old days, I had everything in the domain mulhollon.com, then I gradually rolled everything internal into active directory hosted cedar.mulhollon.com and now I have nothing but external internet services on mulhollon.com.  In the interim, while I was setting up AD I needed both domains in my DNS search path for internal hosts, but I don't think I need that any longer and it hasn't been needed for many years.

Time for some more troubleshooting strategy 7, cowboy changes on a test cluster.  Some quality Ansible time resulted in the entire LAN having its DNS search path "adjusted".  I had Ansible apply the /etc/resolv.conf changes to the entire RKE2 cluster in a couple minutes.  Verified the changes at the RKE2 host level, changes look good, DNS continues to work at the RKE2 and host OS level so nothing has been made worse.

I ran a "kubectl rollout restart deployment -n nodered" which wiped and recreated the NodeRED container farm (without deleting the PVs or PVCs, K8S is cool).  Connect to the shell of a random container, the container's /etc/resolv.conf inherited "live" from the host /etc/resolv.conf without any RKE2 reboot or other system software level restart required or anything weird, looks like at container startup time it simply copies in the current host resolv.conf file, simple and effective.  "ping www.google.com" works now that the DNS blackhole is no longer in the search path.  And I can install NodeJS nodes into Node-RED from the CLI and the web GUI and containers in general in the new RKE2 cluster have full outgoing internet access, which was the goal of the issue.

Troubleshooting strategy 8: If you didn't document it, it didn't happen.

I saved a large amount of time by keeping detailed notes in the Redmine issue, using it as a constantly up-to-date project plan for fixing the problem and reaching the goal.  Ironically I spent twice as much time writing this beautiful blog post as I spent initially solving the problem.

I will list my troubleshooting strategies below.  These overall strategies will get you through some dark times.  Don't panic, keep grinding, switch to a new strategy when progress on the old strategy slows, and eventually, things will start working.
  • Clearly define the problem as a goal.
  • Gather related data.
  • Think about nearby components to isolate the problem.
  • Hope this is not your first rodeo.
  • Enumerate your possibilities.
  • Search the internet after you have something specific to search for.
  • Know when to cowboy and when not to cowboy.
  • If you didn't document it, it didn't happen.
Good Luck out there!

Thursday, March 7, 2024

Why Kubernetes Takes a Long Time

Why Kubernetes Takes a Long Time.

The Problem

Let's test something simple in Kubernetes on a fresh new bare-metal (running under Proxmox) RKE2 cluster, and deploy the classic intro app "numbers" from the book "Kubernetes in a Month of Lunches".  Other simple test apps will behave identically for the purposes of this blog post, such as the "google-samples/hello-app" application.

If you look at the YAML files, you'll see a "kind: Service" that has a "spec type LoadBalancer" and some port info.  After an apparently successful application deployment, if you run "kubectl set svc numbers-web" you will see a TYPE LoadBalancer with an EXTERNAL-IP listed as "<pending>" that will never exit the pending state and the service will be inaccessible from the outside world.

NodePorts do work out of the box with no extra software and no extra configuration, but you don't have to be limited to NodePorts forever.

The Solution

Kubernetes is a container orchestrator and it is willing to cooperate with external load-balancing systems, but it does not implement a load balancer.

That's OK.

If K8S can virtualize anything, why not virtualize its external load balancer?  This is not as bad of an idea as a VMware cluster getting its DHCP addresses set by a DHCP server running inside the cluster; if the cluster is impaired or down enough that the LB isn't working, the app probably isn't working either, so no loss.

We can install MetalLB in Kubernetes, which implements a virtual external load balancer system. https://metallb.universe.tf/

The Implementation

  1. Let's read about how to install MetalLB.  https://metallb.universe.tf/installation/  I see we are strongly encouraged to use IPVS instead of iptables.
  2. Research why we're encouraged to use IPVS instead of iptables.  I found https://www.tigera.io/blog/comparing-kube-proxy-modes-iptables-or-ipvs/ which explains IPVS scales roughly o(1) constant with traffic where the iptables version scales roughly o(n) with traffic.  OK, we have to use IPVS, which is an in-kernel load balancer that runs in front of or with kube-proxy and MetalLB.  Additionally, the K8S docs discussing kube-proxy are at https://kubernetes.io/docs/reference/networking/virtual-ips/
  3. Next, research IPVS.  Aggravating that every Google search for IPVS is autocorrected to IPv6, EVERY TIME.  Found http://www.linuxvirtualserver.org/software/ipvs.html
  4. Will this work with RKE2?  It's reported that both iptables and IPVS work fine with Calico.  RKE2 runs Canal by default which is Flannel between nodes and Calico for network policies, so I guess it's OK?  https://docs.rke2.io/install/network_options
  5. Time to set up IPVS on all RKE2 nodes.  The usual song and dance with automation, set up the first node completely manually, then set up in Ansible, test on the second node, then roll out slowly and carefully.  First IPVS setup step, install ipvsadm so I can examine the operation of the overall IPVS system, "apt install ipvsadm".  Not much to test in this step, success would be running "ipvsadm" and nothing weird seen.
  6. IPVS needs a kernel module, so without rebooting, modprobe the kernel ip_vs module, then try "ipvsadm" again, then if it works, create a /etc/modules-load.d/ip_vs.conf file to automatically load the ip_vs module during node reboots.
  7. Finally, add the IPVS config for kube-proxy to the end of the RKE2 config.yaml, merely tell kube-proxy-arg to use ipvs mode, and ipvs needs strict-arp.
  8. After a node reboot, RKE2 should have kube-proxy running in an IPVS compatible mode.  Success looks like running "ipvsadm" outputs sane-appearing mappings and "ps aux | grep kube-proxy" should show the options --proxy-mode=ipvs and --ipvs-strict-arp=true.  None of this manual work was straightforward and required some time to nail down.
  9. Set up automation in Ansible to roll out to the second node.  This was pretty uneventful and the branch merge on Gitlab can be seen here: https://gitlab.com/SpringCitySolutionsLLC/ansible/-/commit/65445fd473e5421461c4e20ae5d6b0fe1fe28dc4
  10. Finally, complete the IPVS conversion by rolling out and testing each node in the RKE2 cluster.  The first node done manually with a lot of experimentation took about half a day, the second took an hour, and the remaining nodes took a couple minutes each.  Cool, I have an RKE2 cluster running kube-proxy in IPVS mode, exactly what I wanted.
  11. Do I run MetalLB in BGP or L2 mode?  https://metallb.universe.tf/concepts/  I don't have my BGP router set up so it has to be L2 for now.  In the long run, I plan to set up BGP but I can spare a /24 for L2 right now.  Note that dual-stack IPv4 and IPv6, which I plan to eventually use, requires FRR-mode BGP connections, which is a problem for future-me, not today.
  12. Allocate some IP space in my IPAM.  I use Netbox as an IPAM.  Reserve an unused VLAN and allocate L2 and future BGP prefixes.  I decided to use IPv4 and 150 in my RFC1918 address space, I will add IPv6 "later".  I do almost all of my Netbox configuration automatically via Ansible, which has a great plugin for Netbox.  Ansible's Netbox integration can be seen at https://netbox-ansible-collection.readthedocs.io/en/latest/ The Ansible branch merge to allocate IP space looks like this: https://gitlab.com/SpringCitySolutionsLLC/ansible/-/commit/1d9a1e6298ce6f041ab4e98ad374850faf4a1412
  13. It is time to actually install MetalLB.  I use Rancher to wrangle my K8S clusters, it's a nice web UI, although I could do all the helm work with a couple lines of CLI work.  Log into Rancher, RKE cluster, "Apps", "Charts", search for metallb and click on it, "Install", "Install into Project" "System", "Next", "Install", and watch the logs. It'll sit in Pending-Install for a while.
  14. Verify the operation of MetalLB.  "kubectl get all --namespace metallb-system" should display a reasonable output.  Using rancher, "RKE" cluster, "Apps", "Installed Apps", namespace metallb-system should contain a metallb with reasonable status results.
  15. Configure an IPAddressPool for MetalLB as per the IPAM allocation in Netbox.  Here is a link to the docs for IPAddressPools: https://metallb.universe.tf/apis/#ipaddresspool Currently, I only have a "l2-pool" but I will eventually have to add a "bgp-pool".
  16. Configure an L2Advertisement for MetalLB to use the IPAddressPool above.  Here is a link to the docs for L2Advertisements: https://metallb.universe.tf/apis/#l2advertisement  Currently, I'm feeding "default" to "l2-pool" which will probably default to "bgp-pool" after I get BGP working.
  17. Try provisioning an application using a Service type LoadBalancer.  I used numbers-web as per the intro.  In the CLI, "kubectl get svc numbers-web" should show a TYPE "LoadBalancer" and an "EXTERNAL-IP" in your L2 IPAM allocation, and even list the PORT(S) mapping.
  18. Check operation in Rancher.  "RKE", "Service Discovery", "Services", click thru on numbers-web, the top of the page should contain a "Load Balancer" IP address, the tab "Recent Events", should see nodeAssigned and IPAllocated events, and the tab "Ports" should tell you the ports in use.
  19. Test in a web browser from the desktop.  Remember that the numbers-web app runs on port 8080 not the default 80.
  20. You can specify statically assigned IP addresses using a custom annotation described at: https://metallb.universe.tf/usage/#requesting-specific-ips  This is useful because I can add DNS entries in Active Directory using Ansible pointing to addresses of my choice.
For reference, a bare-bones ipaddresspool.yaml looks like this:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: l2-pool
namespace: metallb-system
spec:
addresses:
  - 10.10.150.0/24

And an equally bare-bones l2advertisement.yaml looks like this:

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
spec:
ipAddressPools:
  - l2-pool

The Summary

This took 20 logically distinct steps.  Don't get me wrong, K8S is awesome, MetalLB is awesome, RKE2 is awesome, however, everything takes longer with Kubernetes...  On the bright side, so far, operation and reliability has been flawless, so it's worth every minute of deployment effort.

Trivia

There are only two types of K8S admins, the ones who admit that at least one time they thought metallb was spelled with only one letter "L", and the ones who are liars LOL haha.  This is right up there in comedic value with RKE2 pretending that .yml files are invisible and only processing .yaml files.

Friday, December 8, 2023

Proxmox VE Cluster - Chapter 020 - Architecture 1.0 Review and Future Directions

Proxmox VE Cluster - Chapter 020 - Architecture 1.0 Review and Future Directions


A voyage of adventure, moving a diverse workload running on OpenStack, Harvester, and RKE2 K8S clusters over to a Proxmox VE cluster.


What worked

So far, everything.  And it all works better than I expected, and was generally less of a headache than I anticipated.  Performance is vastly better than OpenStack for various design and overall architectural reasons.

How long it took

I am writing these blog posts in a non-linear fashion so the final editing of this post is being done on a CEPH cluster with HA and the new software defined networking system and quite a few other interesting items, most of which already have rough draft blog posts.

However, if you believe Clockify, over the course of half a year of hobby-scale effort, I have logged 125 hours, 57 minutes, and 52 seconds getting to this point.  So, around "three weeks full time labor" to convert a small OpenStack cluster to a medium size half-way-configured Proxmox VE cluster.  I believe this is about half the time it took to convert from VMware to OpenStack a couple years back.

Future Adventures

This is just a list of topics you can expect to see in blog posts and Spring City Solutions Youtube videos probably after the holidays or in late winter / early spring:  Setting up and upgrading CEPH.  Optimizing memory, CPU, storage.  Adding the new SDN feature.  Open vSwitch and Netgear hardware QoS.  Connecting Ansible and probably Terraform to Proxmox.  Monitoring using Observium, Zabbix, and Elasticsearch.  Setting up Rancher and RKE2 production clusters on top of Proxmox.  Backups using the Proxmox Backup Server product.  Cloud-init, will I ever it it working the way I want it to work?  HA High Availability, unfortunately I can verify this software feature works excellently during hardware failures.  USB pass thru.  

Most of the stuff listed above is done or in process, and already partially documented in rough draft blog posts.  CEPH integration, for example, has been unimaginably cool.


Anyway, thanks for reading and have a great day!

Wednesday, December 6, 2023

Proxmox VE Cluster - Chapter 019 - Proxmox Operations

Proxmox VE Cluster - Chapter 019 - Proxmox Operations


A voyage of adventure, moving a diverse workload running on OpenStack, Harvester, and RKE2 K8S clusters over to a Proxmox VE cluster.


Proxmox Operations is a broad and complicated topic.


Day to day operations are performed ALMOST entirely in the web GUI, with very few visits to the CLI.  I have years of experience with VMware and OpenStack, and weeks, maybe even months of experience with Proxmox, so let's compare the experience:

  • VMware:  vSphere is installed on the cluster as an image, and as an incredibly expensive piece of licensed software, you get one (maybe two, depending on HA success) installation of vSphere and you get to hope it works.  Backup, restore, upgrades, and installation work about as well as you expect for "enterprise" grade software.
  • OpenStack: Horizon is installed on the controller and the controller is NOT part of the cluster.  It's free, feel free to install multiple controllers although I never operated that way.  Its expensive in terms of hardware as the core assumptions of the design assume you're throwing a rather large cloud, not a couple hosts in a rack.  Upgrades are terrifying and moderately painful and long process.  The kolla-ansible solution of running it all in containers is interesting although it replaces the un-troubleshoot-able complication of bare metal installation with an equal level of un-troubleshoot-able complication of Docker containers.
  • Proxmox VE: Every VE node has a web front end to do CRUD operations against the shared cluster configuration database.  The VE system magically synchronizes the hardware to match the configuration database.  Very cool design and 100% reliable so far.  Scalability is excellent; whereas OpenStack assumes you're rolling in with a minimum of a dozen or so nodes, Proxmox works from as low as one isolated node.
An interesting operational note is the UI on Proxmox is more "polished" and "professional" and "complete" than either alternative.  Usually FOSS has a reputation for inadequate UI but Proxmox has the best UI of the three.

Upgrades

Lets consider one operational task.  Upgrades.  Proxmox is essentially a Debian Linux installation with a bunch of Proxmox specific packages installed on top of it.  Not all that different from installing Docker or ElasticSearch from upstream.  I try to upgrade every node in the cluster at least monthly, the less stuff that changes per upgrade the less "exciting" the upgrade.  The level of excitement and drama and stress scales exponentially with the number of upgraded software packages with Debian-based operating systems in general.

The official Proxmox process for upgrades is just hit it, maybe have to reboot, all good.

As you'd expect, there are complications, IRL.

First I make a plan, upgrading all the hosts in one sitting because I don't want cross-version compatibility cluster issues, and I start with the least sensitive cluster host.  Note that if you log into proxmox001 and upgrade/reboot proxmox002, you stay logged into the cluster.  However if you log into proxmox001 and upgrade and reboot proxmox001, you lose web access to the rest of the cluster during the reboot (as a work around, simply log into the proxmox002 webui while rebooting proxmox001).

Next I verify the backups of the VMs on a node, and generally poke thru the logs.  If I'm getting hardware errors or something I want to know before I start changing software.  Yes this blog post series is non-linear and I haven't mentioned backups or the Proxmox Backup Server product but those posts are coming soon.

I generally shutdown clustered VMs and unimportant VMs and migrate "important" VMs to other hosts. 

There are special notes about Beelink DKMS process for the custom ethernet driver using non-free firmware.  Basically Proxmox 8.0 shipped with a Linux kernel that could be modified to use the DKMS driver for the broken Realtek ethernet driver, however, the DKMS driver does NOT seem compatible with the kernel shipped with Proxmox 8.1, so after some completely fruitless hours of effort, I simply removed my three Beelink microservers from the cluster.  "Life's too short to use Realtek".  You'd think Linux compatibility would be better in 2023 than 1993 when I got started, but really there isn't much difference between 2023 and 1993 and plenty of stuff just doesn't work.  So, here's a URL to remove nodes from a cluster, which is a bit more involved than adding nodes LOL:

Other than fully completing and verifying operation of exactly one node at a time, I have no serious advice.  Upgrades on Proxmox generally just work, somehow even less drama than VMware upgrades.  Lightyears less stress than an OpenStack upgrade.  Don't forget to update the Runbook docs and due date in Redmine after each node upgrade.

Note that upgrading the Proxmox VE software is only half the job, once that's done entirely across the cluster its time to look at CEPH.  Again I mention these blog posts are being written long after the action, and I haven't mentioned CEPH in a blog post.  Those posts are on the way.


Shortly after I rough drafted these blog posts, Proxmox 8.1 dropped along with an upgrade from CEPH Quincy to CEPH Reef.  AFAIK any CEPH upgrade even a minor version number is basically the same as a major upgrade, just much less exciting and stressful.  I do everything for a minor upgrade in the same order and process, more or less, as a major CEPH version upgrade, and that may even be correct.  It does work, at least so far.

Next post, a summary and evaluation of "Architecture Level 1.0" where we've been and where we're going.

Monday, December 4, 2023

Proxmox VE Cluster - Chapter 018 - Moving the remainder of workload to the full size cluster

Proxmox VE Cluster - Chapter 018 - Moving the remainder of workload to the full size cluster


A voyage of adventure, moving a diverse workload running on OpenStack, Harvester, and RKE2 K8S clusters over to a Proxmox VE cluster.


Some notes on moving the remainder of the old OpenStack workload to the full size Proxmox cluster.  These VMs were "paused" for a couple days and recreated on Proxmox.


Elasticsearch cluster members es04, es05, es06

This is the other half of the six host Elasticsearch cluster.  Rather than storing the disk images over CEPH (foreshadowing of future posts...) or enabling HA high availability (more foreshadowing of adventures to come...) I use local 100 LVM disks because the Proxmox VE system only uses a couple gigs of my 1 TB SSD OS install drives.

Adding more cluster members to an existing Elasticsearch cluster is no big deal.  Create a temporary cluster enrollment token on any existing cluster member, install a blank unused Elasticsearch binary on the VM, run the cluster-mode reconfiguration script with the previously mentioned token, wait until it's done.  The main effort is adjusting the config for kibana, filebeat, and metricbeat on Ansible so I can push out config changes to all hosts to use the additional three cluster members.  It 'just works'.  Currently, I have index lifecycle management to store only a couple days of logs and metrics because it seems 600 gigs of logs fills up faster than it did back in the 'old days'.

jupyter, mattermost, navidrome, pocketmine, tasmoadmin, ttrss, others..

These are just docker hosts that run docker containers.  The scripts to set up the docker containers, and the docker volumes, are stored on the main NFS server, so re-deployment amounts to install an Ubuntu server, let Ansible set it up to join the AD domain, install Docker for me, set up autofs, etc, then simply run my NFS mounted scripts to run Docker containers accessing NFS mounted Docker volumes.

booksonic, others...

Another Docker host like the above paragraph.  I had set up Active Directory authentication for a couple applications running in Docker containers and I had some "fun" reconfiguring them to use the new domain controller IP addresses.  No big deal, however, AD auth reconfiguration was an unexpected additional step.  If "everything" is configured automatically in Ansible, but its not REALLY "everything", then its easy to forget some application-level configuration remains necessary.  Every system that's big enough, has a couple loose ends somewhere.

kapua, kura, hawkbit, mqttrouter (containing Eclipse Mosquitto)

This is my local install of the Eclipse project Java IoT suite that I use for microcontroller experimentation and applications.

Kapua is a web based server for IoT that does everything except firmware updates.  The software is run via a complicated shell script running version 1 docker-compose that works fine with version 2 docker compose, after exporting some shell environment variables to force the correct Kapua version and editing the start up script to run v2 "docker compose" instead of v1 "docker-compose".  Kapua overall is a bit too complicated to explain in this blog post.

Kura is an example Java IoT device framework running locally in Docker instead of on real hardware, for testing Kapua and generally messing around.

Hawkbit is a firmware updater and it works great, anything with wifi/ethernet and MCUboot can upgrade itself very reliably, or recover from being bricked.  Works great with STM32 boards.

Finally, as for mqttrouter, simply start the NFS config and Eclipse Mosquitto works.

The Eclipse project Java-based IoT suite is REALLY cool and once upon a time I planned a multi-video Youtube series using it and Zephyr but I ran out of RAM on my STM32 boards before implementing more than 50% of the Kapua/Kura protocol and now-a-days I'd just install Kura on a Raspberry Pi, if not Node-RED or on the smaller end install one of the microcontroller Python implementations and call it good; maybe some day I'll get back into Eclipse Java IoT.

win11

This was a gigantic struggle.  The Proxmox side works perfectly, with emulated TPM and the install went perfectly smoothly.  The problem was I have a valid windows license on microsoft.com for this VM but the image refused to 'activate'.  I paid list price for this license that I can't even use; I can see why people have a bad attitude about Microsoft...  None the less, via various technical means I now have a remotely accessible domain-joined windows 11 image that I can access via Apache Guacamole's rdesktop feature from any modern web browser (including my Chromebook) to run windows "stuff" remotely.  Works pretty well, aside from the previously mentioned license activation problem.  Everything 'Microsoft' is a struggle all the time.

ibm7090, pdp8, rdos, rsx11, tops10, mvs, a couple others

Runs the latest OpenSIMH retrocomputing emulator in a tmux window.  The MVS host has the "famous" MVS/370 Turnkey 5 installed along with a console 3270 emulator.  The disk images are normally stored over NFS along with all configs.  All data is stored in projects on Redmine.  I have login entries on Apache Guacamole so I have full access to my retrocomputing environment via any web browser.


Next blog post:  Various operations issues.  Upgrading Proxmox VE software, daily stuff like that.

Wednesday, November 29, 2023

Proxmox VE Cluster - Chapter 017 - Install Proxmox VE on the old OS2 cluster hardware

Proxmox VE Cluster - Chapter 017 - Install Proxmox VE on the old OS2 cluster hardware


A voyage of adventure, moving a diverse workload running on OpenStack, Harvester, and RKE2 K8S clusters over to a Proxmox VE cluster.


Some notes on installing Proxmox VE on the old OS2 cluster hardware.  The main difference between installation on OS1 and OS2 is adding NTP serving to OS2, more or less as per Chapter 014 NTP notes.  The main reference document:

https://pve.proxmox.com/pve-docs/chapter-pve-installation.html

The plan to work around the networking challenges is to get everything working on a single plain temporary 1G ethernet connection, then use that as a management web interface to get the dual 10G LAG with VLANs up and running, then use the new "20G" ethernet connected web management interface to connect and reconfigure the dual 1G LAG / VLAN ethernet ports, at which point everything will be working.

First Steps

Install using IPMI KVM and USB key, so find the usb key and plug in the IPMI ethernet.


On boot, DEL for setup, alter the boot options to include the USB drive, reboot, hit F11 for boot menu, then boot off the USB.

Proxmox "OS" install process

  • I have a habit of using the console install environment.
  • Installer wants to default to install on the M2 drive although I am using the SATA.
  • Country: United States
  • Timezone: The "timezone" field will not let me enter a timezone, only city names, none of which are nearby.  Super annoying I can't just enter a timezone like a real operating system.  I ended up selecting a city a thousand miles away.  This sucks.  Its a "timezone" setting not "name a far away city that coincidentally is in the same timezone".  I expect better from Proxmox.
  • Keyboard Layout: U.S. English
  • Password: (mind your caps-lock)
  • Administrator email: vince.mulhollon@springcitysolutions.com
  • Management Interface: the first 1G ethernet (eno1, aka the "bottom left corner")
  • Hostname FQDN: as appropriate, as per the sticker on the device
  • IP address (CIDR): as appropriate, as per the sticker on the device / 016
  • Gateway address: 10.10.1.1
  • DNS server address: 10.10.8.221
  • Note you can't set up VLANs in the installer, AFAIK.
  • Hit enter to reboot, yank the USB flash install drive, yank the USB keyboard, watch the monitor... seems to boot properly...
  • Web interface is on port 8006.  Log in as root.  Note I installed 8.0-2 and on the first boot, the web gui reports version 8.0.3, it must have auto-updated as part of the install process?

Upgrade the new Proxmox VE node

  1. Double check there's no production workload on the server; its a new install there shouldn't be anything, but its a good habit.
  2. Select the "Server View" then node name, then on the right side, "Updates", "Repositories", disable both enterprise license repos.  Add the community repos as explained at https://pve.proxmox.com/wiki/Package_Repositories
  3. Or in summary, click "add", select "No-subscription", "add", then repeat for the "Ceph Quincy No-Subscription" repo.
  4. In right pane, select "Updates" then "Refresh" and watch the update.  Click "Upgrade" and watch the upgrade.
  5. Optimistically get a nice message on the console of "Your system is up-to-date" and a request to reboot.
  6. Reboot and verify operation.

Install hardware in permanent location with temporary ethernet cables

  1. Perform some basic operation testing
  2. In the web UI "Shutdown" then wait for power down.
  3. Reinstall in permanent location.
  4. Connect eno1 to any untagged "Prod" VLAN 10 access-only ethernet port, temporarily, for remote management via the web interface.
  5. Connect the 10G ethernets eno3 and eno4 to the LAG'd and VLAN'd 10G ethernet switch ports.

Move the Linux Bridge from single 1 gig eno1 to dual 10 gig LAG on eno3 and eno4

You are going to need this:
  1. Modify eno3 and eno4, checkmark "Advanced", change MTU to 9000.
  2. Create a Linux Bond named bond1, Checkmark "Advanced", change MTU to 9000, Mode "balance-xor", slaves "eno3 eno4" (note space in between, not comma etc).  Note bond0 will eventually be the 1G LAG, and the old OpenStack used "balance-xor" so I will start with that on the Proxmox.
  3. Create Linux VLAN named bond1.10 with MTU 9000, can create the other VLANs now if you want.
  4. Edit vmbr0 Linux bridge to have a MTU of 9000 and Bridge Ports of bond1.10
  5. Double check everything then "Apply Configuration", and after about twelve to thirteen heart stopping seconds it should be up and working.
At some later date I will try some LAG bond modes more interesting than "balance-xor".

Note the network interfaces do not have "VLAN aware" checked.  Everything works.  I will research this later in a dedicated advanced networking post.

Convert the single 1 gig eno1 to dual 1 gig LAG on eno1 and eno2

  1. Edit eno1 and eno2 and set MTU to 9000
  2. Create a Linux Bond named bond0, Checkmark "Advanced", change MTU to 9000, Mode "balance-xor", slaves "eno1 eno2" (space in between).
  3. Create VLAN interfaces now on bond0, or create them later.

Configure NTP

  1. Create (or copy) the files for sources into "/etc/chrony/sources.d" I put exactly one clock in each file.  Files in sources.d can be re-read without restarting the entire service by running "chronyc reload sources".  If successful you should see the other clocks are now accessible when running "chronyc sources".
  2. Remove the default clocks shipped by Proxmox and enable NTP serving.  Edit /etc/chrony/chrony.conf and comment out the "pool" directive and add a line underneath "allow 10.0.0.0/8"  This will require a service restart not a mere reload, so "service chrony restart" and verify Chrony operation after a few minutes using "chronyc sources"
  3. Edit DNS for ntp4 (or as appropriate) to point to the new proxmox node IP address.
  4. Edit NTP on ALL THE OTHER NODES to reflect the presence of this new NTP server.
  5. Test NTP from various nodes, VMs, and hardware to verify NTP is working.

Final Installation Tasks

  1. Join the new node(s) to the existing cluster.  In "Datacenter" on any cluster member, "Cluster" "Join Information" "Copy Information" cut and paste into "Datacenter" on the new node, "Join Cluster", enter the peer's root password, "Join 'Proxmox'".  Will have to log back into the web UI after the SSL certs update...
  2. Verify information in Netbox to include MAC, serial number, ethernet cabling, platform should be Proxmox VE, remove old Netbox device information.
  3. Add new hosts to Zabbix.
The next post will be about adding the remaining "paused" workload to the now "full sized" Proxmox VE cluster.

Monday, November 27, 2023

Proxmox VE Cluster - Chapter 016 - Hardware Prep Work on the OS2 Cluster

Proxmox VE Cluster - Chapter 016 - Hardware Prep Work on the OS2 Cluster


A voyage of adventure, moving a diverse workload running on OpenStack, Harvester, and RKE2 K8S clusters over to a Proxmox VE cluster.


This will be similar to Chapter 012 although different hardware.

These microservers are three old SuperMicro SYS-E200-8D that were used for Homelab workloads.  They will become Proxmox cluster nodes proxmox004, proxmox005, and proxmox006.  

This server hardware was stereotypical for a late 2010's "VMware ESXi Eval Experience"-licensed cluster, and later worked very well under OpenStack.  1.90 GHz Xeon D-1528 with six cores and 96 GB of ram, 1 TB SATA SSD for boot and local storage, new 1 TB M2 NVME SSD for eventual CEPH cluster storage.

Hardware reliability history

Proxmox004 had its AC power brick replaced 2022-07-10

Proxmox005 had a NVME failure 2020-05-24, took advantage of that outage to also upgrade its SSD to a new 1TB (on suspicion, the old one was working fine although wearout measurement was getting to a high percentage per SMART reports) on 2020-05-27.

Proxmox006 had a NVME failure 2021-02-20, and had its AC power brick replaced 2022-06-10

Previously in Chapter012 I claimed that 5/6 of the power supplies had failed on my E800 microservers, but I made a mistake, and it seems "only" TWO THIRDS of the power supplies have failed as of late 2023, currently proxmox001 and proxmox005 are still running on the original mid 2010s power supplies.  I will keep a close eye on the output voltages (monitorable via IMPI using Observium and probably Zabbix and maybe somehow via Elasticsearch)

FIVE Ethernet ports

Even the official manufacturer's operating manual fails to explain the layout of the five ethernet ports on this server.  Looking at the back of the server, the lone port on the left side is the IPMI, then:

eno1 1G ethernet bottom left corner, 9000 byte MTU

eno2 1G ethernet top left corner, 9000 byte MTU

eno3 10G ethernet bottom right corner, 9000 byte MTU

eno4 10G ethernet top right corner, 9000 byte MTU

eno1 and eno2 are combined into bond12, which uses balance-xor mode to provide 2 GB of bandwidth.

eno3 and eno4 are combined into bond34, which uses balance-xor mode to provide 20 GB of bandwidth.  20 GB ethernet is pretty fast!

I run the VLANs as subinterfaces of the bond interfaces.  So, "Production" VLAN 10, has an interface name of "bond34.10"

Hardware Preparation task list

  1. Clean and wipe old servers, both installed software and physical dusting.
  2. Relabel ethernet cables and servers.
  3. Update port names in the managed Netgear ethernet switch.  VLAN and LAG configs remain the same, making installation "exciting" and "interesting".
  4. Remove monitoring of old server in Zabbix.
  5. Verify IPAM information in Netbox.
  6. Test and verify new server DNS entries.
  7. Install new 1TB M.2/NVME SSDs.
  8. Replace old CMOS CR2032 battery as it's probably 5 to 7 years old.  This is child's-play compared to replacing the battery on a hyper-compact Intel-NUC.
  9. Reconfigure the BIOS in each server.  For a variety of reasons, PXE netboot requires UEFI and BIOS initialization of the network, so I used that in the OpenStack era which was installed on top of Ubuntu.  However, I could not force the UEFI bios to boot the SATA SSD it insisted on booting the M.2 only, which is odd because it worked fine under older, USB-stick installed Ubuntu.  Another problem with the BIOS config was "something" about pre-initializing the ethernet system for PXEBoot messes up the bridge configuration on Proxmox's Debian OS, resulting in traffic not flowing; I experimented with manually adding other interfaces to the bridge; no go; symptoms were no packets flowing in (brctl showmac is essentially the bridge's ARP table) also no packets out, although link light up and everything looks OK.  Anyway, in summary, disable PXEboot entirely and convert entirely from UEFI to Legacy BIOS booting.  This was typical of the UEFI experience in the late 2010s, it doesn't really work most of the time, but Legacy BIOS booting always works.  Things are better now.

Next post will be about installing Proxmox VE on the old OS2 cluster hardware.