Showing posts with label Kubernetes. Show all posts
Showing posts with label Kubernetes. Show all posts

Monday, April 8, 2024

Troubleshooting Strategies for Kubernetes, ndots: 5, and DNS blackholes

Troubleshooting Strategies for Kubernetes, ndots: 5, and DNS blackholes

Based solely on the title, some K8S old-timers already know the problem, and how I solved it.  Regardless, it's useful to discuss DNS troubleshooting strategies in K8S.

Troubleshooting strategy 1: Clearly define the problem as a goal.

I am trying to bring up a new RKE2 K8S cluster and roll some new workload onto it.  The cluster had no workload because it was brand new.  Three days ago I upgraded to K8S 1.28.8 and Rancher v2.8.3.  I deployed several Node-RED systems, the deployments, pods, services, PVCs, and PVs all look great and I can access the systems.  However, Node-RED is a node.js application and when I try to install additional packages to the palette, I get a long timeout followed by an installation failure in the GUI.  The goal is, I would like to be able to install node-red-node-aws and node-red-contrib-protobuf in Node-RED without errors.

Troubleshooting strategy 2: Gather related data.

The GUI is not much help; all we know is it didn't work.  Open a shell in Rancher on the K8S pod, and poke around in /data/.npm/_logs until I find the exact error message "FetchError: request to https://registry.npmjs.org/node-red-contrib-protobuf failed, reason: getaddrinfo ENOTFOUND registry.npmjs.org"  This looks bad, very bad; the pod can't resolve DNS for registry.npmjs.org.

Troubleshooting strategy 3: Think about nearby components to isolate the problem.

OK, DNS does not work.  However, the rest of the LAN is up, and the RKE2 system resolved hub.docker.com to initially install the container images, so at a reasonably high level, DNS appears to be working.  The problem seems isolated to the pods (not the entire LAN) and exactly one website.  Nest, stretch out, and try pinging Google.

ping www.google.com

Fails to resolve "no IP address found".  Let's try some other tools, sometimes nslookup will report better errors.  Ping asks the OS to resolve a domain name, nslookup asks a specific DNS server to resolve a domain name for it, a subtle difference.

nslookup www.google.com

Wow, that worked.  Let's combine the observations so far:  the operating system DNS resolver (the local thing using /etc/resolv.conf) failed yet asking the K8S DNS server directly, worked.  Local DNS resolution (inside the cedar.mulhollon.com domain) works perfectly all the time.  Interesting.

Troubleshooting strategy 4: Hope this is not your first rodeo.

This is not my first rodeo for solving sysadmin and DNS problems.  Always try the FQDN and non-FQDN hostname.  Surprisingly:

ping www.google.com.

That worked.  A DNS hostname that does not end in a period (plus or minus some ndots options) will check the search path before hitting the internet.  That's how a DNS lookup in a webbrowser on the LAN to nodered1 ends up pointing to the same place as nodered1.cedar.mulhollon.com, there are not two entries with different names, there's just a cedar.mulhollon.com in the DNS search path and a nodered1 A type record.  

When working through a puzzling IT problem, if this is your first rodeo, you will be in for a rough grind, but I've seen a lot over the years, making life a little easier.  My hunch about hosts with FQDN vs non-FQDN paid off.

Back to troubleshooting strategy 2, let's gather more data.  Inside the pod, it's /etc/resolv.conf looks like this:

search nodered.svc.cluster.local svc.cluster.local cluster.local cedar.mulhollon.com mulhollon.com
nameserver 10.43.0.10
options ndots:5

This is the longest search path and highest ndots option I've ever seen, which based on some internet research is apparently normal in K8S situations, and the nameserver IP looks reasonable.

Troubleshooting strategy 5: Enumerate your possibilities.

  1. Is 10.43.0.10 a working coredns resolver?
  2. The search path is long.  UDP packet size limit?  Some limit in the DNS resolver?  Could it be timing out trying so many possible search paths?
  3. What's up with the huge ndots option setting?  Problem?  Symptom?  Cause?  Effect? Irrelevant?
At this stage of the troubleshooting process, I had not yet guessed the root cause of the problem, but I had some reasonable possibilities to follow up upon.

Troubleshooting strategy 6: Search the internet after you have something specific to search for.

When you have described your problem in enough detail, it's time to find out how "the internet" solved the problem, someone else probably already solved the problem and is happily telling everyone.  Similar to this blog post.  I found a couple leads:
  1. Some people with weird ndots settings have massive DNS leaks causing lots of extraneous DNS traffic and running the famous K8S nodelocal DNS cache helped them cut the DNS traffic overload.  Is that my problem?  Not sure, but its "a problem".  I probably need to set up nodelocal DNS caching eventually, according to everything I read.  Make a note that because my RKE2 has IPVS installed, I need a slightly modified nodelocal DNS cache config.
  2. Some people with weird ndots settings had DNS queries timeout because they made too many DNS search queries.  Note that the "overall Node-RED palette installer" fails after 75 seconds of trying, and the command line "npm install node-red-contrib-protobuf" times out and fails in about 90 seconds, but my DNS test queries at the shell instantly fails so their problem is likely not my problem, also I have very low traffic on this new cluster and they have large clusters with huge DNS traffic loads which I would not have.  Also, there is a large amount of opposition from the general K8S community to 'fixing' high ndots option settings due to various internal traffic issues related to K8S, so we're not doing that.  I think this is a dead end to pursue.
  3. Rancher and RKE2 and K8S all have pretty awesome wiki DNS troubleshooting guides.  I plan to try a few!
Troubleshooting strategy 7: Know when to cowboy and when not to cowboy.

If the system is in production or you're operating during a maintenance window then you have a written documented meeting-approved change management and risk plan and a window to work, along with written prepared back-out plans.  However, if it's a pre-production experimental burn-in test system, then it's the Wild West, just make sure to write good docs about what you do to justify your learning time.  This particular example was an experimental new RKE2 cluster, perfectly safe to take some risks upon, I need to set up nodelocal DNS sooner or later anyway, and less DNS traffic might help this problem or at least can't make it worse.  So I talked myself into cowboy style installing Nodelocal DNS caching on this RKE2 cluster using IPVS, during the process of working an "issue".  I felt it was related "enough" and safe "enough" and in the end, I was correct, even though it did not solve the immediate problem.


The DNS cache worked with no change in results (although nothing is worse; I think). Note you have to add the ipvs option if you have ipvs enabled (which I do).

I switch my troubleshooting strategy back to "Gather related data.".  If my first and only real workload of Node-RED containers fails, let's try a different workload.  I spin up a linuxserver.io dokuwiki container.  Works perfectly except it can't resolve external DNS either.  At least it's consistent, and consistent problems get fixed quickly.  This removes the containers from the cause of the problem, it is unlikely to be a problem unique to Node-RED containers if the identical problem appears in a Dokuwiki container from another vendor...

Back to Troubleshooting strategy 6 of when in doubt search the internet.  I methodically worked through Rancher's DNS troubleshooting guide as found at:


kubectl -n kube-system get pods -l k8s-app=kube-dns

This works.

kubectl -n kube-system get svc -l k8s-app=kube-dns

This works.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

"nslookup: can't resolve 'kubernetes.default'"
This fails which seems to be a big problem.

Time to circle back around to Troubleshooting Strategy 7, and try some cowboy stuff.  This is just a testing burn-in and experimentation cluster, let's restart some pods, it's plausible they have historical or out-of-date config issues.  I restart the currently running pod.  It was restarted 3 days ago during a K8S upgrade and restarting the CoreDNS pods on a test cluster should be pretty safe, and it was.  The two rke2-coredns-rke2-coredns pods are maintained by the replicaset rke2-coredns-rke2-coredns.  I restarted one pod, and nothing interesting happened.  The good news is logs look normal on the newly started pod.  The bad news is busybox DNS query to kubernetes.default still fails.  I restarted the other pod, so now I have two freshly restarted CoreDNS pods.  Logs look normal and boring on the second restarted pod.  The pod images are rancher/hardened-coredns:v1.11.1-build20240305 The busybox query to kubernetes.default continues to fail same as before.  Nothing worse, nothing better.

I return to troubleshooting strategy 2, "Gather more data".

kubectl -n kube-system get pods -l k8s-app=kube-dns

NAME                                         READY   STATUS    RESTARTS   AGE
rke2-coredns-rke2-coredns-864fbd7785-5lmgs   1/1     Running   0          4m1s
rke2-coredns-rke2-coredns-864fbd7785-kv5zq   1/1     Running   0          6m26s

Looks normal, I have two pods, these are the pods I just restarted in the CoreDNS replicaset.

kubectl -n kube-system get svc -l k8s-app=kube-dns

NAME                        TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
kube-dns-upstream           ClusterIP   10.43.71.75   <none>        53/UDP,53/TCP   51m
rke2-coredns-rke2-coredns   ClusterIP   10.43.0.10    <none>        53/UDP,53/TCP   50d

OK yes I set up nodelocal caching as part of the troubleshooting probably 51 minutes ago, very reasonable output.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

Server:    10.43.0.10
Address 1: 10.43.0.10 rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
pod "busybox" deleted
pod default/busybox terminated (Error)

Time to analyze the new data.  After upgrading and restarting "Everything" it's still not working, so it's probably not cached data or old configurations or similar, there's something organically wrong with the DNS system itself that only presents in pods running on RKE2, everything else is OK.  Its almost like there's nothing wrong with RKE2... which eventually turned out to be correct...

Time for some more of Troubleshooting strategy 6, trust the internet.  Methodically going through the "CoreDNS specific" DNS troubleshooting steps:

kubectl -n kube-system logs -l k8s-app=kube-dns

.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29
.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29

kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}

Error from server (NotFound): configmaps "coredns" not found

That could be a problem?  No coredns configmaps?  Of course these are some old instructions and I have a new cluster with a fresh caching node-local DNS resolver, and DNS is mostly working so a major misconfiguration like this can't be the problem, so I poke around on my own a little.

I checked the node-local-dns configmap and that looks reasonable.

kubectl -n kube-system get configmap node-local-dns -o go-template={{.data.Corefile}}

(It would be a very long cut and paste, but it seems to forward to 10.43.0.10, which admittedly doesn't work, and this ended up being irrelevant to the story anyway so its not included in this blog post)

Ah, I see in the installed helm app for rke2-coredns the configmap name is actually named rke2-coredns-rke2-coredns, OK that makes sense now.

kubectl -n kube-system get configmap rke2-coredns-rke2-coredns -o go-template={{.data.Corefile}}

.:53 {
    errors 
    health  {
        lameduck 5s
    }
    ready 
    kubernetes   cluster.local  cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus   0.0.0.0:9153
    forward   . /etc/resolv.conf
    cache   30
    loop 
    reload 
    loadbalance 
}

The above seems reasonable.

Docs suggest checking the upstream nameservers

kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'

The nameserver info matches the configuration successfully used by 78 hosts configured by Ansible operating on this LAN, and superficially looks good (insert ominous music here; not to ruin the story, but the cause of the problem was the config was not good after all, it just mostly worked everywhere except K8S for reasons I'll get to later)

Let's contemplate Troubleshooting strategy 7: Know when to cowboy and when not to cowboy.  On one hand, I feel like enabling query logging and watching one specific query process at a time.  On the other hand, I feel pretty confident in my ability to enable logging, less confident in the infrastructure's ability to survive an accidental unintended log flood, and very unconfident about my ability to shut OFF query logging instantly if some crazy flood occurs.  I feel overall, that going cowboy by enabling query logging is a net negative risk/reward ratio at this time.  Someday I will experiment safely with query logging but today is not that day.

Well, I seem stuck.  Not out of ideas but the well is starting to run dry.  I haven't tried "Troubleshooting Strategy 1: Clearly define the problem as a goal" in awhile, so I will try that again.  I re-documented the problem again.  It looks like the story I placed at the beginning of this blog post, no real change.  It was still a good idea to focus on the goal and what I've done so far.  Maybe unconsciously this review time helped me solve the problem.

Let's try some more Troubleshooting strategy 2, and gather more data.  As previously discussed, I did not feel like going all cowboy by enabling cluster-wide DNS query logging.  As per Troubleshooting Strategy 4, hope this is not your first rodeo, I am quite skilled at analyzing individual DNS queries, so let's try what I'm good at:  We will pretend to be a K8S pod on a VM, and try all the search paths just to see what they look like.

From a VM unrelated to all this K8S stuff we've been doing, let's try the google.com.cedar.mulhollon.com search path.  That is my Active Directory domain controller and it returns a NXDOMAIN, this is normal and expected.

Following troubleshooting strategy 3, think about nearby components to isolate the problem, let's try the last DNS search path.  This will be google.com.mulhollon.com.  That domain is hosted by Google and it returns a valid NOERROR but no answer.  

Wait what, is that even legal according to the DNS RFCs?

Following troubleshooting strategy 5, enumerate your possibilities, I think it's quite plausible this weird "NOERROR header but empty data" DNS response from Google could be the problem.  This isn't my first rodeo troubleshooting DNS, and I know the search protocol for DNS takes the first answer it gets, so when internal resolution fails its last search path for host "whatever" will be whatever.mulhollon.com and Google will blackhole all incoming queries so it'll never try external resolution.  This certainly seems to fit the symptoms.  As a cowboy experiment on the test cluster, I could remove that domain from the DNS search path in /etc/resolv.conf and try again.  In summary, I can now repeatedly and reliably replicate a problem directly related to the issue in a VM, and I have a reasonable experiment plan to try.

Before I change anything, gather some more data under Troubleshooting Strategy 2.  I can now replicate the solution in a K8S pod.  I don't have root and can't edit the /etc/resolv.conf file in my NodeRED containers is mildly annoying, it's just how the docker containers are designed.

I found a container that I can successfully log into as root and modify the /etc config files.  With mulhollon.com (hosted at Google) if I try to ping www.google.com I get "bad address" because Google domain hosting blackholes missing A records, so weird but so true.
If I edit /etc/resolv.conf in this container, and remove mulhollon.com from the search path, SUCCESS AT LAST! I can now resolve and ping www.google.com immediately with no problems.  I can also ping registry.npmjs.org so that implies I can probably use it (although this test container isn't a NodeJS container or a Node-RED container)

Well, my small cowboy experiment worked, let's try a larger-scale experiment next.  But first, some explanation of why the system has this design.  In the old days, I had everything in the domain mulhollon.com, then I gradually rolled everything internal into active directory hosted cedar.mulhollon.com and now I have nothing but external internet services on mulhollon.com.  In the interim, while I was setting up AD I needed both domains in my DNS search path for internal hosts, but I don't think I need that any longer and it hasn't been needed for many years.

Time for some more troubleshooting strategy 7, cowboy changes on a test cluster.  Some quality Ansible time resulted in the entire LAN having its DNS search path "adjusted".  I had Ansible apply the /etc/resolv.conf changes to the entire RKE2 cluster in a couple minutes.  Verified the changes at the RKE2 host level, changes look good, DNS continues to work at the RKE2 and host OS level so nothing has been made worse.

I ran a "kubectl rollout restart deployment -n nodered" which wiped and recreated the NodeRED container farm (without deleting the PVs or PVCs, K8S is cool).  Connect to the shell of a random container, the container's /etc/resolv.conf inherited "live" from the host /etc/resolv.conf without any RKE2 reboot or other system software level restart required or anything weird, looks like at container startup time it simply copies in the current host resolv.conf file, simple and effective.  "ping www.google.com" works now that the DNS blackhole is no longer in the search path.  And I can install NodeJS nodes into Node-RED from the CLI and the web GUI and containers in general in the new RKE2 cluster have full outgoing internet access, which was the goal of the issue.

Troubleshooting strategy 8: If you didn't document it, it didn't happen.

I saved a large amount of time by keeping detailed notes in the Redmine issue, using it as a constantly up-to-date project plan for fixing the problem and reaching the goal.  Ironically I spent twice as much time writing this beautiful blog post as I spent initially solving the problem.

I will list my troubleshooting strategies below.  These overall strategies will get you through some dark times.  Don't panic, keep grinding, switch to a new strategy when progress on the old strategy slows, and eventually, things will start working.
  • Clearly define the problem as a goal.
  • Gather related data.
  • Think about nearby components to isolate the problem.
  • Hope this is not your first rodeo.
  • Enumerate your possibilities.
  • Search the internet after you have something specific to search for.
  • Know when to cowboy and when not to cowboy.
  • If you didn't document it, it didn't happen.
Good Luck out there!

Thursday, March 7, 2024

Why Kubernetes Takes a Long Time

Why Kubernetes Takes a Long Time.

The Problem

Let's test something simple in Kubernetes on a fresh new bare-metal (running under Proxmox) RKE2 cluster, and deploy the classic intro app "numbers" from the book "Kubernetes in a Month of Lunches".  Other simple test apps will behave identically for the purposes of this blog post, such as the "google-samples/hello-app" application.

If you look at the YAML files, you'll see a "kind: Service" that has a "spec type LoadBalancer" and some port info.  After an apparently successful application deployment, if you run "kubectl set svc numbers-web" you will see a TYPE LoadBalancer with an EXTERNAL-IP listed as "<pending>" that will never exit the pending state and the service will be inaccessible from the outside world.

NodePorts do work out of the box with no extra software and no extra configuration, but you don't have to be limited to NodePorts forever.

The Solution

Kubernetes is a container orchestrator and it is willing to cooperate with external load-balancing systems, but it does not implement a load balancer.

That's OK.

If K8S can virtualize anything, why not virtualize its external load balancer?  This is not as bad of an idea as a VMware cluster getting its DHCP addresses set by a DHCP server running inside the cluster; if the cluster is impaired or down enough that the LB isn't working, the app probably isn't working either, so no loss.

We can install MetalLB in Kubernetes, which implements a virtual external load balancer system. https://metallb.universe.tf/

The Implementation

  1. Let's read about how to install MetalLB.  https://metallb.universe.tf/installation/  I see we are strongly encouraged to use IPVS instead of iptables.
  2. Research why we're encouraged to use IPVS instead of iptables.  I found https://www.tigera.io/blog/comparing-kube-proxy-modes-iptables-or-ipvs/ which explains IPVS scales roughly o(1) constant with traffic where the iptables version scales roughly o(n) with traffic.  OK, we have to use IPVS, which is an in-kernel load balancer that runs in front of or with kube-proxy and MetalLB.  Additionally, the K8S docs discussing kube-proxy are at https://kubernetes.io/docs/reference/networking/virtual-ips/
  3. Next, research IPVS.  Aggravating that every Google search for IPVS is autocorrected to IPv6, EVERY TIME.  Found http://www.linuxvirtualserver.org/software/ipvs.html
  4. Will this work with RKE2?  It's reported that both iptables and IPVS work fine with Calico.  RKE2 runs Canal by default which is Flannel between nodes and Calico for network policies, so I guess it's OK?  https://docs.rke2.io/install/network_options
  5. Time to set up IPVS on all RKE2 nodes.  The usual song and dance with automation, set up the first node completely manually, then set up in Ansible, test on the second node, then roll out slowly and carefully.  First IPVS setup step, install ipvsadm so I can examine the operation of the overall IPVS system, "apt install ipvsadm".  Not much to test in this step, success would be running "ipvsadm" and nothing weird seen.
  6. IPVS needs a kernel module, so without rebooting, modprobe the kernel ip_vs module, then try "ipvsadm" again, then if it works, create a /etc/modules-load.d/ip_vs.conf file to automatically load the ip_vs module during node reboots.
  7. Finally, add the IPVS config for kube-proxy to the end of the RKE2 config.yaml, merely tell kube-proxy-arg to use ipvs mode, and ipvs needs strict-arp.
  8. After a node reboot, RKE2 should have kube-proxy running in an IPVS compatible mode.  Success looks like running "ipvsadm" outputs sane-appearing mappings and "ps aux | grep kube-proxy" should show the options --proxy-mode=ipvs and --ipvs-strict-arp=true.  None of this manual work was straightforward and required some time to nail down.
  9. Set up automation in Ansible to roll out to the second node.  This was pretty uneventful and the branch merge on Gitlab can be seen here: https://gitlab.com/SpringCitySolutionsLLC/ansible/-/commit/65445fd473e5421461c4e20ae5d6b0fe1fe28dc4
  10. Finally, complete the IPVS conversion by rolling out and testing each node in the RKE2 cluster.  The first node done manually with a lot of experimentation took about half a day, the second took an hour, and the remaining nodes took a couple minutes each.  Cool, I have an RKE2 cluster running kube-proxy in IPVS mode, exactly what I wanted.
  11. Do I run MetalLB in BGP or L2 mode?  https://metallb.universe.tf/concepts/  I don't have my BGP router set up so it has to be L2 for now.  In the long run, I plan to set up BGP but I can spare a /24 for L2 right now.  Note that dual-stack IPv4 and IPv6, which I plan to eventually use, requires FRR-mode BGP connections, which is a problem for future-me, not today.
  12. Allocate some IP space in my IPAM.  I use Netbox as an IPAM.  Reserve an unused VLAN and allocate L2 and future BGP prefixes.  I decided to use IPv4 and 150 in my RFC1918 address space, I will add IPv6 "later".  I do almost all of my Netbox configuration automatically via Ansible, which has a great plugin for Netbox.  Ansible's Netbox integration can be seen at https://netbox-ansible-collection.readthedocs.io/en/latest/ The Ansible branch merge to allocate IP space looks like this: https://gitlab.com/SpringCitySolutionsLLC/ansible/-/commit/1d9a1e6298ce6f041ab4e98ad374850faf4a1412
  13. It is time to actually install MetalLB.  I use Rancher to wrangle my K8S clusters, it's a nice web UI, although I could do all the helm work with a couple lines of CLI work.  Log into Rancher, RKE cluster, "Apps", "Charts", search for metallb and click on it, "Install", "Install into Project" "System", "Next", "Install", and watch the logs. It'll sit in Pending-Install for a while.
  14. Verify the operation of MetalLB.  "kubectl get all --namespace metallb-system" should display a reasonable output.  Using rancher, "RKE" cluster, "Apps", "Installed Apps", namespace metallb-system should contain a metallb with reasonable status results.
  15. Configure an IPAddressPool for MetalLB as per the IPAM allocation in Netbox.  Here is a link to the docs for IPAddressPools: https://metallb.universe.tf/apis/#ipaddresspool Currently, I only have a "l2-pool" but I will eventually have to add a "bgp-pool".
  16. Configure an L2Advertisement for MetalLB to use the IPAddressPool above.  Here is a link to the docs for L2Advertisements: https://metallb.universe.tf/apis/#l2advertisement  Currently, I'm feeding "default" to "l2-pool" which will probably default to "bgp-pool" after I get BGP working.
  17. Try provisioning an application using a Service type LoadBalancer.  I used numbers-web as per the intro.  In the CLI, "kubectl get svc numbers-web" should show a TYPE "LoadBalancer" and an "EXTERNAL-IP" in your L2 IPAM allocation, and even list the PORT(S) mapping.
  18. Check operation in Rancher.  "RKE", "Service Discovery", "Services", click thru on numbers-web, the top of the page should contain a "Load Balancer" IP address, the tab "Recent Events", should see nodeAssigned and IPAllocated events, and the tab "Ports" should tell you the ports in use.
  19. Test in a web browser from the desktop.  Remember that the numbers-web app runs on port 8080 not the default 80.
  20. You can specify statically assigned IP addresses using a custom annotation described at: https://metallb.universe.tf/usage/#requesting-specific-ips  This is useful because I can add DNS entries in Active Directory using Ansible pointing to addresses of my choice.
For reference, a bare-bones ipaddresspool.yaml looks like this:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: l2-pool
namespace: metallb-system
spec:
addresses:
  - 10.10.150.0/24

And an equally bare-bones l2advertisement.yaml looks like this:

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
spec:
ipAddressPools:
  - l2-pool

The Summary

This took 20 logically distinct steps.  Don't get me wrong, K8S is awesome, MetalLB is awesome, RKE2 is awesome, however, everything takes longer with Kubernetes...  On the bright side, so far, operation and reliability has been flawless, so it's worth every minute of deployment effort.

Trivia

There are only two types of K8S admins, the ones who admit that at least one time they thought metallb was spelled with only one letter "L", and the ones who are liars LOL haha.  This is right up there in comedic value with RKE2 pretending that .yml files are invisible and only processing .yaml files.