Monday, April 8, 2024

Troubleshooting Strategies for Kubernetes, ndots: 5, and DNS blackholes

Troubleshooting Strategies for Kubernetes, ndots: 5, and DNS blackholes

Based solely on the title, some K8S old-timers already know the problem, and how I solved it.  Regardless, it's useful to discuss DNS troubleshooting strategies in K8S.

Troubleshooting strategy 1: Clearly define the problem as a goal.

I am trying to bring up a new RKE2 K8S cluster and roll some new workload onto it.  The cluster had no workload because it was brand new.  Three days ago I upgraded to K8S 1.28.8 and Rancher v2.8.3.  I deployed several Node-RED systems, the deployments, pods, services, PVCs, and PVs all look great and I can access the systems.  However, Node-RED is a node.js application and when I try to install additional packages to the palette, I get a long timeout followed by an installation failure in the GUI.  The goal is, I would like to be able to install node-red-node-aws and node-red-contrib-protobuf in Node-RED without errors.

Troubleshooting strategy 2: Gather related data.

The GUI is not much help; all we know is it didn't work.  Open a shell in Rancher on the K8S pod, and poke around in /data/.npm/_logs until I find the exact error message "FetchError: request to https://registry.npmjs.org/node-red-contrib-protobuf failed, reason: getaddrinfo ENOTFOUND registry.npmjs.org"  This looks bad, very bad; the pod can't resolve DNS for registry.npmjs.org.

Troubleshooting strategy 3: Think about nearby components to isolate the problem.

OK, DNS does not work.  However, the rest of the LAN is up, and the RKE2 system resolved hub.docker.com to initially install the container images, so at a reasonably high level, DNS appears to be working.  The problem seems isolated to the pods (not the entire LAN) and exactly one website.  Nest, stretch out, and try pinging Google.

ping www.google.com

Fails to resolve "no IP address found".  Let's try some other tools, sometimes nslookup will report better errors.  Ping asks the OS to resolve a domain name, nslookup asks a specific DNS server to resolve a domain name for it, a subtle difference.

nslookup www.google.com

Wow, that worked.  Let's combine the observations so far:  the operating system DNS resolver (the local thing using /etc/resolv.conf) failed yet asking the K8S DNS server directly, worked.  Local DNS resolution (inside the cedar.mulhollon.com domain) works perfectly all the time.  Interesting.

Troubleshooting strategy 4: Hope this is not your first rodeo.

This is not my first rodeo for solving sysadmin and DNS problems.  Always try the FQDN and non-FQDN hostname.  Surprisingly:

ping www.google.com.

That worked.  A DNS hostname that does not end in a period (plus or minus some ndots options) will check the search path before hitting the internet.  That's how a DNS lookup in a webbrowser on the LAN to nodered1 ends up pointing to the same place as nodered1.cedar.mulhollon.com, there are not two entries with different names, there's just a cedar.mulhollon.com in the DNS search path and a nodered1 A type record.  

When working through a puzzling IT problem, if this is your first rodeo, you will be in for a rough grind, but I've seen a lot over the years, making life a little easier.  My hunch about hosts with FQDN vs non-FQDN paid off.

Back to troubleshooting strategy 2, let's gather more data.  Inside the pod, it's /etc/resolv.conf looks like this:

search nodered.svc.cluster.local svc.cluster.local cluster.local cedar.mulhollon.com mulhollon.com
nameserver 10.43.0.10
options ndots:5

This is the longest search path and highest ndots option I've ever seen, which based on some internet research is apparently normal in K8S situations, and the nameserver IP looks reasonable.

Troubleshooting strategy 5: Enumerate your possibilities.

  1. Is 10.43.0.10 a working coredns resolver?
  2. The search path is long.  UDP packet size limit?  Some limit in the DNS resolver?  Could it be timing out trying so many possible search paths?
  3. What's up with the huge ndots option setting?  Problem?  Symptom?  Cause?  Effect? Irrelevant?
At this stage of the troubleshooting process, I had not yet guessed the root cause of the problem, but I had some reasonable possibilities to follow up upon.

Troubleshooting strategy 6: Search the internet after you have something specific to search for.

When you have described your problem in enough detail, it's time to find out how "the internet" solved the problem, someone else probably already solved the problem and is happily telling everyone.  Similar to this blog post.  I found a couple leads:
  1. Some people with weird ndots settings have massive DNS leaks causing lots of extraneous DNS traffic and running the famous K8S nodelocal DNS cache helped them cut the DNS traffic overload.  Is that my problem?  Not sure, but its "a problem".  I probably need to set up nodelocal DNS caching eventually, according to everything I read.  Make a note that because my RKE2 has IPVS installed, I need a slightly modified nodelocal DNS cache config.
  2. Some people with weird ndots settings had DNS queries timeout because they made too many DNS search queries.  Note that the "overall Node-RED palette installer" fails after 75 seconds of trying, and the command line "npm install node-red-contrib-protobuf" times out and fails in about 90 seconds, but my DNS test queries at the shell instantly fails so their problem is likely not my problem, also I have very low traffic on this new cluster and they have large clusters with huge DNS traffic loads which I would not have.  Also, there is a large amount of opposition from the general K8S community to 'fixing' high ndots option settings due to various internal traffic issues related to K8S, so we're not doing that.  I think this is a dead end to pursue.
  3. Rancher and RKE2 and K8S all have pretty awesome wiki DNS troubleshooting guides.  I plan to try a few!
Troubleshooting strategy 7: Know when to cowboy and when not to cowboy.

If the system is in production or you're operating during a maintenance window then you have a written documented meeting-approved change management and risk plan and a window to work, along with written prepared back-out plans.  However, if it's a pre-production experimental burn-in test system, then it's the Wild West, just make sure to write good docs about what you do to justify your learning time.  This particular example was an experimental new RKE2 cluster, perfectly safe to take some risks upon, I need to set up nodelocal DNS sooner or later anyway, and less DNS traffic might help this problem or at least can't make it worse.  So I talked myself into cowboy style installing Nodelocal DNS caching on this RKE2 cluster using IPVS, during the process of working an "issue".  I felt it was related "enough" and safe "enough" and in the end, I was correct, even though it did not solve the immediate problem.


The DNS cache worked with no change in results (although nothing is worse; I think). Note you have to add the ipvs option if you have ipvs enabled (which I do).

I switch my troubleshooting strategy back to "Gather related data.".  If my first and only real workload of Node-RED containers fails, let's try a different workload.  I spin up a linuxserver.io dokuwiki container.  Works perfectly except it can't resolve external DNS either.  At least it's consistent, and consistent problems get fixed quickly.  This removes the containers from the cause of the problem, it is unlikely to be a problem unique to Node-RED containers if the identical problem appears in a Dokuwiki container from another vendor...

Back to Troubleshooting strategy 6 of when in doubt search the internet.  I methodically worked through Rancher's DNS troubleshooting guide as found at:


kubectl -n kube-system get pods -l k8s-app=kube-dns

This works.

kubectl -n kube-system get svc -l k8s-app=kube-dns

This works.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

"nslookup: can't resolve 'kubernetes.default'"
This fails which seems to be a big problem.

Time to circle back around to Troubleshooting Strategy 7, and try some cowboy stuff.  This is just a testing burn-in and experimentation cluster, let's restart some pods, it's plausible they have historical or out-of-date config issues.  I restart the currently running pod.  It was restarted 3 days ago during a K8S upgrade and restarting the CoreDNS pods on a test cluster should be pretty safe, and it was.  The two rke2-coredns-rke2-coredns pods are maintained by the replicaset rke2-coredns-rke2-coredns.  I restarted one pod, and nothing interesting happened.  The good news is logs look normal on the newly started pod.  The bad news is busybox DNS query to kubernetes.default still fails.  I restarted the other pod, so now I have two freshly restarted CoreDNS pods.  Logs look normal and boring on the second restarted pod.  The pod images are rancher/hardened-coredns:v1.11.1-build20240305 The busybox query to kubernetes.default continues to fail same as before.  Nothing worse, nothing better.

I return to troubleshooting strategy 2, "Gather more data".

kubectl -n kube-system get pods -l k8s-app=kube-dns

NAME                                         READY   STATUS    RESTARTS   AGE
rke2-coredns-rke2-coredns-864fbd7785-5lmgs   1/1     Running   0          4m1s
rke2-coredns-rke2-coredns-864fbd7785-kv5zq   1/1     Running   0          6m26s

Looks normal, I have two pods, these are the pods I just restarted in the CoreDNS replicaset.

kubectl -n kube-system get svc -l k8s-app=kube-dns

NAME                        TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
kube-dns-upstream           ClusterIP   10.43.71.75   <none>        53/UDP,53/TCP   51m
rke2-coredns-rke2-coredns   ClusterIP   10.43.0.10    <none>        53/UDP,53/TCP   50d

OK yes I set up nodelocal caching as part of the troubleshooting probably 51 minutes ago, very reasonable output.

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default

Server:    10.43.0.10
Address 1: 10.43.0.10 rke2-coredns-rke2-coredns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
pod "busybox" deleted
pod default/busybox terminated (Error)

Time to analyze the new data.  After upgrading and restarting "Everything" it's still not working, so it's probably not cached data or old configurations or similar, there's something organically wrong with the DNS system itself that only presents in pods running on RKE2, everything else is OK.  Its almost like there's nothing wrong with RKE2... which eventually turned out to be correct...

Time for some more of Troubleshooting strategy 6, trust the internet.  Methodically going through the "CoreDNS specific" DNS troubleshooting steps:

kubectl -n kube-system logs -l k8s-app=kube-dns

.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29
.:53
[INFO] plugin/reload: Running configuration SHA512 = c18591e7950724fe7f26bd172b7e98b6d72581b4a8fc4e5fc4cfd08229eea
58f4ad043c9fd3dbd1110a11499c4aa3164cdd63ca0dd5ee59651d61756c4f671b7
CoreDNS-1.11.1
linux/amd64, go1.20.14 X:boringcrypto, ae2bbc29

kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}

Error from server (NotFound): configmaps "coredns" not found

That could be a problem?  No coredns configmaps?  Of course these are some old instructions and I have a new cluster with a fresh caching node-local DNS resolver, and DNS is mostly working so a major misconfiguration like this can't be the problem, so I poke around on my own a little.

I checked the node-local-dns configmap and that looks reasonable.

kubectl -n kube-system get configmap node-local-dns -o go-template={{.data.Corefile}}

(It would be a very long cut and paste, but it seems to forward to 10.43.0.10, which admittedly doesn't work, and this ended up being irrelevant to the story anyway so its not included in this blog post)

Ah, I see in the installed helm app for rke2-coredns the configmap name is actually named rke2-coredns-rke2-coredns, OK that makes sense now.

kubectl -n kube-system get configmap rke2-coredns-rke2-coredns -o go-template={{.data.Corefile}}

.:53 {
    errors 
    health  {
        lameduck 5s
    }
    ready 
    kubernetes   cluster.local  cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus   0.0.0.0:9153
    forward   . /etc/resolv.conf
    cache   30
    loop 
    reload 
    loadbalance 
}

The above seems reasonable.

Docs suggest checking the upstream nameservers

kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'

The nameserver info matches the configuration successfully used by 78 hosts configured by Ansible operating on this LAN, and superficially looks good (insert ominous music here; not to ruin the story, but the cause of the problem was the config was not good after all, it just mostly worked everywhere except K8S for reasons I'll get to later)

Let's contemplate Troubleshooting strategy 7: Know when to cowboy and when not to cowboy.  On one hand, I feel like enabling query logging and watching one specific query process at a time.  On the other hand, I feel pretty confident in my ability to enable logging, less confident in the infrastructure's ability to survive an accidental unintended log flood, and very unconfident about my ability to shut OFF query logging instantly if some crazy flood occurs.  I feel overall, that going cowboy by enabling query logging is a net negative risk/reward ratio at this time.  Someday I will experiment safely with query logging but today is not that day.

Well, I seem stuck.  Not out of ideas but the well is starting to run dry.  I haven't tried "Troubleshooting Strategy 1: Clearly define the problem as a goal" in awhile, so I will try that again.  I re-documented the problem again.  It looks like the story I placed at the beginning of this blog post, no real change.  It was still a good idea to focus on the goal and what I've done so far.  Maybe unconsciously this review time helped me solve the problem.

Let's try some more Troubleshooting strategy 2, and gather more data.  As previously discussed, I did not feel like going all cowboy by enabling cluster-wide DNS query logging.  As per Troubleshooting Strategy 4, hope this is not your first rodeo, I am quite skilled at analyzing individual DNS queries, so let's try what I'm good at:  We will pretend to be a K8S pod on a VM, and try all the search paths just to see what they look like.

From a VM unrelated to all this K8S stuff we've been doing, let's try the google.com.cedar.mulhollon.com search path.  That is my Active Directory domain controller and it returns a NXDOMAIN, this is normal and expected.

Following troubleshooting strategy 3, think about nearby components to isolate the problem, let's try the last DNS search path.  This will be google.com.mulhollon.com.  That domain is hosted by Google and it returns a valid NOERROR but no answer.  

Wait what, is that even legal according to the DNS RFCs?

Following troubleshooting strategy 5, enumerate your possibilities, I think it's quite plausible this weird "NOERROR header but empty data" DNS response from Google could be the problem.  This isn't my first rodeo troubleshooting DNS, and I know the search protocol for DNS takes the first answer it gets, so when internal resolution fails its last search path for host "whatever" will be whatever.mulhollon.com and Google will blackhole all incoming queries so it'll never try external resolution.  This certainly seems to fit the symptoms.  As a cowboy experiment on the test cluster, I could remove that domain from the DNS search path in /etc/resolv.conf and try again.  In summary, I can now repeatedly and reliably replicate a problem directly related to the issue in a VM, and I have a reasonable experiment plan to try.

Before I change anything, gather some more data under Troubleshooting Strategy 2.  I can now replicate the solution in a K8S pod.  I don't have root and can't edit the /etc/resolv.conf file in my NodeRED containers is mildly annoying, it's just how the docker containers are designed.

I found a container that I can successfully log into as root and modify the /etc config files.  With mulhollon.com (hosted at Google) if I try to ping www.google.com I get "bad address" because Google domain hosting blackholes missing A records, so weird but so true.
If I edit /etc/resolv.conf in this container, and remove mulhollon.com from the search path, SUCCESS AT LAST! I can now resolve and ping www.google.com immediately with no problems.  I can also ping registry.npmjs.org so that implies I can probably use it (although this test container isn't a NodeJS container or a Node-RED container)

Well, my small cowboy experiment worked, let's try a larger-scale experiment next.  But first, some explanation of why the system has this design.  In the old days, I had everything in the domain mulhollon.com, then I gradually rolled everything internal into active directory hosted cedar.mulhollon.com and now I have nothing but external internet services on mulhollon.com.  In the interim, while I was setting up AD I needed both domains in my DNS search path for internal hosts, but I don't think I need that any longer and it hasn't been needed for many years.

Time for some more troubleshooting strategy 7, cowboy changes on a test cluster.  Some quality Ansible time resulted in the entire LAN having its DNS search path "adjusted".  I had Ansible apply the /etc/resolv.conf changes to the entire RKE2 cluster in a couple minutes.  Verified the changes at the RKE2 host level, changes look good, DNS continues to work at the RKE2 and host OS level so nothing has been made worse.

I ran a "kubectl rollout restart deployment -n nodered" which wiped and recreated the NodeRED container farm (without deleting the PVs or PVCs, K8S is cool).  Connect to the shell of a random container, the container's /etc/resolv.conf inherited "live" from the host /etc/resolv.conf without any RKE2 reboot or other system software level restart required or anything weird, looks like at container startup time it simply copies in the current host resolv.conf file, simple and effective.  "ping www.google.com" works now that the DNS blackhole is no longer in the search path.  And I can install NodeJS nodes into Node-RED from the CLI and the web GUI and containers in general in the new RKE2 cluster have full outgoing internet access, which was the goal of the issue.

Troubleshooting strategy 8: If you didn't document it, it didn't happen.

I saved a large amount of time by keeping detailed notes in the Redmine issue, using it as a constantly up-to-date project plan for fixing the problem and reaching the goal.  Ironically I spent twice as much time writing this beautiful blog post as I spent initially solving the problem.

I will list my troubleshooting strategies below.  These overall strategies will get you through some dark times.  Don't panic, keep grinding, switch to a new strategy when progress on the old strategy slows, and eventually, things will start working.
  • Clearly define the problem as a goal.
  • Gather related data.
  • Think about nearby components to isolate the problem.
  • Hope this is not your first rodeo.
  • Enumerate your possibilities.
  • Search the internet after you have something specific to search for.
  • Know when to cowboy and when not to cowboy.
  • If you didn't document it, it didn't happen.
Good Luck out there!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.