Friday, November 3, 2023

Proxmox VE Cluster - Chapter 006 - Install Proxmox VE on the old Rancher cluster

Proxmox VE Cluster - Chapter 006 - Install Proxmox VE on the old Rancher cluster


A voyage of adventure, moving a diverse workload running on OpenStack, Harvester, and RKE2 K8S clusters over to a Proxmox VE cluster.


Today, installing Proxmox VE on the old Rancher cluster hardware.  The main reference document:

https://pve.proxmox.com/pve-docs/chapter-pve-installation.html


On the Beelink N5095, my BIOS boot order defaults to the hard drive (and its old RKE2 K8S installation), but this version of the BIOS has a handy feature where you boot, hit ESC to enter the bios, left arrow will wrap-around to select the rightmost BIOS menu, then select a boot option of single time only UEFI boot the USB key containing the Proxmox VE installation media.  Why doesn't PXEboot work over the LAN like everything else?  Well that's a long story, booting USB for right now.


Notes from the install process:

  • Note the USB keyboard I have did not work in the Proxmox graphical install environment, so I used the console install environment.
  • Country: United States
  • Timezone: The "timezone" field will not let me enter a timezone, only city names, none of which are nearby.  Super annoying I can't just enter a timezone like a real operating system.  I ended up selecting a city a thousand miles away.  This sucks.  Its a "timezone" setting not "name a far away city that coincidentally is in the same timezone".  I expect better from Proxmox.
  • Keyboard Layout: U.S. English
  • Password: (mind your caps-lock)
  • Administrator email: vince.mulhollon@springcitysolutions.com
  • Management Interface: enp2s0 (not the wifi)
  • Hostname FQDN: as appropriate, as per the sticker on the device
  • IP address (CIDR): as appropriate, as per the sticker on the device / 016
  • Gateway address: 10.10.1.1
  • DNS server address: 10.10.7.21 (my "old" dns22, which will probably get re addressed soon)
  • Note you can't set up VLANs in the installer, AFAIK.  I intend to use VLANs in the distant future.
  • Hit enter to reboot, yank the USB flash install drive, yank the USB keyboard, watch the monitor... seems to boot properly...
  • Web interface is on port 8006.  Log in as root.  Note I installed 8.0-2 and on the first boot, the web gui reports version 8.0.3, it must have auto-updated as part of the install process?


Upgrade the new Proxmox VE node

  1. Double check there's no production workload on the server; its a new install there shouldn't be anything, but its a good habit.
  2. Select the "Server View" then node name, then on the right side, "Updates", "Repositories", disable both enterprise license repos.  Add the community repos as explained at https://pve.proxmox.com/wiki/Package_Repositories
  3. Or in summary, click "add", select "No-subscription", "add", then repeat for the "Ceph Quincy No-Subscription" repo.
  4. In right pane, select "Updates" then "Refresh" and watch the update.  Click "Upgrade" and watch the upgrade.
  5. Optimistically get a nice message on the console of "Your system is up-to-date" and a request to reboot.
  6. Reboot and verify operation.

Increase the Ethernet MTU

The plan is to change the MTU of the ethernet physical port and the Proxmox internal bridge to 9000.  The Netgear ethernet switch was set to 9200+ a long time ago.

  1. Select the node, "System", "Network", select the ethernet port, edit, change the MTU from 1500 to 9000, "OK", "Apply Configuration".
  2. Repeat process selecting the bridge instead of the ethernet port. 
  3. On the right pane in "Updates" along the top there is a "Reboot" button, hit it.

R8168 Ethernet Driver Problem in Debian Linux

There is a driver problem with the Debian Linux / Proxmox version 8, R8168 ethernet driver.  Every couple seconds, depending on network load, the syslog will report some variation upon:

kernel: pcieport 0000:00:1c.5: AER: Multiple Corrected error received: 0000:00:1c.5

kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)

kernel: pcieport 0000:00:1c.5:   device [8086:4dbd] error status/mask=00000041/00002000

kernel: pcieport 0000:00:1c.5:    [ 0] RxErr                  (First)

kernel: pcieport 0000:00:1c.5:    [ 6] BadTLP     

And it will crash the ethernet connectivity entirely approximately once per day, depending on network use level.  The system keeps running and spamming the syslog with the error messages above, however the ethernet driver stops processing packets.  Link light will be on and flashing.  A power cycle will fix it for "about a day", depending on network load.

This problem is documented in the Proxmox Forums at:

https://forum.proxmox.com/threads/pve8-netdev-watchdog-enp1s0-r8169-transmit-queue-0-timed-out-fix-to-some-extent.133752/

Which references a Medium article at:

https://medium.com/@pattapongj/how-to-fix-network-issues-after-upgrading-proxmox-from-7-to-8-and-encountering-the-r8169-error-d2e322cc26ed

Which references a RealTechTalk article at:

https://realtechtalk.com/Ubuntu_Debian_Linux_Mint_r8169_r8168_Network_Driver_Problem_and_Solution-2253-articles

To paraphrase and summarize the above articles, and implement the work-around:

Verify the Beelink has a R8168 network card with an installed R8169 driver which "almost" works most of the time, by using "lspci -nnk":

02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)

        Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]

        Kernel driver in use: r8169

        Kernel modules: r8168

It has a r8168 on board and a r8169 driver auto-installed.  Well, that's not going to work.

Access the console, ssh or the web UI both work.

Then use vi to edit /etc/apt/sources.list to add non-free and non-free-firmware to BOTH the bookworm and bookworm-updates lines.

"apt update"

"apt install pve-headers"

"apt install r8168-dkms"

"dkms status" should show the r8168 dkms module is installed.  Not just ready or something, it needs to report as "installed".

vi /etc/modprobe.d/r8168-dkms.conf and uncomment the blacklist line, such that the r8169 driver will not load.

Reboot and it should be fixed.

After rebooting, in the console, run "lspci -nnk", it should now report:

02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)

        Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]

        Kernel driver in use: r8168

        Kernel modules: r8168

So, now we have an r8168 on board ethernet hardware device, using the newly installed r8168 driver.  Cool.

Is it working better?  Immediately I noticed the syslog is not being spammed anymore.  I ran it under load for a couple days and it continues to work.  I'd consider this fixed.  With the wrong driver installed, none of the nodes would never run more than, perhaps, 24 hours.  I'm sure this mandatory DKMS driver will make future Proxmox upgrades "more exciting".


Final Installation Checklist after the install and post-installation tasks above:

  1. Perform some basic operation testing
  2. In the web UI "Shutdown" then wait for power down.
  3. Reinstall in permanent location, power up.
  4. Verify information in Netbox to include MAC, serial number, ethernet cabling, platform should be Proxmox VE, remove old Netbox device information.
  5. Add new hosts to Zabbix.
  6. Verify operation one final time.

Next post will be about preparing the old Harvester cluster hardware, before installation of Proxmox VE.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.