Tuesday, July 5, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 007 - OpenStack Infrastructure Prep - Netboot.XYZ

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 007 - OpenStack Infrastructure Prep - Netboot.XYZ

So... my netboot / PXE infrastructure was not working.

In the old days, setting up VOIP phones and turning piles of P75 desktops into clusters before the turn of the century, I had some hand written ISC-DHCP server configs and hand written TFTP configs, so this is all "old stuff" to me.

The thing that trips up newer sysadmins about network booting, it the utter lack of transparency.  Configure some rando DHCP server with mysterious incantations, then configure some rando TFTP server with similar weird configurations, then toss strange never before seen files into an unusual directory hierarchy, change some BIOS options on the host that most mere mortals fear to change, slap the reset button, and either it works and you celebrate, or it just freezes up and that's pretty much all you know until you whip out the ethernet protocol analyzer, or find someone else who previously "been there done that".  Set up Active Directory, eh, that's hard and a lot of work but there's entire books written about it.  Set up network booting, and not just "set up" but it actually works?  Now that's hard core, as I see things.

The way its supposed to work, from a very high level, is your DHCP server sends some interesting option codes to let devices know it can do this stuff.  Various BIOS settings, traditionally undocumented but sometimes you'll be surprised, tell your device to netboot off the network after the hard drive boot fails or maybe when you press F12 while booting or something more random and nonstandard, or maybe netboot every time, or whatever.  Anyway, assuming your device is set up correctly (and how do you know?) when the DHCP server responds to netbooting devices, the device is told the ip address of a tftp server to follow up with later, and what file to grab from that TFTP server, probably depending on some option codes the device sent to the server to tell if you should boot EFI or old fashioned BIOS (or I suppose you could hard code which depending on device MAC address, or lay down the law, its 2022 we only boot UEFI now a days and forever more).  Then your TFTP server tosses the netbooting device some magic file that does absolutely everything else.  So a problem could be at the DHCP server level, the TFTP server level, or a higher level "everything else" application level.  Without a protocol analyzer and someone who knows how to use it, best of luck to you, you'll need it.  As that famous Australian electrical engineer YouTube guy says, "its all a bit how-ya-doin" and when it works, its nice, but its not fun to troubleshoot as an overall system.

Anyway, specifically, Netboot.XYZ is a pretty awesome "everything else" for a netboot system, that can netboot install various operating systems and testing images.  Its really cool.  As an additional and much appreciated effort the project provides extras like the magic files required on your TFTP server to bootstrap into running netboot.xyz, and a Docker image holding a TFTP server and the higher-level stuff for the netboot system (menus and stuff).

https://netboot.xyz/

The system has been around for awhile, and technology has changed over the years, and the docs as shipped for the Docker image with instructions to "cut and paste" were for files that are no longer used for netbooting purposes and no longer shipped in the docker container.  So netboots would fail and I'd see requests for file download in the docker-provided TFTP server logs to filenames that do not exist in the container, at least no longer exist currently.

So I figured this all out, reverse engineered how the system should work based on other netboot systems I've used in the past, simplified the formerly rather elaborate ISC-DHCP server config down to about five minimal lines, and it all works, both legacy and UEFI netbooting!  I later went to submit a bug and a github pull request to improve the docs for everyone else, ended up meeting a guy trying to do the same thing as me, but running into the same problem, walked thru how to debug the innards of a Docker container to demonstrate the problem and the fix I came up with (wasn't Docker invented to eliminate the need to do exactly what we end up doing all the time anyway?) and to summarize a long story, I met some really cool folks and everyone's netbooting is working now, both PXE and legacy, AFAIK, and the docs for Netboot.xyz are now up to date for anyone who wants to install the overall system.

Netboot.XYZ is definitely the finest netbooting software I've ever used or even seen, far exceeding the abilities of anything I ever handcoded into working temporarily.  And the people who develop it are super cool to work with.  Overall an excellent experience.  And its free FOSS software.  They do have an OpenCollective, so send them money at:

https://opencollective.com/netbootxyz/donate

So, after a long days work, I can install Ubuntu 20.04 (LTS) over the LAN.

And everyone lived happily ever after.   Oh wait, I still have to finish converting my VMware cluster over to OpenStack.

Stay tuned for the next chapter!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.