Thursday, June 30, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 003 - Starting Conditions

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 003 - Starting Conditions

To get where you're trying to go, first you have to figure out where you are.

The good news is, I'm the guy whom stood up the VMware cluster, so I have a pretty good idea how it works.  And I like having good documentation, so I have a local installation of Netbox, which is the best IPAM solution I've ever seen, or even dreamed of.

There are six identical ESXi hosts, all a couple years old, the "famous" SuperMicro SYS-E200-8D model that is so popular in home labs around the world.  96 gigs of RAM, each, because in my short term, wild and experimental VMware NSX era, NSX was incredibly memory hungry, to the point where you'd think the world's memory manufacturers bribed VMware to find some way to use more memory.  Like what is it even doing with all those gigabytes of RAM on each host?

As for networking, each ESXi host has an IPMI port with a pretty good web accessible HTML KVM, and dual one-gig ethernet ports and dual ten-gig ethernet ports.  The VMware networking concept, at least pre-NSX, is to set up distributed switches across the hosts and uplink each switch to VLANs, preferably across all the ethernet ports, to different switches.  So I have a one-gig ethernet switch with 12 connections and a ten-gig ethernet switch (which was admittedly expensive some years ago...) with 12 connections.  All the ports are configured the same way and all identically trunked together although I tried to follow VMware guidelines to prefer vmotion vmk to use to one dedicated 10G port and prefer vSAN vmk to use a different dedicated 10G port, but in theory (and in practice, a couple times...) it was possible to yank out any three of the four ethernet ports and the system would keep working, although perhaps slowly.  Well, it was possible to REALLY fool VMware in the old days by admin down ports or blocking VLANs on the switch side, but in general it was pretty bulletproof.  NSX was somewhat less bulletproof due to extreme complexity, but I had given up on NSX many years ago.  OpenStack does things quite a bit different than VMware, will get to that later...

As for storage, there's vSAN across all six hosts, all SSD, ESXi hosts, which for years has always been one hundred percent reliable, and I've never personally experienced data loss or frankly even much of  a problem with vSAN.  I also have a huge iX Systems hardware NAS of many terabytes, and a couple smaller TrueNAS boxes based on Intel NUCs from a couple years ago with a fraction of a TB each.  VMware works reliably over NFS although obviously the vSAN is enormously faster.  Over the years I upgraded the storage on each host, started out with cheap small HDD and moved to small SSD and later larger SSDs as prices fell over the years.  Those SSDs will be used for the bare metal OS and Cinder storage.  Each host also has a M.2 NVME internal SSD that was used for vSAN cache and intend to use for OpenStack Swift.  Storage will also be a lot different than VMware for OpenStack Cinder...

Generally the easiest way to load balance and backup and replicate and generally administer Docker workloads on VMware was to set up a host for every project (or even, every container!) and let vMotion and DRS and HA do its magic.  With Ansible to automate the Linux side and products like Orchestrator to automate the VMware side it only takes minutes to spin up a new Active Directory connected Docker host on the old VMware system, which I can likely replicate on OpenStack using Heat (or Magnum with really small Docker Swarms?) or I could move to Zun on OpenStack, eventually.

The biggest changes were infrastructural in nature.  Aside from previously mentioned Zun to handle Docker, I would be using Heat instead of Orchestrator, and 'something' in place of Log Insight.  Probably a homemade ELK stack?  In the old days you'd install an ELK stack on bare metal, or a bare metal image anyway, just like you'd install something like Apache Guacamole on bare metal in the old days, but now there are Docker containers for seemingly all services.  So somehow this conversion project is already morphing into a larger conversion; not just drag and drop existing images, but change entire software architectures to use more Docker containerization and so forth.  Not exactly the first IT project in history to experience massive scope creep over time... 

Stay tuned for the next chapter!

Wednesday, June 29, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 002 - The Plan, v 1.0

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 002 - The Plan, v 1.0

First, I need an operations and logistical plan.  What's going to go where, when, the usual project planning puzzle.

Before the big hardware crash, I had some experience with OpenStack and researched my options, so I was not going in blind.  Also, if you're creative enough, and have enough resources at hand, its hard to get yourself boxed in.  At least so far in life, I've always managed to figure my way out of whatever I got myself into.  So far, LOL.

The existing VMware cluster has six hosts, and vSAN really needs a minimum of three hosts, and the cluster workload was somewhat less than fifty percent, probably less than a third would be "tolerable" for awhile.  This would imply I could split the six host cluster, convert half of it, the first three hosts, to OpenStack, move everything from VMware to OpenStack, convert and add the other half of the cluster to the larger OpenStack, and I'd be done.  

No big deal, probably a long weekend's work.  With respect to installing OpenStack, I'm sure the experience and docs have improved over the years so a couple "apt-get install" lines on a fresh Ubuntu install and I'll be running.  Looking back, I was sooooooo optimistic.  "In the old days" when I was experimenting with OpenStack many years ago this new-fangled "Kolla-Ansible" project was too new to use, so I figured I'd use the openstack repos and just install components one at a time by hand, just like the good ole days, with some help from my existing Ansible infrastructure to replicate across all the hosts and keep the cluster identical.

So in more detail, my plan, version 1.0, looked like this:

Decommission as many legacy cruft/junk/obsolete VMs as possible.  If its not there anymore, I don't have to move it nor worry about making it work.  All clusters accumulate old junk; what a glorious time for a spring cleaning marathon?

Shut down vSAN, for safety's sake, and run everything off the slow NAS over NFS.  I've done this before and have some HUGE storage servers that run off this NAS; its just a matter of some Storage vMotion moves, then shut off the vSAN.  NFS to a spinning rust HDD NAS is slow, a lot slower than vSAN on an all SSD cluster with 10G ethernet, but its "fast enough for awhile".

Infrastructure prep before starting.  Why is my netboot.xyz infrastructure not working for PXE netboot/installation of software, I've certainly installed ESXi over the network before, along with other OS.  Get a head start on documenting everything in Netbox, tidy up the ethernet switch VLANs and stuff, prep Ansible for future hosts and services...

Shut down ESXi hosts 1, 2, and 3 safely and cleanly.  Clean, dust, relabel the hardware, update the docs in Netbox and in the Active Directory DNS (which is actually a cluster of Samba servers acting as DCs, works great for many years now).

Bare metal Ubuntu installs on hosts 1, 2, and 3.  Docs imply OpenStack "Yoga" version works best on Ubuntu 20.04.  I plan to have Bind DNS running on all bare metal hosts for OS Designate to dynamically configure DNS for the cluster, will be interesting to see how that interacts with the existing Active Directory install (spoiler, it was challenging and needed a lot of reworking...)  Originally I planned to move my DHCP servers to bare metal installs, that plan changed along the way also.  One plan that worked pretty well was setting up the entire OpenStack cluster as a giant NTP cluster.  My innovative and creative solution to OpenStack not handling USB passthru like VMware, was to just set up Docker on one of the hosts and not virtualize those USB-requiring applications at all.  This step also includes proving out the LAN, and in retrospect my MTU testing was not careful enough, leading to some considerable trouble later on.

OpenStack has a wildly different project architecture than VMware.  VMware sets up simple hypervisors on all hosts, then runs "the cool stuff" as virtualized hosts, so vCenter lives as just another image.  OpenStack kind of inverts that architecture, so you need a controller host (or, ideally, cluster) that runs bare metal MySQL as a database, or RabbitMQ or similar, and the hypervisors ONLY run production workload.  So there's some infrastructure to set up, those applications along with memcached, etcd, etc.

At this point I planned to follow the online OpenStack "Installation Tutorial" documentation.  I was a little nervous that most of the docs referenced Ubuntu 18 or even Ubuntu 14...  I designed a sensible dependency tree, Keystone first, Swift before Glance, Placement before Nova, etc.  I figured I'd set up the basics now, and experiment with more advanced features like Magnum and Mistral and Zun and Trove later.

"Move everything from VMware to OpenStack".  Sounds simple.  In retrospect, as usual, the last ten percent of any effort takes ninety percent of the time, recursively...

Shut down and clean up and re-install the last three hosts, 4, 5, and 6.  This will be the end of the VMware cluster, don't have to be so careful, plus with previous experience, this should be pretty smooth.

Add hosts 4, 5, and 6 to the existing cluster consisting of hosts 1, 2, and 3.  I have some experience messing around with OS Swift so I know it'll take some effort but its quite possible.

Add "cool new extra" services to the larger capacity full size OpenStack.  Maybe try out Trove for databases instead of spawning off more Docker containers, that sort of thing.  At the time, I planned on setting up an ELK stack, which I've done before, to replace Log Insight.

I expected to do one last sweep thru the entire system to update docs, update hardware labels, verify and maybe even test new backup strategies.

The plan sounded great...  However, IT work is similar to my military experience, in that no matter how well designed a plan is, the plan never survives contact with the enemy.  The mental effort of making a plan provides the "virtual experience" to improve the odds of success, so its not wasted time.

Stay tuned for the next chapter!

Tuesday, June 28, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 001 - Intro

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 001 - Intro

"A long time ago in a galaxy far, far away..."

I personally set up a small VMware cluster of a mere six ESXi hosts a couple years ago.  I enjoyed the "cheap" ESXi-Experience as a VMware Users Group member for many years.  For those unfamiliar with the VMUG Eval-Experience or whatever its called now, for a very modest annual fee you get the "keys to the kingdom" of everything VMware has under a license only permitting educational use, and I learned a lot about VMware over the years.

I enjoyed the experience, I have to admit over years of experience that vSAN is the best SAN I've ever used, I scripted things that don't need to be scripted with Orchestrator, followed up on hardening suggestions from VMware Operations Manager, experimented with vCloud Director and Integrated Containers.  Log Insight is not necessarily better or worse than a FOSS ELK stack, but it is different.  Like many people I tried setting up NSX without understanding the basic concepts and thus obtained an education under fire of how NSX works; cool system, but a lot of mental gymnastics and work.  Upgraded vCenter many a time, had to reinstall it a couple times over the years, LOL.  Remediated quite a few ESXi hosts.  Like all the rest of you I was wondering how I'm going to use a flash-based web interface for vCenter when everyone removes flash support from their web browsers in a couple months, but VMware came thru seemingly at the last minute and everything worked perfectly on the new html5 web client.

Like everyone else who's honest about their cluster admin experience, there was plenty of comedy over the years.  Take a snapshot of a server before upgrading, upgrade, and wonder why there's a fraction of a terabyte snapshot file slowly filling up my NAS months later.  The usual fun with put host 2 in maint mode to shut it down and upgrade the memory, but instead pull the plug on host 3 and stick the memory in it instead, wonder why the cluster went nuts.  The odds of having five identical ethernet cables on six hosts across five VLANs all properly labeled and installed ... are about as low as you'd think, leading to fun with primary and backup links when upgrading the VSAN and vMotion interfaces to a new 10G ethernet switch.  Back in the old days (vers 5.x ?) you had to install special custom hardware drivers for 10G ethernet, which hilariously crashed ESXi when you upgraded ESXi far enough past those old drivers.  Much like how "everyone knows" that UPS hardware is less reliable than wall outlet power in the long run (well, depending where you live, LOL) VCHA sounds like a good idea but was always less reliable than a single VC install, I blew up VCHA quite a few times.  It seems that if something crazy could theoretically happen, it happened to me, I figured it out, and I fixed it.  I have this "VMware Sysadmin" stuff figured out.  Every time, eventually, everything always ended up working really well!

But all good things eventually end; VMware has been sold, and I'm not sure about the future, not as much money to be made in VMware consulting as I would have hoped (although you know where to reach me if you need skilled, experienced help with a VMware cluster).  Sounds like newer versions of ESXi will require a writable boot drive for logging that I don't want anyway and this cannot be wedged into my existing hardware without extensive effort.  The Eval-Experience program is a lot cheaper than list price, but it's still about half the annual cost of electricity to operate the cluster.  I am just done with VMware.    

I had some really positive experiences with OpenStack over the last decade or so.  Both at my home lab and when working for other people.  Eventually, I will start the project to convert my VMware cluster to OpenStack.  Eventually.  For a couple years now it has been "Eventually".  Meanwhile, other things to do.

Sometimes, "Eventually" happens faster than you'd expect.  One fine afternoon this winter, the power supply for host number three failed.  No big deal, that's why we keep spares.  About five minutes after replacing the power supply, I see in the IPMI KVM window, that host number three's boot device also failed some time in the past (no alarms in vCenter?), so now it can't boot ESXi.  No big deal, that's why we keep spares.  Next surprise discovery is the vastly underutilized yet very important netboot.xyz installation for PXE based operating system installation over the LAN is also not working.  I'm getting perturbed by that point.

I guess this is as good a point as any to initiate the big conversion project, replacing VMware with OpenStack?

Stay tuned for the next chapter!