Sunday, July 31, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 033 - Basic Kolla-Ansible Installation

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 033 - Basic Kolla-Ansible Installation

Here are the main references I used to install the basic Kolla-Ansible system:

https://docs.openstack.org/kolla-ansible/yoga/user/quickstart.html

https://docs.openstack.org/kolla-ansible/yoga/reference/index.html

I'm not going to reword the entire installation process, the docs are pretty good.  I'm only going to comment on undocumented or special issues from my installation.  Read the official docs, then skim this post, then you're ready.

First thing to note, is I use my LAN Ansible system to install the OS packages listed in the instructions.  I began my "for reals" installation at the VENV stage of the instructions.

The docs advise setting some common sense ansible config options, which I agree are good options, but the docs only explain how to set them in a NON-VENV environment.  In my VENV environment I put my config options into /root/ansible.cfg and that seems to work.

My specific flavor of Kolla-Ansible was "kolla_base_distro: ubuntu" and "kolla_install_type: source"

After auto-generating my deployment grade passwords, I manually edited the /etc/kolla/passwords.yml file, like "keystone_admin_password", because the default web UI password is, I'm sure, very high security, but also incredibly inconvenient.  This will also, BTW, be your "skydive" web user interface password.  Ditto for kibana's password.

Remember to edit /etc/kolla/ml2_conf.ini which has the concept of a limited range of VLAN IDs (why not all of them?  I don't know?) so I used bond12:1:1000 to imply vlans 1 thru 1000 on interface bond12, although I only use 10,20,30,... up to 60.

The docs, and many or at least some users, strongly encourage editing the single enormous globals.yml file, which seems unmaintainable to me, so I do not edit or comment out ANYTHING in globals.yml (disclaimer see the end of the post), I only override in /etc/kolla/globals.d files with names like neutron.yml.

So, comments about my /etc/kolla/globals.d override files:

kolla_ansible.yml - Don't forget the distro, vip addrs, and keepalived virt router id needs to be different than Cluster 1's keepalived virt router id because they'll be on the same LAN and "fight" each other if they're the same (which is hilarious to watch, although non-productive)

cinder.yml - I enabled the LVM backend, and I use a custom ssd volume group name with swift as my backup driver.  I go back and forth on the idea of using NFS or swift (and rclone, etc) as my backup driver.  Its so slow I end up not using backups and rely on backing up the NAS and using HEAT/Ansible to deploy new instances faster than I can restore a backup, LOL.

glance.yml - I disable the file backend, and enable the swift backend.

neutron.yml - Lets just say neutron is going to get a VERY long post later on.  But initially, remember to set your network_interface, neutron_external_interface, kolla_internal_vip_address.  Note that Kolla-Ansible uses OpenVSwitch whereas hand-installed per the instructions use LinuxBridge, so, that was exciting later on.

swift.yml - Remember to set up your rings by hand using something similar to the procedure in:

https://docs.openstack.org/kolla-ansible/yoga/reference/storage/swift-guide.html

although it doesn't have to be that elaborate and complicated.  Helpfully the hand rolled install docs, the kolla-ansible docs, and the swift docs all do it somewhat differently, so by looking at all three at once you can get the idea of what's going on quicker than if they all documented the process in a single way.

Bootstrapping bug:

Because I'm not modifying the globals.yml file and am doing all configuration in individual yml files in global.d, that triggers a bug:

https://bugs.launchpad.net/kolla-ansible/+bug/1970638

Instead of setting a dummy variable, I make ONE edit to /etc/globals.yml to set the base distro to ubuntu. Now bootstrapping works without error...

There are certainly a lot of possible targets to the kolla-ansible but mostly all you really have to remember is "bootstrap-servers" "prechecks" "deploy" "post-deploy"

Shockingly, it all pretty much worked on the first day.  Cool!

I'm not really into using the demonstration data pack, as I have to extensively edit it to work on my provider network and I end up deleting most of it once I know what I'm doing.  However, for complete OpenStack noobs, this is a working config:

/root/kolla-ansible/share/kolla-ansible/init-runonce

Note that Kolla-Ansible is incredibly unbelievably slow, compared to an hand rolled installation.  For example, if you have no idea how to set up a provider VLAN id range, if you manually install you'll find docs referencing the ml2_conf.ini file, and you can hand edit that file on a three host cluster and restart services and try it, and the cycle time might be below one minute per experiment, so even if you get lost and confused you can get it up and working in, three, five, maybe ten minutes worst case.  Even a blind dog finds a bone once in awhile.  However, on Kolla-Ansible, a reconfigure, even on a small cluster with SSDs and 10G ethernet, is going to take maybe a half hour per experimental cycle, so even the simplest task like setting the network_vlan_range might take hours or an entire day unless you already know how to do it.  This is not even counting the enormous delays in figuring out in general if you edit /etc/kolla/globals.yml, or /etc/kolla/globals.d/something.yml, or /etc/kolla/neutron/ml2_conf.ini (that doesn't work, BTW), or create a ml2_conf.ini file in /etc/kolla/config or in /etc/kolla/config/neutron, or something else.  Eventually you use to learn tags like —tags neutron to save some time, and you learn everything is configured about the same strategy, so what worked in a general sense on Swift, will probably work in a general sense on Neutron WRT file names and things.  So, it gets better after awhile.

If you would like to see the files I use to configure my Kolla-Ansible, take a look at the backup directory in:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts

That link above is intentionally public because I want people to read it; take a look, poke around, learn something, borrow some ideas.  Most importantly, have fun!

Tomorrow, the frustrating adventure of Centralized Logging.

Stay tuned for the next chapter!

Saturday, July 30, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 032 - Bare Metal Install of OpenStack Hosts 4, 5, 6

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 032 - Bare Metal Install of OpenStack Hosts 4, 5, 6

By "bare metal install" I mean a bootable Ubuntu 20.04 server installation on bare metal hardware (not inside the OpenStack that isn't even installed yet LOL).  Kolla-Ansible is installed on top of a cluster of working Ubuntu servers.  So this chapter is where I install, configure, and test those bare metal hardware servers, tomorrow to become controllers and compute nodes and network nodes and stuff like that.

This is somewhat repetitive of Chapter 010 where I installed Ubuntu 20.04 on hosts 1, 2 and 3.

https://springcitysolutions.blogspot.com/2022/07/adventures-of-small-time-openstack_01240235463.html

First, here are the Kolla-Ansible reference documents I used for the "Yoga" release.  By the time you read this retrospective, maybe "Zebra" will be the newest release, I don't know.

https://docs.openstack.org/kolla-ansible/yoga/user/quickstart.html

https://docs.openstack.org/kolla-ansible/yoga/reference/index.html

In retrospect, I should have left the M2 SSD alone; I had to reconfigure it later based upon:

https://docs.openstack.org/kolla-ansible/yoga/reference/storage/swift-guide.html

There's a "kind of bug" in that doc, admittedly it is written for a different OS; anyway the partition on Ubuntu in the mkfs ends in p1, not just 1, which I think would be obvious to an Ubuntu sysadmin?  Anyway, pay close attention to that.

I'm continuing with my strategy from Plan 1.0 in Plan 2.0 to use swap partitions instead of swap files as installed.  Writing this as a retrospective, I still do this even in Plan 3.0 and it works great!

I use Ansible (not Kolla-Ansible, just an install on my LAN) to configure hosts for common configs, think of things like Active Directory SSO or NTP or even just a sensible option file for the VIM editor or the agent for Zabbix monitoring.  This was all uneventful.  I still worry if I run my LAN Ansible against my Kolla-Ansible configured hosts, that'll mess something up, but so far, so good!  I wish Ansible as a tool had some functionality to compare two recipes and report if there's any conflict between them.

Originally, in Plan 2.0 I intended to connect my existing Portainer Docker monitoring system to my own Portainer-Agent containers on the Kolla-Ansible docker installation, as I thought it would be interesting to monitor Kolla-Ansible Docker containers via the magic of Portainer.  Eventually when setting up Kuryr networking and Zun container management "something" messed up Portainer Agent connectivity; maybe in the future if I run something like a Zun container of Portainer Agent connected directly to a provider network, it would work.  I simply haven't explored this alternative in depth, not yet anyway.  It would be VERY helpful as a front end for Kolla troubleshooting, if I could make it work.

https://www.portainer.io/

This Kuryr/Zun effect also seems to stop me from running "bare metal" containers outside the Zun ecosystem; I have some hardware devices plugged into USB ports connected to containers which VMware did quite easily, as it supports USB passthru.  However I will need a new solution for USB hardware when I roll out Plan 3.0.  Will discuss that later; it involves an Intel NUC and the unusual direction of taking software loads OFF the cluster and putting onto bare metal.

The physical hardware has five ethernet ports, two 1G, two 10G, and an IPMI port.  The way I'm doing physical networking on Plan 2.0 is to bond the two 1G ethernets and use those for provider network VLAN trunking, and the two 10G ethernets together as a simple 802.1 Access single VLAN management port, and then let Kolla-Ansible run "everything" over the resulting 20G management interface, which has worked out VERY well.

This results in a couple installation issues.

First, there is the interface dance.  PXEboot by default does not speak VLAN 802.1q protocols so you have to configure eth0 as a simple access port on the management LAN, then do a full install over that single access port, then configure the bonded 10G ports and move all management traffic to them, then configure bonded 1G provider LAN ports instead of being a single access port.  It sounds more complicated than it is; it's the "interface dance".  It just takes some time.

The second installation issue is the stereotypical NetGear managed ethernet switch headache with bonded interfaces.  Logically one would think you admin up the individual ports, configure them as members of a LAG, then admin up the LAG and you're up and running.  And it will look great in the monitoring, and will simply not pass any traffic at all, very frustrating.  For some reason that likely seemed important, configuring a LAG on this old version of NetGear firmware will admin down the LAG-as-a-port.  So the complete NetGear checklist includes admin up the port as a port, admin up the LAG as a (virtual?) port, configure the LAG and admin up as a LAG, double check everything is STILL admined up, then it'll likely work and pass traffic, assuming you set your VLAN correctly and the other side (the E200 server hardware) is also configured correctly.  Don't forget other NetGear hilarity like being able in the web user interface to select a MTU of 9198 bytes or whatever it is, but the hardware only passes a little over 9000 bytes.  NetGear certainly puts the excitement back into networking!  Although not necessarily exciting in a good way LOL.

The third installation issue is there is not much documentation out there on configuring multiple bonded ethernets with VLANs for Ubuntu.  You'd think that being a very popular OS, someone has posted a solution for every possible network config, however that is not so.

Here's what /etc/netplan/os6.yaml looks like on my sixth OpenStack host as of the conclusion of Plan 2.0 but before rolling out Plan 3.0 as you can see its still using the Plan 1.0 AD domain controllers for DNS instead of the freshly installed Plan 2.0 AD domain controllers, or it would be if I didn't fix it in my resolv.conf then "chattr +i /etc/resolv.conf" like most sysadmins end up having to do, LOL:

# os6.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      mtu: 9000
    eno2:
      mtu: 9000
    eno3:
      mtu: 9000
    eno4:
      mtu: 9000
  bonds:
    bond12:
      mtu: 9000
      dhcp4: false
      dhcp6: false
      interfaces: [ eno1, eno2 ]
      parameters:
        mode: balance-xor
        mii-monitor-interval: 100
    bond34:
      mtu: 9000
      dhcp4: false
      dhcp6: false
      interfaces: [ eno3, eno4 ]
      parameters:
        mode: balance-xor
        mii-monitor-interval: 100
  vlans:
    bond34.10:
      id: 10
      link: bond34
      mtu: 9000
      addresses : [ 10.10.20.56/16 ]
      gateway4: 10.10.1.1
      critical: true
      nameservers:
        addresses:
        - 10.10.250.168
        - 10.10.249.196
        search:
        - cedar.mulhollon.com
        - mulhollon.com
    bond34.30:
      id: 30
      link: bond34
      mtu: 9000
      addresses: [ 10.30.20.56/16 ]
    bond34.60:
      id: 60
      link: bond34
      mtu: 9000
      addresses: [ 10.60.20.56/16 ]
#

And don't forget that netplan file is a .yaml file so every spacebar is critical, one off and nothing works.  This works for me; best of luck to you!

A final note is in the subinterfaces like bond34.30, the VLAN is selected by the id: parameter (in this case, 30) not by the name of the subinterface as is commonly incorrectly believed.  I would advise never having them not match on purpose, but the id does override the subinterface name if they do not match.

Tomorrow, the Cluster 2.0 hardware meets Kolla-Ansible, and it will be a LONG post.

Stay tuned for the next chapter!

Friday, July 29, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 031 - Prepare OpenStack Hosts 4, 5, 6

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 031 - Prepare OpenStack Hosts 4, 5, 6

This is somewhat repetitive of Chapter 009 where I prepped hosts 1, 2 and 3.

https://springcitysolutions.blogspot.com/2022/07/adventures-of-small-time-openstack_0394771970.html

I had a new NTP infrastructure, so I had to configure my IPMI NTP settings to match.

Likewise, the DNS settings are different, now that I'm bootstrapping off the OpenStack Cluster 1 infrastructure instead of the old VMware cluster.

Over the years I've found it to be a false economy to cheap out on cable labels, so I purchased a refill for my Brady cable labeler, which worked well.

Sorry for the short post, but that's how projects are, sometimes.

Tomorrow is bare metal install on Cluster 2.0.

Stay tuned for the next chapter!

Thursday, July 28, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 030 - Decommission ESXi hosts 4, 5, 6

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 030 - Decommission ESXi hosts 4, 5, 6

This is somewhat repetitive of Chapter 008 where I decommissioned hosts 1, 2 and 3.

https://springcitysolutions.blogspot.com/2022/07/adventures-of-small-time-openstack_01472243191.html

The main process improvement I applied was to very carefully inspect each ethernet cable to make sure they are labeled properly and plugged into the correct ports.

It was a strange feeling shutting down vCenter for the last time.  I've had a vCenter installation since at least the v5 era.  There were occasional issues, upgrades "resolved" by reinstalling vCenter, etc.  But overall it was a bittersweet moment to flip the switch.

The power supply for server os4 blew up on July 10th.  I distinctly remember scanning it and the rest of os4 with my thermal camera mere weeks before and no unusual heating patterns were seen.  Sometimes its just a capacitor's time to go.  My records show I've now replaced 60% of the E200 series SuperMicro power supplies, they only last a couple years and fail dead no output.  Aftermarket replacements have never failed... so far.  I use Observium for monitoring and it can graph various SNMP IPMI parameters automatically, including incoming 12 volts from the external power supply, which was rock steady, right until it failed.  Monitoring and keeping good records sometimes helps, sometimes does not.  After the first couple supplies burned out, I keep a stock of aftermarket replacements on hand so repair only takes five minutes.

Tomorrow we prep the hosts for OpenStack Cluster 2.0.

Stay tuned for the next chapter!

Wednesday, July 27, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 029 - The Plan, v 2.0

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 029 - The Plan, v 2.0

In the "good old days" back when Fred Brooks from "The Mythical Man-Month" and I were kids, there was the concept of the dreaded "Second System Effect". I am very familiar with this hazardous illness that affects software projects and aim to avoid it all costs while designing my new Plan 2.0 for the second cluster.

https://en.wikipedia.org/wiki/Second-system_effect

Many aspects of Plan 2.0 are reactions to experiences implementing Plan 1.0.

One exciting event during the middle of the Plan 1.0 era was my Unifi hardware firewall died completely after a reboot, in the process of moving it from a docker container hosted controller on VMware to a new docker container host for the controller on OpenStack.  Nothing to do with OpenStack directly, it was just that piece of hardware's time to go.  So it went.  At a most inopportune time.  My old VMware cluster had access to all VLANs on my network, including the layer 2 link VLAN between my cablemodem and the dearly departed hardware firewall.  With respect to networking, this is not my first rodeo and in the old days I used to set up Linux based internet firewalls using ipchains, later iptables, and a variety of software router appliance solutions.  Normally I'd spin up a linux image with two interfaces, one connected to the cablemodem using DHCP the other static IP address on the main LAN, and use about three lines of iptables rules to NAT between those interfaces and my entire LAN would be back on line.  I cannot do that with Plan 1.0 of OpenStack because I set up a single flat provider network on my LAN, so I set this up on the old VMware cluster, and it worked fine until I acquired and installed a new internet firewall a couple days later.  The summary of this long story is Plan 2.0 will include VLAN access to ALL of my VLANs just like VMware had, instead of a single flat provider interface.  Then I could implement this workaround on OpenStack just like I did on VMware.  Flexibility is the key to a happy infrastructure solution.  VLANs on OpenStack ended up requiring a remarkable amount of extra Neutron Service work, details in a later post...

Everything in Plan 1.0 was about hand configuring instances then automating with Ansible.  That got tiring pretty fast and I planned to automate with the CLI, which changed during the Plan 2.0 era into orchestrating with OpenStack Heat Templates which work AMAZINGLY flawlessly well.

One interesting problem with OpenStack networking is the default port security groups are semi-smart and "know" what IP addresses the instance should be using, which is great if you properly configure the IP address at initial port configuration time, but if you slam ports around between self-service and provider networks while experimenting and learning, or convert on-the-fly from DHCP to static addresses, or similar modifications, built in port security will oft get confused then block all that "fake" traffic and your instances will be unreachable, or you can completely shut off port security, which isn't any worse than using bare VMware (without NSX).  As a reaction to that experience, part of Plan 2.0 is configuring and using security groups for each instance.  For some applications this is pretty trivial, OK allow TCP port 80 for HTTP based services, fine whatever no great effort there.  For other applications like Samba or old versions of NFS, its a bit more work, LOL.  Details in a later post.

In Plan 1.0, hand rolling my own homemade installation "worked" but poorly and at great effort.  Looks like Plan 2.0 will use Kolla-Ansible, everyone implies it "just works".  Turns out there's a lot undocumented about it, or poorly documented, anyway.  It turns out that taking a VERY complicated system and then layering another VERY complicated automation system on top of it doesn't make anything simpler or easier to learn, although eventually I was successful.  Details later on.

I figured it would be exciting to watch the Ansible on my network and Kolla-Ansible try to "fight it out" WRT OpenStack server configuration.  Despite writing this series as a retrospective, even months later I'm still a little apprehensive about the two systems fighting each other.  So far, mostly so good?  Nothing terrible has happened ... yet.  I still worry that some completely well intentioned change in my LAN Ansible config, maybe for NTP or iSCSI or something, will totally blow up my Kolla-Ansible and it will be a bear to fix something I did not write and is extremely large and complicated.

My DNS design will be completely revamped based on experiences with Designate service on Plan 1.0.  Basically, instead of trying to run six bare metal auth and resolution hosts on all six hosts, I will go back to having four resolver instances for my four AD domain controllers, and auth for Designate in Kolla-Ansible is handed by a Docker container set up by Kolla-Ansible and running in a new domain name and my resolvers forward to those auth servers for the one specific domain.  Which seems complicated but it is, in some ways, simpler, and it works quite well.  Details in a later post.

There were some other changes in Plan 2.0 required while implementing Plan 2.0, but to "keep things fair" I will stop here, with the plan as it was optimistically designed at the start of the Plan 2.0 project, and explain what had to be "emergency changed" later on, as it happened in the narrative.  Turns out Kolla-Ansible has some interesting opinions, some in direct opposition to the documentation for hand-rolled installations.  I thought Kolla-Ansible would simply be "automated manual installation instructions" but its definitely its own separate flavor of OpenStack, which led to some interesting conflicts later on.

Tomorrow we shut down ESXi and vCenter and decommission the hardware on the old VMware cluster.

Stay tuned for the next chapter!

Tuesday, July 26, 2022

A helpful trio of OpenStack gitlab repos for reproducible infrastructure

Here is a helpful trio of OpenStack related gitlab repos.

They are scripts and config files and demonstrations I use for OpenStack that might be useful or inspirational or at least interesting for others to see.

The following three Spring City Solutions LLC GitLab repos work best together when used together as a set:

https://gitlab.com/SpringCitySolutionsLLC/glance-loader

"glance-loader" repo provides repeatable installable images for OpenStack.

I'm not saying you have to use MY script to load your images.  But you DO need to use A script to reproducibly load your images, so at least take some inspiration from mine.

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts

"openstack-scripts" repo provides various OpenStack Heat Template IaaS orchestration templates.

All of my infrastructure is orchestrated, from creating the projects to creating the networks and subnets and everything installed on them, like virtual instances and docker containers in Zun, all of it.  I do not manually create anything in the Web UI its all orchestrated which means its documented and reproducible and zero percent chance of human error.  Also, its very fast compared to configuring by hand.

https://gitlab.com/SpringCitySolutionsLLC/ansible

"ansible" repo provides Ansible configs for the above Heat orchestration templates.

Pretty much my entire home LAN is completely orchestrated and scripted.

The Ansible handles configuration options ranging from the fairly trivial like VIM editor options being consistent across all systems, thru mid-range tasks like connecting all my servers to Zabbix monitoring system, up to complex tasks such as taking a bare Ubuntu or FreeBSD server and having it successfully connect to my Active Directory domain, including full SSO and file server access, the only manual part of the process is running "net ads join" with the domain admin password. 

What this trio provides is consistent reproducible configuration across all my systems, and extreme speed, for low effort.  I can do more with OpenStack orchestration and automation in an hour, than I could do by hand in a weeks labor.

Adventures of a Small Time OpenStack Sysadmin Chapter 028 - Move everything from Legacy VMware cluster to OpenStack Cluster 1

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 028 - Move everything from Legacy VMware cluster to OpenStack Cluster 1

Moving on to brighter, happier tasks.  Such as moving everything off the old VMware cluster to the new OpenStack Cluster 1 under Plan 1.0, so that when I'm done the old VMware cluster hardware can be repurposed into OpenStack Cluster 2 under Plan 2.0.

As usual with migration projects this finds the technological debt in automation and orchestration and backup systems... so the last 10% of systems end up taking 90% of the time, as is usual with IT projects.  In the end everything was moved by 28-June-2022.  It seemed to take forever because I took the opportunity to bring everything up to sensible standards WRT testing and verifying backup processes "for reals" and updating documentation in netbox and cleaning up DNS entries etc.

As an example, there were domain controllers set up before 2017, upgraded periodically, backed up, tested, monitored, always operational, but never reinstalled in over five years.  Samba and FreeBSD "just work!" and normally that's good news.  So I got my first experience in half a decade on transferring FSMO roles using Samba or even just adding and removing domain controllers at the command line.  Also had fun replicating my SYSVOL volumes, and setting my home directories for users to new fileserver hostnames for the first time in half a decade LOL.  Fun, educational, sometimes exciting, but mostly it was time consuming.  Everything worked on OpenStack Cluster 1, and nothing remained on the old VMware cluster except vCenter, in the end...

Tomorrow we talk at length about Plan 2.0 and how it differs from Plan 1.0

Stay tuned for the next chapter!

Monday, July 25, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 027 - OpenStack Freezer Backup Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 027 - OpenStack Freezer Backup Service

My reference for installing Freezer:

https://docs.openstack.org/freezer/yoga/install/

Install notes:

I like the idea because I've tried to backup Cinder volumes using Swift and RClone and for some odd reason I've never determined it runs at like 500K/sec and no its not a MTU setting problem LOL.  So an automated backup system sounds really handy... as long as it works.  Which it doesn't.

Freezer is another one of those dead projects where the marketing material is fresh and exciting and its part of the recent Yoga release and all that, but it actually hasn't worked since Ubuntu 16.04 was released, the docs haven't been touched since Python3 was released, there's bugs about it being uninstallable on StoryBoard from OpenStack releases years ago, etc.

The main problem with Freezer seems to be the docs suggest it runs best on ElasticSearch 2.3.0 which was released on March 30th 2016.  I'm not comfortable running a DB that old, nor am I sure the software would be compatible with a DB that's six years more recent.  So I used the MySQL option.  Or tried to.

freezer-manage db update-settings does not seem to work "Option update-settings not found"

The docs in general are so old and unmaintained that the URLs across the entire set of docs pre-date the existence of opendev.org, as near as I can tell that would be four years ago.

The current Ubuntu packages start freezer-api using "service freezer-api restart"

The administrator manual explains you need to install the agent before you install the api, and the install manual explains you have to install the api and ignores the topic of installing an agent at all.

I eventually got everything installed although it didn't work and I figured my time was best spent to rapidly arrive at Plan 2.0 and maybe Kolla-Ansible will figure it out or whatever.

I contemplated filing a large pile of bugs on the ancient installation documentation, but why bother improving the docs of apparently abandoned software, it'll never get used anyway.  So I worked on more productive topics instead...

I really like the idea of Freezer Service, and I wish OpenStack had it, or an installable version of it, anyway. 

Tomorrow we stop installing software, and do other things.

Stay tuned for the next chapter!

Sunday, July 24, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 026 - OpenStack Watcher Optimization Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 026 - OpenStack Watcher Optimization Service

My reference for installing Watcher:

https://docs.openstack.org/watcher/yoga/

Install notes:

The usual problem where the Python2.7 CLI package name is python-watcherclient as displayed in the docs, but the project has moved into Python3 some time ago and the new Python3 CLI package name is python3-watcherclient.  Seriously, its like ALL the abandoned projects have this minor installation bug.  

The Yoga docs advise installing on Ubuntu 16.04 (LTS), which was, IIRC, released six years ago and its five year security maintenance process ended over a year ago, so clearly Watcher has been a completely abandoned project for some years now.  The docs marketing always reads so optimistic and upbeat and feature-full, and then when you install a dead project the detailed instruction tasks are all about how you require a Commodore 64 floppy drive as part of the install process resulting in that sinking feeling in the stomach, this isn't going to be good.  And it wasn't.

Default region_name in the template is regionOne and needs to be changed to RegionOne, another ongoing "OpenStack-wide" problem.

There was some craziness where "su -s /bin/sh -c "watcher-db-manage --config-file /etc/watcher/watcher.conf upgrade" did not work nearly as well as watcher-db-manage --config-file /etc/watcher/watcher.conf create_schema with numerous logs about tables not existing, but in the end I got it to 'work'

I guess Watcher has been part of a "hazing" thing for new OpenStack admins for about a decade now, where its traditional to ask Watcher to optimize your cluster, let it try to migrate some hosts, then because the new admin never tested migration and would never guess that OpenStack would let itself destroy the instances, the instances are destroyed in failed migrations.  Hope you have good backups!  Apparently this hazing "joke" has been going on since at least the mid 10s.  So, yeah, I found out I have good backups too, just like happens to every other new openstack admin who tries Watcher optimization service.

Into the cool sounding Freezer tomorrow.

Stay tuned for the next chapter!

Saturday, July 23, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 025 - OpenStack Aodh Alarm Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 025 - OpenStack Aodh Alarm Service

My reference for installing Aodh:

https://docs.openstack.org/aodh/yoga/install/

Install notes:

The usual problem where the Python2.7 CLI package name is python-aodhclient as displayed in the docs, but the project has moved into Python3 some time ago and the new Python3 CLI package name is python3-aodhclient.  

I had a weird issue with aodh-dbthreshold resulting in a message "Could not load threshold" although it seems OK?  Whatever.

I believe at the end of installation process you should also restart the aodh-expirer service along with the other services, but I never had live data to test and verify, so I don't really know.  Restarting it along with the other services didn't seem to hurt anything?

It was a relatively painless installation.

With respect to Ceilometer being dead and not recording any data, I therefore have minimal operational experience with Aodh, although I was able to look at its UI and imagine how it might report alarms if it had any incoming data stream to alert upon.  This is one of those services I planned to return to after fixing other things including, obviously, Ceilometer, but I never did, and moved on to later phases and plans. 

I wish I had more to say about Aodh.  Looks nice.  I think it would have been nice to use?  Cool idea.  Oh well?

Tomorrow, its Watcher.

Stay tuned for the next chapter!

Friday, July 22, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 024 - OpenStack Ceilometer Telemetry Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 024 - OpenStack Ceilometer Telemetry Service

My reference for installing Ceilometer:

https://docs.openstack.org/ceilometer/yoga/install/

Install notes:

The usual problem where the Python2.7 CLI package name is python-gnocchiclient as displayed in the docs, but the project has moved into Python3 some time ago and the new Python3 CLI package name is python3-gnocchiclient.  No big deal.  I planned on after I get Ceilometer working I will file bugs that I found; never got Ceilometer working, so never filed any bugs.  I wonder if other people do a similar workflow; there are a lot of obvious doc bugs.

gnocchi-api Ubuntu package does not seem to install a gnocchi-api service to be restarted.  It seems there is some UWSGI workaround that is OK or maybe not?

This seems to be handled, possible successfully by /etc/apache2/sites-available/gnocchi-api.conf as per:

https://stackoverflow.com/questions/45374863/devstackceilometergnocchi-error-403

or maybe not as per:

https://bugs.launchpad.net/ceilometer/+bug/1949305

/var/log/apache2/gnocchi_error.log shows "unable to initialize coordination driver" when trying to run "gnocchi status"

coordination_url in gnocchi.conf is supposed to move to [DEFAULT] section in the config file.  Seems obvious.

Also if I try to telnet to port 6379 on my controller I don't have a redis database set up at this time so the docs should imply a redis db is necessary.

/etc/ceilometer/pipeline.yaml seems to be a new file, not an edited file.  See:

https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/ceilometer/templates/pipeline.yaml.j2

You can tell by this point in the process of setting up Plan 1.0 I was already fed up with doing a hand installation of OpenStack and was starting to use the docs for Plan 2.0 era Kolla-Ansible as my doc source for Plan 1.0, LOL.

On Ubuntu, you can determine the path to "cinder-volume-usage-audit" by running "which cinder-volume-usage-audit" and it should be /usr/bin/cinder-volume-usage-audit so edit the cron line as such

There is no cinder-api as part of the default install.

The proper way to handle the compute ipmi sudoers is to create a new file in sudoers.d not edit the sudoers file directly, as the other components do.

For swift metrics, python-ceilometermiddleware should be python3-ceilometermiddleware, the usual Python2.7 to Python3 transition package rename issue run into so many other times.

Upon install, my /etc/ceilometer/pipeline.yaml file on Ubuntu was missing/empty.

Obviously adding only the gnocchi publisher section will result in nothing being stored because there is no input section.

I found a sample pipeline.yaml online and adapted it to my needs.

To convert to storage of logs on swift:

https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/gnocchi/templates/gnocchi.conf.j2

So, um, yeah, a little more challenging to install than any other service in the entire project.  All that work for mere telemetry.

Everything seems to work although no data is being stored.  It seems actually accessing the data is even more complex than the extremely complex task of installing the telemetry service.

IT professionals used to the VMware or ELK stack or Zabbix or Nagios or Observium or LibreNMS or even smokeping experience may be a little disappointed when they meet OpenStack and there's two monitoring stacks, the legacy Ceilometer that is unmaintained and reportedly does not scale and is very difficult to make work, or its replacement the Monasca which supposedly scales well but is also unmaintained and it is inoperable in Kolla-Ansible because the project's required version of ElasicSearch is incompatible with the overall system version so it boot-loops the container, LOL.  Its just a different world if you have used something like VMware Ops Manager in the past, or even just bare "as-installed" vCenter.

My personal solution to OpenStack not providing a viable monitoring/telemetry service was to leverage my existing systems.  I continue to use Observium talking to the IMPI SNMP controller to report data like CPU temps and fan RPMs, and I continue to use Zabbix to report operating system level data like CPU use percentage on the bare metal OS and memory use and so forth.  And, honestly, it works great!

Additionally, for awhile, I was trying to use Portainer to monitor the Docker containers in Kolla-Ansible, but Portainer got agitated by some changes made by Zun/Kuryr so I stopped using that, although it was pretty cool for awhile.

It would be pretty awesome if something like Observium or Zabbix were integrated into OpenStack.  But even un-integrated, it works pretty well.

Tomorrow we experiment with Aodh.

Stay tuned for the next chapter!

Thursday, July 21, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 023 - OpenStack Magnum Container Infrastructure Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 023 - OpenStack Magnum Container Infrastructure Service

My reference for installing Magnum:

https://docs.openstack.org/magnum/yoga/install/

Installation of Magnum was uneventful and everything went per the manual.  That was the good news.  Now, for the rest of the story...

One minor bug in Magnum install docs is the python package name for the client is listed as python-magnumclient but its actually python3-magnumclient.  I opened a bug and provided the exact file to change and the solution at:

https://storyboard.openstack.org/#!/story/2010156

I also ran into the template problem I've run into other times, where the default "region_name = regionOne" but region names are caps sensitive so that needs to be changed to match everything else "region_name = RegionOne".  It gets to the point where instead of filing a bug on seemingly every OpenStack service, I should file a bug on Keystone and people doing manual installs should set it to "region_name = regionOne" to match every other service in the overall project.  Whatever, so over that now, did not feel like filing a bug.  Just posting this so people know its an issue.

My thought process in deciding to use Magnum went like this:  On my old VMware cluster I have CONSIDERABLE experience setting up virtual hosts, in fact bunches of virtual hosts, to act as Docker Container-containers, and Docker Swarm members, and K8S cluster members and stuff like that, all via various forms of manual labor up to extensive orchestration and automation using various VMware products and Ansible etc, so the idea of being able to spawn off a Docker Swarm at a click of a button instead of taking maybe thirty minutes to mostly automate and orchestrate a swarm seems appealing.  Also I tried setting up Zun/Kuryr container infrastructure by hand and pretty much failed and decided if I want container infra I need to wait for Plan 2.0 and Kolla-Ansible where it was supposedly effortless, which turned out to be pretty accurate.  So in the interim I had the brainstorm that anything that runs on simple Docker containers can be wedged into running on a Docker Swarm, even on a crazy one host Docker Swarm.  So I could use Magnum and its Docker Swarms to replace the uninstallable (by hand) Zun/Kuryr.  This seemed like a great way to host Docker Containers during the Plan 1.0 era.  During Plan 2.0 era I'd just stick containers in Zun, so Magnum would be a temporary band-aid.  I also had a third reason to install, Magnum might be fun for deploying temporary clusters for educational and testing purposes.

So... I know online I've read a lot of stuff about people using Magnum to spin up K8S test clusters and nobody uses it for Docker anymore, etc.  But I had my reasons or justifications for trying to make it work.

It did not work.

"ERROR: Property error: : resources.api_lb.properties: : Property allowed_cidrs not assigned"

Great, just great...  "Push button receive cluster" sounds like a useful productive service, "Push button receive mysterious failure message" not so useful of a service.

The problem, oft seen in OpenStack, of trying to hide and abstract a standard service away behind custom code, is its very difficult to troubleshoot someone else's hidden abstraction code.  So its easier for me to polish up my orchestration and automation to a fine shine, and use that to spin up my own docker swarms and docker container-containers and K3S clusters, than to troubleshoot someone else's efforts to abstract and automate.  I ran into this problem later with Trove and Manila and some other service-services; I already have orchestration and automation of certain services, and I fully understand my orchestration and automation and my backup systems and troubleshooting access, and its not really worth the investment of large amounts of my time to understand someone else's opaque system, especially if theirs doesn't work at all in comparison my systems work great.

So I got VERY frustrated with Magnum after a couple days struggle, sent a post to the OpenStack users mailing list asking if anyone else has a working Magnum installation and can collaborate to get it working and/or improve the software and docs, heard crickets, tossed it, and moved on with life.

If I need a Docker Swarm I have Heat templates and Ansible and they can spin up one up faster than I can fix Magnum, and even if Magnum worked its possible my scripts might work just as fast as Magnum anyway.  Its kind of like the pragmatic way to fix bugs in Trove, where the fastest way to fix a Trove bug is to toss a Docker Mysql:latest container into Zun and no longer use Trove, LOL, I can be up and running in less than a minute, fixing Trove would take a lot longer than that minute.  The entire Trove project can be replaced by a single Zun Resource line in a template.  Much like Magnum.

Yeah, I got a little frustrated with this one.  Well, maybe a lot frustrated.  Bye Magnum, I'll replace you with a very small shell script!

As a change of pace, tomorrow is not Friday, its Ceilometer Day.

Stay tuned for the next chapter!

Wednesday, July 20, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 022 - OpenStack Heat Orchestration Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 022 - OpenStack Heat Orchestration Service

My reference for installing Heat:

https://docs.openstack.org/heat/yoga/install/

Installation of Heat was uneventful and everything went per the manual.

Heat is probably the most fun I've had with OpenStack.  For something named "Heat" its very cooooool.

Originally, in Plan 1.0 timeframe, I automated all the infrastructure and some instances using the openstack cli and scripts and frankly just experimented doing stuff by hand in the web UI.  Slow, tedious, prone to human error, but it "works".

The problem I rapidly ran into is scripting is very slow, if you change one thing but have to apply a script of a hundred things it takes awhile.  Also its kind of unpredictable, from memory, which components are idempotent and therefore safe, and which are not.  I vaguely recall project create scripting was safe because project names are unique so running the script over and over is quite safe, at least WRT creating projects.  However, it was something like Neutron subnets where the system simply doesn't care and you can create multiple identical-ish subnets by re-running the config script, one extra subnet created each time you re-run the script.  Well, maybe it wasn't Neutron subnets, but it was some "big problem", whatever it was, anyway.  The stereotypical script way to fix that situation is to delete the component before adding it, which is great during initial development but once stuff depends on it, its not funny anymore to delete and create things that other components depend upon.

Which is why people invented orchestration a long time ago.  Orchestration looks about the same as scripting, to a first approximation, and on an empty green field deployment it behaves exactly the same.  However, the algorithm for scripting is the computer reads the script top to bottom one line at a time and does exactly that that line tells it to do, exactly, each time the script is run.  And sometimes thats what you need for a process, but not everything is a process.  Whereas orchestration repeatedly re-reads thru the file trying its best each time to make real-world things match what it reads in the template, skipping over everything that already matches, until it can scan the entire template and find no mismatch with reality, at which point the terminology gets weird and it sets its status to "created" or "updated" depending on starting conditions, which is kind of dumb but whatever.  The status of a successfully completed orchestration should use a more intuitively descriptive status word like "matches", not "created".

In Plan 2.0 everything is orchestrated except setting my personal user's password and a couple microscopic things like that.  Its really cool being able to add or fine tune some minor detail and in seconds, reality matches my updated template.  And of course templates are source-code-alike text, you can store them on GitLab and so forth.

If you're experienced with Ansible and related technologies, at a very detailed level they are scripting in the sense of "in order, top to bottom" but at a distant high level Ansible is used to reach orchestration-ish goals very successfully.

After installation, before you know what you're doing, you should probably examine this vast collection of useful Heat template examples, if you can't find something cool here, you need to recalibration your concept of cool:

https://opendev.org/openstack/heat-templates

During actual production use, I spend a lot of time at this URL titled "OpenStack Resource Types", its a really nice reference-class document:

https://docs.openstack.org/heat/yoga/template_guide/openstack.html

Pretty much everything in the Web UI, REST API, CLI, Python library, and Heat Resource match in some way, even if its displayed or described somewhat differently, each conceptual activity has a matching doppelgänger in the other services.  I like that aspect of OpenStack, its so versatile, it doesn't really matter if you're writing REST API calls or clicking web UI by hand or templating, its always possible to do everything and its all interoperable.

Today's blogpost might have been a sappy emotional love-letter to the Heat Orchestration Service, but tomorrow's topic is Magnum, which is, um, not.  That should sound ominous, because it is, LOL.

Stay tuned for the next chapter!

Tuesday, July 19, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 021 - OpenStack Nova Compute Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 021 - OpenStack Nova Compute Service

My reference for installing Nova:

https://docs.openstack.org/nova/yoga/install/

Nova-conductor is required on a controller, and can not be installed on a compute node. That seems to be what prevents the controller node os3 from hosting compute services.  Until I discovered this, I planned on making my controller a compute node, in at least a limited sense.  The docs alluded to controllers not being compute nodes but I thought that was merely a best practice or some kind of system-load limitation.

For some reason the default template provides a region_name = regionOne and I found out the hard way that region names are capitalization sensitive.  So that was weird and moderately annoying and definitely time consuming.  I did not file a bug report as I'm not entirely clear if its reproducible and what caused the weird template region_name.

There is a LOT of config in Nova that relates to Cinder and Neutron, also some config in Cinder and Neutron that relates to Nova.  So the configuration process is a little circular over the last couple days.  Most services in OpenStack do not have these circular configuration dependencies; it does get better.

As a practical example of the above paragraph, if you mismatch your metadata_procy_shared_secret passwords, you will have a difficult time troubleshooting the resulting cloudinit problem.  However, there is surprisingly little configuration needed for a minimal system so I strongly recommend trying to administer OpenStack on a multi-monitor desktop; its much easier when you display configurations that need to match simultaneously, and you notice they don't match.

I ran into an interesting bug related to OpenStack Nova and FreeBSD:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255231

There's an incompatibility between FreeBSD 13.0 and OpenStack WRT UEFI, such that OpenStack can successfully signal to the operating system to shut down, but after shutting down the OS, the instance can't shut down.  I was able to verify that yes indeed this does not work at all on FreeBSD 13.0 necessitating manually shutting down the instance in OpenStack.  However I was also able to verify its fixed in FreeBSD 13.1.  FreeBSD 13.1 was released on 12 May 2022 right around the time I was doing this OpenStack project, so that's convenient.  After I verified it was fixed, someone else double checked, and closed the FreeBSD bug report.  Perhaps I'm biased, but to me, it seems FreeBSD people are the nicest people, no sarcasm, its just how it is with everyone I've interacted with over the years.  FreeBSD is the chillest software project.

If you want OpenStack compute migrations to work, you need to manually enable SSH between the hosts as detailed in:

https://docs.openstack.org/nova/yoga/admin/ssh-configuration.html

This is not mentioned in the simplified instructions above.  Even more excitingly, if you attempt to migrate an instance, and the migration fails, OpenStack kills the instance permanently.  Hope you have backups or extensive orchestration or it was just a test image!  The above paragraph relating to SSH not being configured will permanently kill any instance you try to migrate.  However, I also permanently killed a couple migrating instances due to a temporary Neutron problem that went away (probably was some weird startup order dependency, but that's just a slightly educated guess). 

"ERROR oslo_messaging.rpc.server nova.exception.InternalError: Failure running os_vif plugin plug method: No VIF plugin was found with the name linux_bridge"

Seriously, are you kidding, its there, right there, look, and it seems to work all the rest of the time that its not migrating something.  After multiple days of frustration (sadly not exaggerating) I moved on to other struggles temporarily, which is a legit troubleshooting technique, and when I returned to this problem, after not knowingly changing anything related to this problem, migrations were working perfectly.  Sometimes having things work perfectly can be Mildly Frustrating.  Did not file a bug because I could not even remotely reproduce the problem.  Never after did I have even the smallest problem with migrations, fun and trouble free beyond this point.  OS migrations, especially live ones, are not nearly as smooth as VMware vMotion, but they certainly do work. I have a gut level guess that some tangential distant software component maybe in Glance or Placement got out of sync with the configs or live data in Neutron and they fought it out and this was the result, but I have no proof. I had focused my troubleshooting HARD on the big three, Neutron, Nova, and Cinder, and found absolutely nothing wrong, so I'm just assuming the problem was somewhere else. Who knows maybe Designate locked some data while experimenting with setting DNS names, causing a sync issue in Neutron. I really don't think the problem was in the big three.

So, yeah.  Migrations.  Either they work flawlessly, or they zorch your image permanently and I hope you keep good backups because you're gonna need them.  Not exactly the VMware vMotion flawless experience you might hope for.

I did not document it, but I burned several days kind of having fun with Nova, just pushing thru various operating systems and trying stuff.  Nova is fun.  Maybe not as fun as Heat, but its fun.

Speaking of Heat, that's tomorrow's topic.

Stay tuned for the next chapter!

Monday, July 18, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 020 - OpenStack Neutron Network Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 020 - OpenStack Neutron Network Service

My references for installing Neutron:

Administration Guide aka OpenStack Networking Guide

https://docs.openstack.org/neutron/yoga/admin/config.html

Configuration and Policy References

https://docs.openstack.org/neutron/yoga/configuration/

Networking service Installation Guide

https://docs.openstack.org/neutron/yoga/install/

Note that in the usual circular fashion Nova has to be configured to contact Neutron later on.

Manual install like this uses LinuxBridge, Kolla-Ansible uses OpenVSwitch.  So experience with LinuxBridge will not transfer over to Kolla-Ansible.

I was confused by the setting for local_ip in /etc/neutron/plugins/ml2/linuxbridge_agent.ini for the vxlan and the solution in the end is that should be my overlay network's IP address.

I configured my MTU as per:

https://docs.openstack.org/neutron/yoga/admin/config-mtu.html

This was fairly uneventful; just remember to set the MTU in both neutron.conf AND ml2_conf.ini.  And don't forget to test using long ping packets.  I did run into a web UI problem with old firmware on an old NetGear managed ethernet switch where its possible to configure the UI to a longer packet length than the device can actually pass, which was "funny" although easy to detect and fix.

I followed the instructions here to configure my overlay networks, it was uneventful and just works:

https://docs.openstack.org/neutron/yoga/admin/deploy-lb-selfservice.html

A pretty good reference for metadata is:

https://docs.openstack.org/nova/yoga/user/metadata.html

I experimented with linking Designate to Neutron, which technically worked, but in the long run I prefer configuring my DNS somewhat differently than my project organization.  Some DNS entries should be under project DNS zones, others global, it can be complicated.  Its a nice idea to have automatic DNS entries.

For the manual install, I made one flat provider access network, which turned out to be a big mistake when my hardware firewall crashed and I wanted to spawn a software firewall and needed to do that on the old VMware cluster, delaying the deployment of OpenStack Cluster 2 until the firewall was replaced.  This is why Plan/Cluster 2.0 has provider networks for all my VLANs so I could run pfsense as a firewall if I had to.

Configuring my flat provider network looked like this:

openstack network create --share --provider-physical-network provider --provider-network-type flat provider1 --mtu 9000

openstack subnet create --subnet-range 10.10.0.0/16 --gateway 10.10.1.1 --network provider1 --allocation-pool start=10.10.248.2,end=10.10.251.253 --dns-nameserver 10.10.200.41 provider1-v4 --no-dhcp

I shudder to imagine what happens if you leave "--no-dhcp" off a provider subnet.  I suspect the result would be very bad indeed if two DHCP servers, the one in OpenStack and whatever existing external solution you have, get into a fight on a LAN.

The neutron metadata server, as default configured, will put nonsense in the /etc/resolv.conf file.  Nothing some hand editing and chattr +i /etc/resolv.conf can't fix.  Most automation invented to "make /etc/resolv.conf simpler" just makes it harder.

I experimented with installing Kuryr because I wanted to do OpenStack containers, and this went very poorly.  I am uncertain if the docs or I were in error, but somehow I installed local Python packages for iproute2 that were compatible with Kuryr and INCOMPATIBLE with Neutron, in fact Neutron was impossible to restart until I wiped my install for Kuryr.  So, that was moderately painful and inadvisable.  This was about when I started thinking Plan 2.0 would involve Kolla-Ansible rather than hand installation, as Kuryr and Zun work out of the box on Kolla but are seemingly uninstallable by hand as I currently understand things.

One of the big problems with cookbook solutions like the installation guide is they encourage people to not learn how the system works. Then pile on multiple layers of abstraction, with no or minimal debugging facility, and problems develop later.

At the time of cutting and pasting, I was setting up a rather unambitious "flat" provider network using one unbonded ethernet connection and linuxbridge as my virtual switch, and the cookbook instructions ask me to add a single line to linuxbridge_agent.ini, "physical_interface_mappings = provider:eno1". How nice of OpenStack to let me tag my provider interface as a provider. I never noticed later on when I cut and pasted in the network configuration line that I was attaching my OpenStack network to "provider1" not eno1 or something like that. Have to admit, it works great although it confuses sysadmins.

This vast simplification of how the very complicated network system works on OpenStack made life VERY exciting later on, when using Kolla-Ansible and its default OpenVswitch and I'm trying to set up multiple VLANs on a bonded pair of ethernet ports and Kolla-Ansible doesn't use the name "provider1"? Oh and don't forget to add, this is OpenStack where the only easily accessible error message you'll get is instance deploys fail, and of course Centralized Logging is cool when it works but there's an ElasticSearch version incompatibility in Kolla-Ansible Yoga (or, was an incompatibility, maybe fixed by the time you read this?) so your only log access is logging into Docker containers and poking around as best you can. But, that "fun" is a story for another day.  

Anyway, today, Neutron is working.  Tomorrow is Nova day.

Stay tuned for the next chapter!

Sunday, July 17, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 019 - OpenStack Cinder Block Device Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 019 - OpenStack Cinder Block Device Service

My primary reference for installing Cinder is here:

https://docs.openstack.org/cinder/yoga/install/

I ran into some notes online that I should expect to run into problems requiring "lvm_suppress_fd_warnings = False" yet I did not run into any problems like that, a false alarm.  Might have been a problem fixed long ago in an older release of OpenStack.

The /var/log/cinder/cinder-volume.log file was filling up with volume name errors so as per:

https://docs.openstack.org/cinder/yoga/admin/ts-cinder-config.html

I ended up configuring my /etc/tgt/targets.conf to look like this:

include /etc/tgt/conf.d/cinder.conf

default-driver iscsi

That seemed to fix the problem some months ago; I declined to enter a bug report at that time, and I don't remember why (sorry).

I was getting some interesting error messages in the cinder log along the lines of cinder is trying to run lsscsi and nvme-cli as part of its processes, and the docs do not specify installing those packages so they are not installed, which results in some error messages in the logs, so I installed those packages and no more errors appear in the logs.  I began the process to file a bug, and found one already exists from about one year ago:

https://bugs.launchpad.net/kolla/+bug/1942038

It's becoming a common theme in this blogpost series, that reading bug reports needs to be part of the OpenStack installation process, because most mistakes and problems are long ago documented with fixes and workarounds.

If you follow the instructions at this URL:

https://docs.openstack.org/cinder/yoga/install/cinder-storage-install-ubuntu.html

You will be instructed toward the end to "service tgt restart" but depending on what workload you've applied to tgt in the process of experimenting and testing, you MAY need to "service tgt forcedrestart".  I could not justify filing a bug on this because it depends on what the admin has done historically, and its pretty obvious how to work around the problem if you encounter it.

This next Cinder situation is out of order, but it does not seem you can configure the OS region for a Cinder install in cinder.conf, and the default template for Nova comes up with the wrong default os_region_name (NOTE its case sensitive) for Cinder.  You'll be fixing that later in the nova.conf when you set up Nova.  This is a Nova problem that is Cinder-adjacent so I mention it in the Cinder notes.  You can get led down a false path if Nova reports it cannot talk to Cinder, but Cinder seems fine, and its all because Nova has the wrong os_region_name in the default template.  Although its not really bug-worthy for Nova in that you shouldn't be relying on the template to configure your Region Name anyway.

Anyway, tomorrow is Neutron day.

Stay tuned for the next chapter!

Saturday, July 16, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 018 - OpenStack Glance Image Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 018 - OpenStack Glance Image Service

My primary reference for installing Glance is here:

https://docs.openstack.org/glance/yoga/install/

I did not bother setting oslo_limit sections as the documentation was not wrong so much as it was confusing and I don't intend to implement quotas on myself for fun.

Surprisingly, for such a scriptable task, I don't see any formal systems to upload Glance images.  So I wrote one.  Feel free to use this:

https://gitlab.com/SpringCitySolutionsLLC/glance-loader

The way it works is you enter a directory for your OS distro, then version for that distro, then run download.sh, optionally verify the image is good via your own custom processes and methods, then upload.sh, then optionally clean.sh to save disk space if you want.

Why use my glance scripts?  Well, first of all, for downloads, I use resumable download methods if possible, so an interrupted download can be restarted.  Then I calculate the md5sum or whatever the project publishes, and display a link to the OS project's list of checksums, and display the checksum as of the time I downloaded the image I actually use.  In my upload script, I actually bother to set the metadata values that most admins do not set, which I think will be helpful in the long run, why bother looking up the minimum ram setting yourself if I've already done the work for you?

Even if for "security reasons" or whatever, you can't use my scripts, at least they might provide some inspiration for your own internal glance loading automation.

Tomorrow is Cinder day, and Cinder is huge and complicated so I will have plenty to write about.

Stay tuned for the next chapter!

Friday, July 15, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 017 - OpenStack Swift Object Store

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 017 - OpenStack Swift Object Store

My primary reference for installing Swift is here:

https://docs.openstack.org/swift/yoga/install/

Note there is quite a difference between disk partitioning and mounting for hand installed Swift and Kolla-Ansible installed Swift.  For hand installed swift, format a disk on /srv/node/swift as XFS format as per the online docs, just expect to do it completely differently for Plan 2.0 Kolla-Ansible installation later on.

The controller instructions for Ubuntu on page:

https://docs.openstack.org/swift/yoga/install/controller-install-ubuntu.html

Suggest to install a Python2.7 library that no longer exists, simply specify python3-keystoneclient instead.  After figuring it out on my own and attempting to file a documentation improvement bug report, I found someone else already filed a doc-improvement bug a mere two years ago.  Such is life when administering OpenStack sometimes the best doc are to read all five hundred open bug reports before starting installation...

https://bugs.launchpad.net/swift/+bug/1893769

When I was testing and experimenting, partially following along with the docs at this URL:

https://docs.openstack.org/swift/yoga/install/verify.html

I ran into a situation where the above link recommends using demo-openrc but that user didn't have access to see status, so I ended up using admin-openrc instead.  I messed around with things enough that I'm not sure who did it wrong, myself or the docs, so I didn't file a bug, and as these blogposts are being written as a retrospective about the past based on my extensive project documentation notes, by the implementation of Plan 3.0 I no longer have access to a hand installed Swift so I can't replicate.  None the less, for those playing along at home if demo-openrc does not work due to permissions issues, try admin-openrc because that will possibly work and certainly can't make the situation worse.

I encountered some MTU mismatch problems when experimenting with Swift.  Container creates will work because that fits in one packet, object upload timeouts, its all very tedious but typical of a MTU mismatch, a problem I've seen before.  If you're following along at home this is the first time the system is getting pushed hard enough to run into MTU problems.  It's easy to diagnose with some simple long ping packet commands, and then force your interface and switch MTUs to working values.  I could complain and file a bug on NetGear firmware from 2017 but I'm not even using that old managed switch anymore thus I can't replicate it, but it was something "hilarious" like the web UI for the switch permitted me to set a MTU as high as 9198 bytes but experimental tests with the ping command shows no packets longer than 9018 bytes would actually pass (or something like that) so I set it all to 9000, switch port, interface, everything, and later ended up replacing that switch anyway so I no longer have access.  Anyway, the moral of the story is always test different length ping packets during initial testing, it'll save you time in the long run.

The list of Swift-associated projects is very cool to browse.

https://docs.openstack.org/swift/yoga/associated_projects.html

rclone is not in the list although I plan to make use of it later.  In the past I've successfully used rclone in non-Swift related situations to do cloudy file transfer stuff, so when I get around to setting up rclone swift access, I expect it'll work at least as well as past situations.

https://rclone.org/swift/

I submitted a doc improvement bug to the swift-launchpad as you can see at this link:

https://bugs.launchpad.net/swift/+bug/1981617

Tomorrow is Glance day.

Stay tuned for the next chapter!

Thursday, July 14, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 016 - OpenStack Horizon UI

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 016 - OpenStack Horizon UI

Horizon is a pretty easy install.  Note that the docs on page:

https://docs.openstack.org/horizon/yoga/install/install-ubuntu.html

Refer to OPENSTACK_KEYSTONE_URL on port 80, and if you've been following along in the instructions, you set it up on port 5000, and there is a note in the docs on that topic.

I note that Kolla-Ansible also installs Keystone on port 5000.  So its just kind of odd to see a suggestion of running Keystone on port 80, I have not seen that anywhere else.

On identical hardware, Kolla-Ansible's Horizon is vastly faster.  I suspect something to do with different WSGI configurations.

You will need the "Horizon Plugin Registry" later on when installing add-on services.

https://docs.openstack.org/horizon/yoga/install/plugin-registry.html

When I first set up Horizon and Keystone I placed pretty much the entire LAN into one project, which is slow to use and slow to page thru multiple screens in Horizon.  My solution for Plan 2.0 of breaking everything into smaller projects makes Horizon much more usable.  My architecture advice would be if you find yourself with multiple pages of Anything in Horizon, you probably need to break your projects down into smaller components, for ease of use reasons if nothing else.

Stay tuned for the next chapter!

Wednesday, July 13, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 015 - OpenStack Designate DNS Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 015 - OpenStack Designate DNS Service

My strategy in setting up Designate before setting up storage or compute services was to get something VERY simple working with the complete system from database to web UI.  Then, if I have any problems with something complicated like volume storage, I can rule out system-wide issues such as mysql not working or whatever "infrastructure-level" problems.  This strategy worked very well.

https://docs.openstack.org/designate/yoga/install/

https://docs.openstack.org/designate/yoga/install/install-ubuntu.html

The installation was almost flawless.  I had to restart the designate-api and designate-central one extra time before updating my DNS pools, not mentioned in the docs.  Odd, but not a big deal.

There was a minor problem discovered later on, where I manage Docker hosts remotely using Portainer, which has a default port number of 9001.  Which, on an OpenStack cluster, is the default port number of the Designate API.  So I reconfigured Portainer and its agent to connect only to the OpenStack hosts on port 19001 instead of 9001 and that worked great.

The good news about Kolla-Ansible is it puts the APIs on the virtual IP address so port 9001 should be open on the bare metal.  The bad news is Kolla-Ansible is not entirely compatible with bare metal Docker especially after installing Zun and friends.  See Plan 3.0 for how I host bare metal Docker hosts requiring hardware access / pass-thru.

Samba Internal DNS Resolver is not very smart, and unfortunately being an AD domain, all my hosts point to it for DNS resolution.  All the Internal Resolver can do is respond back with any configured A or PTR record if the request is in its domain, OR pass the request to its configured forwarders if its not in a configured domain.  Which, superficially, sounds like a good start to a DNS server, but as always the devil is in the details.  I set up my Designate to configure it's zones on my bind9 resolver "forwarders" which are downstream of my Samba Internal DNS Resolver.  Yes I know, not ideal DNS architecture to merge authoritative and resolution DNS on the same system but if you're careful, it'll work very well.  Anyway, my plan was to delegate NS records in Samba Active Directory such that subdomains of the domain set up in AD would be forwarded to the resolvers, which are also the Designate authoritative servers.  If the Samba Internal DNS resolver were as smart as a Bind DNS server, it would indeed work using that configuration.  But ... it is not, so it does not work, you can put a protocol analyzer on the network then feel unhappy, or just trust me, it does not work.  You can configure NS records in the AD side of Samba, but the built in Internal DNS resolver in Samba simply ignores those NS records.  Later on, sometime around Plan 2.0, I implemented the very common alternative of simply using multiple domains, cedar.mulhollon.com remained my Samba domain, and openstack.mulhollon.net became my OpenStack Designate domain, but that's a story for a much later blogpost.

Somehow the docs completely skip installation of the CLI client for Designate.  On Ubuntu installation of the client is a one liner:

# apt-get install python3-designateclient

https://docs.openstack.org/neutron/yoga/admin/config-dns-int.html

Integrating Designate with Neutron.

Although it makes sense and is obvious in retrospect, this link has a table describing which Neutron components support which Designate attribute, and it boils down to the only component pairs that can't be linked are dns_name and a network, which makes a certain amount of sense.  I was kind of hoping that if you link a hostname to a network that would hypothetically be a sensible way for a sysadmin to define a wildcard A record for a domain, but that's not how its done.

Note that I discovered later on, in my HEAT orchestration experiments around Plan 2.0 timeframe, that if you design a HEAT template with DNS attributes and try to apply it to a Neutron resource that does NOT have Designate installed and configured, the HEAT template will fail.  So that's kind of annoying.  I had hoped and expected that Neutron would support entering the unused data in the database and later on after configuring Designate it would process the backlog or at least start to work on new configurations, but that is not the case, unfortunately.  So if you're EVER going to use Designate in the future, you should probably set it up and integrate it with Neutron very early in your overall cluster installation process.

Another thing that does not work until AFTER Designate is integrated with Neutron is integrating Designate with your external dns service.

Its kind of buried in the fine print, but remember you need to restart neutron-server after integrating with Designate.  You don't have to power cycle and reboot everything, remember this is not Windows, just restart neutron-server service and it's all good.

https://docs.openstack.org/neutron/yoga/admin/config-dns-int-ext-serv.html

This link is probably the most important link on the page, you have to read and wrap your mind around all the use-cases in order to successfully use Designate with Neutron.

This link explains some interesting behavior you will probably run into which might otherwise be confusing.  For example, if you consider a Use-Case-1 situation, if you try to configure a Neutron Port with a dns_name that does not match a Nova Instance hostname, you will probably become frustrated.  AFAIK if you want to IPv4 IPv6 dual stack and only NAT the IPv4 then Use-Case-3a is your only practically viable option.  Of course my OpenStack experience as end-user and later as an admin cover quite a span of time, so its possible times have changed and I should probably closely re-read this link.

On the other hand if you're just doing "typical normal stuff" then Designate does reliably do pretty much exactly what you'd expect it to do, which is awesome.  It is nice to have this link to read when you're trying to implement something well off the beaten path.

https://opendev.org/openstack/designate-dashboard

There is no documentation about the Designate Horizon web UI plugin, however it was pretty easy to figure out what to do on the OpenStack Controller:

apt-get install python3-designate-dashboard

systemctl reload apache2.service

My overall experience with Designate is it's pretty cool and easy to use and reliable.  I have always been disappointed that virtualization services like VMware do not have a boxed, prepackaged DNS solution similar to Designate.  VMware does have a half dozen different orchestration and automation solutions and most of them could be manipulated into a custom solution to automatically set up DNS, of course.

My project notes are grouped by service.  Because of that, setting up designate-dashboard and linking with Neutron above is out of order; we install Horizon tomorrow and Neutron four days later.  There are some circular loops when setting up an OpenStack cluster, at least if you set it up by hand like Plan 1.0.

Stay tuned for the next chapter!

Tuesday, July 12, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 014 - OpenStack Placement Inventory Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 014 - OpenStack Placement Inventory Service

https://docs.openstack.org/placement/yoga/install/

This is a short blog post.  Sorry about that.  There is just not much to say about the installation of Placement on OpenStack.  It "just works", the documentation was complete and accurate.

Sometimes no news is good news.

In comparison, I will have a LOT more to say tomorrow about OpenStack Designate.

Stay tuned for the next chapter!

Monday, July 11, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 013 - OpenStack Barbican Key Management Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 013 - OpenStack Barbican Key Management Service

https://docs.openstack.org/barbican/yoga/install/

Barbican is a pretty cool idea.  Take a large number of physical crypto storage hardware solutions, or even just a plain text file, and wrap it with all the Keystone access control stuff WRT users and projects.  My experience, so far, is it works very well.

Note that I configured with "enabled_secretstore_plugins = store_crypto" which just uses the crypto.simple software only crypto system.  Still better than nothing.

The hand installed Barbican following the official installation guides came up with secret HREF uris listing localhost instead of a real IP address, which is weird, later on the Kolla-Ansible installation of Barbican worked perfectly out of the box, although that's getting way ahead of the story.

To test the hand installed Barbican I wrote some testing scripts.  Later those were modified for a Kolla-Ansible installation, but they may still be useful for a hand installation, and can be found here at this public GitLab repository:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/tree/master/demos/barbican

As of writing this blogpost I have not written demonstration scripts for Barbican's Container functions or its Consumer functions, although I probably will sooner or later.

Stay tuned for the next chapter!

Sunday, July 10, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 012 - OpenStack Keystone Identity Service Installation

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 012 - OpenStack Keystone Identity Service Installation

https://docs.openstack.org/keystone/yoga/install/

Ansible installs/upgrades the package, everything else is manual.

Installation of this service was uneventful.  Lots of cut and paste from the installation guide into the console.

I used the CLI client to create the sample domain, projects, roles, etc.

It takes awhile to wrap your head around the right way to use Keystone's features.  At first I just used random designs from the docs.  Eventually I set up one giant project, at least I set it up correctly, and put everything "production" into that giant project.  That makes the Web UI very slow and clumsy to operate.

Not to jump too far ahead, but in the end I set up projects in Keystone more or less along role and project lines.  I have the following projects at this time:

admin: test experiments while logged in as admin, which you shouldn't do normally anyway.

enduser: literal enduser instances, hosts that I log into on a day to day basis to do "stuff".  Things you'd have an Apache Guacamole entry to access.

infrastructure: instances and things that are required but rarely directly touched, think of DHCP servers or DNS servers or similar.  If a user would directly touch it, like Booksonic or Apache Guacamole, it belongs in "server" project.

iot: this is an enduser project with multiple images.  More or less the entire Eclipse IoT software suite.

rutherford: another enduser project.  Kind of a demonstration of computational physics simulations mixed with a demonstration of OpenStack.  More detail is beyond the scope of this blog post, or even this series.  Watch for an interesting Youtube video in the future on the Spring City Solutions LLC channel!

server: this is server things endusers directly connect to, like Apache Guacamole.  If its important and required but end users don't log in directly, like DHCP, that would belong in the "infrastructure" project.  Think of Emby server, or minecraft server, or Booksonic server.

Looking back, from the time of writing this all the way back to the earliest days, this project organization strategy has worked very well.

The docs emphasize making your own roles in Keystone, but this seems a non-productive activity.  Nice to know its possible and nice to know how if you need to, but under normal conditions, why?

For troubleshooting experience, everyone involved with Keystone should replicate the work in the Verification section of the chapter, just to have the experience of having obtained a token.  Never know when this troubleshooting experience will be necessary.

https://docs.openstack.org/keystone/yoga/install/keystone-verify-ubuntu.html

Stay tuned for the next chapter!

Saturday, July 9, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 011 - Installing the OpenStack Environment

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 011 - Installing the OpenStack Environment.

The relevant chapter from The OpenStack Install Guide:

https://docs.openstack.org/install-guide/environment.html

Most of this was installed one time, by hand, by me, then I added it to my Ansible 

and let Ansible take care of the other hosts in the cluster. Life is too short to configure by hand, let Ansible take care of it. I created for myself kind of a low-budget homemade Kolla-Ansible, and it worked pretty well.

As a checklist and commentary of each step in the process:

Ansible takes care of the apt repo.

Ansible installs SQL packages, but have to restart and finalize manually.

Ansible installs message queue, but have to add user and permissions manually.

Ansible installs memcached, but have to restart manually if config changes.

Ansible installs etcd and config file, but have to enable and restart manually. Note the human readable name is default, not controller as per docs. That seems to work?

As a side note, Openstack docs should emphasize testing MTU on networks as OS behaves very weirdly during MTU mismatch situations. MySQL, especially goes insane if you set up for a 9000 byte MTU but its actually only 1500 or even worse, 1450 bytes later on. Old timers who've run into MTU mismatch problems recognize and fix it pretty easily, but sysadmins whom have not run into MTU problems always struggle on their first. I'd almost suggest new sysadmins intentionally set up MTU settings wrong to gain the experience.

Stay tuned for the next chapter!

Friday, July 8, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 010 - Bare Metal OS install on OpenStack hosts 1, 2, 3

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 010 - Bare Metal OS install on OpenStack hosts 1, 2, 3

The actual OS installation is uneventful.  Its just Ubuntu 20.04 LTS aka the "Subiquity" network install, using Netboot.xyz.  As part of the install process you get to pick a default user and password, I uncreatively entered the username "test".

However, there are tasks to perform between installing the OS and installing OpenStack software.

After installation, I like to log in and do a standard upgrade process. 

sudo apt-get update

sudo apt-get dist-upgrade

sudo apt-get autoremove

sudo apt-get clean 

For some weird reason, I configured a swap partition in LVM but Ubuntu adds its own swapfile in the filesystem.  I get it; if you're running RAID you want to have your swap protected against drive failure, and allow hot swapping.  But its a little slower and I already set swap up in a LVM LV anyway.  So, "swapoff /swap.img" "rm /swap.img", remove swap from /etc/fstab.  This shrank my root partition size from 34 or so gigs down to about 2 gigs.

Some OpenStack docs imply apparmor is incompatible with OpenStack, some internet posts claim it doesn't matter, and Kolla-Ansible takes care of it automatically as part of the "bootstrap-servers" stage of installation.  For better or worse I shut down apparmor, "systemctl stop apparmor" "systemctl disable apparmor" "systemctl mask apparmor".

There's kind of a standard process to configure networking on new Ubuntu servers.  I set the hostname to the FQDN (important for ActiveDirectory join later on).  I set up the host for Ansible, life is too short to configure NTP and DNS resolvers and standard VIM options and SSHD options and stuff like that by hand.  After Ansible is done with basic server configuration, I usually re-generate my SSH host keys (because I have some specific requirements).  Anyway the point of this paragraph is now I have a generic Ubuntu 20.04 host that's integrated fully into my network environment, but it doesn't actually "do" anything, yet.

Pre-OpenStack, I had some rando hosts acting as a NTP cluster and a nice raspberry-pi based GPS clock, which worked OK.  My idea is to set up NTP on every host in both OpenStack clusters and have everything connect to that new NTP cluster.  This works really well!  Because almost everything on my network is configured by Ansible, I was able to push out the NTP config changes to all devices with minimal effort.  So I set up, tested, and modified my Ansible configs to make the OpenStack cluster host NTP on the "bare metal" level (as opposed to running a virtual machine instance on each host or something)

I used to do DNS by having all user devices point to the Samba AD DCs, which have a forwarder to two virtual machines I had on VMware but I hoped having six hosts doing DNS on bare metal would work even better.  For the use of OpenStack Designate, I set up bind9 DNS on each OpenStack host, fully recursive PLUS authoritative for the OpenStack Designate domains I had not at that time set up yet.  This eventually ended up not working very well at all because the Active Directory internal DNS resolver in Samba on my domain controllers is not particularly smart and will NOT cooperate with NS records pointing to other subdomains in the existing domain for use with Designate, but I'm getting WAY ahead of the story here.  It seemed a good design at the time.

Previously, on VMware, I ran both my DHCP servers as FreeBSD images with high availability and all that.  Experimented with VMware Fault Tolerance but it's really kind of overkill for the benefit.  Initially I intended to run DHCP on bare metal on my six hosts, of course ISC-DHCPD only supports dual host clusters so I intended to set up DHCP on all six hosts, twice, once as a primary and once as a secondary, and then I could manually log in and run a script or something to set a host as primary or secondary DHCP as needed.  This all seems like a lot of work to do by hand, and it is, but via the magic of Ansible scripting it was really almost no work at all!  However, for various reasons, later on I changed this architecture quite a bit and now have four DHCP servers, which is a long story I'll get to later on.

I had the interesting idea to replace USB pass-thru on VMware with running Docker on bare metal on the OpenStack cluster.  No need for openwebrx or homeassistant to have anything to do with the innards of OpenStack just run on the bare metal.  I always stored my volumes in NFS so the docker containers don't care where they're running as long as the host has NFS access, which my entire network does...  This works GREAT and is very fast.  There is a tiny problem in that Kolla-Ansible does not seem to be 100% compatible with this solution.  But we will get to that part of the story later on.  In the short term, this worked really well, very fast and efficient!

There was a minor problem discovered later on, where I manage Docker hosts using Portainer, which has a default port number of 9001.  Which, on an OpenStack cluster, is the port number of Designate.  So I reconfigured Portainer to run just the OS hosts on port 19001 instead of 9001 and that worked.  The good news about Kolla-Ansible is it puts the APIs on the virtual IP address so port 9001 should be open, I think.  The bad news is Kolla-Ansible is not entirely compatible with bare metal Docker especially after installing Zun and friends. 

Stay tuned for the next chapter!