Monday, August 29, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 057 - ELK Stack

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 057 - ELK Stack

Why?

An ELK system for end user systems such as server syslog messages, ethernet switch logs, and similar things, is technically not an OpenStack service.  However, as the overall project intention is to replace the complete functionality of a VMware cluster, and VMware clusters have Log Insight to centralize logging, I describe how I set up an ELK stack to replace my VMware Log Insight installation.

The Plan

After some research, the plan is to add a Docker host to hold an ELK stack all-in-one Docker container.  Unlike running Zun containers natively on OpenStack, a dedicated host can connect to my large and backed up NFS NAS for log storage, by bind mounting the Docker container volumes.  So I will roll out yet another Ubuntu 20.04 instance with Docker installed, and integrate it fully into the LAN including Active Directory SSO, roaming home directories, Zabbix and Portainer monitoring, etc.

Create a New Virtual Server

I have a nice checklist in the Ansible repo, so all new server rollouts on the OpenStack clusters are consistent and easy, as detailed below.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/ubuntu-server-20.04.txt

In Todoist, which is an online and mobile to-do tracking app, I create a task for the new server with a due date in the future to schedule upgrades. In two months the to-do task will reach the top of the queue and I will upgrade and examine this system.  Documenting and scheduling future upgrades using Todoist takes about two minutes. 

In my dockerized Netbox installation for IP allocation and management, I select an IP address, create the server entry in Netbox, etc.  So, looks like elk.cedar.mulhollon.com will be at IP address 10.10.7.23.  This takes about two minutes.

Logged into the OpenStack controller, I create a new HEAT orchestration template, the server templates are all similar, aside from obvious differences such as IP address, security groups, so this process is fast and easy with search and replace in the editor.  This takes about five minutes depending on how "unusual" the configuration is.  If I'm configuring my fourth identical Active Directory Domain Controller it only takes a bit more than a minute.  This ELK project required some thought and some new ideas; so this new (to me) "filebeat" protocol between the clients and the ELK server's Logstash uses TCP port 5044, I guess I'll add that to the security group for this server.  Also Elasticsearch is legendarily memory hungry so I boosted this instance to 8 gigs of ram.  Flavors in OpenStack are so annoying, I wish I could do the VMware thing and simply type in any random amount of ram I feel like, without having to pre-define it as a flavor beforehand.  Computers eliminate some busywork, create more busywork, kind of a physics law of the conservation of mass, or conservation of mass of busywork...  Once started, the stack create process takes quite awhile in the background, while I do other things.  I would estimate I had about ten minutes of things to think about when designing my ELK container.

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/projects/infrastructure/elk/elk.yml

I do Active Directory DNS, by having Ansible run samba-tool to add forward and reverse DNS entries for each thing on the network, so I added a file roles/activedirectory/tasks/elk.yml and search and replaced the correct values.  Don't forget to add to roles/activedirectory/tasks/main.yml and of course run ansible-playbook ./playbook/activedirectory.yml.  This takes about two minutes of actual work, the script takes longer to run, but whatever.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/activedirectory/tasks/elk.yml

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/activedirectory/tasks/main.yml

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/playbooks/activedirectory.yml

There's some "behind the scenes" configuration in Ansible that's abstracted away by adding the new Ubuntu image to the Ansible file inventory/ubuntu3.  Its a one line job.  Takes one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/inventory/ubuntu3

For this specific server, Ansible needs a playbook script file created, playbook/elk.yml.  Generally I pick one that's pretty close and edit it.  To start with this is a generic Docker host so I copy one and change some names.  Takes one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/playbooks/elk.yml

I use Active Directory and SMB file sharing on my network so I need to create a roles/samba/files/smb.conf.elk.cedar.mulhollon.com file to configure which directories are exported as shares.  I'd like to pretend I put great effort into configuring custom shares to export the server's logs and such all under reasonable security precautions and so forth; but most of the time "its just another docker host" so copy a similar predecessor and change some hostnames.  Yeah, I know, there's still commented out config options from back when Samba 4.5 was new, I've been doing this for awhile and could modernize the config files, sometime, in my infinite spare time...  Anyway setting up Samba for a new host takes about one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/samba/files/smb.conf.elk.cedar.mulhollon.com

For some years I've been using Ansible to maintain my /etc/sshd/known_hosts file across my LAN.  It's all scripted up, requires minimal effort, just add another hostname to the script's list of hosts.  So I edit roles/ssh/files/ssh_known_hosts.sh to add the new server.  Takes one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/ssh/files/ssh_known_hosts.sh

By now, OpenStack HEAT Orchestration should have completed the installation of my new server, so this paragraph is about prepping the new server to run the Ansible playbook on it later on which does all the "real work" of configuring the new server.  I configure HEAT to use my Ansible ssh key for initial public-key login, so from the Ansible user's login, ssh ubuntu@elk lets me log in.  I have to do some minor manual sshd config work for Ansible related purposes, and OpenStack cloud init always messes up the domain name for the new server for a variety of obscure reasons, so I need to manually "sudo hostnamectl set-hostname elk.cedar.mulhollon.com" which is how Ubuntu 20.04 does it (seemingly every unix-alike OS and every version of that OS has a different protocol).  The version of Ubuntu loaded into Glance on OpenStack is recent, but there are always patches that are even newer, may as well start from as clean and recently upgraded system as possible, so I spend a couple minutes running the usual "apt-get get update" "apt-get dist-upgrade" "apt-get clean" routine.  Finally, a quick reboot and the new server is ready for Ansible to configure it.  This generally takes about fifteen minutes, almost all of which is spent waiting for upgrading processes and rebooting delays, probably three minutes actual human labor.

After the new, clean, bare, unconfigured Ubuntu server completes its reboot, I "ansible-playbook ./playbooks/elk.yml" then Ansible does the vast majority of work required to integrate and harmonize with my existing network.  It would probably be a couple hours work to do manually, especially integrating with Active Directory using Samba, and it would be a very long error-prone checklist for a human to follow, but Ansible scripts never make mistakes.  There are a very small handful of manual tasks to perform after Ansible is done.  I could automatically install the latest Zabbix Agent V2 but I still consider it experimental until I get used to it, and as such I run a script that Ansible placed there ready for me to use to install it; I will eventually automate Zabbix Agent 2, after I am fully chill with it's use and behavior, seems OK so far...  Also given that I reconfigured the crypto options for SSH I create new SSH host keys (again using a script I wrote that Ansible places there ready for my use).  I generally get rid of the default "ubuntu" user because I have full SSO via Active Directory.  Also I feel weird about embedding my Active Directory "administrator" password in Ansible, so my entire Active Directory integration is fully automated with the exception of running a quick "net ads join -U administrator" and entering my domain's administrator password, although please remember that command line is NOT how to join a new Domain Controller to an existing domain, that's a similar but different one-liner.  Active Directory integration on Linux is sometimes sketchy, I've never found a way around rebooting to make everything about it work on a new install, so another, final, reboot of the new server.  This task overall is maybe 15 wall clock minutes, mostly watching automation do its thing, but I'd budget about four minutes of actual human labor.

Now that the new ELK server is integrated with my LAN, I need to work the opposite direction and integrate my LAN with the new ELK server, which is mostly accomplished by Active Directory but I do need to run roles/ssh/files/ssh_known_hosts.sh to pull the NEW ssh host keys off the ELK server and then I run the playbooks for OTHER hosts using the "—tags ssh" option to only update ssh configs on the other servers.  This is about one minutes work, its just running two scripts.

Usually, while Ansible is distributing the new SSH host keys, I fill time by messing around with Active Directory "ADUC tool" to enter a plain text description of the new server and enable trust delegation for SSO purposes.  Takes probably five minutes total to log into AD and mess around, after which Ansible is usually done with updating other server's SSH known host key list.

My next step is verifying SSO works.  Can I log into my new server and see my roaming NFS home directory without re-entering my password assuming I am already logged into a different server?  All my docker hosts share a NFS share that holds (and eventually backs up) the docker volumes, can I access it?  This testing takes only two minutes just to try and poke around.

Just a couple final integration tasks remain.  I use Zabbix to monitor operating systems so I configure Zabbix to connect to the new server, and I wait to verify good live data arrives in Zabbix.  I also use Portainer for remote control and monitoring at the Docker application level, so I need to install the agent for Portainer on the new host (its a docker container, as you'd expect) then add the new server as a docker host, verify it operates.  This probably takes ten minutes total.

The final task in rolling out a new server is git commit the OpenStack orchestration template and the Ansible playbook and other files.  This probably takes two minutes.

Overall using the power of OpenStack and Ansible, the time required to spin up a new usable server can be broken down into:

Documentation and Design 20 minutes

Operations "manual" labor 7 minutes

Integration and Testing 18 minutes

In the "bad old days" the operations category would have been "half a day" to scare up some hardware, burnin test it, verify the BIOS settings, slowly watch an OS installation progress bar creep across the screen, install the hardware in some permanent location.  You still have to do all that, once, for the cluster hardware, but once its done the additional labor to spin up a new server drops to, as seen above, seven minutes.  Which is quite an improvement from "half a day".

Install ELK stack on the new virtual server

I am using the "sebp" combined stack to spin up an ELK:

https://elk-docker.readthedocs.io/

https://hub.docker.com/r/sebp/elk/

Unusual Server Configuration Requirement

Another advantage of setting up a Docker host for the ELK stack is I have more control over the Docker environment that I would have with an OpenStack Zun container.  I have to make a custom mmap count limit setting as per: 

https://elk-docker.readthedocs.io/#prerequisites

and:

https://www.elastic.co/guide/en/elasticsearch/reference/5.0/vm-max-map-count.html#vm-max-map-count

I ran sysctl vm.max_map_count on the server as configured, and the default seems to be 65530 instead of the desired 262144.

Well, OK, fine, whatever, I can fix that.

In the short term I created a file /etc/sysctl.d/elk.conf containing one line

vm.max_map_count=262144

and run "service procps restart" (The documentation in /etc/sysctl.d/README.sysctl has a bug, the reload option doesn't exist LOL but restart works fine, when I get around to it, I will file a simple documentation-fix bug).

then I ran sysctl vm.max_map_count and now it shows the correct, larger, configuration.  Cool.

I documented that oddity in the Todoist task for this server.  The Todoist tasks act as a "runbook" to document exactly whats required to replicate a server installation, and usually there's not much oddity to document because Ansible Playbooks will take care of everything.  

In the long term, I created an issue in GitLab to add a "hardware" role for configuration challenges like this.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/issues/9

Who knows, maybe by the time you read this, I will have already implemented this in Ansible?

Open many more TCP and UDP ports

I had to add some more ports to the security groups for the Orchestration Template.  No big deal, just edit and run the update.  It doesn't wipe and rebuild, it reasonably intelligently modifies in place.

Docker Run Script

My docker run script for the new ELK looks like this:

docker run \
  -d \
  --name elk \
  --restart=always \
  --log-driver local \
  --log-opt max-size=1m \
  --log-opt max-file=3 \
  -e TZ="US/Central" \
  -v /net/freenas/mnt/freenas-pool/docker/elk/elasticsearch:/var/lib/elasticsearch \
  -v /net/freenas/mnt/freenas-pool/docker/elk/backups:/var/backups \
  -p 5044:5044/tcp \
  -p 5601:5601/tcp \
  -p 9200:9200/tcp \
  -p 9300:9300/tcp \
  -p 9600:9600/tcp \
  sebp/elk:8.3.3

Filebeat

Back in the "old days" when I was getting started with ELK, we ran logstash on our servers and that pumped into Elasticsearch.  The modern solution seems to be using various *beat applications to pump data into Logstash which then pumps into Elasticsearch.  In the end I configured this differently, but whatever, in the narrative I set up Filebeat at this time, and someday in the future I might use it. 

Looks like the exact version of Filebeat is important for ELK, so I can't run filebeat locally because every little system would have a different version, and of course hardware devices like my managed ethernet switches will never run Filebeat as their firmware only supports syslog.  Therefore I will run a Docker Filebeat, on the same ELK server, of the exact matching version, and use multiple syslog inputs (for each specific syslog RFC format) to feed logs into ELK, and it looks like the highest shared matching version for both this specific ELK stack and Filebeat at the time of posting is:

https://www.docker.elastic.co/r/beats/filebeat-oss:8.3.3

Here are some Filebeat links for reference:

https://www.elastic.co/guide/en/beats/filebeat/8.3/filebeat-overview.html

https://www.elastic.co/guide/en/beats/filebeat/8.3/filebeat-input-syslog.html

https://www.elastic.co/guide/en/beats/filebeat/8.3/running-on-docker.html

My filebeat.docker.yml file looks like this:

filebeat.config:
  modules:
    path: ${path.config}/modules.d/*.yml
    reload.enabled: false
filebeat:
  inputs:
    -
      type: syslog
      format: rfc3164
      protocol.udp:
        host: "0.0.0.0:23164"
    -
      type: syslog
      format: rfc3164
      protocol.tcp:
        host: "0.0.0.0:23164"
-
      type: syslog
      format: rfc5424
      protocol.udp:
        host: "0.0.0.0:25424"
    -
      type: syslog
      format: rfc5424
      protocol.tcp:
        host: "0.0.0.0:25424"
output.elasticsearch:
  hosts: elk.cedar.mulhollon.com:9200

Note that I output directly into the elasticsearch which has no security theater on, by default.  The typical port for beats connected to logstash has some security theater on by default and it would be a bit of work to apply the self signed SSL cert; its just not worth the effort.  They are both running on the same server so a MITM attack seems unlikely, and the entire point of the Filebeat container is to import unsecured raw UDP logs so implementing security theater, or even real live SSL certs, between the Filebeat and the ELK would be a waste of effort.  I might still do that for the LOLs someday in my infinite spare time, just to have the experience of having done it.

My docker run script for Filebeat looks like this:

docker run \
  -d \
  --name filebeat \
  --restart=always \
  --log-driver local \
  --log-opt max-size=1m \
  --log-opt max-file=3 \
  -v /net/freenas/mnt/freenas-pool/docker/filebeat/config/filebeat.docker.yml:/usr/share/filebeat/filebeat.yml:ro \
 -p 23164:23164 \
 -p 23164:23164/udp \
 -p 25424:25424 \
 -p 25424:25424/udp \
  docker.elastic.co/beats/filebeat-oss:8.3.3

Configure Servers to Send Logs to ELK

To set up FreeBSD to send logs to ELK, see ansible roles/syslog/files/syslog.freebsd

*.* @elk.cedar.mulhollon.com:25424

Note the RFC5424 option in the RC file:

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/syslog/files/rc.conf.d.syslogd.freebsd

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/syslog/files/syslog.freebsd

To set up Ubuntu or anything using syslog-NG, see ansible roles/syslog/files/syslog-ng.conf.ubuntu which has a destination section similar to:

destination d_net { 
  syslog(
    "elk.cedar.mulhollon.com"
    port(25424)
    transport(udp)
  ); 
};

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/syslog/files/syslog-ng.conf.ubuntu

Conclusion

Obviously I had to do some set up in ELK such as adding filebeat* as my data view source, although note that I'm not trying to write an ELK tutorial.  Its pretty easy to create some searches and dashboards in ELK.  Anyway, in summary, it works, Cool!

Obviously its possible to make this MUCH fancier using SSL secured TCP transport instead of simple UDP, I could write entire posts about interesting ELK query and dashboard creation, it would be fun to follow up with setting up filebeat on individual servers to pump data into ELK, or converting the existing Filebeat gateway from pumping directly into Elasticsearch and use Logstash instead, but this is an excellent start to an ELK stack.

Stay tuned for the next chapter!

Wednesday, August 24, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 056 - Prometheus

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 056 - Prometheus

During plan 3.0 I decided to either use or lose Prometheus.

Some observations about Prometheus

It works very well in my experimentation.

It replicates my Zabbix infrastructure without providing any additional value.

I need Zabbix to monitor the rest of my infrastructure, which is larger than my openstack cluster.  So I can't replace Zabbix with Prometheus (at least at this time, who knows in the future?)

As such I decided to remove Prometheus.

Kolla-Ansible is not an orchestration system, Ansible is merely a very fancy scripting language and set of libraries.  So removal of the /etc/kolla/globals.d/prometheus.yml file and running a kolla-ansible deploy will NOT remove the Prometheus installation although it will configure the entire rest of the system to NOT use Prometheus anymore in the future.

The solution to that problem, is to deploy, test that everything is working other than Prometheus, run something like "docker ps | grep prometheus" note a long list of about a dozen large containers providing the now-orphaned Prometheus service, then manually run many "docker stop prometheus-whatever" commands to shut down all the Prometheus containers.  The final step is a quick "kolla-ansible prune-images" with the really-really sure option to wipe the cached docker images for prometheus which will save a couple bytes of storage.

Tomorrows post will depend on what I do next to my home lab in my spare time.  I'm caught up to real time after a mere 56 posts.

Stay tuned for the next chapter!

Tuesday, August 23, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 055 - Kolla-Ansible installation on Cluster 1 aka hosts 1, 2, and 3

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 055 - Kolla-Ansible installation on Cluster 1 aka hosts 1, 2, and 3

References

https://docs.openstack.org/kolla-ansible/yoga/user/quickstart.html

https://docs.openstack.org/kolla-ansible/yoga/reference/index.html

Pre-Kolla-Ansible Preparation

Ansible installs the OS packages.  Begin following along with the web instructions for Kolla-Ansible (linked above) at the VENV stage.

The new controller 1 will be on OS3 as root.

Python Prep

python3 -m venv /root/kolla-ansible

source /root/kolla-ansible/bin/activate

pip install -U pip

pip install 'ansible>=4,<6'

Kolla-Ansible Installation

This is installing Kolla-Ansible, not using Kolla-Ansible to install an OpenStack cluster (which will be done later)

pip install git+https://opendev.org/openstack/kolla-ansible@stable/yoga

mkdir /etc/kolla

chown root:root /etc/kolla

cp /root/kolla-ansible/share/kolla-ansible/etc_examples/kolla/* /etc/kolla

cp /root/kolla-ansible/share/kolla-ansible/inventory/* .

kolla-ansible install-deps

ssh-keygen to create a ssh key for root, and add to gitlab, so I can access the repos which store my configs and scripts and templates.

git clone the openstack-scripts repo and the glance loader repo.  Feel free to adapt these to your own needs.  I intentionally make these repos public to be seen and used.

mkdir /etc/kolla/globals.d

cp /root/openstack-scripts/backup/whatever/globals.d/* /etc/kolla/globals.d and edit them if necessary

mkdir /etc/kolla/config

cp -R /root/openstack-scripts/backup/whatever/config/* /etc/kolla/config and edit them if necessary

The online docs recommend setting some ansible config options; the instructions are for a non-VENV install, so I put my config in /root/ansible.cfg instead.

Edit /root/multinode, notice this inventory is for Cluster 1 which uses hosts 1, 2, and 3.

Final Pre-Deployment Config

kolla-genpwd and examine /etc/kolla/passwords.yml and note that some will need changing.

Be sure to edit /etc/kolla/passwords.yml line "keystone_admin_password" as you're not going to like the admittedly highly secure autogenerated password.

Also need to edit /etc/kolla/passwords.yml and edit the line kibana_password as you're not going to like the autogenerated kibana password.

Be sure to edit or verify /etc/kolla/config/ml2_conf.ini to set the network_vlan_ranges variable to a reasonable range of VLAN ids, such as bond12:1:1000 (I'm only using 10,20,30 thru 60 on interface bond12)

Run the Swift disk labeler on ALL swift disks on each host

apt install docker because the swift ring generator requires docker

Make sure /etc/kolla/config/swift is empty... for now.

Run the ring maker script in openstack-scripts to make the rings.

Configure Individual Kolla-Ansible Product Globals.d Files

For every service in the product reference, there will be a .yml file in /etc/kolla/globals.d for example neutron.yml to configure the OpenStack Neutron service.  There are no changes to the as-shipped globals.yml file other than one line for various bug reasons (explained later on).  My backups of these files work for me.  You may find my configurations inspirational or at least amusing as a starting point for your own cluster.  Some typical starting points for configuration:

kolla_ansible.yml

distro, vip addrs, and keepalived virt router ID

central_logging.yml

enable_central_logging: "yes"

That will eventually provide kibana on port 5601

Also need to edit /etc/kolla/passwords.yml and edit the line kibana_password as you're not going to like the autogenerated kibana password.

cinder.yml

Enable LVM backend, use the ssd volume group, and swift as backup driver.

I go back and forth on using swift or a shared NFS mount for backups; probably 51% better off with swift and tools like rclone.

glance.yml

Disable file backend, enable swift backend

heat.yml

Empty initial config

horizon.yml

Empty initial config

keystone.yml

Empty initial config

neutron.yml

network_interface: "whatever the bind interface is for the dual 10G on Prod VLAN"

neutron_external_interface: "whatever the bind interface is for the dual 1G"

kolla_internal_vip_address: "10.10.20.62" aka controller2.cedar.mulhollon.com

keepalive_virtual_router_id: "cluster number, 2 in this case"

Note that Kolla-Ansible uses the openvswitch agent whereas all my experience is with linuxbridge.

nova.yml

Empty initial config

swift.yml

I have to set up swift rings by hand as per:

https://docs.openstack.org/kolla-ansible/yoga/reference/storage/swift-guide.html

Note that the kolla-ansible docs provide an opaque process using docker to generate rings, with a pointer to the swift docs as an explanation, whereas the swift docs use a completely different method.  So that's confusing.

Work around the bootstrapping bug

Because I'm not modifying the globals.yml file and am doing all configuration in individual yml files in global.d, that triggers a bug:

https://bugs.launchpad.net/kolla-ansible/+bug/1970638

Instead of setting a dummy variable, I make ONE edit to /etc/globals.yml to set the base distro to ubuntu.  Now bootstrapping works...

Bootstrap

Make sure the venv is activated (if no (kolla-ansible)  in the prompt, source /root/kolla-ansible/bin/activate)

kolla-ansible -i ./multinode bootstrap-servers

Pre-Deployment Checks

Make sure the venv is activated (if no (kolla-ansible)  in the prompt, source /root/kolla-ansible/bin/activate)

kolla-ansible -i ./multinode prechecks

Deployment

Make sure the venv is activated (if no (kolla-ansible)  in the prompt, source /root/kolla-ansible/bin/activate)

kolla-ansible -i ./multinode deploy

Generate the admin-openrc.sh file:

kolla-ansible post-deploy

. /etc/kolla/admin-openrc.sh

Now copy /etc/kolla/admin-openrc.sh where-ever you need it.

Possibly create demonstration data:

/root/kolla-ansible/share/kolla-ansible/init-runonce

Or more likely just use the HEAT templates.

Prepare the CLI

If on OS3, Make sure the venv is activated (if no (kolla-ansible)  in the prompt, source /root/kolla-ansible/bin/activate)

pip install python-openstackclient -c https://releases.openstack.org/constraints/upper/yoga

Or, more likely:

run the install-cli script from openstack-scripts repo.

Local Configuration

Run network scripts to create provider nets and ip pools

Run the keypair script to upload ssh keys

Run the flavor script to upload the flavors

Run the glance-loader repo scripts to upload some usable install images

Run the heat scripts to set up all the projects

Use the web ui to add myself and admin user to all the projects with roles of admin for both

Create the /etc/kolla/admin-project-name openrc scripts for each individual project.  Look at how individual scripts in the projects handle the issue.  There are probably more elegant options to do this without individual files.

Run the individual heat project scripts to set up individual projects (had previously set up all the projects as a group)

Test the backup script in openstack-scripts

Run the heat scripts to set up some test instances.

Conclusion

I completed this step in a long afternoon.  It took hours using Kolla-Ansible to go twice as far as I got when configuring OpenStack by hand over the course of about two weeks.

After this monstrous long post, tomorrow will be a short post about Prometheus.

Stay tuned for the next chapter!

Monday, August 22, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 054 - Bare Metal Install on hosts 1, 2, 3, and the docker USB server.

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 054 - Bare Metal Install on hosts 1, 2, 3, and the docker USB server.

https://docs.openstack.org/kolla-ansible/yoga/user/quickstart.html

https://docs.openstack.org/kolla-ansible/yoga/reference/index.html

Its the usual networking dance where netbooting is done on the non-LAG 1G ethernets and after a successful OS installation the networking is completely modified on the host and in the ethernet switch for LAG'd ethernets and VLANs.

Netboot and Install OS Ubuntu 20.04 LTS "Subiquity".

Leave the M2 alone, will configure it later, after the installer.

Remember to make the root partition on the controller 200G instead of 100G.

Username test, standard LAN password.

sudo apt-get update

sudo apt-get dist-upgrade

sudo apt-get autoremove

sudo apt-get clean

swapoff /swap.img, rm /swap.img, get rid of /swap.img file in /etc/fstab because I'm using a swap LVM partition not a swap file.

Configure the M2 on each host, using the labelswift script:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/labelswift/label_swift.sh

Copy over the netplan setting up bond interfaces and VLANs and configure the ethernet switch.  After copying the file over, probably safest to log into the console via the IPMI KVM when altering the network.  Don't forget to ping test long packets to verify MTU settings...

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/backups/os3.cedar.mulhollon.com/netplan/netplan.yaml

Get LAN Ansible up and running to do basic configuration.  Fundamentally its just another Ubuntu server.  Cloud configs use "ubuntu" as a default username and for whatever reason when I set up physical servers I use the username "test".  Other than that, pretty uneventful.

Remember after setting up Zabbix to let it autodiscover drives and interfaces for a half hour or so after initial setup, but you NEED to shut off the autodiscovery rules in Zabbix for the OpenStack hosts before installing Kolla-Ansible or Zabbix will be flooded with virtual devices.

The docker server is a physical hardware Ubuntu install, netbooted on an Intel NUC mini-server.  The weird USB hardware devices plug into the NUC, the NUC hosts Ubuntu and Docker, the Docker containers run on NFS so nothing is stored locally on the docker server thus there really isn't anything to backup on the docker server itself.  There's not much else to say about the docker server.  It only exists because VMware USB passthrough is/was quite reliable but there is no such functionality on OpenStack.

Tomorrow will be a long post about Kolla-Ansible installation on Cluster 1.

Stay tuned for the next chapter!

Sunday, August 21, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 053 - Plan 3.0

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 053 - Plan 3.0

Here is a list of required tasks and optimistic goals for the Plan 3.0 era which involves redeploying the Plan 1.0 Cluster 1 hardware into a new Kolla-Ansible deployed cluster 1.  At the conclusion of Plan 3.0 there will be two mostly-identical Kolla-Ansible clusters.  I'm not sure there will be a Plan 4.0.  I will be able to upgrade the clusters one at a time and via the power of load balancing and redundancy its no big deal if I have upgrade "issues" on one small cluster at a time.  Kolla-Ansible seems pretty reliable and relatively low stress WRT upgrades-in-place... so far.

Main Task

The main task of Plan 3.0 is Cluster 1 will be reinstalled using Kolla-Ansible instead of hand rolling like the old Plan 1.0 installation, with the following minor exceptions from how cluster 2 was set up during the Plan 2.0 era:

Bare Metal

Root partition on the controller should be more like 200G than 100G based on experience.

At the end of the Plan 2.0 era, my 100G root partition utilization looked like this:

Compute node os4: 23.9 gigs

Compute node os5: 13.1 gigs

Controller aka host os6: 82.8 gigs (on a 100G root partition, eeeeeeek!)

So I would feel more confident with a 200G root partition on the controller.   The compute nodes seem stable, low % use, and may as well save their disk space for Cinder to use.

Bare Metal

Configure Zabbix BEFORE deploying Kolla-Ansible, then disable the automatic creation in Zabbix of drive and interface graphing, because using the cluster will spam Zabbix full of unusable drives and interfaces and polling that slows it down.

Ansible on Bare Metal

During the era of Plan 2.0, I improved my LAN-Ansible configuration for bare metal on OS4, OS5, and OS6.  Using my system-wide LAN Ansible to set my default "vim" editor options and similar tasks has not interfered with whatever Kolla-Ansible is doing to set up Lib-Kuryr and Cinder and whatever else Kolla-Ansible does to spawn an OpenStack cluster.  I have to be "reasonably" careful going forward but this idea of two Ansibles is now a production meme rather than an experimental meme.

Keystone

I can federate the two cluster's Keystones together, so they imply...

Cinder

At some point I would like to get NFS working with Cinder.  I have three NFS servers, one "real" backed up NAS, and two test/experimental smaller NFS servers.

Cinder

Push Cinder traffic off the management VLAN and onto the pre-existing Storage VLAN.

Swift

Push Swift traffic off the management VLAN and onto the pre-existing Storage VLAN.  I would imagine this would require new rings to be deployed etc so do this before trying to do anything important with Swift.

Upgrade Kolla-Ansible

I've upgraded Kolla-Ansible in the Yoga series (not across releases, yet) and I'd like to document the process.  Its mostly painless?

Neutron

Push Overlay traffic off the management VLAN and onto the pre-existing Overlay VLAN previously used for VMware and NSX and so on.

Designate

Modify DNS such that os1 domains on os2 cluster will be secondary DNS relationship with each other.  They will back each other up rather than ignore each other.

Mistral

Mistral is so cool and works so well, I want to go out of my way to find a use case for it.  I intend to completely automate provisioning down to one push of a button.  In the Plan 1.0 era, provisioning was completely manual followed by Ansible playbook-based configuration.  In the Plan 2.0 era, provisioning was handed by Heat Orchestration Templates and after that completed, Ansible playbook configuration was run, so its kind of one button, pause, second button.  My hope for Plan 3.0 is to use Mistral to automate the complete provisioning process down to one button press, possibly to even include updating status in my Netbox IPAM system.

Prometheus

Use it or lose it.  By the end of the Plan 2.0 era I have Zabbix monitoring everything at bare metal OS level and below which includes IPMI hardware sensors, and Centralized Logging more or less successfully and efficiently funneling all the Docker container logs into a working Kibana.  I also have Prometheus installed and am not using Prometheus in any fashion; so find a use for it or wipe it to save the disk space / cpu cycles.

Backups

Backups in general should be a focus of the Plan 3.0 cycle.

ELK for endusers

At some point in this conversion process from VMware to OpenStack I have to admit I need to replace VMware's LogInsight, which combined cluster and enduser logs, with something, and currently I intend to use a simple dockerized ELK stack for enduser only.  This will be just another HEAT Orchestration Template in the server project, and some Ansible changes to point syslog outputs to the ELK stack.  The only reason this was not done in Plan 2.0 era was simple hardware capacity, which I do have in the Plan 3.0 era.

Neutron

Plan 1.0 had each instance dynamically assigning IP addresses from the pool.  Plan 2.0 mostly orchestrates the same address regardless which cluster an instance is running on today.  In Plan 3.0 I still intend to keep pools on each cluster, only for testing and experimenting use, trying out new OS images or similar, "production" will continue the Plan 2.0 tradition of using "real" ip address assignments rather than pool assignments.

*aaS

I'm basically getting rid of almost everything *aaS provided by OpenStack as per previous discussion.

Bare metal Docker

I have three Docker containers that need to talk to bare metal USB interfaces connected to obscure hardware, the explanation and details are a long story...  I was running docker on OS2, OS3 for these containers in the Plan 1.0 era, but Kolla-Ansible wants to run its own Docker and something about it does not cooperate with my Portainer monitoring and something about Kolla's installation of Zun and Kuryr makes manual installed docker containers not work, its all so tedious sometimes.  "In the old days" on VMware, I completely successfully relied on VMware's USB passthrough service, but the OpenStack Nova does not do USB passthrough AFAIK.  So as part of Plan 3.0 I set up a tiny Intel NUC bare metal hardware server with all the USB ports stuffed full of interesting hardware devices and I run my bare metal Docker containers on that physical server.  Sometimes to go cloudy, or cloudier, you have to un-cloud some unusual workloads; weird but true.

Tomorrow, reinstall bare metal OS on hosts 1, 2, 3.

Stay tuned for the next chapter!

Saturday, August 20, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 052 - Conclusion of Plan 2.0 era

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 052 - Conclusion of Plan 2.0 era

There isn't much else to say about the Plan 2.0 era.

The next step was to move everything off Cluster 1, cleaning up orchestration and automation each step of the way.  Its the usual IT project fiasco where the last 10% of the work takes 90% of the time.  Everything was moved off hand-rolled cluster 1 onto Kolla-Ansible cluster 2 by early August.

Some lessons learned:

Networking with OpenStack is like mudwrestling a pig, you're not going to win without a struggle, you're probably going to get exhausted and/or hurt, and in the end the pig seems to like it.  But, it is possible to bend OpenStack Neutron Networking to your will, given enough time, sweat, and sheer stubbornness.  Eventually, 100% features and reliability is possible.

Orchestration is the way of the future and Heat Templates are awesome.

Kolla-Ansible works and requires minimal effort, takes a bit of learning how to override its defaults but eventually it mostly makes sense, most of the time.  I was worried about Kolla-Ansible and my lan ansible getting into configuration fights but with firm demarcation between the configuration systems it was not an issue.  This seems to be the ideal system for deploying an OpenStack cluster.

Logging and monitoring was a time consuming struggle that was worth the time and effort.  There are many paths for monitoring an OpenStack cluster, most of which are dead or deprecated legacy projects, but in the end it IS possible to build a robust and feature filled monitoring system for OpenStack running under or with Kolla-Ansible.

Container infrastructure (Zun/Kuryr) in OpenStack is reliable and effortless and is the way forward with the possible exception of the mystery of backing up container volumes.  Which if you think about it, is kind of traditional for Docker containers LOL.

None of the exciting looking *aaS products in OpenStack are usable, either in general or for my specific purposes, other than Designate, which works perfectly.  Luckily all of them can be replaced by very small Heat templates and Docker containers.

Tomorrow, start to discuss Plan 3.0.

Stay tuned for the next chapter!

Friday, August 19, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 051 - Why I'm not using almost all the *aaS OpenStack offerings.

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 051 - Why I'm not using almost all the *aaS OpenStack offerings.

Intro

The only *aaS offering I'm currently using from OpenStack is Designate.  Designate works perfectly, installs easily, and is reliable in operation.  As for the other *aaS offerings, at least as of midsummer 2022 in the Yoga release of Kolla-Ansible... well...

Solum

https://docs.openstack.org/solum/yoga/

There seems to be no Project Deployment Configuration Reference document for Solum.

Solum is equivalent to "Cap Rover" or "Dokku" as an OpenStack integrated service instead of a Zun container orchestrated by OpenStack Heat.  If I needed that functionality I'd install the Docker containers or just go commercial and install on Heroku (although Heroku is incredibly expensive to use...)

AFAIK Solum works, although I have not tried it, so ...

Magnum

The idea of a service providing Docker Swarms and K8S clusters is compelling, and I'd be excited to use it, if it were not a dead project.

https://docs.openstack.org/magnum/yoga/

Kolla-Ansible Deployment Configuration Reference for Magnum

https://docs.openstack.org/kolla-ansible/yoga/reference/containers/magnum-guide.html

Some config examples I experimented unsuccessfully with:

https://docs.openstack.org/magnum/yoga/configuration/sample-config.html

The wiki explains the very limited compatibility list for each different Magnum version.

https://wiki.openstack.org/wiki/Magnum

Magnum seems dead for a couple years and a couple Kolla-Ansible project releases:

http://lists.openstack.org/pipermail/openstack-discuss/2021-August/024273.html

General discussion seems to indicate that despite documentation advertising the contraty, Magnum does not seem capable of running Docker Swarm anymore, which is too bad.

The good news is I can replace Magnum by a medium sized Heat Template and some Ansible Playbooks faster than I can fix the Magnum project.

Murano

A GUI application catalog.  Cool, would enjoy trying that.  Luckily, I don't really need it, because it's a dead project.

https://docs.openstack.org/murano/yoga/

There seems to be no Kolla-Ansible Project Deployment Configuration Reference document for Murano.

The Murano project appears to be dead.  There's a bug reported 17 months ago from back on Victoria release that I unfortunately easily reproduced locally on Yoga.  This makes the project uninstallable, no one seems to care, project seems dead.  Bye!

https://bugs.launchpad.net/kolla-ansible/+bug/1916370

Manila

Ironically I could not figure out what Manila did for awhile.  I thought it was a helper/gateway for Neutron to tunnel and isolate storage to end user projects.  It seems it is actually a *aaS to automatically provision a NFS or Samba server.

https://docs.openstack.org/manila/yoga/

Kolla-Ansible Project Deployment Configuration Reference for Manila

https://docs.openstack.org/kolla-ansible/yoga/reference/storage/manila-guide.html

The personal problem I have with Manila is it creates a software NAS using Cinder.  But I have excellent hardware NAS infrastructure existing, and I have no good Cinder based backup solution, and if I needed a software NAS installed in an instance I'd just install TrueNAS into an instance and be done with it.  It solves the opposite of all my storage problems.  I assume it is very useful to some people, and I'm happy for those people that it exists.  I have an excellent NFS hardware NAS infrastructure with backups and everything, and I could run Manila on top of it to provide CIFS, but I'd probably find it easier to install, operate, and troubleshoot a Samba instance.

AFAIK, based on gossip from people using it, it reportedly works well.

Sahara

This project seems to be like Manila but for cloudy big data servers instead.

https://docs.openstack.org/sahara/yoga/

There does not seem to be a Kolla-Ansible Project Deployment Configuration Reference document for Sahara.

I'm not even going to try Sahara.  There are no errors when installing but the docs have not been updated since Newton (I'm on Yoga) and I don't actually have a use case for anything it can do, and if I did, I'd install a fresh brand new up to date Docker container instead of something from back when the Newton release was new...  I'm not saying its a dead project, but when the docs are that old...

Senlin

https://docs.openstack.org/senlin/yoga/

There seems to be no Project Deployment Configuration Reference for Senlin.

Senlin tutorials look very well documented and caught my interest and I'm excited to try it in the future.  I do not currently have a use case for this software, but its too cool looking not to try anyway.  I may try installing it someday in the future as an experiment.

AFAIK this is a live project.  Then again I haven't tried using it.

Trove

This is yet another *aaS project, this one specializing in databases, that amounts to wrapping a docker container around the OpenStack Keystone system for auth and quotas and such.

https://docs.openstack.org/trove/yoga/

There seems to be no Kolla-Ansible Project Deployment Configuration Reference for Trove.

As configured by Kolla-Ansible, Trove is ready to use in the sense that all the infrastructure is set up, but there are no guest images uploaded and no datastores configured.

Building a Trove guest image notes:

https://docs.openstack.org/trove/latest/admin/building_guest_images.html

mkdir images

git clone https://opendev.org/openstack/trove

cd trove/integration/scripts

./trovestack build-image

openstack image create trove-guest-ubuntu-bionic \

  --private \

  --disk-format qcow2 \

  --container-format bare \

  --tag trove --tag mysql \

  --file ~/images/trove-guest-ubuntu-bionic-dev.qcow2

openstack datastore version create 5.7.29 mysql mysql "" \

  --image-tags trove,mysql \

  --active --default

trove-manage db_load_datastore_config_parameters mysql 5.7.29 ~trove/trove/templates/mysql/validation-rules.json

Fails with:

2022-07-11 23:15:31.903 | ERROR: Cannot install trove==17.1.0.dev36 because these package versions have conflicting

 dependencies.

2022-07-11 23:15:31.903 | 

2022-07-11 23:15:31.903 | The conflict is caused by:

2022-07-11 23:15:31.903 |     trove 17.1.0.dev36 depends on Jinja2>=2.10

2022-07-11 23:15:31.903 |     The user requested (constraint) jinja2===3.1.2

I filed a bug on the inability to build Trove images:

https://storyboard.openstack.org/#!/story/2010137

I like the idea of integrated backups in Trove.  However it seems like a HUGE amount of work to set up, compared to a couple lines in a Heat Orchestration Template to spawn off a Zun container of any arbitrary DBMS, whichever MySQL version I would like at this moment, etc.

Octavia

https://docs.openstack.org/octavia/yoga/

Kolla-Ansible Project Deployment Configuration Reference (PDCR) for Octavia

https://docs.openstack.org/kolla-ansible/yoga/reference/networking/octavia.html

I don't have a use case for a load balancer at this time.  If I did, I would install a software LB of my own choice on an instance, or in a Docker Zun Container.

Watcher

Watcher seems a good idea in concept.  However, it seems dead.

https://docs.openstack.org/watcher/yoga/

I could not find a Project Deployment Configuration Reference (PDCR) for Watcher.

I tried to install Watcher, but it repeatedly restarts its containers every 30 seconds I assume its crashing.  Then, I noticed Watcher's horizon dashboard gets weird and crashes also.

So... removing watcher.  Ironically on my hand installed Plan 1.0 cluster, it "worked" or at least it didn't crash constantly.

Freezer

Integrated backup service in OpenStack sounded very exciting.  If it were not a dead project I'd surely use it on a regular basis.

https://docs.openstack.org/freezer/yoga/

There seems to be no Project Deployment Configuration Reference (PDCR) for Freezer.

Install fails.   Domain_name error.  There's a bug open at storyboard and launchpad.  Freezer project seems dead, unfortunately.

https://storyboard.openstack.org/#!/story/2009936

https://bugs.launchpad.net/kolla-ansible/+bug/1961430

Conclusion

The idea of deeply integrating *aaS products into OpenStack sounds very appealing as a way to leverage Keystone and HEAT and generally integrate with the other services.  However, none of the *aaS integrate with the exception of Designate which integrates pretty well with Neutron.  All the *aaS projects other than Designate seem to either not fit my needs so I didn't try them or I tried them and they are dead.  Luckily in all cases technology has marched on, and if I need an instant MySQL, as an example of a *aaS product, I can quickly and easily spawn a Docker container in Zun which works great and I can orchestrate with Heat Templates, and its easier to admin than an "embedded" OpenStack project.  So in summary, these projects either don't meet my needs, or they're dead projects that no longer work, but I don't mind.

Tomorrow, the conclusion post for the Plan 2.0 era.

Stay tuned for the next chapter!

Thursday, August 18, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 050 - Mistral and Masakari Doubleheader

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 050 - Mistral and Masakari Doubleheader

Intro

This will be a short and boring blog post; sorry in advance.  I installed Mistral and created some test and demonstration workbooks and it works great and is well documented, but actually implementing will take some time to experiment.  As for Masakari, its essentially undocumented and I don't have substantial enough resources to make use of it, although I like the idea and want to implement it "eventually" when I have the spare time.

Mistral

https://docs.openstack.org/mistral/yoga/

I have this working, and it looks terribly useful, but I don't have any live applications for it at this time.

I have a very simple "hello" demo at this GitLab link:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/tree/master/demos/mistral/hello

Masakari

Masakari Docs Page

https://docs.openstack.org/masakari/yoga/

Kolla-Ansible Deployment Configuration Reference for Masakari

https://docs.openstack.org/kolla-ansible/yoga/reference/compute/masakari-guide.html

It seems very difficult to find documentation for how to use Masakari. It seemed to install successfully; what do I do next? My clusters are very small so failover will probably not work anyway for capacity reasons. I will keep it around in hopes I get to use it someday.

Tomorrow, why I'm not using most of the *aaS OpenStack offerings.

Stay tuned for the next chapter!

Wednesday, August 17, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 049 - Centralized Logging Redux

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 049 - Centralized Logging Redux

Intro

Note these blog posts are written as a retrospective long after the work was accomplished.  My initial post about Central Logging was written very soon after I discovered it, early in the Plan 2.0 era.  This is a follow up post about what I learned along the path of the Plan 2.0 era, ending with description of the entire monitoring system at that time, at least as I remember it.

Narrative

Last time I wrote about central logging service, adding Monasca to a Yoga cluster kills central logging because Monasca refuses to connect to a version of Elasticsearch newer than very ancient, as seen at this link:

https://bugs.launchpad.net/kolla-ansible/+bug/1980554

I noticed that leaving Monasca installed and running with the crashing log importer shut down, results in a backlog forming in Kafka and Zookeeper on the order of quite a few gigs per week, slowly and inevitably filling my controller root partition.  Annoying.  40+ gigs is not a big deal in 2022 in the grand scheme of things, but it is a big deal on a 100 G root partition or if it seemingly uncontrollably grows 10+ gigs per week.

My solution to the slow log flood, was to remove Monasca from Cluster 2, and as a result remove Grafana (I use Zabbix mostly for IPMI level monitoring) and remove the related Kafka and Zookeeper and Vitrage.  Quite a cleanup.

Now it is important to realize that Kolla-Ansible only adds and installs following a dependency tree; it is NOT repeat NOT an orchestration system.  Remove a service, then kolla-ansible reconfigure, and it will reconfigure EVERYTHING ELSE still enabled around the changes necessary to no longer use or point to the removed service, but leave the legacy service up and running and using disk space.  So I ran a reconfigure and only afterwards did I manually stop the Docker containers for monasca, kafka, zookeeper, grafana, vitrage, etc.  Then its safe to run a prune and delete the storage volumes and so forth.

There's unfortunately a stalled out task to add a kolla-ansible service destroyer, to make it a sort-of-orchestrator, at the following links:

https://bugs.launchpad.net/kolla-ansible/+bug/1874044

https://review.opendev.org/c/openstack/kolla-ansible/+/504592

Interesting observation:  Central logging was knocked out by Monasca's logger being dead because it can only connect to truly ancient versions of elasticsearch.  By removing monasca and redeploying, I'm now getting centralized logging messages.  Cool!

To create your default Kibana indexing pattern:

https://docs.openstack.org/kolla-ansible/yoga/reference/logging-and-monitoring/central-logging-guide.html

Of course, now that elasticsearch is working on centralized logging, need to configure curator or the drive will inevitably eventually fill.  The docs seem to imply that /etc/kolla/globals.yml has options for Elasticsearch-curator, which seems not to be the case.  I have a personal TODO item to submit a patch to the docs to fix the docs.

Superficially a sysadmin would predict the config file /etc/kolla/elasticsearch-curator/elasticsearch-curator-actions.yml would be overridden in the directory /etc/kolla/config/elasticsearch-curator, however, per the docs that file should actually be stored in /etc/kolla/config/elasticsearch.  Well, OK, whatever, Kolla-Ansible sometimes gets unpredictable about peculiar filenames WRT overrides as I discovered back in my early Neutron Networking days.  Anyway, for a probably-working example configuration with somewhat shorter data retention than the rather long defaults, see this URL for inspiration, or if it ends up not working, for the LOLs:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/backups/os6.cedar.mulhollon.com/configs/elasticsearch/elasticsearch-curator-actions.yml

Logging, Monitoring, and Telemetry around the end of the Plan 2.0 era

This is how monitoring looked on the Cluster toward the end of the Plan 2.0 era:

Zabbix monitors the OS underneath OpenStack to include everything the zabbix-agent sees down to lower level stuff like IPMI fan speeds and temps.  Skydive component of OpenStack provides a beautiful GUI for real-time-ish live network traffic flows, at some effort to decode interface names, etc.  Centralized monitoring aka Kibana monitors all docker logs.  I have Prometheus installed although I'm not actually doing anything at all with it.  I am not at this time experimenting with the optional connection between prometheus and elasticsearch, but I will get around to it eventually.

On VMware I found Log Insight worked really well as a merged monitoring tool holding both cluster logs AND enduser logs from server installs, etc.  However I don't think that merger works with OpenStack.  I will likely go back to using a dedicated ELK stack to monitor my enduser apps.  I did not set that up during Plan 2.0 for simple hardware capacity reasons.  So that is why monitoring at the end of the Plan 2.0 era seems to be missing a key component, that being end user logging; it is indeed missing and will be added later.

Tomorrow, a doubleheader of Mistral and Masakari

Stay tuned for the next chapter!

Tuesday, August 16, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 048 - Prometheus and Grafana Doubleheader

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 048 - Prometheus and Grafana Doubleheader

Intro

Ironically, I am no longer using Prometheus and Grafana.  Although I got it working, it did not do anything Zabbix did not do, and its certainly not easier to use, and its OpenStack cluster monitoring only whereas Zabbix can monitor anything.  Technically I probably could have converted literally everything on my entire network to use Prometheus/Grafana, um, no I don't have that kind of spare time.  None the less, I got it working, its pretty cool, so I write it up in this blog post.

Installation

There aren't really any OpenStack docs for Prometheus / Grafana, at least that I could find, so this is a mess of google searches and experimentation and software archeology concatenated into something resembling a checklist:

/etc/kolla/globals.d/prometheus.yml needs this line:

enable_prometheus: "yes"

/etc/kolla/globals.d/grafana.yml needs this line:

enable_grafana: "yes"

/etc/kolla/passwords.yml needs the following lines edited:

grafana_admin_password: Make this something sane, the default web UI login will be username admin password "this password"

/root/multinode (or where-ever you keep your kolla-ansible inventory file):

This is where you enable or disable what prometheus will monitor.  The defaults seem fine or at least they worked for me.

Well, lets deploy it and try it out:

kolla-ansible -i ./multinode  reconfigure -t prometheus,grafana

The deployment took about ten minutes to fail.

One multi-page error message was something about telegraf username and password failing.

One single-page long error message was about monasca data source having an invalid username or password.

Well, I tried using it anyway to see what happens with IPMI and similar monitoring, and to my happy surprise it kind of works anyway regardless of the partially failed deployment.

First steps walkthrough with Grafana

This really isn't the time for Grafana training, but here's what I did on my hardware the first time I logged in, which will probably not -exactly- match your hardware but should provide some inspiration.

Web UI for Grafana will be at the URL, your controller ip addrs, port 3000, uname is admin, pword is whatever was in /etc/kolla/passwords.yml

Left side click on dashboards.  "+ New Dashboard"

"Add a new panel"

On the right side, change "Title" from "Panel Title" to "CPU" or similar

Data source is the default dead telegraf.  Change to Prometheus.  Everything underneath the data source selection changes...

Metric: Select "node_hwmon_temp_celsius"

Labels: "chip = platform_coretemp_0" and click the "+" to add "sensor = temp1"

(your hardware is probably different...)

Click "Run Queries" and you get what seems to be the CPU temp of my cluster's hosts.

Click "Apply" in the upper right.

You probably want a better dashboard name than "New dashboard" so when looking at the pretty graph, click the "Dashboard Settings" in the upper right

"General" "Name" change from "New dashboard" to "Temperatures"

Click "Save Dashboard" and "Save"

Now you could spend endless hours making fancier and fancier dashboards, but this is a five minute intro to Grafana and it seems to work well enough.

Conclusion

The problem with me personally using Prometheus and Grafana, is I have a working Zabbix across my entire system (not just this cluster) so I have zero motivation to use Prometheus and Grafana.  I am happy to report the version installed with Kolla-Ansible does mostly work and I provided a walkthrough above of how to start using it to create a dashboard showing the cluster host CPU temperatures.

Tomorrow, the sequel to the old post "Central Logging", it's the all new long awaited "Central Logging Redux"

Stay tuned for the next chapter!

Monday, August 15, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 047 - Zabbix

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 047 - Zabbix

Intro

Technically Zabbix is not an OpenStack project member or service.  However, I find it VERY useful to monitor the hardware and bare metal OS.  Also, for consistencies sake, I use Zabbix across my entire system; if it has a Zabbix software agent OR a SNMP port OR an IPMI port, then I use Zabbix to monitor it.  Very versatile software!

I run Zabbix inside my OpenStack clusters inside my "infrastructure" project.  A Zun container named zabbixserver holds the backend server.  There are Zun containers for the zabbixproxy (which is technically no longer necessary...) and zabbixweb frontend.  The only persistent state in the monitoring system is in the MySQL server, I have an Ubuntu instance running that holds Docker containers backed by my NAS NFS server.  The NAS is backed up via a process that is far off topic for this series.  I also have an Adminer Zun container connected to the zabbixmysql database so I can, if I want, use a web UI to poke around, although thats more for fun than productivity.

How to enable IPMI pollers in Zabbix on a Zun OpenStack container

https://www.zabbix.com/documentation/current/en/manual/config/items/itemtypes/ipmi

This is done in a Zun container using a Heat orchestration template by adding the following environment variable to the OS::Zun::Container configuration:

ZBX_IPMIPOLLERS: "1"

See the Gitlab repo:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/projects/infrastructure/zabbixserver/zabbixserver.yml

How to configure an individual IPMI host in Zabbix

Its possible in Zabbix to configure multiple interfaces.  Normal hosts only have an "Agent" but all the OpenStack hosts have an IPMI interface, so I added it.  Note you have to add the IPMI interface, update, then click on the IPMI tab, what works for me is:

Authentication algorithm: Default

Privilege Level: Admin (change)

Username: LOL NOT SAYIN (but everyone in real life uses ADMIN, of course)

Password: LOL NOT SAYIN

It will take quite awhile to eventually get around to polling.

Add the "Template Server Chassis by IPMI" named template.

Click the Host "Macros" tab and add the following two macros:

{$IPMI.USER}

{$IPMI.PASSWORD}

with the obvious text values and click "Update".

Zabbix Agent2

Agent2 replaces the old agent and adds Docker monitoring and could optionally do some interesting MySQL monitoring.  Each OS has a somewhat different way to install Agent2, ranging from FreeBSD where its simply a new package to Ubuntu where you have to download a package manually then manually install, all via a little script I wrote and have Ansible place in the /root directory of every Ubuntu install.

https://www.zabbix.com/documentation/6.2/en/manual/config/templates_out_of_the_box/zabbix_agent2

Zabbix Community Template Collection

https://github.com/zabbix/community-templates

Active Checks Requirements

For active checks to work, MUST configure a host as:

"Host name" the FQDN (example os4.cedar.mulhollon.com

"Visible name" the short name (example os4)

My suggested Zabbix host configuration for monitoring an OpenStack host

Template App NTP Service

Template App SSH Service

Template Module ICMP Ping

Template OS Linux by Zabbix agent

Template Server Chassis by IPMI

SMART by Zabbix agent 2

Host groups: Hypervisors ALSO all are members of OpenStack Cluster 1 or 2

Autodiscovery "Fun" with OpenStack

Autodiscovery is OK before Kolla-Ansible is deployed.  After deployment, zabbix will autodiscover all the new LVM drives and network interfaces and ethernet bridges, then perseverate upon them forever as a complete waste of time.  I've seen over 1400 items monitored on one host, very slowly, LOL.  The plan is to disable autodiscovery AFTER bare metal host prep, but BEFORE kolla-ansible deployment.

"Configuration" "Hosts" click the items for your host.  For example for me it was "os4 Items 1351" or whatever it was.  Click over on the "Discovery Rules 3".  In the "Status" column, all should be "Enabled" click on "Block Devices discovery" and "Network interface discovery" to change them to disabled.  I've not run into problems with "mounted filesystem discovery"... so far.  Then click back on "Items".  One cool feature of a relatively modern Zabbix web UI is you can select multiple tags by clicking on them.  If you have "junk" openstack interfaces from instances that no longer exist or it just doesn't matter, click several of the "TAGS" so they are listed below, click the checkbox in the upper left to select all, make sure, then click the "Delete" button.  Repeat for all hosts and your Zabbix will run dramatically faster with lower system load.

SMART drive monitoring using Zabbix

https://www.zabbix.com/integrations/smart

Need to be using agent2, which I am doing for Docker support anyway.

Need to "apt-get install smartmontools" for Zabbix SMART monitoring to work.  

"chmod u+s /usr/sbin/smartctl" is usually handled via Ansible

Test by logging in to an OpenStack host as root, then "sudo -u zabbix smartctl -a /dev/sda" or similar.

Note there is no default template for SMART although one can be downloaded and installed:

Download the SMART template from the integrations link above

In the Zabbix web UI, "Configuration", "Templates", "Import"

Select the file saved in the step above

Mark the required obvious options in import rules

"Import"

Your list of templates now includes "SMART by Zabbix agent 2"

Add the above template to hosts.

Note this adds an autodiscovery rule.

I've tried setting "Plugins.Smart.Path=/usr/sbin/smartctl" in /etc/zabbix/zabbix_agent2.conf

The only place its documented ANYWHERE that a sudo entry is required for SMART is:

https://www.zabbix.com/forum/zabbix-suggestions-and-feedback/415662-discussion-thread-for-official-zabbix-smart-disk-monitoring

See:

zabbix ALL=(ALL) NOPASSWD:/usr/sbin/smartctl

In:

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/sudo/files/sudoers

At this point, Zabbix SMART monitoring will work.

Tomorrow, a Prometheus / Grafana doubleheader!  Two for the price of one!

Stay tuned for the next chapter!

Saturday, August 13, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 046 - Skydive

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 046 - Skydive

Around this time, during the Plan 2.0 era, I stopped doing OpenStack directly, and wandered off on a monitoring expedition, because I need a way to troubleshoot broken clusters and one of the primary troubleshooting tools is, well, troubleshooting tools, such as Skydive, Zabbix, Prometheus / Grafana, and later on in the Plan 3.0 era I provisioned an ELK stack (its a bit resource hungry, or can be anyway, so I waited until I had more resources in Plan 3.0).

There is not much to this "Skydive" post because its trivial to set up and easy to use.

First, links to some reference docs I used:

Kolla-Ansible Project Deployment Configuration Reference (PDCR) for Skydive

https://docs.openstack.org/kolla-ansible/yoga/reference/logging-and-monitoring/skydive-guide.html

Skydive Project repo

https://github.com/skydive-project/skydive/

Installation Notes:

Add one line, and deploy as usual:

enable_skydive: "yes"

As seen at this link:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/backups/os6.cedar.mulhollon.com/globals.d/skydive.yml

The default login seems to be the Horizon Web UI password.  I don't see an alternative password for Skydive in my /etc/kolla/passwords.yml file.

Skydive works immediately out of the box and is visually stunning.

Skydive integrates with Kolla-Ansible in the sense that it "just works" out of the box upon installation, but it does not "deep integrate" with OpenStack in the sense that instead of displaying OpenStack names for components, the names are all UUID-type internal designators.  Very much like the experience of Zabbix monitoring.

Tomorrow, a MUCH longer post on Zabbix.

Stay tuned for the next chapter!

Friday, August 12, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 045 - Blazar Reservation Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 045 - Blazar Reservation Service

First, links to some reference docs I used:

Blazar Docs Page

https://docs.openstack.org/blazar/yoga/

No Kolla-Ansible Deployment Configuration Reference for Blazar.

Blazar CLI Client

Run this:

pip install python-blazarclient -c https://releases.openstack.org/constraints/upper/yoga

Experimenting with Blazar

I have a kind of embarrassing funny story about the early days of Blazar, I should have realized if I add my hosts to Blazar as a reservable resource, if I don't have the hosts leased I can't spawn instances on them.  Got burned on that one, LOL.  I know its literally what Blazar is supposed to do, although I have to admit the error messages that resulted were bizarre, so weird, that I thought I killed Nova accidentally via other experimenting.  The UI when trying to use a reserved resource "illegally" could use a bit of work.

I also had an interesting design confusion about Blazar.  I was pretty chill with the idea that a host reservation was for an entire Nova-Compute host, other than silliness like forgetting I had reserved all my hosts in an experiment one time, LOL.  However, I had assumed that an instance reservation was for a specific existing instance; perhaps I could somehow monopolize a specific mysql server or similar reservation task.  That is NOT the case nor the design scenario; an instance reservation is to reserve the creation of a specific instance, which creates a new UUID numbered flavor ID.  Then when your reservation is chronologically active you can create an instance using that magical new flavor ID.  That's a cool idea but not what I expected based upon the name of the reservation type.

I tested and experimented with Blazar, to include locking myself out of my own cluster temporarily, and it all seems to work very well, although I have no particular use for it at this time.  It really is good reliable cool and interesting software, but it doesn't match my use case, so after my experimenting during the Plan 2.0 era, its not getting installed in the Plan 3.0 era.

Tomorrow, Skydive.  No, not me with a parachute, I mean the cool graphical web-accessible network monitoring tool.

Stay tuned for the next chapter!

Thursday, August 11, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 044 - Designate DNS Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 044 - Designate DNS Service aka DaaS DNS as a Service.

First, links to some reference docs I used:

Designate Docs Page

https://docs.openstack.org/designate/yoga/

Kolla-Ansible Projects Deployment Configuration Reference, for Designate, points to:

Kolla-Ansible Networking Guide, Designate Chapter:

https://docs.openstack.org/kolla-ansible/yoga/reference/networking/designate-guide.html

Neutron's notes about integrating with Designate

https://docs.openstack.org/neutron/yoga/admin/config-dns-int.html

DNS integration with an external service AKA the "Use Cases" Document

https://docs.openstack.org/neutron/yoga/admin/config-dns-int-ext-serv.html

My DNS design:

My Samba Active Directory domain is cedar.mulhollon.com, and its advised for anything attached to the AD to point to the domain controllers (I have four virtual machines for DC, two on each OS cluster).  So all four DCs are authoritative for cedar.mulhollon.com.

I have four resolver virtual machines.  The usual arrangement of two VMs on each cluster.  I've fooled around with all kinds of round robin and cross cluster backup and in the end the simplest thing to do is like-to-like, so dc22 lists dns22 as its first samba "dns forwarder" and dns21 as its second backup, and so forth, so generally, dc11 will forward to dns11, etc.

Designate, as configured by hand, in Plan 1.0, was authoritative on a set of Bind 9 installations, one on each OpenStack host.  Well, its no longer Plan 1.0 era, its Plan 2.0 era.  And Kolla-Ansible by default and somewhat to my surprise spins up a "Bind 9 in a container" on the controller aka os6 for Cluster 2.  My primary OpenStack Designate domain for cluster2 will be os2.mulhollon.net.  So, os6 is authoritative primary for the domain os2.mulhollon.com (and, eventually, os3 would be auth for os1.mulhollon.com).  So that plain old network users can resolve in the os2.mulhollon.com domain, each virtual resolver (the paragraph above...) contains:

zone "infrastructure.os2.cedar.mulhollon.net" {
  type forward;
  forward only;
  forwarders { 10.10.20.56; };
};

Note that 10.10.20.56 is cluster 2's controller

My strategy is to configure both domains on each cluster such that Heat Orchestration templates can run on EITHER cluster with no change, and DNS resolution will work, of course if you install wiki on cluster 2 the dns entry for wiki.os2.mulhollon.com will work but wiki.os1.mulhollon.com will not work.  For "domain wide" DNS I'd be using samba-tool and Samba active directory hostname wiki.cedar.mulhollon.com regardless of which cluster its installed upon anyway.

I have not experienced any problems with this design, so far.  But, its early days...

I have been doing DNS "stuff" on and off for about a quarter century now.  I have to admit, this is the first time I've configured a DNS server off a REST API, or use a Python library to configure it.  Pretty Cool!

Installation Notes

Installation was mostly per the Kolla-Ansible PDCR for Designate, aka the "Kolla-Ansible Networking Guide" Designate chapter.  

One minor difference is my configuration strategy is pristine /etc/kolla/globals.yml so I enable_designate: "yes" in /etc/kolla/globals.d/designate.yml

I do not alter my dns_interface setting; the default is network_interface which in my neutron config points to my management LAN.  I suspect almost all sysadmins run the default config of putting their Designate traffic over their management LAN, but alternatives are possible.

The default designate_backend is "bind9" and that's what I'm using.

I set my designate_ns_record to my controller, where Kolla-Ansible installed a Docker container of Bind 9.

I don't have multiple Designate-workers so I don't need to worry about how to configure redis to coordinate them.

This rather minimal configuration can be seen at the following link:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/backups/os6.cedar.mulhollon.com/globals.d/designate.yml

Designate CLI Client

You have to install the client before creating zones and doing "Designate Stuff" at the CLI, although I mostly use Heat Orchestration Templates it helps during initial setup and during testing and troubleshooting to have a working CLI.

It boils down to running:

pip install python-designateclient -c https://releases.openstack.org/constraints/upper/yoga

See this link:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/installcli/installcli.sh

I did not do many demo scripts for Designate but this example shows an interesting characteristic of Designate:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/demos/designate/zonelist.sh

And that characteristic would be, that zones are very much a project-based resource, so specify "--all-projects" to see all the zones/domains or rephrased, if your zone seems missing, your best first troubleshooting step is to see if you're even logged into the correct project...

Nova and Neutron Autoconfiguration

Designate needs a designate-sink.conf file in /etc/kolla/config/designate/designate-sink.conf to tell Neutron and Nova what DNS zones to configure automatically.  The architecture page explains designate-sink.

https://docs.openstack.org/designate/latest/contributor/architecture.html#designate-sink

The meaning of the zone_id is completely undocumented but looking at the REST API docs I think its the ID string you see if you run an "openstack zone list" and look at the ID column.  I'm pretty sure its not the name of the zone because that would be better named "zone_name"

I have this about 50% working.  I'm not entirely sure what 50% is working from day to day.  At some later date I will return to this.  I'm not super motivated about the time savings of automatic configuration when using the web UI and CLI, because I mostly Heat Template Orchestrate everything and its trivial to use Designate in an orchestration template.

Heat Template Operations

Did an openstack zone create for each separate project, note this needs project_IDs not project names.  A typical HEAT template resource zone create looks like:

  zone_os2:
    type: OS::Designate::Zone
    properties:
      name: "iot.os2.cedar.mulhollon.net."
      email: "vince@mulhollon.com"
      type: PRIMARY

And here is a link to the entire template containing the above resource in context:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/projects/iot/iot/iot.yml

After a zone is created, time to stuff it with records.  Here's a typical HEAT template resource record:

  my_forward_dns_os2:
    type: OS::Designate::RecordSet
    properties:
      type: "A"
      zone: "iot.os2.cedar.mulhollon.net."
      name: "hawkbit"
      records:
        - get_attr: [ my_floating_ip, floating_ip_address ]

And here's a link to the entire orchestration template:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/projects/iot/hawkbit/hawkbit.yml

I'm not sure I actually "need" Designate but its too cool not to play with.  So, I play.  So far it all works perfectly, easily, and reliably, although I'm still experimenting with the automatic entry creation feature in Neutron and Nova.

Tomorrow, Blazar Reservation Service.

Stay tuned for the next chapter!

Wednesday, August 10, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 043 - Zun and Kuryr Container Services

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 043 - Zun and Kuryr Container Services

First, links to some reference docs I used:

Zun Docs Page

https://docs.openstack.org/zun/yoga/

Zun Launchpad Bugs Page

https://bugs.launchpad.net/zun

Kolla-Ansible Deployment Configuration Reference for Zun

https://docs.openstack.org/kolla-ansible/yoga/reference/compute/zun-guide.html

Kolla-Ansible Deployment Configuration Reference for Kuryr

https://docs.openstack.org/kolla-ansible/yoga/reference/containers/kuryr-guide.html

Main Zun Repository

https://opendev.org/openstack/zun

Installation

In /etc/kolla/globals.d/zun.yml I needed to add:

enable_zun: "yes"
enable_kuryr: "yes"
enable_etcd: "yes"
docker_configure_for_zun: "yes"
containerd_configure_for_zun: "yes"

As seen here:

The Kolla-Ansible install itself was surprisingly uneventful.  I had to kolla-ansible bootstrap-servers before running a kolla-ansible deploy to make the necessary network changes.

Installing the Python support in the CLI for Zun

Run:

pip install python-zunclient -c https://releases.openstack.org/constraints/upper/yoga

As seen here:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/installcli/installcli.sh

Note that the docs for "openstack appcontainer cp" and "zun cp" are technically correct although in my opinion they mislead sysadmins into thinking that a Zun CP uses "scp" syntax (which will not work).  See openstack-scripts/demos/zun/pingtest/read-file.sh and save-file.sh.  I would summarize the issue to "when doing a zun cp, the source is always one specific file and the destination is always a directory".  Or just remember that the CLI looks like scp syntax, but it isn't scp.

I am looking into the detailed instructions at:

https://docs.openstack.org/contributors/code-and-documentation/index.html

such that I can submit a documentation fix as per the above for: 

https://docs.openstack.org/python-openstackclient/yoga/cli/plugin-commands/zun.html

Demo

See openstack-scripts/demos/zun/pingtest at:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/tree/master/demos/zun/pingtest

Its pretty simple to use.  Early on I was storing Docker images in Glance which scales well to large numbers of identical images, although I later stopped using Glance.

Another difference between the demo and my production containers is the demo uses the CLI exclusively whereas my production images use HEAT orchestration templates exclusively.

There's a good example of a Zun HEAT template at:

https://opendev.org/openstack/heat-templates/src/branch/master/hot/zun/webapp.yaml

A typical real world Heat template orchestrated Zun container deployment can be seen here:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/tree/master/projects/iot/nodered

This runs Node-RED very well indeed...

Overall I was pleasantly surprised how well Zun works at hosting container workloads.

Tomorrow, Designate DNS service.

Stay tuned for the next chapter!

Tuesday, August 9, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 042 - Swift Object Storage Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 042 - Swift Object Storage Service

First, links to some reference docs I used:

Swift Docs Page

https://docs.openstack.org/swift/yoga/

Kolla-Ansible Deployment Configuration Reference for Swift

https://docs.openstack.org/kolla-ansible/yoga/reference/storage/swift-guide.html

Swift Ops Runbook

https://docs.openstack.org/swift/yoga/ops_runbook/index.html

Configure Swift

Configuring Swift is a lot of work compared to other services.  It's pretty rough.

There's a bug in the docs, and/or kolla-ansible, where if you follow the installation instructions, storage policy index 0 is not completely configured, AND there's an auth issue where you have to set a delayed setting if you want to access storage policies via the API.  So that means you CAN use the CLI to create containers but Horizon web UI can NOT create containers until those two settings are fixed.  I opened a bug report at Kolla-Ansible explaining what the bug is, and how to fix it.

I found another problem relating to "swift_delay_auth_decision".  I would strongly suggest that it is a bug for Kolla-Ansible to permit a user to configure "cinder_backup_driver: swift" at the same time as the default configuration of "swift_delay_auth_decision: no" because that'll crash the cinder_backup container as soon as a volume backup is attempted. At least as of Yoga.

Maybe I should not have opened three separate problems in the same bug report, but since no one is working on those bug reports a month later it probably doesn't matter:


Next, the rings. I have to set up Swift rings by hand as per:

https://docs.openstack.org/kolla-ansible/yoga/reference/storage/swift-guide.html

Note that the kolla-ansible docs provide an opaque process using docker to generate rings, with a pointer to the swift docs as an explanation, whereas the swift docs use a completely different method. So that's confusing.

See the openstack-scripts/ringmaker scripts at:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/tree/master/ringmaker

As for plain old Kolla-Ansible configuration, its pretty simple.  The /etc/kolla/globals.d/swift.yml has a link:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/backups/os6.cedar.mulhollon.com/globals.d/swift.yml

And it looks like this:

# /etc/kolla/globals.d/swift.yml
#
# On OS6 for cluster 2
#
---
enable_swift: "yes"
enable_swift_recon: "yes"
swift_delay_auth_decision: "yes"
# Later on...
# swift_storage_interface: "bond34.30"
# swift_replication_interface: "{{ swift_storage_interface }}"
#

The "Later on" is for eventually moving swift traffic off the management VLAN onto a dedicated pre-existing storage VLAN.  There would be zero performance gain and I'm not using Swift much so the motivation level has been low on that one, but "eventually" I will push Storage and Overlay onto their dedicated networks toward the end of the Plan 3.0 era or maybe Plan 4.0 timeframe.

Install CLI

See the script installcli.sh:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/installcli/installcli.sh

Or just run:

pip install python-swiftclient -c https://releases.openstack.org/constraints/upper/yoga

Volume Backup

https://docs.openstack.org/cinder/yoga/admin/volume-backups.html

https://docs.openstack.org/cinder/yoga/admin/volume-backups-export-import.html

I have an imitation of a backup system working now with Cinder and Swift.  However, note that I can spin up and ansible configure a new instance in less than ten minutes whereas a backup, and presumably a restore, could take over a half hour during my experiments, and I already have a disciplined system of not storing data on servers unless its on an otherwise backed-up NFS server.  So I'm not thinking I'll be doing very much OpenStack level backup.  If orchestration and automation is faster than restoring a backup...

rclone

https://rclone.org/swift/

Install and use rclone software, which is kind of a "rsync for clouds" that is Swift compatible, which could be incredibly useful for some applications.

apt-get install rclone

This is a big job, for another day.  Presently I'm not using Swift other than to store Glance images, so I have little to copy right now...

Demonstration Scripts

These can be found at:

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/tree/master/demos/swift

./create.sh

This creates a temp file containing a timestamp, creates a container named demo, uploads the temp file as an object into that container, and deletes the local copy of the temp file.

./show.sh

This script shows the demo container's status, lists the contents of the demo container, and dumps the tempfile object to stdout so you can see the contents of the file (which should be a simple one line timestamp)

./delete.sh

This merely deletes the demo container and every object inside it.

The good news, is after some pain installing it, Swift works really well.  The bad news is I currently am not using it other than as a backend for glance.  I have some future plans to stash snapshot style backups into Swift and there are some cool utilities that gateway Swift into a filesystem.  It would make a truly awful transactional database backend, but its a pretty cool way to store relatively permanent data like home movies.

Tomorrow, some new (to me) services, Zun and Kuryr.  I'm pretty excited because manual installation of those two seems nearly insurmountable, but Kolla-Ansible supposedly makes it easy.

Stay tuned for the next chapter!