Thursday, December 22, 2022

Ubiquiti Firmware Update Requirements

I solved a Ubiquiti networking equipment problem yesterday.

The Ubiquiti system is very powerful and versatile, yet marketed as "just plug it together", so there are hidden and unclear pitfalls.

The initial problem statement was something like "device firmware updates are reported as available, but update process fails, no other known issues".

Eventually I located and fixed two separate problems:

Each hardware device in a Ubiquiti network participates fully in the network, having IP address, routing, etc.  This... also includes DNS.  Due to some re-addressing work all the Ubiquiti devices had the former IP addresses for DNS servers.  This had no apparent impact on day to day performance, until firmware upgrades were attempted.  It didn't seem to matter if the firmware was cached on the controller or not.  I'd theorize the firmware is going out to the internet to verify firmware keys and refusing to boot if unable to verify authenticity.  The symptom was attempts at firmware updates would take seemingly forever (at least more than five minutes per attempt per device), then fail and roll back to the old firmware.  Anyway, regardless of hypothesis about "how" it works, every individual device on the Ubiquiti network needed to be manually updated to the correct, new, working DNS server address.

Generally speaking, in the Ubiquiti web interface, you select a device, "Settings", "Network", and edit "Preferred DNS" and "Alternate DNS".

There's always some give and take in a centralized configuration system, or in this case, push and pull.  Take for example, Puppet, which has a remote agent periodically pull configs from a central server, vs Ansible, where a central server pushes configs out to remote devices.  Sure, in a simplistic way, Ubiquiti pushes:  You configure stuff in the web interface such as perhaps DNS server info, then "Apply Changes" and in a disturbingly long several minute process the device goes offline, then "Getting Ready" status, then comes back online.  However, how does it work under the surface?  There is a system wide setting, mysteriously named "Override Inform Host", that is pushed to every Ubiquiti device, asking it to pull from a new controller IP address.  I've used this many times in the past when installing new controllers.  I've upgraded from old to new cloud key to a much faster Docker installation of the controller in the past.  Anyway, for whatever reason this site setting had the old IP address for the inform host.  Not entirely sure how/why the system was working, but it was working perfectly... except for no upgrades are possible.

In the Ubiquiti web interface, select "Settings" then "System" then scroll to the very bottom "Override Inform Host" and if thats enabled, it should be the IP address of the current (or new?) controller.

One of the many advantages of a Docker install of the controller is access to the "unifi/log/remote" directory.  There should be a file for each device's IP address and the logs are interesting to watch during a firmware upgrade attempt.

So, the lesson learned, is a Ubiquiti network can operate perfectly normally with an incorrect IP address for DNS servers in each device AND an incorrect IP address as the system-wide "Override Inform Host", but either mistake will prevent successful firmware upgrades.  Who could have guessed?

Anyway, in the end, it all worked!

Monday, November 28, 2022

I wrote some digital input to USB HID support software using a $6 microcontroller

I made a fun, simply, working, weekend project:

https://gitlab.com/SpringCitySolutionsLLC/usb-hid-circuitpython-rp2040

Given a video game like DCS World or Elite Dangerous, it's fun to make a "real world" control panel with real world switches hooked up to the game, and people have been doing this for many years with "hundred dollar" digital input to USB adapter / appliances.

However, technology has improved over the years, and now for only $6, anyone can buy a Raspberry Pi RP2040 microcontroller running "CircuitPython" which is a limited Python language wedged into a microcontroller under a MIT free software license.  I decided to use the same MIT software license for my software linked above, please use it and have fun!  The RP2040 supports custom USB HIDs so I can easily emulate a multi-button joystick in software, and the buttons of this emulated joystick can then follow the switch inputs of the microcontroller's GPIO ports.

The hardware costs of the entire project are $6 for the microcontroller (plus a USB cable, I guess) and the software costs of the entire project, including the software I wrote, are zero dollars completely open source software.

I think this was a relatively simple software task.  Surprisingly, there do not seem to be many competing examples of this kind of code.  I tried to write the code and docs as obvious and clear and simple as humanly possible, although USB HID is ... non trivial so there is some inherent and unavoidable complexity in even the most minimal system.  I believe most home hobbyist simulation panel constructors are much handier with a miter saw or drill press than with a fancy software engineering IDE, and I also think they have more experience flying real airplanes than writing avionics software, so I'm trying to respect the needs of my end users, by making the code as simple as possible.  Just read the source code top to bottom, there's nothing tricky about it.

The demonstration code I have simply connects three toggle switches to a virtual joystick's first three buttons.  Obviously this could be trivially extended to something with more switches like a 'Huey' UH1-H helicopter's C-1611 Signal Distribution Panel, so you could flip a switch and turn the in-game helicopter's various radios on and off, or any other control panel in the game, such as the engine fuel control panel, or the overhead panel with the cockpit light dimmer dials.  Or this could be applied to any other simulation-type game, perhaps control your landing gear in Elite-Dangerous.

CircuitPython provides a very enjoyable software development experience.  A bit smoother integration with VS Code or Sublime Text would be nice.

Wednesday, October 5, 2022

Solution for upgrading ISM43362-M3G-L44 wifi module firmware on a ST B-L4S5I-IOT01A dev board

Why I'm trying to upgrade the wifi module's firmware.

I have been experiencing problems with network offload on Zephyr RTOS using a ST B-L4S5I-IOT01A dev board which has an onboard Inventek ISM43362-M3G-L44 wifi module.  Its not completely dead, I have it successfully connecting to my wifi, the offload processing works well enough that I can ping the module and telnet to the wifi coprocessor, but there's something not "connecting" between the processors in the sense that I can't get telnet to the dev board shell or anything else on Zephyr to connect in a conceptual sense to the wifi.  Offload "works" in that the C language callbacks fire and I can report my DHCP provided wifi IP address, which is how I can ping it, but its like the application TCP/UDP sockets are refusing to connect to the wifi interface, even though they kind of are doing that; most aggravating.

For a variety of reasons, I decided to upgrade the wifi firmware.  I've read elsewhere about how Azure connectivity with this board will NOT work unless the wifi firmware is upgraded, so that is very interesting although I'm not using Azure in this application.  Another reason is the usual troubleshooting procedure of I've done everything easier and more of an obvious solution already.  My final justification is I have no idea what version of the wifi coprocessor firmware was shipped on the board, so at least after an upgrade I'd know for certain what version I'm running, which probably provides at least some troubleshooting value.  

I did not decide to upgrade the wifi firmware for fun or because I was bored, LOL.

Description of the wifi module

The wifi module is similar to an ancient Hayes-compatible phone modem, serial attached with "AT" command set, that connects to wifi networks instead of phone lines.  Also instead of connecting to the device using a serial port like a traditional phone modem, this module has the option of SPI connection for higher speed, and on this ST dev board, it connects via that SPI interface.  So in one way of looking at it, it's exactly like a phone modem, except when looking at it in another way, everything is different from a phone modem (LOL?).

Finding references, instructions, tools for the upgrade process

If you google for the wifi module and firmware, which is a pretty good strategy most of the time, one of the top results is this URL:

https://www.inventeksys.com/es-wifi-firmware/

Sadly, the most recent firmware on that page is v3.5.2.3 from Oct 10 2016.  That's not going to help.  I don't think the CPU on this dev board had been released back in 2016.

Further research led me to:

https://www.inventeksys.com/ism4336-m3g-l44-e-embedded-serial-to-wifi-module/

From that device page, you can find a link to another, separate and newer, firmware upgrade page for the device:

https://www.inventeksys.com/iwin/firmware/

The highest version available on THIS download page is 3.5.2.7, that's cool.

Next we have to upload the firmware into the wifi module.  If you have a serial connection, with a switch selectable USB port or similar like the Arduino shields provide, this task is pretty easy.  If you have a PCB standard JTAG pinout and a JTAG programmer this task is also pretty easy.  The ST board has none of the above.  What the ST provided upgrade solution does have, is a very small binary program provided by ST, that enables SPI passthrough so STM32CubeProgrammer or anything else that speaks ST/Link can program what it thinks is the STM32 flash, but its actually programming the flash in the wifi module.  What could possibly go wrong? 

The software to upload the firmware is on the ST website, in the "docs" section not "tools":

https://www.st.com/en/evaluation-tools/b-l4s5i-iot01a.html#documentation

The best instructions for the firmware upgrade seem to be in the upload tool's README file, although Inventek provides a link to a .pdf which has some alternative options:

http://www.inventeksys.com/iwin/wp-content/uploads/Firmware-Upgrades.pdf

How the upgrade process was supposed to work

I followed the README upgrade instructions from ST.  First you download the firmware zip from inventeksys.com, using the correct, newer download page, not the Google-provided older download page.  Then, unzip and change the filename from *.bin.rename to *.bin, because of crazy windows anti-virus software, the firmware has to be double wrapped to get past.  Then copy the renamed firmware into the bin directory of the ST provided firmware upgrader.  The firmware upgrader batch file uploads a tiny bin file that puts the dev board into SPI passthrough, and then uses the CLI for STM32CubeProgrammer to burn the firmware into what it thinks is the STM32L4 chip but is actually the passed-through wifi module's flash chip.  It's a rather short and simple windows batch file.

Something undocumented, but cool, is when the wifi module is in passthrough mode the blue LED turns on.  At least ... I think that's what the LED implies.  The STM32L4 firmware is a binary blob and I didn't run the debugger on it so who really knows?

What actually happened during the upgrade process

There is a slight problem, if you run update_Wifi.bat on a Windows 11 PC with the newest version of STM32CubeProgrammer and newest ST/Link firmware on the dev board, the dev board bootloader pass thru is downloaded to the STM32L4 and it runs successfully, OK so far.  Then the command to bulk erase the Inventek module's flash runs, and eventually times out while in the process of erasing the flash (whoops).  Then the program stage runs, tries to re-issue the erase command before writing the firmware, however the flash chip is still erasing from the previous bulk erase command, so it crashes the write process and never writes the firmware to the wifi.  So now we have a wifi module with erased firmware flash.  That's not good.

It gets somewhat worse.  If you reinstall MCUBoot and a Zephyr image, the MCUBoot will jump to the Zephyr image and silently fail and lock up the STM32L4 during initial bootup of Zephyr before it even jumps to main(), so time can be wasted trying to troubleshoot your reinstall, well, obviously I forgot to ... confirm the Zephyr image or something.  So the board seems bricked and everyone is unhappy.

Unbricking and successful wifi module firmware upgrade process

The board, however, is not bricked.  I would theorize the problem is older versions of STM32CubeProgrammer and/or the ST/Link firmware would not return from the erase command until the erase was fully complete and/or the default delay to wait for bulk erase completion before timing out was longer.  Also I don't really see the point of bulk erasing the wifi flash if the next step in the batch file, the write step, begins by erasing the flash anyway.

So use vim or whatever editor you use to edit this line in update_Wifi.bat to skip the bulk erase step by commenting out, prefix the line with a "rem":

rem %PROGRAMMERPATH% -c port=%COMPORT% br=115200 p=EVEN db=8 sb=1 fc=OFF -e all

Then re-run the uploader script and it seems to work quite well, instead of timing out and screwing everything up past the failed erase command, it skips the bulk erase that doesn't work anyway, and runs the next command in the uploader script which successfully erases the flash, writes the firmware, and does a verify read pass in about two and a half minutes.

After this, unplug and plug the board back in, re-upload the MCUBoot firmware, re-upload the Zephyr RTOS image, and it works exactly the same as before.  Unfortunately this firmware upgrade had zero effect on my Zephyr networking offload problems.  But at least the dev board isn't bricked anymore, and I know it has the latest wifi module firmware version installed.

Summary

If you're trying to upgrade the firmware on a Inventek ISM43362-M3G-L44 board which is SPI attached on a ST B-L4S5I-IOT01A dev board, you'll run into two problems.  There are two firmware download pages at Inventek and Google will send you to the older download page with half a decade old firmware that doesn't work with the ST uploader.  The other problem is the ST batch file to automate firmware upgrades will brick your board, although with some extremely minor editing the script will work and un-brick your board and successfully upgrade your wifi module firmware.

This firmware upgrade process, overall, was somewhat more exciting than is necessary.

Monday, September 19, 2022

Zephyr RTOS with MCUboot "USAGE FAULT" after "Jumping to the first image slot", fixed.

Subjectively, what is it like to develop embedded software?

When things go wrong, its days of feeling like your hands are wrapped in mittens while wearing a blindfold, with your primary source of documentation being a Douglas Adams novel like "Hitchhikers Guide to the Galaxy".  Or even worse a Charles Stross "Laundry Series" novel.  Then, after days of struggle, the problem is fixed!  Development rapidly proceeds at a supersonic pace.  Until impact with the next block, stuck for a couple days of zero progress.  Such is life when doing embedded / IOT software development.


Take for example, this recent three day struggle:

Consider that for more than a year, you've been using older, lower RAM STM32 dev boards like the disco_l475_iot1, a truly fine board, other than only having about 128K of ram and Zephyr needing about 127.9K to run everything other than your application code, which will only require another 128K of ram LOL.  Or a classic F767 board like the nucleo_f767zi, a wonderful board and a joy to use, other than being wired ethernet instead of wifi.  The good news is the b_l4s5i_iot01a is a massive upgrade over the disco_l475_iot1 it has around six times the ram... awesome... its the board every "starved for memory" disco_l475_iot1 developer dreams of... this should be a simple port job ... Right?  Right?  Just recompile and upload?  Its never that easy in embedded development...

Not to put the cart before the horse, however, the problem I ran into is, in a technical sense only, documented at:

https://docs.zephyrproject.org/latest/services/device_mgmt/dfu.html

Where it explains you need to set the zephyr,code-partition in your DTS overlay file using a line looking something like:

zephyr,code-partition = &slot0_partition;

However, old dev boards automatically or automagically had that set, no developer effort required.  To make it crystal clear, although it disagrees with the docs above, code for old boards (admittedly, using older versions of Zephyr...) works perfectly without setting a code-partition variable.  I just verified using the latest version of Zephyr that a venerable 'F767 compiles and boots perfectly with MCUboot, no need to manually set a code-partition setting.

What happens if you do not configure a code-partition and compile for a nice new l4s5i board?  Certainly, not any error messages upon compilation, not the slightest indication of any future problems incoming.  Run your imgtool to sign the new image with firmware keys and mark as confirmed so MCUboot will be happy, upload to your dev board at address 0x08020000 as you do on that platform, reboot, and I kid you not, this is all you have to troubleshoot the problem:

(the usual MCUboot startup messages, it recognizes my primary image as good, etc)
I: Jumping to the first image slot
E: ***** USAGE FAULT *****
E: Attempt to execute undefined instruction
(boring register dump here)
E: Faulting instruction address (r15/pc): 0x0801bca8
and it crashes out and halts the system.

Notice that the PC address where it crashes out is NOT in the zephyr firmware slot ranges, its in the MCUboot range of addresses.  Interesting!  So its "obviously" a MCUboot bug, amirite?  Not so fast...

I think I have a good MCUboot compilation and source tree, in fact if I change the "west" command to force the same zephyr source tree to compile MCUboot for my old boards, they work fine.  So, a code bug in MCUboot?  I donno, there's not much space in "jump to the first slot image" to have a bug. 

Clearly (sarcasm) it can't be a problem with my code that is configured as per my old notes and runs fine on my old boards.  MCUboot is crashing at an MCUboot program counter address ONLY on the brand new dev board and fine on the other hardware.

Try one of several dev boards?  Sure, why not.  Turns out none of them work, all fail the same way.  OK then.  So its not hardware, its not Zephyr, its crashing at a MCUboot address, its not my build system, that works fine when told to compile for other boards...  I've "proven" it has to be a MCUboot problem.  

I spent a couple days learning about the innards of MCUboot.  Fun, but not productive and it didn't fix my problem.

After some days of messing around trying everything else under the sun, as typical for embedded development, I finally checked the Zephyr docs for device management (as opposed to using my documentation and source code comments and working source code) and sometime after Zephyr version 2.2.1 and before Zephyr 3.0.0 the online Zephyr docs were changed to reflect the need for a code-partition setting (apparently only for SOME boards, as most of my boards don't require this, and certainly never required it in the past).

Add that code-partition config option and recompile and imgtool sign the binary, upload with STM32CubeProgrammer, and we're off and running, where I should have been three days ago.

Two interesting summary points:  Good luck finding anything explaining this using Google, because I certainly did not, all I had to work with was the crash address 0x0801bca8 and that isn't finding much.  Also, yes, weird as it sounds, I have proven you can crash MCUboot while its running, by giving it a Zephyr firmware image that compiled perfectly yet was incorrectly configured.

By posting this on my blog maybe Google will index it, then people who experience crashes in MCUboot at address 0x0801bca8 on the newest L4 STM32 boards while trying to run old code that runs fine on older boards, will find this explanation and fix...


I could also tell the typical embedded development story of how my spyware blocking "lie protection" software thought the Ninja build tool, used by Zephyr, is spyware so it would crash the build for fun, not bother to log any errors or other explanation, just kind of randomly crash the suspected spyware program while in mid-build.  That was so much fun.


So... yeah.  That's what people mean when they talk about the Embedded IoT software development experience.  On the other hand, it's more fun than the above blog post implies, really, it is.  It's just that the struggles are real.

Thursday, September 1, 2022

Notes on Node-RED development on a Raspberry Pi using Visual Studio Code

Notes on Node-RED development on a Raspberry Pi using Visual Studio Code

In the past, I developed Node-RED nodes by locally editing using Visual Studio Code and exporting the directory via NFS to a Raspberry Pi with the appropriate hardware installed, and the node was added to Node-RED via symlinks.  It worked, but it was brittle and there can be permissions issues and its annoying if the Pi reboots and cannot connect to the fileserver then flows will not start because the nodes are not present, it's just so tedious.

My new, improved development environment involves VSC's "remote-ssh" extension.

As always, detailed notes make it easier to set things up.  There are so many small details from configuring git on the RaspPi to removing the "nano" editor to make it easier to do command line git commits.

So, here are my cut and paste notes that start with a newly installed RaspPi (or rephrased, OS and Node-RED previously installed) and end with a working VSC development environment for Node-RED:

On Desktop:

Install Visual Studio Code extension “Remote - SSH”

https://code.visualstudio.com/docs/remote/ssh

https://code.visualstudio.com/docs/remote/linux

https://code.visualstudio.com/docs/remote/troubleshooting#_installing-a-supported-ssh-client

https://code.visualstudio.com/docs/remote/troubleshooting#_improving-security-on-multi-user-servers

Install Windows OpenSSH client

https://docs.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse?tabs=gui

Move SSH keys around for key-based SSH authentication

On New Remote:

On windows scp the key from the windows host to the new remote

scp c:\users\vince\.ssh\id_rsa.pub pi@new_remote:

then while logged into the new remote run:

cat id_rsa.pub >> .ssh/authorized_keys

Then test a login from windows to the new remote

ssh pi@new_remote

Create a SSH key on the new remote and add the .ssh/id_rsa.pub to GitLab

ssh-keygen, then add the key to GitLab

Prepare git on the new remote

On the new remote:

sudo apt-get install git

sudo dpkg –purge nano

git config –global user.email “vince.mulhollon@springcitysolutions.com”

git config –global user.name “Vince Mulhollon”

Install the git repo for the project on the remote

git clone the applicable Node-RED node project into ~ directory

Add git repo to installed Node-RED

cd ~/.node-red

npm install ~/node_git_repo_from_above

Restart node-red

./node-red-restart

Install standard.js in a project

npm install standard –save-dev

Then run it with “npx standard”.

Install standard.js extension in VS Code

“View” “Extensions”

Search for vscode-standard

Click install

Remember to install the standard engine as a devDependencies and it works automatically.

Anyway I hope these detailed notes help somebody do remote Node-RED development on a Raspberry Pi using Visual Studio Code.


Monday, August 29, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 057 - ELK Stack

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 057 - ELK Stack

Why?

An ELK system for end user systems such as server syslog messages, ethernet switch logs, and similar things, is technically not an OpenStack service.  However, as the overall project intention is to replace the complete functionality of a VMware cluster, and VMware clusters have Log Insight to centralize logging, I describe how I set up an ELK stack to replace my VMware Log Insight installation.

The Plan

After some research, the plan is to add a Docker host to hold an ELK stack all-in-one Docker container.  Unlike running Zun containers natively on OpenStack, a dedicated host can connect to my large and backed up NFS NAS for log storage, by bind mounting the Docker container volumes.  So I will roll out yet another Ubuntu 20.04 instance with Docker installed, and integrate it fully into the LAN including Active Directory SSO, roaming home directories, Zabbix and Portainer monitoring, etc.

Create a New Virtual Server

I have a nice checklist in the Ansible repo, so all new server rollouts on the OpenStack clusters are consistent and easy, as detailed below.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/ubuntu-server-20.04.txt

In Todoist, which is an online and mobile to-do tracking app, I create a task for the new server with a due date in the future to schedule upgrades. In two months the to-do task will reach the top of the queue and I will upgrade and examine this system.  Documenting and scheduling future upgrades using Todoist takes about two minutes. 

In my dockerized Netbox installation for IP allocation and management, I select an IP address, create the server entry in Netbox, etc.  So, looks like elk.cedar.mulhollon.com will be at IP address 10.10.7.23.  This takes about two minutes.

Logged into the OpenStack controller, I create a new HEAT orchestration template, the server templates are all similar, aside from obvious differences such as IP address, security groups, so this process is fast and easy with search and replace in the editor.  This takes about five minutes depending on how "unusual" the configuration is.  If I'm configuring my fourth identical Active Directory Domain Controller it only takes a bit more than a minute.  This ELK project required some thought and some new ideas; so this new (to me) "filebeat" protocol between the clients and the ELK server's Logstash uses TCP port 5044, I guess I'll add that to the security group for this server.  Also Elasticsearch is legendarily memory hungry so I boosted this instance to 8 gigs of ram.  Flavors in OpenStack are so annoying, I wish I could do the VMware thing and simply type in any random amount of ram I feel like, without having to pre-define it as a flavor beforehand.  Computers eliminate some busywork, create more busywork, kind of a physics law of the conservation of mass, or conservation of mass of busywork...  Once started, the stack create process takes quite awhile in the background, while I do other things.  I would estimate I had about ten minutes of things to think about when designing my ELK container.

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts/-/blob/master/projects/infrastructure/elk/elk.yml

I do Active Directory DNS, by having Ansible run samba-tool to add forward and reverse DNS entries for each thing on the network, so I added a file roles/activedirectory/tasks/elk.yml and search and replaced the correct values.  Don't forget to add to roles/activedirectory/tasks/main.yml and of course run ansible-playbook ./playbook/activedirectory.yml.  This takes about two minutes of actual work, the script takes longer to run, but whatever.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/activedirectory/tasks/elk.yml

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/activedirectory/tasks/main.yml

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/playbooks/activedirectory.yml

There's some "behind the scenes" configuration in Ansible that's abstracted away by adding the new Ubuntu image to the Ansible file inventory/ubuntu3.  Its a one line job.  Takes one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/inventory/ubuntu3

For this specific server, Ansible needs a playbook script file created, playbook/elk.yml.  Generally I pick one that's pretty close and edit it.  To start with this is a generic Docker host so I copy one and change some names.  Takes one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/playbooks/elk.yml

I use Active Directory and SMB file sharing on my network so I need to create a roles/samba/files/smb.conf.elk.cedar.mulhollon.com file to configure which directories are exported as shares.  I'd like to pretend I put great effort into configuring custom shares to export the server's logs and such all under reasonable security precautions and so forth; but most of the time "its just another docker host" so copy a similar predecessor and change some hostnames.  Yeah, I know, there's still commented out config options from back when Samba 4.5 was new, I've been doing this for awhile and could modernize the config files, sometime, in my infinite spare time...  Anyway setting up Samba for a new host takes about one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/samba/files/smb.conf.elk.cedar.mulhollon.com

For some years I've been using Ansible to maintain my /etc/sshd/known_hosts file across my LAN.  It's all scripted up, requires minimal effort, just add another hostname to the script's list of hosts.  So I edit roles/ssh/files/ssh_known_hosts.sh to add the new server.  Takes one minute.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/ssh/files/ssh_known_hosts.sh

By now, OpenStack HEAT Orchestration should have completed the installation of my new server, so this paragraph is about prepping the new server to run the Ansible playbook on it later on which does all the "real work" of configuring the new server.  I configure HEAT to use my Ansible ssh key for initial public-key login, so from the Ansible user's login, ssh ubuntu@elk lets me log in.  I have to do some minor manual sshd config work for Ansible related purposes, and OpenStack cloud init always messes up the domain name for the new server for a variety of obscure reasons, so I need to manually "sudo hostnamectl set-hostname elk.cedar.mulhollon.com" which is how Ubuntu 20.04 does it (seemingly every unix-alike OS and every version of that OS has a different protocol).  The version of Ubuntu loaded into Glance on OpenStack is recent, but there are always patches that are even newer, may as well start from as clean and recently upgraded system as possible, so I spend a couple minutes running the usual "apt-get get update" "apt-get dist-upgrade" "apt-get clean" routine.  Finally, a quick reboot and the new server is ready for Ansible to configure it.  This generally takes about fifteen minutes, almost all of which is spent waiting for upgrading processes and rebooting delays, probably three minutes actual human labor.

After the new, clean, bare, unconfigured Ubuntu server completes its reboot, I "ansible-playbook ./playbooks/elk.yml" then Ansible does the vast majority of work required to integrate and harmonize with my existing network.  It would probably be a couple hours work to do manually, especially integrating with Active Directory using Samba, and it would be a very long error-prone checklist for a human to follow, but Ansible scripts never make mistakes.  There are a very small handful of manual tasks to perform after Ansible is done.  I could automatically install the latest Zabbix Agent V2 but I still consider it experimental until I get used to it, and as such I run a script that Ansible placed there ready for me to use to install it; I will eventually automate Zabbix Agent 2, after I am fully chill with it's use and behavior, seems OK so far...  Also given that I reconfigured the crypto options for SSH I create new SSH host keys (again using a script I wrote that Ansible places there ready for my use).  I generally get rid of the default "ubuntu" user because I have full SSO via Active Directory.  Also I feel weird about embedding my Active Directory "administrator" password in Ansible, so my entire Active Directory integration is fully automated with the exception of running a quick "net ads join -U administrator" and entering my domain's administrator password, although please remember that command line is NOT how to join a new Domain Controller to an existing domain, that's a similar but different one-liner.  Active Directory integration on Linux is sometimes sketchy, I've never found a way around rebooting to make everything about it work on a new install, so another, final, reboot of the new server.  This task overall is maybe 15 wall clock minutes, mostly watching automation do its thing, but I'd budget about four minutes of actual human labor.

Now that the new ELK server is integrated with my LAN, I need to work the opposite direction and integrate my LAN with the new ELK server, which is mostly accomplished by Active Directory but I do need to run roles/ssh/files/ssh_known_hosts.sh to pull the NEW ssh host keys off the ELK server and then I run the playbooks for OTHER hosts using the "—tags ssh" option to only update ssh configs on the other servers.  This is about one minutes work, its just running two scripts.

Usually, while Ansible is distributing the new SSH host keys, I fill time by messing around with Active Directory "ADUC tool" to enter a plain text description of the new server and enable trust delegation for SSO purposes.  Takes probably five minutes total to log into AD and mess around, after which Ansible is usually done with updating other server's SSH known host key list.

My next step is verifying SSO works.  Can I log into my new server and see my roaming NFS home directory without re-entering my password assuming I am already logged into a different server?  All my docker hosts share a NFS share that holds (and eventually backs up) the docker volumes, can I access it?  This testing takes only two minutes just to try and poke around.

Just a couple final integration tasks remain.  I use Zabbix to monitor operating systems so I configure Zabbix to connect to the new server, and I wait to verify good live data arrives in Zabbix.  I also use Portainer for remote control and monitoring at the Docker application level, so I need to install the agent for Portainer on the new host (its a docker container, as you'd expect) then add the new server as a docker host, verify it operates.  This probably takes ten minutes total.

The final task in rolling out a new server is git commit the OpenStack orchestration template and the Ansible playbook and other files.  This probably takes two minutes.

Overall using the power of OpenStack and Ansible, the time required to spin up a new usable server can be broken down into:

Documentation and Design 20 minutes

Operations "manual" labor 7 minutes

Integration and Testing 18 minutes

In the "bad old days" the operations category would have been "half a day" to scare up some hardware, burnin test it, verify the BIOS settings, slowly watch an OS installation progress bar creep across the screen, install the hardware in some permanent location.  You still have to do all that, once, for the cluster hardware, but once its done the additional labor to spin up a new server drops to, as seen above, seven minutes.  Which is quite an improvement from "half a day".

Install ELK stack on the new virtual server

I am using the "sebp" combined stack to spin up an ELK:

https://elk-docker.readthedocs.io/

https://hub.docker.com/r/sebp/elk/

Unusual Server Configuration Requirement

Another advantage of setting up a Docker host for the ELK stack is I have more control over the Docker environment that I would have with an OpenStack Zun container.  I have to make a custom mmap count limit setting as per: 

https://elk-docker.readthedocs.io/#prerequisites

and:

https://www.elastic.co/guide/en/elasticsearch/reference/5.0/vm-max-map-count.html#vm-max-map-count

I ran sysctl vm.max_map_count on the server as configured, and the default seems to be 65530 instead of the desired 262144.

Well, OK, fine, whatever, I can fix that.

In the short term I created a file /etc/sysctl.d/elk.conf containing one line

vm.max_map_count=262144

and run "service procps restart" (The documentation in /etc/sysctl.d/README.sysctl has a bug, the reload option doesn't exist LOL but restart works fine, when I get around to it, I will file a simple documentation-fix bug).

then I ran sysctl vm.max_map_count and now it shows the correct, larger, configuration.  Cool.

I documented that oddity in the Todoist task for this server.  The Todoist tasks act as a "runbook" to document exactly whats required to replicate a server installation, and usually there's not much oddity to document because Ansible Playbooks will take care of everything.  

In the long term, I created an issue in GitLab to add a "hardware" role for configuration challenges like this.

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/issues/9

Who knows, maybe by the time you read this, I will have already implemented this in Ansible?

Open many more TCP and UDP ports

I had to add some more ports to the security groups for the Orchestration Template.  No big deal, just edit and run the update.  It doesn't wipe and rebuild, it reasonably intelligently modifies in place.

Docker Run Script

My docker run script for the new ELK looks like this:

docker run \
  -d \
  --name elk \
  --restart=always \
  --log-driver local \
  --log-opt max-size=1m \
  --log-opt max-file=3 \
  -e TZ="US/Central" \
  -v /net/freenas/mnt/freenas-pool/docker/elk/elasticsearch:/var/lib/elasticsearch \
  -v /net/freenas/mnt/freenas-pool/docker/elk/backups:/var/backups \
  -p 5044:5044/tcp \
  -p 5601:5601/tcp \
  -p 9200:9200/tcp \
  -p 9300:9300/tcp \
  -p 9600:9600/tcp \
  sebp/elk:8.3.3

Filebeat

Back in the "old days" when I was getting started with ELK, we ran logstash on our servers and that pumped into Elasticsearch.  The modern solution seems to be using various *beat applications to pump data into Logstash which then pumps into Elasticsearch.  In the end I configured this differently, but whatever, in the narrative I set up Filebeat at this time, and someday in the future I might use it. 

Looks like the exact version of Filebeat is important for ELK, so I can't run filebeat locally because every little system would have a different version, and of course hardware devices like my managed ethernet switches will never run Filebeat as their firmware only supports syslog.  Therefore I will run a Docker Filebeat, on the same ELK server, of the exact matching version, and use multiple syslog inputs (for each specific syslog RFC format) to feed logs into ELK, and it looks like the highest shared matching version for both this specific ELK stack and Filebeat at the time of posting is:

https://www.docker.elastic.co/r/beats/filebeat-oss:8.3.3

Here are some Filebeat links for reference:

https://www.elastic.co/guide/en/beats/filebeat/8.3/filebeat-overview.html

https://www.elastic.co/guide/en/beats/filebeat/8.3/filebeat-input-syslog.html

https://www.elastic.co/guide/en/beats/filebeat/8.3/running-on-docker.html

My filebeat.docker.yml file looks like this:

filebeat.config:
  modules:
    path: ${path.config}/modules.d/*.yml
    reload.enabled: false
filebeat:
  inputs:
    -
      type: syslog
      format: rfc3164
      protocol.udp:
        host: "0.0.0.0:23164"
    -
      type: syslog
      format: rfc3164
      protocol.tcp:
        host: "0.0.0.0:23164"
-
      type: syslog
      format: rfc5424
      protocol.udp:
        host: "0.0.0.0:25424"
    -
      type: syslog
      format: rfc5424
      protocol.tcp:
        host: "0.0.0.0:25424"
output.elasticsearch:
  hosts: elk.cedar.mulhollon.com:9200

Note that I output directly into the elasticsearch which has no security theater on, by default.  The typical port for beats connected to logstash has some security theater on by default and it would be a bit of work to apply the self signed SSL cert; its just not worth the effort.  They are both running on the same server so a MITM attack seems unlikely, and the entire point of the Filebeat container is to import unsecured raw UDP logs so implementing security theater, or even real live SSL certs, between the Filebeat and the ELK would be a waste of effort.  I might still do that for the LOLs someday in my infinite spare time, just to have the experience of having done it.

My docker run script for Filebeat looks like this:

docker run \
  -d \
  --name filebeat \
  --restart=always \
  --log-driver local \
  --log-opt max-size=1m \
  --log-opt max-file=3 \
  -v /net/freenas/mnt/freenas-pool/docker/filebeat/config/filebeat.docker.yml:/usr/share/filebeat/filebeat.yml:ro \
 -p 23164:23164 \
 -p 23164:23164/udp \
 -p 25424:25424 \
 -p 25424:25424/udp \
  docker.elastic.co/beats/filebeat-oss:8.3.3

Configure Servers to Send Logs to ELK

To set up FreeBSD to send logs to ELK, see ansible roles/syslog/files/syslog.freebsd

*.* @elk.cedar.mulhollon.com:25424

Note the RFC5424 option in the RC file:

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/syslog/files/rc.conf.d.syslogd.freebsd

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/syslog/files/syslog.freebsd

To set up Ubuntu or anything using syslog-NG, see ansible roles/syslog/files/syslog-ng.conf.ubuntu which has a destination section similar to:

destination d_net { 
  syslog(
    "elk.cedar.mulhollon.com"
    port(25424)
    transport(udp)
  ); 
};

https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/syslog/files/syslog-ng.conf.ubuntu

Conclusion

Obviously I had to do some set up in ELK such as adding filebeat* as my data view source, although note that I'm not trying to write an ELK tutorial.  Its pretty easy to create some searches and dashboards in ELK.  Anyway, in summary, it works, Cool!

Obviously its possible to make this MUCH fancier using SSL secured TCP transport instead of simple UDP, I could write entire posts about interesting ELK query and dashboard creation, it would be fun to follow up with setting up filebeat on individual servers to pump data into ELK, or converting the existing Filebeat gateway from pumping directly into Elasticsearch and use Logstash instead, but this is an excellent start to an ELK stack.

Stay tuned for the next chapter!

Wednesday, August 24, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 056 - Prometheus

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 056 - Prometheus

During plan 3.0 I decided to either use or lose Prometheus.

Some observations about Prometheus

It works very well in my experimentation.

It replicates my Zabbix infrastructure without providing any additional value.

I need Zabbix to monitor the rest of my infrastructure, which is larger than my openstack cluster.  So I can't replace Zabbix with Prometheus (at least at this time, who knows in the future?)

As such I decided to remove Prometheus.

Kolla-Ansible is not an orchestration system, Ansible is merely a very fancy scripting language and set of libraries.  So removal of the /etc/kolla/globals.d/prometheus.yml file and running a kolla-ansible deploy will NOT remove the Prometheus installation although it will configure the entire rest of the system to NOT use Prometheus anymore in the future.

The solution to that problem, is to deploy, test that everything is working other than Prometheus, run something like "docker ps | grep prometheus" note a long list of about a dozen large containers providing the now-orphaned Prometheus service, then manually run many "docker stop prometheus-whatever" commands to shut down all the Prometheus containers.  The final step is a quick "kolla-ansible prune-images" with the really-really sure option to wipe the cached docker images for prometheus which will save a couple bytes of storage.

Tomorrows post will depend on what I do next to my home lab in my spare time.  I'm caught up to real time after a mere 56 posts.

Stay tuned for the next chapter!