Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.
Adventures of a Small Time OpenStack Sysadmin Chapter 057 - ELK Stack
Why?
An ELK system for end user systems such as server syslog messages, ethernet switch logs, and similar things, is technically not an OpenStack service. However, as the overall project intention is to replace the complete functionality of a VMware cluster, and VMware clusters have Log Insight to centralize logging, I describe how I set up an ELK stack to replace my VMware Log Insight installation.
The Plan
After some research, the plan is to add a Docker host to hold an ELK stack all-in-one Docker container. Unlike running Zun containers natively on OpenStack, a dedicated host can connect to my large and backed up NFS NAS for log storage, by bind mounting the Docker container volumes. So I will roll out yet another Ubuntu 20.04 instance with Docker installed, and integrate it fully into the LAN including Active Directory SSO, roaming home directories, Zabbix and Portainer monitoring, etc.
Create a New Virtual Server
I have a nice checklist in the Ansible repo, so all new server rollouts on the OpenStack clusters are consistent and easy, as detailed below.
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/ubuntu-server-20.04.txt
In Todoist, which is an online and mobile to-do tracking app, I create a task for the new server with a due date in the future to schedule upgrades. In two months the to-do task will reach the top of the queue and I will upgrade and examine this system. Documenting and scheduling future upgrades using Todoist takes about two minutes.
In my dockerized Netbox installation for IP allocation and management, I select an IP address, create the server entry in Netbox, etc. So, looks like elk.cedar.mulhollon.com will be at IP address 10.10.7.23. This takes about two minutes.
Logged into the OpenStack controller, I create a new HEAT orchestration template, the server templates are all similar, aside from obvious differences such as IP address, security groups, so this process is fast and easy with search and replace in the editor. This takes about five minutes depending on how "unusual" the configuration is. If I'm configuring my fourth identical Active Directory Domain Controller it only takes a bit more than a minute. This ELK project required some thought and some new ideas; so this new (to me) "filebeat" protocol between the clients and the ELK server's Logstash uses TCP port 5044, I guess I'll add that to the security group for this server. Also Elasticsearch is legendarily memory hungry so I boosted this instance to 8 gigs of ram. Flavors in OpenStack are so annoying, I wish I could do the VMware thing and simply type in any random amount of ram I feel like, without having to pre-define it as a flavor beforehand. Computers eliminate some busywork, create more busywork, kind of a physics law of the conservation of mass, or conservation of mass of busywork... Once started, the stack create process takes quite awhile in the background, while I do other things. I would estimate I had about ten minutes of things to think about when designing my ELK container.
I do Active Directory DNS, by having Ansible run samba-tool to add forward and reverse DNS entries for each thing on the network, so I added a file roles/activedirectory/tasks/elk.yml and search and replaced the correct values. Don't forget to add to roles/activedirectory/tasks/main.yml and of course run ansible-playbook ./playbook/activedirectory.yml. This takes about two minutes of actual work, the script takes longer to run, but whatever.
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/activedirectory/tasks/elk.yml
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/activedirectory/tasks/main.yml
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/playbooks/activedirectory.yml
There's some "behind the scenes" configuration in Ansible that's abstracted away by adding the new Ubuntu image to the Ansible file inventory/ubuntu3. Its a one line job. Takes one minute.
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/inventory/ubuntu3
For this specific server, Ansible needs a playbook script file created, playbook/elk.yml. Generally I pick one that's pretty close and edit it. To start with this is a generic Docker host so I copy one and change some names. Takes one minute.
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/playbooks/elk.yml
I use Active Directory and SMB file sharing on my network so I need to create a roles/samba/files/smb.conf.elk.cedar.mulhollon.com file to configure which directories are exported as shares. I'd like to pretend I put great effort into configuring custom shares to export the server's logs and such all under reasonable security precautions and so forth; but most of the time "its just another docker host" so copy a similar predecessor and change some hostnames. Yeah, I know, there's still commented out config options from back when Samba 4.5 was new, I've been doing this for awhile and could modernize the config files, sometime, in my infinite spare time... Anyway setting up Samba for a new host takes about one minute.
For some years I've been using Ansible to maintain my /etc/sshd/known_hosts file across my LAN. It's all scripted up, requires minimal effort, just add another hostname to the script's list of hosts. So I edit roles/ssh/files/ssh_known_hosts.sh to add the new server. Takes one minute.
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/ssh/files/ssh_known_hosts.sh
By now, OpenStack HEAT Orchestration should have completed the installation of my new server, so this paragraph is about prepping the new server to run the Ansible playbook on it later on which does all the "real work" of configuring the new server. I configure HEAT to use my Ansible ssh key for initial public-key login, so from the Ansible user's login, ssh ubuntu@elk lets me log in. I have to do some minor manual sshd config work for Ansible related purposes, and OpenStack cloud init always messes up the domain name for the new server for a variety of obscure reasons, so I need to manually "sudo hostnamectl set-hostname elk.cedar.mulhollon.com" which is how Ubuntu 20.04 does it (seemingly every unix-alike OS and every version of that OS has a different protocol). The version of Ubuntu loaded into Glance on OpenStack is recent, but there are always patches that are even newer, may as well start from as clean and recently upgraded system as possible, so I spend a couple minutes running the usual "apt-get get update" "apt-get dist-upgrade" "apt-get clean" routine. Finally, a quick reboot and the new server is ready for Ansible to configure it. This generally takes about fifteen minutes, almost all of which is spent waiting for upgrading processes and rebooting delays, probably three minutes actual human labor.
After the new, clean, bare, unconfigured Ubuntu server completes its reboot, I "ansible-playbook ./playbooks/elk.yml" then Ansible does the vast majority of work required to integrate and harmonize with my existing network. It would probably be a couple hours work to do manually, especially integrating with Active Directory using Samba, and it would be a very long error-prone checklist for a human to follow, but Ansible scripts never make mistakes. There are a very small handful of manual tasks to perform after Ansible is done. I could automatically install the latest Zabbix Agent V2 but I still consider it experimental until I get used to it, and as such I run a script that Ansible placed there ready for me to use to install it; I will eventually automate Zabbix Agent 2, after I am fully chill with it's use and behavior, seems OK so far... Also given that I reconfigured the crypto options for SSH I create new SSH host keys (again using a script I wrote that Ansible places there ready for my use). I generally get rid of the default "ubuntu" user because I have full SSO via Active Directory. Also I feel weird about embedding my Active Directory "administrator" password in Ansible, so my entire Active Directory integration is fully automated with the exception of running a quick "net ads join -U administrator" and entering my domain's administrator password, although please remember that command line is NOT how to join a new Domain Controller to an existing domain, that's a similar but different one-liner. Active Directory integration on Linux is sometimes sketchy, I've never found a way around rebooting to make everything about it work on a new install, so another, final, reboot of the new server. This task overall is maybe 15 wall clock minutes, mostly watching automation do its thing, but I'd budget about four minutes of actual human labor.
Now that the new ELK server is integrated with my LAN, I need to work the opposite direction and integrate my LAN with the new ELK server, which is mostly accomplished by Active Directory but I do need to run roles/ssh/files/ssh_known_hosts.sh to pull the NEW ssh host keys off the ELK server and then I run the playbooks for OTHER hosts using the "—tags ssh" option to only update ssh configs on the other servers. This is about one minutes work, its just running two scripts.
Usually, while Ansible is distributing the new SSH host keys, I fill time by messing around with Active Directory "ADUC tool" to enter a plain text description of the new server and enable trust delegation for SSO purposes. Takes probably five minutes total to log into AD and mess around, after which Ansible is usually done with updating other server's SSH known host key list.
My next step is verifying SSO works. Can I log into my new server and see my roaming NFS home directory without re-entering my password assuming I am already logged into a different server? All my docker hosts share a NFS share that holds (and eventually backs up) the docker volumes, can I access it? This testing takes only two minutes just to try and poke around.
Just a couple final integration tasks remain. I use Zabbix to monitor operating systems so I configure Zabbix to connect to the new server, and I wait to verify good live data arrives in Zabbix. I also use Portainer for remote control and monitoring at the Docker application level, so I need to install the agent for Portainer on the new host (its a docker container, as you'd expect) then add the new server as a docker host, verify it operates. This probably takes ten minutes total.
The final task in rolling out a new server is git commit the OpenStack orchestration template and the Ansible playbook and other files. This probably takes two minutes.
Overall using the power of OpenStack and Ansible, the time required to spin up a new usable server can be broken down into:
Documentation and Design 20 minutes
Operations "manual" labor 7 minutes
Integration and Testing 18 minutes
In the "bad old days" the operations category would have been "half a day" to scare up some hardware, burnin test it, verify the BIOS settings, slowly watch an OS installation progress bar creep across the screen, install the hardware in some permanent location. You still have to do all that, once, for the cluster hardware, but once its done the additional labor to spin up a new server drops to, as seen above, seven minutes. Which is quite an improvement from "half a day".
Install ELK stack on the new virtual server
I am using the "sebp" combined stack to spin up an ELK:
https://elk-docker.readthedocs.io/
https://hub.docker.com/r/sebp/elk/
Unusual Server Configuration Requirement
Another advantage of setting up a Docker host for the ELK stack is I have more control over the Docker environment that I would have with an OpenStack Zun container. I have to make a custom mmap count limit setting as per:
https://elk-docker.readthedocs.io/#prerequisites
and:
https://www.elastic.co/guide/en/elasticsearch/reference/5.0/vm-max-map-count.html#vm-max-map-count
I ran sysctl vm.max_map_count on the server as configured, and the default seems to be 65530 instead of the desired 262144.
Well, OK, fine, whatever, I can fix that.
In the short term I created a file /etc/sysctl.d/elk.conf containing one line
vm.max_map_count=262144
and run "service procps restart" (The documentation in /etc/sysctl.d/README.sysctl has a bug, the reload option doesn't exist LOL but restart works fine, when I get around to it, I will file a simple documentation-fix bug).
then I ran sysctl vm.max_map_count and now it shows the correct, larger, configuration. Cool.
I documented that oddity in the Todoist task for this server. The Todoist tasks act as a "runbook" to document exactly whats required to replicate a server installation, and usually there's not much oddity to document because Ansible Playbooks will take care of everything.
In the long term, I created an issue in GitLab to add a "hardware" role for configuration challenges like this.
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/issues/9
Who knows, maybe by the time you read this, I will have already implemented this in Ansible?
Open many more TCP and UDP ports
I had to add some more ports to the security groups for the Orchestration Template. No big deal, just edit and run the update. It doesn't wipe and rebuild, it reasonably intelligently modifies in place.
Docker Run Script
My docker run script for the new ELK looks like this:
-d \
--name elk \
--restart=always \
--log-driver local \
--log-opt max-size=1m \
--log-opt max-file=3 \
-e TZ="US/Central" \
-v /net/freenas/mnt/freenas-pool/docker/elk/elasticsearch:/var/lib/elasticsearch \
-v /net/freenas/mnt/freenas-pool/docker/elk/backups:/var/backups \
-p 5044:5044/tcp \
-p 5601:5601/tcp \
-p 9200:9200/tcp \
-p 9300:9300/tcp \
-p 9600:9600/tcp \
sebp/elk:8.3.3
Back in the "old days" when I was getting started with ELK, we ran logstash on our servers and that pumped into Elasticsearch. The modern solution seems to be using various *beat applications to pump data into Logstash which then pumps into Elasticsearch. In the end I configured this differently, but whatever, in the narrative I set up Filebeat at this time, and someday in the future I might use it.
Looks like the exact version of Filebeat is important for ELK, so I can't run filebeat locally because every little system would have a different version, and of course hardware devices like my managed ethernet switches will never run Filebeat as their firmware only supports syslog. Therefore I will run a Docker Filebeat, on the same ELK server, of the exact matching version, and use multiple syslog inputs (for each specific syslog RFC format) to feed logs into ELK, and it looks like the highest shared matching version for both this specific ELK stack and Filebeat at the time of posting is:
https://www.docker.elastic.co/r/beats/filebeat-oss:8.3.3
Here are some Filebeat links for reference:
https://www.elastic.co/guide/en/beats/filebeat/8.3/filebeat-overview.html
https://www.elastic.co/guide/en/beats/filebeat/8.3/filebeat-input-syslog.html
https://www.elastic.co/guide/en/beats/filebeat/8.3/running-on-docker.html
My filebeat.docker.yml file looks like this:
modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
filebeat:
inputs:
-
type: syslog
format: rfc3164
protocol.udp:
host: "0.0.0.0:23164"
-
type: syslog
format: rfc3164
protocol.tcp:
host: "0.0.0.0:23164"
-
type: syslog
format: rfc5424
protocol.udp:
host: "0.0.0.0:25424"
-
type: syslog
format: rfc5424
protocol.tcp:
host: "0.0.0.0:25424"
output.elasticsearch:
hosts: elk.cedar.mulhollon.com:9200
Note that I output directly into the elasticsearch which has no security theater on, by default. The typical port for beats connected to logstash has some security theater on by default and it would be a bit of work to apply the self signed SSL cert; its just not worth the effort. They are both running on the same server so a MITM attack seems unlikely, and the entire point of the Filebeat container is to import unsecured raw UDP logs so implementing security theater, or even real live SSL certs, between the Filebeat and the ELK would be a waste of effort. I might still do that for the LOLs someday in my infinite spare time, just to have the experience of having done it.
My docker run script for Filebeat looks like this:
-d \
--name filebeat \
--restart=always \
--log-driver local \
--log-opt max-size=1m \
--log-opt max-file=3 \
-v /net/freenas/mnt/freenas-pool/docker/filebeat/config/filebeat.docker.yml:/usr/share/filebeat/filebeat.yml:ro \
-p 23164:23164 \
-p 23164:23164/udp \
-p 25424:25424 \
-p 25424:25424/udp \
docker.elastic.co/beats/filebeat-oss:8.3.3
Configure Servers to Send Logs to ELK
To set up FreeBSD to send logs to ELK, see ansible roles/syslog/files/syslog.freebsd
*.* @elk.cedar.mulhollon.com:25424
Note the RFC5424 option in the RC file:
https://gitlab.com/SpringCitySolutionsLLC/ansible/-/blob/master/roles/syslog/files/syslog.freebsd
To set up Ubuntu or anything using syslog-NG, see ansible roles/syslog/files/syslog-ng.conf.ubuntu which has a destination section similar to:
syslog(
"elk.cedar.mulhollon.com"
port(25424)
transport(udp)
);
};
Conclusion
Obviously I had to do some set up in ELK such as adding filebeat* as my data view source, although note that I'm not trying to write an ELK tutorial. Its pretty easy to create some searches and dashboards in ELK. Anyway, in summary, it works, Cool!
Obviously its possible to make this MUCH fancier using SSL secured TCP transport instead of simple UDP, I could write entire posts about interesting ELK query and dashboard creation, it would be fun to follow up with setting up filebeat on individual servers to pump data into ELK, or converting the existing Filebeat gateway from pumping directly into Elasticsearch and use Logstash instead, but this is an excellent start to an ELK stack.
Stay tuned for the next chapter!