Tuesday, August 2, 2022

Adventures of a Small Time OpenStack Sysadmin Chapter 035 - Neutron Network Service

Adventures of a Small Time OpenStack Sysadmin relate the experience of converting a small VMware cluster into two small OpenStack clusters, and the adventures and friends I made along the way.

Adventures of a Small Time OpenStack Sysadmin Chapter 035 - Neutron Network Service

First, links to some reference docs I used:

Administration Guide aka OpenStack Networking Guide

https://docs.openstack.org/neutron/yoga/admin/config.html

Configuration and Policy References

https://docs.openstack.org/neutron/yoga/configuration/

Projects Deployment Configuration Reference for Neutron

https://docs.openstack.org/kolla-ansible/yoga/reference/networking/neutron.html|Kolla-Ansible

Install the CLI

Install the CLI using the openstack-scripts/installcli/installcli.sh or run something like this:

pip install python-neutronclient -c https://releases.openstack.org/constraints/upper/yoga

What is a "provider" and where are the logs?

I ran into a few Neutron problems with Kolla-Ansible, but nothing insurmountable.  This will likely be a VERY long blog post as OpenStack Networking is VERY feature filled thus complicated.

The starting scenario is I have experience with hand installed OpenStack using the supplied installation tutorials which amount to cut and paste a lot and use linuxbridge as the virtual switch because its easy, and I have a single provider network on a single unbonded, lag-free ethernet port set up in access mode (no vlans, no dot1q tagging, none of that just a plain ole port on my production lan).  The proposed ending scenario is using Kolla-Ansible, which by default uses OpenVSwitch as its virtual switch, and I intend to bond multiple ethernet ports together and run multiple tagged VLANs over that port.  Ambitious, yeah, that's me.  Has Big Headache, yeah, that's me too.  Got it working in the end, after only three long days of effort.

One of the big problems with cookbook solutions like the OpenStack installation tutorial series online, is it encourages sysadmins to not learn how the system works.  Then pile on multiple layers of abstraction, with no or minimal debugging facility or error logging, and significant problems develop later.

At the time of cutting and pasting Cluster #1 into operation, the install instructions direct me to add a single line to linuxbridge_agent.ini with my provider ethernet port, "physical_interface_mappings = provider:eno1".  How nice of OpenStack to let me tag my provider interface as a provider.  I never noticed later on when I cut and pasted in the network configuration line that I was attaching my OpenStack network to "provider1".  Have to admit, it works great.  Everything must be handled magically behind the scenes and as long as the magic is compatible with your goals its effortless and requires no learning.  Which became a problem later on...

This creates a false mental model in the sysadmin-brain where one merely adds a single line to ml2_conf.ini where you tell your ml2_type_flat to have a flat_network of bond12 or maybe eno1 for initial unbonded testing.  When I look at the network in Horizon web interface, it even shows up as Provider Network Physical Network: eno1 or whatever.  Cool, right?

Of course if you try to provision a compute instance, you get a "Port Bind Failed" error after awhile and the only option is to delete the instance and try again.

If you "docker logs" the nova-compute container you get the error messages about "Port Bind Failed" and a helpful suggestion to obtain further information in the neutron logs.  The neutron logs when viewed by "docker logs", are, of course, empty.

Moving on, if you "docker logs nova_compute" you get great log messages from nova.  Cool.

If you "docker logs neutron_server" you get approximately nothing other than some startup messages.  

If you "docker exec -i neutron_server /bin/bash" and navigate to /var/log/kolla/neutron there are files full of beautiful neutron log messages.  Ahh that's where they are hiding...

My wishlist request would be for consistency, that not just neutron_server but ALL containers dump beautiful logs when looked at via "docker logs".  Or, if for various technical reasons its impossible, maybe the very last startup message seen by a "docker logs" should be "For further log messages please use docker exec -i neutron_server /bin/bash and look at the files in /var/log/kolla/neutron" or similar.

A partial list of containers that properly log to "docker logs": nova_compute, monasca_log_persister

A partial list of containers that do not log to "docker logs": neutron_server, horizon

I opened up a wishlist bug on this misleading docker logs output in:

https://bugs.launchpad.net/kolla-ansible/+bug/1980603

I hope someday this bug report helps other OpenStack Sysadmins.

A long rant about directly manipulating OpenVSwitch Bridge Ports

Anyway, aside from troubleshooting being difficult, the source of the problem is Kolla-Ansible is generally a "batteries included" experience such that I mistakenly thought you set your neutron_external_interface to your linux kernel interface name in /etc/kolla/globals.d/neutron.yml and enable_neutron_provider_networks and enable_neutron_dvr of course in the same file, then in /etc/kolla/config/neutron/ml2_conf.ini you tell it your flat network name (or, VLAN stuff) has flat_networks = your linux kernel interface name.  Then when you run "openstack network create ..." you tell it —provider-physical-network "your linux kernel interface name".  Basically you tell it your linux ethernet port name a bunch of time and then it works all magically inside.  But down that path lies madness, and the only troubleshooting error you'll get is Nova complaining about Port Binding failing.  That's all the help you're getting.  Its all you, all alone, from here out.  Good Luck!

The docs have stuff written them like "The default Neutron physical network is physnet1."  Well, I don't have an ethernet card named physnet1 (although I could rename eno1 to physnet1... should I?).  So I will just be careful to never see the name physnet1 in any configuration or status because that would imply I forgot to change the config to eno1 or bond12 or whatever.  That did not work out so well, LOL.

The problem with an extremely elaborate and abstracted and encapsulated project like OpenStack is the docs only operate at two levels.  "Hey bro, cut and paste in three incantations you don't have to understand, and kolla-ansible deploy, and it just works, amirite?" or 372 pages of source code and diagrams, or sometimes clickbait blogspam sites copying and pasting the two above from software releases seven years ago for the ad clicks.  What OpenStack networking needs is a cookbook with recipes, an intermediate level of documentation.  I could write that given enough time, money, etc.

It turns out, that as a sysadmin, you have to get more intimately involved with OpenVSwitch bridges that you would initially expect.

In the process of experimenting I tried various combinations of tagged VLANs on a LAG, tagged VLANs on a single port, one access on a LAG, one access on a bare physical port.  In theory, the various kolla-ansible options such as deploy, reconfigure, pull then upgrade cycle, even doing a full destroy-bootstrap-deploy cycle, can have mixed results.  Another hilarious problem I ran into, is if you "netplan apply" an ethernet bond LAG interface on Ubuntu 20.04, and the bond works, if you try to disable the bond by removing the config from netplan and doing another "netplan apply" to my immense displeasure the bond remains configured in the system.  I guess netplan fails as an orchestration system once again.  After a reboot the orphaned bond12 interface did disappear.

Anyway the point of expressing my general displeasure in the previous paragraph is somehow I ended up with bridge br-ex on a compute node having itself connected to the bond12 network that no longer existed, so I had to manually go into the container and add a port connecting the br-ex bridge to the outside world, at which point my provider-attached instance gained full network connectivity.

For a good time, ssh into a compute node, then access the innards of the openVswitch container by running "docker exec -it openvswitch_vswitchd /bin/bash" and run "ovs-vsctl show" and to my immense surprise I saw bridge br-ex had a Port "bond 12" on Interface bond12 with an error: "could not open network device bond12 (No such device)".

Well, no kidding, seeing as bond12 was removed from the OpenStack configuration a couple reconfigure/deploy/pull cycles ago and it was also removed from the host operating system via netplan a reboot ago.  Not really sure what inspired OpenStack to try and connect bridge br-ex to that former interface.

Anyway it seems the proper syntax to add a port to a bridge inside an OpenVSwitch container is "ovs-vsctl add-port br-ex eno1" assuming your real live ethernet untagged access port is eno1.  At that instant the instance on that host gained network access.  Cool!  At that point it was about one in the morning and I went to sleep to cleanup everything the next day, which I did, and everything worked great after that.

How to cook up a flat provider network:

If you have an Ubuntu 20.04 OpenStack host, /etc/netplan/os6.yaml (or whatever you call your host)

network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      mtu: 9000
      dhcp4: false

That should be about enough, for eno1 anyway.  I have five other ethernet ports doing other things.

In /etc/kolla/globals.d/neutron.yml:

neutron_external_interface: "eno1" or "bond12" or whatever you're using

enable_neutron_provider_networks: "yes" (technically this is optional)

enable_neutron_dvr: "yes" (technically this is optional)

The key takeaway is physical linux network interface name in this file.

In /etc/kolla/config/neutron/ml2_conf.ini needs these as minimum, plus tenant stuff

[ml2]

type_drivers = flat,vxlan

[ml2_type_flat]

flat_networks = *

The key takeaway is * will let you use any name, although I believe physical linux network interface names like "eno2" or "bond34" would work.

openstack network create \
  —share \
  —external \
  —provider-network-type flat \
  —provider-physical-network physnet1 \
  —mtu 1500 \
  —default \ (optional, unless this IS your default network)
  —disable-port-security \
  external-net

The key takeaway is you connect to interface "physnet1" if you use Kolla-Ansible or "provider" if you skip Kolla-Ansible and install by hand using the "Installation Tutorial" docs.

If you  "docker exec -it openvswitch_vswitchd /bin/bash" and run "ovs-vsctl show"  you will NOT see this "physnet1" port on a compute node.  Where does it come from?  I do not currently know.  Its not on the controller node either.  I could probably figure this out but I was getting pretty sick of fighting with Neutron at that time.  I'm sure there's a diagram online somewhere.

Also in your "openstack subnet create" remember a —no-dhcp as your provider net probably has its own perfectly good DHCP server. 

How to cook up a VLAN provider network:

If you have an Ubuntu 20.04 OpenStack host, /etc/netplan/os6.yaml (or whatever you call your host)

network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      mtu: 9000
    eno2:
      mtu: 9000
    eno3:
      mtu: 9000
    eno4:
      mtu: 9000
  bonds:
    bond12:
      mtu: 9000
      dhcp4: false
      dhcp6: false
      interfaces: [ eno1, eno2 ]
      parameters:
        mode: balance-xor
        mii-monitor-interval: 100
    bond34:
      mtu: 9000
      dhcp4: false
      dhcp6: false
      interfaces: [ eno3, eno4 ]
      parameters:
        mode: balance-xor
        mii-monitor-interval: 100
  vlans:
    bond34.10:
      id: 10
      link: bond34
      mtu: 9000
      addresses : [ 10.10.20.56/16 ]
      gateway4: 10.10.1.1
      critical: true
      nameservers:
        addresses:
        - 10.10.250.168
        - 10.10.249.196
        search:
       - cedar.mulhollon.com
       - mulhollon.com
    bond34.30:
      id: 30
      link: bond34
      mtu: 9000
      addresses: [ 10.30.20.56/16 ]
    bond34.60:
      id: 60
      link: bond34
      mtu: 9000
      addresses: [ 10.60.20.56/16 ]

The key takeaway is netplan makes everything harder PLUS the joy of having to enter everything in a computer interchange format.  I'd rather use punchcards and ISAM....

The /etc/kolla/config/neutron/ml2_conf.ini looks like this:

[ml2]

type_drivers = flat,vlan,vxlan

tenant_network_types = vxlan

mechanism_drivers = openvswitch,l2population

extension_drivers = port_security

[ml2_type_flat]

flat_networks =

[ml2_type_vlan]

network_vlan_ranges = physnet1:9:999

[ml2_type_vxlan]

vni_ranges = 1:1000

The key takeaway is * will let you use any name, although I believe physical linux network interface names like "eno2" or "bond34" would work.

The /etc/kolla/globals.d/neutron.yml looks like this:

# /etc/kolla/globals.d/neutron.yml

# On OS6 for cluster 2

#

---

network_interface: "bond34.10"

neutron_external_interface: "bond12"

tunnel_interface: "bond34.60"

enable_neutron_provider_networks: "yes"

enable_neutron_dvr: "yes"

The key takeaway is the external interface is the base interface for the VLAN trunk and I don't think you need to set the neutron_tenant_network_types to vlan or whatever if you want your instances to keep using vxlan between themselves; probably.

The provisioning script for a provider network looks like this:

#!/bin/bash

# openstack-scripts/networks/prod-cluster2.sh
#
source /etc/kolla/admin-openrc.sh
# Note that only prod-net should be set default, all others skip that line
openstack network create \
  --share \
  --external \
  --provider-network-type vlan \
  --provider-physical-network physnet1 \
  --provider-segment 10 \
  --mtu 1500 \
  --default \
  prod-net
# prod-cluster1.sh and prod-cluster2.sh have different allocation pools on same LAN
openstack subnet create \
  --no-dhcp \
--subnet-range 10.10.0.0/16 \
--allocation-pool start=10.10.244.2,end=10.10.247.253 \
--gateway 10.10.1.1 \
--dns-nameserver 10.10.250.168 \
--dns-nameserver 10.10.249.196 \
--network prod-net \
prod-net-v4
# TODO: prod-net-v6 definition belongs here
exit 0

The key takeaway is the provider-network-type is unsurprisingly vlan, the provider-physical-network is the mysterious physnet1 not some linux host interface name, and the VLAN dot1Q ID is stored in provider-segment.

Another thing to remember about VLANs and LAGs on Netgear switches, both 1G and 10G switches, is you have to admin up the port on the port page, and admin up the LAG on the port page, and admin up the LAG on the LAG page (3x).  Also set the MTU three times on each page LOL.

As a last word about VLANs, if you want another take on the issue, perhaps you don't like my writing style or witty dry humour, try this git repository.  Google search for four words, kolla ansible vlan bacon.  Yes, bacon.  Not Kevin Bacon but the well known "shreddedbacon" on Github.  None of it made any sense to me until I figured it out myself, but different learning approaches work for different folks so take a look and "star" the repo.

https://github.com/shreddedbacon/openstack-lab

Port Security Issues

"Port Security" on OpenStack Neutron is kind of a misnomer.  Its not so much security as enforcement of anti-spoofing rules.  Neutron knows what IP address it assigned to the port, or at least it thinks so...  Thus it blocks all traffic sent from the "wrong" address or destined to the "wrong" address, this includes ipv4 addrs AND MAC addresses.

The problem with enforcing anti-spoofing rules is if I'm setting up a boring vanilla instance, I'd be putting it on a tenant network behind a neutron NAT router not on a provider net.  So the only reason for me to put an instance on a provider network is specifically, intentionally, very much willfully on purpose, to do "weird" things like have multiple ip addresses on an interface or send and receive traffic to strange IP and MAC addresses.  I understand the intention, if there's weird traffic, block it, but I wouldn't be using a provider net UNLESS I was intentionally putting weird traffic on it.  So for my use case, I "—disable-port-security" on the provider network as a matter of course and/or I shut off individual port security on individual instances.  WRT security, its not any less secure than plugging hardware into an ethernet switch; its nifty that every port CAN have a little firewall on it, but not necessary in most situations.  As an example of the hilarity that can result, if your external provider DHCP server decides to change a DHCP configured instance's IP address, port security can lock that instance out until you either reconfigure neutron to the new address or shut off port security.

Another option for dealing with port security is if you know the alternative IP and MAC addresses in advance, supposedly its possible to force neutron, and thus port security, to recognize an arbitrary list of addresses as being valid.  Probably not worth the effort, in my case.

In the long run, the solution to living with port security, was to orchestrate everything properly.  Then it actually works pretty well!  Just because you can change things on the fly in the web UI doesn't mean you should, or that it will work.

DVR distributed virtual router

Figure out DVR mode distributed virtual router

This amounted to no more effort than adding one line to /etc/kolla/globals.d/neutron.yml:

enable_neutron_dvr: "yes"

Setting MTU to 9000 (well, later I used 8000, but whatever)

Edit /etc/kolla/config/neutron.conf

[DEFAULT]

global_physnet_mtu = 9000

Edit /etc/kolla/config/neutron/ml2_conf.ini

[ml2]

path_mtu=9000

Spent about one day on the struggle where "everyone knows" custom configuration sections reside in /etc/kolla/config as per globals.yml.  But the magic swift shell scripts drop all the swift config into /etc/kolla/config/swift instead of /etc/kolla/config, so naturally I'd want to drop my neutron.conf custom config entry for MTU settings into /etc/kolla/config/neutron/neutron.conf.  That, of course, will not work and will NOT be applied.

Adding to the confusion, I made VLANs work first before trying to fix the MTU, and changes made to /etc/kolla/config/neutron/ml2_conf.ini most certainly DO work.  Note that file is in /etc/kolla/config/neutron, not /etc/kolla/config like you would expect it would "have" to be.

In summary the single most important line in this entire blog series is: Some custom configuration sections are very picky about where their files are and other are best-effort and easier to use, leading to misleading the unfortunate sysadmin.

Scripting vs Orchestration

I wrote scripts to provision all my provider LANs and LANs and subnets are provisioned as part of my project scripts, so at least I can just run a script and get 100% repeatable results in seconds instead of "best of luck to you" and a half hour of web browser clicking after which maybe it'll work if I made no human errors.

Then I scrapped the scripts and moved on to full on Heat Orchestration as its more reproducible than hitting the REST API via the CLI.  I ran into weird problems, something like running a subnet create made multiple subnet entries or something.  Generally speaking trying to use a REST API as if its an orchestration system is going to hurt, especially if there's an actual orchestration system sitting right next to it.

https://gitlab.com/SpringCitySolutionsLLC/openstack-scripts

Quotas

See the project scripts WRT setting quotas in openstack-scripts.  Not too much to it.

Security Groups for ports

To my enormous annoyance, the "name" field of security groups is a global namespace.  So you can hide a security group named "ping" in the "iot" project, but nobody else on the system can name their project's security group "ping".  Furthermore the only way to share sec groups in the docs involves manual manipulation or RBAC policies upon internal project IDs and nonsense like that.  So the workaround is every project gets its own "ssh" secgroup named "ssh-projectname".  Annoying but it DOES work.

Tomorrow is Heat Orchestration Day.

Stay tuned for the next chapter!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.