Another Lab Power Cycle

Posted by Brad on Tue 23 October 2018

This is going to be a bit of a double post since I didn't publish the first part when I wrote it. When I go on vacation I like to power down the lab to ensure nothing happens to it and save power. Every time I power the lab down completely there is inevitably something new that breaks or doesn't come back up properly. Fortunately, now that I've done this a few times the number of issues is decreasing. This time I lucked out and only had some issues auto-mounting datastores (and a couple secondary issues caused by this). I should probably figure out what the deal is with the datastores but I suspect it's related to the odd setup I have where vCenter is located on its own cluster and powering it down completely. Here's a step-by-step of what it should take to restore the lab completely as well as how it actually happened.

Powering Down the Lab

  1. Backup everything "off-site"
  2. Ensure firewall and dns server are on the Dell R610 (since it has local storage)
  3. Ensure switch running config has been saved
  4. Power off non-critical VMs
  5. Power off vcenter
  6. Power off individual hosts
  7. Power off storage server
  8. Power off 3560G rack switch

Powering Up the Lab

  1. Power on the 3560G switch
  2. Power on the storage server(warehouse)
  3. Power on the Dell R610 (since it has the firewall and dns servers on local storage)
  4. Power on labfw, dns1, and vcenter
  5. Connect to vcenter and power on other VMs as needed

Issues This Time

Where'd my Datastores go?

Before I dive into the datastore issues I need to explain why they mattered if the so-called "critical" VMs were all on local storage. It turns out my tiny 120 GB SSD didn't have room for the vCenter appliance so in my infinite wisdom I moved it to nfs... only I forgot that it was nfs instead of iscsi so I had to find it.

There were fewer storage issues this time around, and ultimately I think they're as limited as I can make them. The targetd configuration is good now, since I re-configured it to use the device UUID instead of /dev/sdX. However, neither esxi host mounted the iscsi or nfs datastores at boot and for whatever reason the esxi webui doesn't have an option to force mount them either. So I enabled ssh and attempted to connect from my workstation when I realized the next mistake. My workstation stores its ssh config on nfs and most likely because I was being impatient it wasn't connecting. Also worth noting was that my workstation is behind the lab firewall AND connecting to nfs using a dns name instead of IP. So I quickly reconfigured some bits of nfs to bypass the lab firewall and dns which allowed me to make my ssh config happy, though I later realized that since I was able to power on the dns and firewall I likely just needed to wait for them to come up and maybe reboot the workstation than completely reconfigure its nfs.

Once ssh'd into the esxi hosts I had to run esxcfg-volume -l to find the iscsi volumes. Then I ran esxcfg-volume -m <volume_label> to mount it, later I realized I probably should have used a big M in the mount command so it would do it in the future. Now, a scary thing happens when you run esxcfg-volume -m, the volume no longer appears when running esxcfg-volume -l, so I thought I screwed something up when really the webui just hadn't fully refreshed.

With the iscsi datastore mounted I realized the vCenter appliance was on the nfs datastore and found out esxi actually separates nfs 3 and 4.1 in separate commands. So after an initial scare I found it with the esxcli storage nfs41 command. It wouldn't let me mount it without first removing it with esxcli storage nfs41 remove -v <volume_name>. To add it back I used the same settings as before, esxcli storage nfs41 add -H <nfs_server> -s <nfs_mountpoint> -v <volume_name>. And with that I was able to power on vCenter and everything was back to normal.

Mini-conclusion

This time around I probably did a few extra steps that wouldn't have been necessary had I verified connectivity to the firewall and dns servers rather than just working around them, but in any case the power up went fairly smooth compared to the previous time, and should be even better the next time around.

After the Previous Vacation

The previous power cycle event was a bit more involved... When I returned nothing came back up as expected. This is why its important to have configuration management and low system uptime. Even with the issues I encountered the lab is in a pretty good place, and several of the issues were simple things like not saving a config. To alleviate some of the issues I encountered I'm going to attempt to fully patch the lab monthly-ish.

Here's an overview of the issues, what I did to troubleshoot, and what did to prevent them from happening again.

Network Issues

To start things off I powered up the storage server and then the esxi hosts. Both booted up fine, and I logged into the esxi host with my vcenter and firewall VMs and powered them on. However I couldn't hit either of those over the network after they booted. ESXi was complaining about the port group not existing so I thought it had gotten confused by the vcenter server not being powered on (which was probably accurate, but not the underlying issue). The actual issue was that I had messed up some port channel settings on my switch and didn't save the configuration (or actually finish troubleshooting) when I had seen the problem during the initial distributed portgroup set up. Once the port-channel configuration was corrected traffic started flowing as expected, I still need to do some testing to see if I can handle individual ports on the switch/host failing.

Storage Issues

With vCenter now working I was ready to power on the rest of the VMs. However the VMs on the iSCSI storage were all showing inaccessible as was the datastore itself. After logging into the storage server I was able to find a number of issues that were easily fixed. The first issue was that the iSCSI port wasn't allowed in FirewallD since I had previously turned it off. With a simple firewall-cmd --add-service=iscsi-target and firewall-cmd --add-service=iscsi-target --permanent that issue was resolved. Pro tip to find out what services are available in firewalld you can run firewall-cmd --get-services.

With that done it was time to look at the targetd configuration. I was expecting it to not have been saved but it turned out the whole config was there except for the block device. Since I had originally hot-added the drive it picked up a different drive letter than if I had done it while the server was powered off. I probably should have used the UUID when I configured targetd but I just used the new drive letter since at this point I just wanted it to work.

So now the ESXi hosts were able to see the drive again but the datastore was still inaccessible. Rescanning the storage adapters and even rebooting a host had no effect. Finally, I tried to add a new datastore and it was able to realize there was an existing signature on the disk and import it. However, this only fixed it for the host that I did the import on and I couldn't simply run the import a second time on the other host. So I storage-vmotioned everything (except my poor EL7 template) to an nfs datastore and re-formatted the iSCSI datastore. After re-formatting and adding it as a new datastores both hosts were able to see it and the lab infrastructure was restored to full health.

Guacamole Issues

I've been trying not to just disable SELinux lately. And for the most part its fairly transparent when using pre-packaged applications. For my guacamole server I had disabled SELinux during the initial set up and never went back and corrected it. Having dealt with the nginx SELinux stuff before I thought maybe I just had to add the ports to those that nginx is allowed to proxy to with semanage port --add --type http_port_t --proto tcp 8080, but 8080 was already there.

So it was time to break out the awesome audit2why to figure out what's actually being denied. First, find an AVC denial in your audit.log and then pipe it over to audit2why with something like grep 1521762214.216:81 /var/log/audit/audit.log | audit2why. This will likely give you the exact command you need to resolve the issue, in my case it was to run setsebool -P httpd_can_network_relay 1 to allow nginx to be able to be used as a proxy. And with that change my server was fully functional again.

For the Future

So those are all the issues that were caused by a full shut down of the lab. They have been permanently fixed a this point, but there are some general lessons to be learned from them. The first is that I need to get better about confirming changes persist after a reboot, routine patching will be a good way to catch these. Better configuration management will also help. That is instead of just disabling a firewall on a host, I should either do it via ansible so its at least tracked or even better finish troubleshooting and resolving issues instead of moving on to other things. I would also like to get more configuration management around the vcenter side of things, to make that easier to rebuild if I need to.

In any case, the lab handled the vacation much better than some of the previous times I had to shut it down and is progressing nicely.