When vSphere 7 introduced vLCM to replace VUM (VMware Update Manager) it was announced that the Baseline update approach would be deprecated in a later version, in favour of Image-based updates.
We extensively use automation around Baselines, to show a measurable compliance against our quarterly patching cycle. This went something like this:
On the first day of each calendar quarter (eg 1st Jan, 1st April etc) a script is run against each vCenter to create a new baseline which includes all the latest VMware patches. This is attached to all datacenters, and any baselines older than a year are removed
Scripts are then run against each vCenter, to apply the new baseline – starting with the Lab environment, and working through in order of criticality, finishing on the production one. For example, the Lab would be done as soon as the baseline is available, then a week later, non-production environments, then DR a week later, and then finally Production a week after DR.
Reports are run weekly, showing the status of each host against the baselines, enabling us to track the compliance, and the progress of the rollout.
Image-based updates work in quite a different way, so this approach needed some rethinking. The only feasible way seems to be to just measure compliance against the defined image, so that is what I’ve gone with.
The main purpose of this post, is to cover the PowerCLI cmdlets and structures that were used to achieve this, both as an aide-memoir for me, and to share my findings for anyone doing a similar migration.
Checking if a cluster is using Image-based updates
This is necessary to allow a staged approach through the environments, while keeping the quarterly updates across the estate. It’s not something that I found possible to do via normal LCM cmdlets, and had to dig around in the SDK cmdlets
Anything using the SDK cmdlets has to use the bare moref IDs, which means cropping the object type off the front of the ID
This is significantly more complicated, but our VMware TAM pointed me in the direction of the SDK cmdlets again for this.
# The SDK cmdlets need the moref ID's trimming
$clusterid=(Get-Cluster $cluster ).Id.Replace("ClusterComputeResource-","")
$vmhostid=(Get-VMHost $vmhost).Id.Replace("HostSystem-","")
# Get the time just before we start the task, so we can filter the Get-Task output
$start = Get-Date
# Create a specification object - you can supply more than one $vmhostid, comma separated
$SettingsClustersSoftwareApplySpec = Initialize-SettingsClustersSoftwareApplySpec -Hosts `
$vmhostid -AcceptEula $true
# Apply the specification object to the cluster
Invoke-ApplyClusterSoftwareAsync -Cluster $clusterid -SettingsClustersSoftwareApplySpec `
$SettingsClustersSoftwareApplySpec
# The apply task runs async, and the output doesn't seem to match to a task id, so now we find the task
$task=Get-Task |?{$_.ObjectId -eq $cluster.Id -and $_.StartTime -gt $start -and $_.Name -eq "apply`$task"}
# Loop until the task finishes
While ($task.State -eq "Running") {
Sleep -Seconds 60
$task=Get-Task |?{$_.ObjectId -eq $cluster.Id -and $_.StartTime -gt $start -and `
$_.Name -eq "apply`$task"}
}
We do things this way so that we can silence alerting for each host while it patches. If that’s not something you bother with, then it’s far simpler to do the one-liner above!
I came across this upgrade failure while upgrading our lab environment
The error happens during the post-upgrade section, when it runs a make against /opt/health/Makefile
In 8.6.2 the relevant section is:
single-aptr: eth0-ip $(begin_check) echo Check the ip address if eth0 resolves only to a single hostname [ 1 -eq $$( host $$( iface-ip eth0 ) | wc -l ) ] $(end_check)
This changes in 8.8.2 to:
single-aptr: eth0-ip
$(begin_check)
echo Check the ip address if eth0 resolves only to a single hostname
[ 1 -eq $$(/usr/bin/dig +noall +answer +noedns -x $$( iface-ip eth0 ) | grep "PTR" | wc -l ) ]
$(end_check)
I’m pretty sure that ‘host’ uses the hostfile as the primary resolution, whereas dig goes out to DNS.
As it turns out the reverse entry of the appliance FQDN was missing, which was causing the upgrade to bomb out at this point. Simply adding the reverse record was all that was needed to resolve this, and allow the upgrade to be re-run from the pre-upgrade snapshot.
We’ve had a lot of memory DIMM issues with our Dell servers over the last couple of years, and some of them have caused VM HA events. Because of this we decided to take another look at OMIVV (Open Manage Integration for VMware VCenter) and saw that monitoring of memory failure conditions had now been added.
I implemented it across the estate a couple of weeks ago and had a couple of issues that I thought it might be useful to share.
Implementing Proactive HA on one cluster caused an HA fail-over
When you enable Proactive HA on a cluster, and select the Dell Proactive HA Provider, it’s default state is to mark all the hosts in the cluster as “Unknown Health State”.
As it detects the true health state on the hosts, it marks them healthy and the unknown health state clears.
The problem is that DRS sees the unknown health state, and tries to move VMs from hosts that appear to be “unhealthy” to the few that have been marked as healthy.
This resulted in a critical low memory state on one host, after VMs totalling 1Tb of allocated memory were migrated to a 512Gb host, the balloon driver removed 450Gb from VMs, and swapping was going on. I caught it only after 2 VMs had crashed due to not being able to allocate memory, and HA’d to another host.
My recommendation would be to disable DRS before enabling Proactive HA, and wait for the hosts to all go to a healthy state, before re-enabling it.
I have raised this with Dell Product Support, as it I see it as an issue.
OMIVV sees powered-off hosts as having a fault condition
We use DPM (Distributed Power Management) to turn off some hosts overnight, when the cluster load is lower, this reduces the power consumption and was introduced to reduce our environmental impact.
Unfortunately when a Dell server powers off, it’s PSU redundancy state is reported as “Disabled”. This causes Proactive HA to Quarantine the host and then VROPS raises a critical alert “Proactive HA provider has reported health degradation on the underlying hosts”.
Obviously this isn’t as severe a concern as the other issue, but is spurious alerting and we only want to alert on real issues.
For the time being I have disabled the PSU Failure Conditions in the Dell Proactive HA Provider, and it has also been raised with Dell Product Support.
One of our VCenters started having backup failures (the file based backup built into VCSA) after it was upgraded to 7.0U2b.
The resolution is pretty simple, however there doesn’t appear to be a KB article for this, or if there is, it doesn’t show up when you search for the error message in the title.
The extract from /var/log/vmware/applmgmt/backup.log is:
2021-06-23T12:22:40.705 [20210623-122202-17958471] [VCDBBackup:PID-58776] [VCDB::BackupVCDB:VCDB.py:2060] ERROR: Encounter error during backup VCDB. Traceback (most recent call last): File "/usr/lib/applmgmt/backup_restore/py/vmware/appliance/backup_restore/components/VCDB.py", line 1939, in BackupVCDB raise Exception('Full backup not allowed during VM snapshot') Exception: Full backup not allowed during VM snapshot
This is generally caused by an interruption to the backup, or a restore from backup, leaving a flag file in place.
Simply remove or rename /etc/vmware/backupMarker.txt and re-run the backup.
We’re currently running through the 7.0U2 upgrades, and have encountered this on one of them. It failed at 92% during the data conversion step, and as part of the diagnostics, support had us run a postgres consistency check script: https://kb.vmware.com/s/article/53062
The script does many checks, but it also creates a postgres extension called amcheck_next.
This caused subsequent upgrade attempts to fail – the message appears in the log /var/log/vmware/applmgmt/PatchRunner.log
2021-05-27T06:56:41.087Z Running: su -s /bin/bash vpostgres -c cd /var/log/vmware/vpostgres/upgrade && /opt/vmware/vpostgres/12/bin/pg_upgrade -U postgres -B /opt/vmware/vpostgres/12/bin -b /opt/vmware/vpostgres/12/../11/bin -D /storage/db/vpostgres.12 -d /storage/db/vpostgres --check
2021-05-27T06:56:41.087Z Running command: ['su', '-s', '/bin/bash', 'vpostgres', '-c', 'cd /var/log/vmware/vpostgres/upgrade && /opt/vmware/vpostgres/12/bin/pg_upgrade -U postgres -B /opt/vmware/vpostgres/12/bin -b /opt/vmware/vpostgres/12/../11/bin -D /storage/db/vpostgres.12 -d /storage/db/vpostgres --check']
2021-05-27T06:56:42.140Z Done running command
, stderr: 2021-05-27T06:56:42.141Z pg_upgrade --check returned error code 1 with error out Performing Consistency Checks
-----------------------------
Checking cluster versions ok
Checking database user is the install user ok
Checking database connection settings ok
Checking for prepared transactions ok
Checking for reg* data types in user tables ok
Checking for contrib/isn with bigint-passing mismatch ok
Checking for tables WITH OIDS ok
Checking for invalid "sql_identifier" user columns ok
Checking for presence of required libraries fatal
Your installation references loadable libraries that are missing from the
new installation. You can add these libraries to the new installation,
or remove the functions using them from the old installation. A list of
problem libraries is in the file:
loadable_libraries.txt
Failure, exiting
.
2021-05-27 06:56:42,169.169Z vpostgres:Patch ERROR vmware_b2b.patching.executor.hook_executor Patch hook 'vpostgres:Patch' failed.
Checking the “loadable_libraries.txt” in ‘/storage/log/vmware/vpostgres/upgrade’ showed the following:
could not load library "$libdir/amcheck_next": ERROR: could not access file "$libdir/amcheck_next": No such file or directory
Database: VCDB
To rectify this, after rolling back to the pre-upgrade snapshot, I just connected to the DB with ‘psql -U postgres -h localhost -d VCDB’ and ran ‘DROP EXTENSION amcheck_next;’
I then cleared out the VUM database (which had cause the original failure) and the update then ran through successfully.
This is as much a reminder for me as for anyone else, having come across this a couple of times and always struggled to find the KB article on how to resolve it
If you perform a migration upgrade on a VCenter Server Appliance, such as from 6.7 to 7.0, and the upgrade fails on the new appliance necessitating a backout, it is common to get a ‘data transfer and appliance setup is in progress’ message on the original appliance
OpenSLP has cropped up again as an ESXi vulnerability, and if you want to disable the service the KB article given only has details for doing so via the ESXi command line.
Far easier, if you have many hosts, is to use PowerCLI, and while it’s relatively simple I thought I would share this to help anyone else wanting to do so.
Disabling the service Connect to the environment with ‘connect-viserver’ and then run:
Edit : As per the comment from Zeev, I’d missed disabling the service, I’ve updated the Disabling and Checking scripts above to include the correct information now.
The ESXi vulnerability found at the 2020 Tianfu Cup was a Critical one, with a CVSSv3 base score of 9.3.
VMware lists an article with the fixes and workarounds here: https://www.vmware.com/security/advisories/VMSA-2020-0026.html The fix is to apply the latest patch, and the workaround is to remove the xHCI (USB 3.0) controller from any VMs that have it.
To quickly determine whether you have an exposure you can run the following PowerCLI against your environment and it will list the VMs which have that particular controller type attached.
I’ve started looking at upgrading our standalone VRO instances from 7.x to 8.x, and one thing that has changed significantly is that we can no longer use the appliance linux environment to run dig or nslookup
These allow the usual forward and reverse lookups, but have significant limitations
System.resolveHostName :
Only returns one record at a time, so if there are multiple records you would have to write a loop to collect them all
Only returns the IP address, no ability to return record type, TTL, SOA record
System.resolveIpAddress :
Only returns one record at a time, so if there are multiple records you would have to write a loop to collect them all
Only returns the host name, no ability to return record type, TTL, SOA record
ONLY WORKS IF A FORWARD RECORD EXISTS THAT MATCHES
This final point took some significant figuring out, and in combination with the rest of the points resulted in me having to change some of the workflows to do an SSH to a linux server to do normal dig and nslookup commands, rather than using the System calls.