vRealize Orchestrator 8.10 – VcNetwork object change

Just a quick one here, I’ve updated our Lab VRO instance to 8.10.1 and found a change to the VcNetwork object that broke some of our workflow logic.

The object name property no longer includes the name of the parent VDS in brackets

Before:

After:

I can’t see any mention of this in the release notes, but might be worth watching out for if you see any issues with network selection flows.

vSphere LifeCycle Manager Image-based Updates – Automation

When vSphere 7 introduced vLCM to replace VUM (VMware Update Manager) it was announced that the Baseline update approach would be deprecated in a later version, in favour of Image-based updates.

We extensively use automation around Baselines, to show a measurable compliance against our quarterly patching cycle. This went something like this:

  • On the first day of each calendar quarter (eg 1st Jan, 1st April etc) a script is run against each vCenter to create a new baseline which includes all the latest VMware patches. This is attached to all datacenters, and any baselines older than a year are removed
  • Scripts are then run against each vCenter, to apply the new baseline – starting with the Lab environment, and working through in order of criticality, finishing on the production one. For example, the Lab would be done as soon as the baseline is available, then a week later, non-production environments, then DR a week later, and then finally Production a week after DR.
  • Reports are run weekly, showing the status of each host against the baselines, enabling us to track the compliance, and the progress of the rollout.

Image-based updates work in quite a different way, so this approach needed some rethinking. The only feasible way seems to be to just measure compliance against the defined image, so that is what I’ve gone with.

The main purpose of this post, is to cover the PowerCLI cmdlets and structures that were used to achieve this, both as an aide-memoir for me, and to share my findings for anyone doing a similar migration.

Checking if a cluster is using Image-based updates

This is a simple one

(Get-Cluster $cluster).CollectiveHostManagementEnabled

returns $true

Updating the Image

I had to use a Try…Catch here to detect whether there was an update to apply

try {
	$rec = Get-LcmClusterDesiredStateRecommendation -Current -Cluster $cluster -ErrorAction:Stop
} catch {
	Write-Output "Cluster $cluster has no recommended updates"
	continue
	# move on to the next vCenter
}

$update=Set-Cluster -Cluster $cluster -BaseImage $rec.Image -VendorAddOn $rec.VendorAddOn -Confirm:$false

This only updates the Base Image and Vendor AddOn, I’ve not touched Components, or the Firmware and Drivers Addon at this time.

Testing compliance

Test-LcmClusterCompliance returns an object containing the status of the compliance, and arrays of the hosts matching certain states.

$comp=Get-Cluster -Name $cluster |Test-LcmClusterCompliance
$comp.Status
$comp.CompliantHosts
$comp.NonCompliantHosts

I’ve then used this output to produce a report (and sorry, this is quite verbose!)

$statuses=@()

$statuses+=$comp.NonCompliantHosts | %{ $_ | select VMhost, 
            @{N="CurrentImage";E={[string]$_.BaseImageCompliance.Status + " " + 
            $_.BaseImageCompliance.Current.Name + " - " + $_.BaseImageCompliance.Current.Version}}, 
            @{N="CurrentAddOn";E={[string]$_.AddOnCompliance.Status + " " + 
            ($_.AddOnCompliance.Current.Name).Replace("PowerEdge Servers running ","") + " - " + 
            $_.AddOnCompliance.Current.Version}}, 
            @{N="TargetImage";E={$_.BaseImageCompliance.Target.Name + " - " + 
            $_.BaseImageCompliance.Target.Version}},
            @{N="TargetAddOn";E={($_.AddOnCompliance.Target.Name).Replace("PowerEdge Servers running ","")+ 
            " - " + $_.AddOnCompliance.Target.Version}} } 
                                        
$statuses+=$comp.CompliantHosts | %{ $_ | select VMhost, 
            @{N="CurrentImage";E={[string]$_.BaseImageCompliance.Status + " " + 
            $_.BaseImageCompliance.Current.Name + " - " + $_.BaseImageCompliance.Current.Version}}, 
            @{N="CurrentAddOn";E={[string]$_.AddOnCompliance.Status + " " + 
            ($_.AddOnCompliance.Current.Name).Replace("PowerEdge Servers running ","") + " - " + 
            $_.AddOnCompliance.Current.Version}}, 
            @{N="TargetImage";E={$_.BaseImageCompliance.Target.Name + " - " + 
            $_.BaseImageCompliance.Target.Version}},
            @{N="TargetAddOn";E={($_.AddOnCompliance.Target.Name).Replace("PowerEdge Servers running ","")+ 
            " - " + $_.AddOnCompliance.Target.Version}} } 

This is then formatted into a report like:

Checking when the Image was updated

This is necessary to allow a staged approach through the environments, while keeping the quarterly updates across the estate. It’s not something that I found possible to do via normal LCM cmdlets, and had to dig around in the SDK cmdlets

Anything using the SDK cmdlets has to use the bare moref IDs, which means cropping the object type off the front of the ID

$comp=Invoke-GetClusterSoftwareCompliance -Cluster `
(Get-Cluster $cluster).Id.Replace("ClusterComputeResource-","")

$comp

incompatible_hosts  : {}
hosts               : @{host-7829=; host-7830=}
non_compliant_hosts : {}
impact              : NO_IMPACT
commit              : 10
compliant_hosts     : {host-7829, host-7830}
scan_time           : 13/10/2022 12:18:26
unavailable_hosts   : {}
notifications       :
host_info           : @{host-7829=; host-7830=}
status              : COMPLIANT
$commit=Invoke-GetClusterCommitSoftware -Cluster `
(Get-Cluster $cluster).Id.Replace("ClusterComputeResource-","") -commit $comp.commit

author           apply_status description commit_time
------           ------------ ----------- -----------
xxxxxx@xxx.xxx   APPLIED                  13/10/2022 12:30:25

That commit_time is the time the Image was updated

Applying the image – Whole Cluster

This can be done with a one-liner

Get-Cluster -Name $cluster | Set-Cluster -Remediate -AcceptEULA

Applying the image – Individual Host

This is significantly more complicated, but our VMware TAM pointed me in the direction of the SDK cmdlets again for this.

# The SDK cmdlets need the moref ID's trimming
$clusterid=(Get-Cluster $cluster ).Id.Replace("ClusterComputeResource-","")
$vmhostid=(Get-VMHost $vmhost).Id.Replace("HostSystem-","")

# Get the time just before we start the task, so we can filter the Get-Task output
$start = Get-Date

# Create a specification object - you can supply more than one $vmhostid, comma separated
$SettingsClustersSoftwareApplySpec = Initialize-SettingsClustersSoftwareApplySpec -Hosts `
 $vmhostid -AcceptEula $true 

# Apply the specification object to the cluster
Invoke-ApplyClusterSoftwareAsync -Cluster $clusterid -SettingsClustersSoftwareApplySpec `
 $SettingsClustersSoftwareApplySpec

# The apply task runs async, and the output doesn't seem to match to a task id, so now we find the task
$task=Get-Task |?{$_.ObjectId -eq $cluster.Id -and $_.StartTime -gt $start -and $_.Name -eq "apply`$task"} 

# Loop until the task finishes
While ($task.State -eq "Running") {
    Sleep -Seconds 60 
    $task=Get-Task |?{$_.ObjectId -eq $cluster.Id -and $_.StartTime -gt $start -and `
      $_.Name -eq "apply`$task"} 
}

We do things this way so that we can silence alerting for each host while it patches. If that’s not something you bother with, then it’s far simpler to do the one-liner above!

vRealize Orchestrator upgrade failure 8.6.2>8.8.2

I came across this upgrade failure while upgrading our lab environment

The error happens during the post-upgrade section, when it runs a make against /opt/health/Makefile

In 8.6.2 the relevant section is:

single-aptr: eth0-ip
$(begin_check)
echo Check the ip address if eth0 resolves only to a single hostname
[ 1 -eq $$( host $$( iface-ip eth0 ) | wc -l ) ]
$(end_check)

This changes in 8.8.2 to:

single-aptr: eth0-ip
	$(begin_check)
	echo Check the ip address if eth0 resolves only to a single hostname
	[ 1 -eq $$(/usr/bin/dig +noall +answer +noedns -x $$( iface-ip eth0 ) |  grep "PTR" | wc -l ) ]
	$(end_check)

I’m pretty sure that ‘host’ uses the hostfile as the primary resolution, whereas dig goes out to DNS.

As it turns out the reverse entry of the appliance FQDN was missing, which was causing the upgrade to bomb out at this point. Simply adding the reverse record was all that was needed to resolve this, and allow the upgrade to be re-run from the pre-upgrade snapshot.

Dell OMIVV and VMware Proactive HA

We’ve had a lot of memory DIMM issues with our Dell servers over the last couple of years, and some of them have caused VM HA events. Because of this we decided to take another look at OMIVV (Open Manage Integration for VMware VCenter) and saw that monitoring of memory failure conditions had now been added.

I implemented it across the estate a couple of weeks ago and had a couple of issues that I thought it might be useful to share.

Implementing Proactive HA on one cluster caused an HA fail-over

When you enable Proactive HA on a cluster, and select the Dell Proactive HA Provider, it’s default state is to mark all the hosts in the cluster as “Unknown Health State”.

As it detects the true health state on the hosts, it marks them healthy and the unknown health state clears.

The problem is that DRS sees the unknown health state, and tries to move VMs from hosts that appear to be “unhealthy” to the few that have been marked as healthy.

This resulted in a critical low memory state on one host, after VMs totalling 1Tb of allocated memory were migrated to a 512Gb host, the balloon driver removed 450Gb from VMs, and swapping was going on. I caught it only after 2 VMs had crashed due to not being able to allocate memory, and HA’d to another host.

My recommendation would be to disable DRS before enabling Proactive HA, and wait for the hosts to all go to a healthy state, before re-enabling it.

I have raised this with Dell Product Support, as it I see it as an issue.

OMIVV sees powered-off hosts as having a fault condition

We use DPM (Distributed Power Management) to turn off some hosts overnight, when the cluster load is lower, this reduces the power consumption and was introduced to reduce our environmental impact.

Unfortunately when a Dell server powers off, it’s PSU redundancy state is reported as “Disabled”. This causes Proactive HA to Quarantine the host and then VROPS raises a critical alert “Proactive HA provider has reported health degradation on the underlying hosts”.

Obviously this isn’t as severe a concern as the other issue, but is spurious alerting and we only want to alert on real issues.

For the time being I have disabled the PSU Failure Conditions in the Dell Proactive HA Provider, and it has also been raised with Dell Product Support.

VCenter 7 – Exception: Full backup not allowed during VM snapshot

One of our VCenters started having backup failures (the file based backup built into VCSA) after it was upgraded to 7.0U2b.

The resolution is pretty simple, however there doesn’t appear to be a KB article for this, or if there is, it doesn’t show up when you search for the error message in the title.

The extract from /var/log/vmware/applmgmt/backup.log is:

2021-06-23T12:22:40.705 [20210623-122202-17958471] [VCDBBackup:PID-58776] [VCDB::BackupVCDB:VCDB.py:2060] ERROR: Encounter error during backup VCDB.
Traceback (most recent call last):
  File "/usr/lib/applmgmt/backup_restore/py/vmware/appliance/backup_restore/components/VCDB.py", line 1939, in BackupVCDB
    raise Exception('Full backup not allowed during VM snapshot')
Exception: Full backup not allowed during VM snapshot

This is generally caused by an interruption to the backup, or a restore from backup, leaving a flag file in place.

Simply remove or rename /etc/vmware/backupMarker.txt and re-run the backup.

VCSA 7.0U2 Upgrade failure – amcheck_next

We’re currently running through the 7.0U2 upgrades, and have encountered this on one of them. It failed at 92% during the data conversion step, and as part of the diagnostics, support had us run a postgres consistency check script: https://kb.vmware.com/s/article/53062

The script does many checks, but it also creates a postgres extension called amcheck_next.

This caused subsequent upgrade attempts to fail – the message appears in the log /var/log/vmware/applmgmt/PatchRunner.log

2021-05-27T06:56:41.087Z  Running: su -s /bin/bash vpostgres -c cd /var/log/vmware/vpostgres/upgrade && /opt/vmware/vpostgres/12/bin/pg_upgrade -U postgres -B /opt/vmware/vpostgres/12/bin -b /opt/vmware/vpostgres/12/../11/bin -D /storage/db/vpostgres.12 -d /storage/db/vpostgres --check
2021-05-27T06:56:41.087Z  Running command: ['su', '-s', '/bin/bash', 'vpostgres', '-c', 'cd /var/log/vmware/vpostgres/upgrade && /opt/vmware/vpostgres/12/bin/pg_upgrade -U postgres -B /opt/vmware/vpostgres/12/bin -b /opt/vmware/vpostgres/12/../11/bin -D /storage/db/vpostgres.12 -d /storage/db/vpostgres --check']
2021-05-27T06:56:42.140Z  Done running command
, stderr: 2021-05-27T06:56:42.141Z  pg_upgrade --check returned error code 1 with error  out Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* data types in user tables                 ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for tables WITH OIDS                               ok
Checking for invalid "sql_identifier" user columns          ok
Checking for presence of required libraries                 fatal
 
Your installation references loadable libraries that are missing from the
new installation.  You can add these libraries to the new installation,
or remove the functions using them from the old installation.  A list of
problem libraries is in the file:
    loadable_libraries.txt
 
Failure, exiting
 
.
2021-05-27 06:56:42,169.169Z vpostgres:Patch ERROR vmware_b2b.patching.executor.hook_executor Patch hook 'vpostgres:Patch' failed.

Checking the “loadable_libraries.txt” in ‘/storage/log/vmware/vpostgres/upgrade’ showed the following:

could not load library "$libdir/amcheck_next": ERROR:  could not access file "$libdir/amcheck_next": No such file or directory
Database: VCDB

To rectify this, after rolling back to the pre-upgrade snapshot, I just connected to the DB with ‘psql -U postgres -h localhost -d VCDB’ and ran ‘DROP EXTENSION amcheck_next;’

I then cleared out the VUM database (which had cause the original failure) and the update then ran through successfully.

‘Data transfer and appliance setup is in progress’ after VCSA upgrade backout

This is as much a reminder for me as for anyone else, having come across this a couple of times and always struggled to find the KB article on how to resolve it

If you perform a migration upgrade on a VCenter Server Appliance, such as from 6.7 to 7.0, and the upgrade fails on the new appliance necessitating a backout, it is common to get a ‘data transfer and appliance setup is in progress’ message on the original appliance

There is a KB article to walk you through resolving this: https://kb.vmware.com/s/article/67179

In particular the final steps where you ‘touch’ three files in the /var/log/vmware/upgrade directory seem to be the main requirement in my experience

Once this has been performed, a simple refresh of the page will return you to a normal login interface

PowerCLI – Disabling ESXi OpenSLP service for VMSA-2021-0002

OpenSLP has cropped up again as an ESXi vulnerability, and if you want to disable the service the KB article given only has details for doing so via the ESXi command line.

Far easier, if you have many hosts, is to use PowerCLI, and while it’s relatively simple I thought I would share this to help anyone else wanting to do so.

Disabling the service
Connect to the environment with ‘connect-viserver’ and then run:

Get-VMHost | %{
	$_ | Get-VMHostFirewallException -Name "CIM SLP" | Set-VMHostFirewallException -Enabled:$false
	Stop-VMHostService -HostService ($_ | Get-VMHostService | ?{$_.Key -eq "slpd"}) -Confirm:$false
	$_ | Get-VMHostService | ?{$_.key -match "slpd"} | Set-VMHostService -Policy "off"
}

Checking the status
Connect to the environment with ‘connect-viserver’ and then run:

Get-VMHost | %{
	$rule = $_ | Get-VMHostFirewallException -Name "CIM SLP"
	$serv = $_ | Get-VMHostService | ?{$_.Key -eq "slpd"}
	$_ | select Name,@{N="Rule";E={$rule.enabled}},@{N="ServiceRunning";E={$serv.Running}},@{N="ServiceEnabled";E={$serv.Policy}}
}

Edit : As per the comment from Zeev, I’d missed disabling the service, I’ve updated the Disabling and Checking scripts above to include the correct information now.

PowerCLI: Find VMs with xHCI controller

The ESXi vulnerability found at the 2020 Tianfu Cup was a Critical one, with a CVSSv3 base score of 9.3.

VMware lists an article with the fixes and workarounds here:
https://www.vmware.com/security/advisories/VMSA-2020-0026.html
The fix is to apply the latest patch, and the workaround is to remove the xHCI (USB 3.0) controller from any VMs that have it.

To quickly determine whether you have an exposure you can run the following PowerCLI against your environment and it will list the VMs which have that particular controller type attached.

Get-VM | ?{$_.ExtensionData.Config.Hardware.Device.DeviceInfo.Label -match "xhci"}

vRealize Orchestrator Name/IP lookups

I’ve started looking at upgrading our standalone VRO instances from 7.x to 8.x, and one thing that has changed significantly is that we can no longer use the appliance linux environment to run dig or nslookup

There are a couple of System calls:

System.resolveHostName(hostname);
System.resolveIpAddress(ip);

These allow the usual forward and reverse lookups, but have significant limitations

System.resolveHostName :

  • Only returns one record at a time, so if there are multiple records you would have to write a loop to collect them all
  • Only returns the IP address, no ability to return record type, TTL, SOA record

System.resolveIpAddress :

  • Only returns one record at a time, so if there are multiple records you would have to write a loop to collect them all
  • Only returns the host name, no ability to return record type, TTL, SOA record
  • ONLY WORKS IF A FORWARD RECORD EXISTS THAT MATCHES

This final point took some significant figuring out, and in combination with the rest of the points resulted in me having to change some of the workflows to do an SSH to a linux server to do normal dig and nslookup commands, rather than using the System calls.