Clearing old Host Profile answer files

We recently had a problem where the Fault Tolerance logging service seemed to be randomly getting assigned to the VMotion vmknic, instead of it’s dedicated vmknic. This obviously prevented FT state sync from occuring, a fact that I discovered in a 20 minute change window at 4.30AM ūüė¶

I found the cause of the state sync failure by reading through th vmware.log file for the affected VM, and noticing that the sync seemed to be trying to happen between source and destination IPs on different subnets. Looking at the host IP services configuration within the cluster I found a host which was correct (fortunately the host the FT primary was on was correct too), and used that for the secondary VM which enabled sync to occur.

The problem was affecting roughly 50% of the cluster, and had apparently happened a number of times earlier and been corrected. I noticed that these hosts also had remnants of a host profile answer file – just the Hostname and VMotion interface details, whereas the hosts that were still configured correctly didn’t have any answer file settings stored in VCenter.

Easy I though, bit of PowerCLI will sort that, so had a look for cmdlets for viewing/modifying answer file settings. I hit a blank pretty much straightaway. There are cmdlets for host profiles, one of which allows you to include answerfiles as part of applying a host profile, but nothing for viewing/modifying/removing answer files.

So to the Views we go. A bit of searching turned up this which was helpful, and after a bit of testing I came up with:

$hostProfileManagerView = Get-View "HostProfileManager"
$blank = New-Object VMware.Vim.AnswerFileOptionsCreateSpec

foreach ($vmhost in (Get-Cluster <cluster> | Get-VMhost | sort Name)) {
   $file = $hostProfileManagerView.RetrieveAnswerFile($vmhost.ExtensionData.MoRef)
   if ($file.UserInput.length -gt 0) {
     $file = $hostProfileManagerView.UpdateAnswerFile($vmhost.ExtensionData.MoRef,$blank)
     $file = $hostProfileManagerView.RetrieveAnswerFile($vmhost.ExtensionData.MoRef)
     Write-Output "$($vmhost.Name) $([string]$file.UserInput)"
   }
}

This iterates through each host in the cluster, and if it has an answerfile, it replaces it with a blank one.

Advertisements

ESXi TLS/SSL/Cipher configuration

Anyone that’s had to configure the TLS/SSL settings for their VMware infrastructure will have probably come across William Lam’s posting on the subject. This provided a much needed script for disabling the weaker protocols on ports 443 (rhttpproxy) and 5989 (sfcb), but leaves out the HA agent on port 8182, and doesn’t alter ciphers – we are having to remove the TLS_RSA ciphers to counter TLS ROBOT ¬†warnings.

The vSphere TLS Reconfigurator utility does fix the TLS protocols for port 8182 (HA communications), but can only be used when the ESXi version is the same minor version as the vCenter, and none of the options will amend the ciphers being used. This was a useful posting I came across for amending the cipher list.
I did attempt to use the (new to ESXi 6.5) Advanced Setting –¬†UserVars.ESXiVPsAllowedCiphers but it appears that this isn’t actually implemented yet. Certainly the rhttpproxy ignores the setting when it starts, and I have raised an SR with VMware to investigate this.

So I thought it might be useful to list the ports that tend to crop up on a vulnerability scan and what is required to fix them, in case there are elements that you may need to configure beyond what the usual utilities and scripts are capable of, such as standalone hosts.

I have only tried these on recent ESXi 6.0U3 and 6.5U1 builds

TCP/443 -VMware HTTP Reverse Proxy and Host Daemon

Set Advanced Settings:
UserVars.ESXiVPsDisabledProtocols to¬†“sslv3,tlsv1,tlsv1.1”
If it’s ESXi 6.0 the following two are also needed:
UserVars.ESXiRhttpproxyDisabledProtocols to “sslv3,tlsv1,tlsv1.1”
UserVars.VMAuthdDisabledProtocols to¬†“sslv3,tlsv1,tlsv1.1”
For the removal of TLS_RSA ciphers:
UserVars.ESXiVPsAllowedCiphers to
“!aNULL:kECDH+AESGCM:ECDH+AESGCM:!RSA+AESGCM:kECDH+AES:ECDH+AES:!RSA+AES”
The ESXiVPsAllowedCiphers setting does not work, instead manually edit /etc/vmware/rhttpproxy/config.xml and add a cipherList entry:

<config>
...
<vmacore>
...
<ssl>
...
<cipherList>!aNULL:kECDH+AESGCM:ECDH+AESGCM:!RSA+AESGCM:kECDH+AES:ECDH+AES:!RSA+AES</cipherList>
...
</ssl>
...
</vmacore>
...
</config>

Restart rhttpproxy service or reboot host

TCP/5989 – VMware Small Footprint CIM Broker

Edit /etc/sfcb/sfcb.cfg and add lines
enableTLSv1: false
enableTLSv1_1: false
enableTLSv1_2: true
sslCipherList: !aNULL:kECDH+AESGCM:ECDH+AESGCM:kECDH+AES:ECDH+AES

Restart sfcb / CIM service or reboot

From what I have seen, the default is to have SSLv3/TLSv1/TLSv1.1 disabled anyway.

TCP/8080 –¬†VMware vSAN VASA Vendor Provider

Should be fixed by the TCP/443 settings

TCP/8182 –¬†VMware Fault Domain Manager

Set Advanced Setting on the *Cluster* :
das.config.vmacore.ssl.protocols to “tls1.2”

Go to each host and initiate “Reconfigure for vSphere HA”

TCP/9080 –¬†VMware vSphere API for IO Filters

Should be fixed by the TCP/443 settings

 

PowerCLI shortcuts

I’ve just set up some shortcuts for connecting to our various VMware environments, as I was sick of typing out the full

connect-viserver vcsa-name.dns.name

every time.

If you want this to apply for just your userid, you can create (or edit if it already exists)  %UserProfile%\Documents\Windows­PowerShell\profile.ps1

And if you want it to apply for all users, you can create (or edit)
%windir%\system32\Windows­PowerShell\v1.0\profile.ps1

I created the latter, and added lines such as:

function ENV1 {connect-viserver vcsa-name-1.dns.name}
function ENV2 {connect-viserver vcsa-name-2.dns.name}

Now to connect to a VCenter, all I have to type is ENV1
Do you have any favourite powershell/powerCLI shortcuts like this?

PowerCLI prompting for credentials

One of our VCenters has been prompting for credentials when running connect-viserver since it was first set up, rather than passing through the signed in user’s credentials, and I decided to look into this annoyance.

The particular instance of VCenter has an external PSC, and this web page states that only the PSC needs to be joined to the domain. Indeed, you can’t add the VCSA appliance to the domain through the web interface if it has an external PSC, the option simply isn’t there.

One thing that did stand out from that web page was:

If you want to enable an Active Directory user to log in to a vCenter Server instance by using the vSphere Client with SSPI, you must join the vCenter Server instance to the Active Directory domain. For information about joining a vCenter Server Appliance with an external Platform Services Controller to an Active Directory domain, see the VMware knowledge base article at http://kb.vmware.com/kb/2118543.

I then discovered on this web page :

If you run Connect-VIServer or Connect-CIServer without specifying the User, Password, or Credential parameters, the cmdlet searches the credential store for available credentials for the specified server. If only one credential object is found, the cmdlet uses it to authenticate with the server. If none or more than one PSCredential objects are found, the cmdlet tries to perform a SSPI authentication. If the SSPI authentication fails, the cmdlet prompts you to provide credentials.

Putting those two paragraphs together, 1) AD login with SSPI requires the VCSA to be added to the domain, even with an external PSC, and 2) PowerCLI attempts to use SSPI if it has no credential objects.

The KB article in the first paragraph gives details of how to add the VCSA to the domain from command line, so I did the following:

  • Started PowerCLI
    Ran connect-viserver command to test
    Prompts for credentials
  • Ran the likewise command to add the VCSA to the domain
    Ran connect-viserver command to test
    Prompts for credentials
    Oh….
  • Restarted the VCenter services
    Ran connect-viserver command to test
    Prompts for credentials
    Oh &%$&…..
  • Tested from another Windows server – start up PowerCLI
    Ran connect-viserver command to test
    Loads with no prompt for credentials
    WTH???
  • Returned to original Windows server and restarted PowerCLI
    Ran connect-viserver command to test
    Loads with no prompt for credentials

So it would seem that you at least need to restart PowerCLI, and maybe you need to restart VCenter services (I’m not sure if that was needed now), once you’ve added the VCSA to the domain.

Remediating security issues on VRO 6.6

I’ve recently had to fix a bunch of security vulnerabilities on vRealize Operations 6.6, and thought it may be worth documenting for anyone else trying to fix the same issues.

It was mostly around use of weaker protocols, and self-signed certificates, and I think I’ve managed to isolate the minimum work necessary to fix, happy to be corrected if there are better ways of doing it, or if I’ve missed anything.

  1. Appliance interface on TCP/5480
    • SSH on to the appliance as root
    • replace¬† /opt/vmware/etc/lighttpd/server.pem with a signed certificate (including certification chain if it’s a private CA) and private key.
    • edit /opt/vmware/etc/lighttpd/lighttpd.conf and replace
        ssl.cipher-list = "HIGH:!aNULL:!ADH:!EXP:!MD5:!DH:!3DES:!CAMELLIA:!PSK:!SRP:@STRENGTH"
      with:
        ssl.honor-cipher-order = "enable"
        ssl.cipher-list = "EECDH+AESGCM:EDH+AESGCM"
        ssl.use-compression = "disable"
        setenv.add-response-header  += ( "Strict-Transport-Security" => "max-age=63072000; includeSubDomains; preload",
            "X-Frame-Options" => "SAMEORIGIN",
            "X-Content-Type-Options" => "nosniff")
  2. Appliance SFCB interface on TCP/5489
    • SSH onto the appliance as root
    • vi /opt/vmware/share/sfcb/genSslCert.sh and update the line:
      umask 0277; openssl req -x509 -days 10000 -newkey rsa:2048 \
      to
      umask 0277; openssl req -x509 -days 730 -newkey rsa:2048 \
    • vi /opt/vmware/etc/ssl/openssl.conf and update
      commonName=<appliance FQDN>
      and add lines
      DNS.2 = <appliance FQDN>
      DNS.3 = <appliance hostname>
      at the end
    • cd /opt/vmware/etc/sfcb/
      and issue
      /opt/vmware/share/sfcb/genSslCert.sh
      to update the certificates.
  3. Update the VCO service and configuration console
    • Log in to https://vcoserver:8283/vco-controlcenter/#/control-app/certificates
    • Generate a new SSL certificate with the correct common name and organization details
    • from a root bash shell on the appliance, generate a CSR with:
      keytool
      -certreq -alias dunes -keypass "password" -keystore
      "/etc/vco/app-server/security/jssecacerts" -file "/tmp/cert.csr"
      -storepass "password"

      (the password is found at /var/lib/vco/keystore.password )
    • Sign the CSR with your Certification Authority
    • Copy the cert to the VCO server as /tmp/cert.cer
    • Re-import the signed certificate with:
      keytool
      -importcert -alias dunes -keypass "password" -file "/tmp/cert.cer
      -keystore "/etc/vco/app-server/security/jssecacerts" -storepass
      "password"
    • Verify the keystore with:
      keytool -list -keystore "/etc/vco/app-server/security/jssecacerts" -storepass "password" 
    • Edit the following files to remove TLS1.0
      /var/lib/vco/app-server/conf/server.xml
      /var/lib/vco/configuration/conf/server.xml
      search for sslEnabledProtocols= and change to read sslEnabledProtocols="TLSv1.1, TLSv1.2"
      also change ciphers= line to remove 3DES ciphers.
  4. Reboot the appliance
  5. Test connections with the following statements:
    openssl s_client -connect <servername>:5480 -tls1
    openssl s_client -connect <servername>:5480 -tls1_2openssl s_client -connect <servername>:5489 -tls1
    openssl s_client -connect <servername>:5489 -tls1_2

    openssl s_client -connect <servername>:8281 -tls1
    openssl s_client -connect <servername>:8281 -tls1_2

    openssl s_client -connect <servername>:8283 -tls1
    openssl s_client -connect <servername>:8283 -tls1_2

    The tls1 connections should now fail, and the tls1.2 connections should still work.

If anyone has examples of getting the SFCB to work with a CA signed certificate I’d be interested, as I’ve tried a number of things without success. It may be down to the properties of the certificate, but the above is sufficient for my requirements at the moment.

PowerCLI – Get-Patch only uses date not time

I’ve been putting together some PowerCLI to set ‘point-in-time’ baselines for VUM patch updates. This is mostly to aid in our interactions with our security colleagues, for example so that when they ask “Is everything patched up to date?” we can say, “The hosts are all compliant with the baseline of dd-mm-yyyy”.

However, when I was using Get-Patch -After ‘<date time>’ to generate delta baselines, I found that it was including the patches from the date/time supplied, rather than anything after.

For example:

PowerCLI C:\> (Get-PatchBaseline “ESXi-standard-baseline-*” | Get-Patch -targettype host -vendor VMware* | Measure-Object -Property “ReleaseDate” -Maximum).Maximum

05 October 2017 01:00:00

But then feeding that into Get-Patch didn’t have the desired effect:

PowerCLI C:\> Get-Patch -TargetType Host -Vendor “VMware*” -After “05 October 2017 01:00:00”

Name Product Release Date Severity Vendor Id
—- ——- ———— ——– ———
Updates esx-base,… {embeddedEsx… 05/10/2017 0… Critical ESXi650-201710401-BG
Updates esx-base,… {embeddedEsx… 05/10/2017 0… Critical ESXi600-201710301-BG

It was pulling out the last patches from the previous baseline, for inclusion in the new one.

Fortunately the resolution for this is pretty straightforward:

PowerCLI C:\> Get-Patch -TargetType host -Vendor “VMware*” | where {$_.ReleaseDate -gt “05 October 2017 01:00:00”}

Which returns no values in this instance, as there’s nothing currently in the patch database after that date.

Problem adding Active Directory auth to Netbackup Appliance

I’ve spent a fair amount of time recently on two things that I’ve not done much of before: Certificates, and integrating to AD authentication. Of the two, the AD auth has generally been straightforward, and installing signed certs has been….. well… ¬†let’s just call it a learning curve.

Of course, I’ve documented the processes internally, and where possible I’ve automated the configuration for future use.

One of the things that came up recently was adding AD auth for some Netbackup appliances, where shared admin credentials had been in use and we were wanting to avoid individual local accounts. As this is a well documented process I was expecting it to be pretty straightforward, and after checking some details with Veritas Support (such as whether the credentials used for configuring the authentication were stored and used for each lookup – they’re not), I wrote up the change plan and began implementation.

On the first appliance, all was well, and it was as simple as the documentation implied. It was when I came to the next one that I hit problems.

I submitted the details and credentials for the local Active Directory, and after a short delay received the following errors:

- [Error] Unable to configure the appliance for Active Directory Authentication.
Check the credentials, authorization of user, and network connectivity issues
- [Error] Unable to join the domain. Please check the credentials used to join the domain, network connectivity, etc. Otherwise contact support
Unable to configure the appliance for Active Directory Authentication. Check
the credentials, authorization of user, and network connectivity issues
Unable to join the domain. Please check the credentials used to join the
domain, network connectivity, etc. Otherwise contact support
Command failed!

Obviously I then tried exactly the same again, unsurprisingly with the same result.

Next step was a tcpdump of traffic between the appliance and the domain controller, after a little trial and error around the filter to find the crux of the matter, I captured the following on port TCP/445:

10:08:56.983135 IP (tos 0x0, ttl 64, id 34248, offset 0, flags [DF], proto TCP (6), length 60)
10:08:56.983135 IP (tos 0x0, ttl 64, id 34248, offset 0, flags [DF], proto TCP (6), length 60)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [S], cksum 0x4556 (correct), seq 2083948561, win 14600, options [mss 1460,sackOK,TS val 91164121 ecr 0,nop,wscale 10], length 0
10:08:56.983365 IP (tos 0x0, ttl 127, id 27767, offset 0, flags [DF], proto TCP (6), length 60)
domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37346: Flags [S.], cksum 0xe3c7 (correct), seq 1747220571, ack 2083948562, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 499809341 ecr 91164121], length 0
10:08:56.983410 IP (tos 0x0, ttl 64, id 34249, offset 0, flags [DF], proto TCP (6), length 52)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [.], cksum 0x3286 (correct), seq 1, ack 1, win 15, options [nop,nop,TS val 91164121 ecr 499809341], length 0
10:08:57.001109 IP (tos 0x0, ttl 64, id 34250, offset 0, flags [DF], proto TCP (6), length 246)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [P.], cksum 0x1f58 (incorrect -> 0x02a5), seq 1:195, ack 1, win 15, options [nop,nop,TS val 91164139 ecr 499809341], length 194
SMB PACKET: SMBnegprot (REQUEST)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x8
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 12819 (0x3213)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 0 (0x0)
smb_bcc=155
Dialect=PC NETWORK PROGRAM 1.0
Dialect=MICROSOFT NETWORKS 1.03
Dialect=MICROSOFT NETWORKS 3.0
Dialect=LANMAN1.0
Dialect=LM1.2X002
Dialect=DOS LANMAN2.1
Dialect=LANMAN2.1
Dialect=Samba
Dialect=NT LANMAN 1.0
Dialect=NT LM 0.12

10:08:57.001402 IP (tos 0x0, ttl 127, id 27768, offset 0, flags [DF], proto TCP (6), length 40)
domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37346: Flags [R.], cksum 0x1836 (correct), seq 1, ack 195, win 0, length 0

The last two packets in the conversation are an SMB Negotiate request from the appliance, which the domain controller responds to with a TCP Reset packet. Rude!

At this point, the sensible thing was to compare with the working one, however as that was already successfully using AD auth, I wasn’t sure whether it would be a good comparison, however I found that using smbclient to try and establish a session on the problem appliance also produced the same result.

10:27:06.868465 IP (tos 0x0, ttl 64, id 38798, offset 0, flags [DF], proto TCP (6), length 60)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [S], cksum 0x71e9 (correct), seq 3921922782, win 14600, options [mss 1460,sackOK,TS val 3537132880 ecr 0,nop,wscale 10], length 0
10:27:06.868694 IP (tos 0x0, ttl 127, id 29676, offset 0, flags [DF], proto TCP (6), length 60)
 domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37879: Flags [S.], cksum 0xd55f (correct), seq 3976192417, ack 3921922783, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 769833214 ecr 3537132880], length 0
10:27:06.868730 IP (tos 0x0, ttl 64, id 38799, offset 0, flags [DF], proto TCP (6), length 52)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [.], cksum 0x241e (correct), seq 1, ack 1, win 15, options [nop,nop,TS val 3537132880 ecr 769833214], length 0
10:27:06.960943 IP (tos 0x0, ttl 64, id 38800, offset 0, flags [DF], proto TCP (6), length 246)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [P.], cksum 0x2378 (incorrect -> 0x856b), seq 1:195, ack 1, win 15, options [nop,nop,TS val 3537132973 ecr 769833214], length 194
SMB PACKET: SMBnegprot (REQUEST)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x8
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 47233 (0xb881)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 0 (0x0)
smb_bcc=155
Dialect=PC NETWORK PROGRAM 1.0
Dialect=MICROSOFT NETWORKS 1.03
Dialect=MICROSOFT NETWORKS 3.0
Dialect=LANMAN1.0
Dialect=LM1.2X002
Dialect=DOS LANMAN2.1
Dialect=LANMAN2.1
Dialect=Samba
Dialect=NT LANMAN 1.0
Dialect=NT LM 0.12

10:27:06.961426 IP (tos 0x0, ttl 127, id 29678, offset 0, flags [DF], proto TCP (6), length 261)
 domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37879: Flags [P.], cksum 0x82f9 (correct), seq 1:210, ack 195, win 514, options [nop,nop,TS val 769833223 ecr 3537132973], length 209
SMB PACKET: SMBnegprot (REPLY)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x88
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 47233 (0xb881)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 17 (0x11)
NT1 Protocol
DialectIndex=9 (0x9)

This time, the SMB negotiate elicited a reply rather than a reset, which selected dialect 9, “NT LANMAN 1.0”, which is the root of the problem. The local DCs for the appliance where AD auth was working had had NTLMv1/SMB1 temporarily enabled, whereas the ones for the appliance where it wasn’t working, didn’t have NTLMv1/SMB1 enabled.

I then started to look into enabling SMB2 on Samba. The appliances are built on RedHat 6.6, and use Samba 3.6.23. While this version of Samba was the first to support SMB2, it *only* supported it as a server, not as a client. The support for SMB2 as a client only came in at version 4.1.0. I also checked the very latest build of the appliance software, which turned out to use exactly the same version as the one we were on. At this point I felt I’d taken it as far as I could, and raised the issue with Veritas support.

Anyone who’s dealt with frontline support from an organisation like Veritas will understand the frustration I went through for the next two weeks, as the frontline engineer assigned tried to follow his call script, completely ignoring all the diagnostics I’d already done. Ultimately he did escalate to a backline engineer (I’d not stamped and shouted as it wasn’t an impacting issue) and it quickly got pushed to the Netbackup engineering team in the states.

In the mean time I’d spoken to the beta team about what version would be in the next release, which confirmed that it would be a version that fixed the problem.

The reply back from engineering was that I was the only person to have come across this issue (meaning that either people just don’t enable AD auth, or that those who do, have SMB1 enabled on their domains), so as it would require significant testing to uplift Samba to a much higher version, they wouldn’t be releasing a patch to fix it. However they confirmed what the beta team had said about it being fixed in the forthcoming version of Netbackup.

Anyway, I’ll be awaiting that release with interest, and I hope my documenting the issue here helps someone else.