ESXi 6 – weird host HA error

I came across a strange fault with VMware HA today, where a host was reporting an error in its ability to support HA, and  wouldn’t “Reconfigure for HA”

Attempts to perform the reconfigure failed and generated a failed task with the status “Cannot install the vCenter Server agent service. Cannot upload agent”

Screen Shot 2016-08-04 at 15.59.32

Taking the host in and out of Maintenance Mode had no effect, and I could find no pertinent errors in the host logs.

I couldn’t find anything particularly relevant in a google search either, but on digging through the VCenter logs I found the following:

 2016-08-04T15:29:28.567+01:00 info vpxd[16756] [Originator@6876 sub=HostUpgrader opID=909E5426-000012CB-b0-7d] [VpxdHostUpgrader] Fdm on host-6787 has build 3018524. Expected build is 3634793 - will upgrade
2016-08-04T15:29:28.725+01:00 info vpxd[16756] [Originator@6876 sub=HostAccess opID=909E5426-000012CB-b0-7d] Using vpxapi.version.version10 to communicate with vpxa at host guebesx-dell-001.skybet.net
2016-08-04T15:29:28.910+01:00 warning vpxd[16756] [Originator@6876 sub=Libs opID=909E5426-000012CB-b0-7d] SSL: Unknown SSL Error
2016-08-04T15:29:28.911+01:00 info vpxd[16756] [Originator@6876 sub=Libs opID=909E5426-000012CB-b0-7d] SSL Error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
2016-08-04T15:29:28.911+01:00 warning vpxd[16756] [Originator@6876 sub=Libs opID=909E5426-000012CB-b0-7d] SSL: connect failed
2016-08-04T15:29:28.911+01:00 warning vpxd[16756] [Originator@6876 sub=Default opID=909E5426-000012CB-b0-7d] [NFC ERROR] NfcNewAuthdConnectionEx: Failed to connect to peer. Error: The remote host certificate has these problems:
-->
--> * The host certificate chain is incomplete.
-->
--> * unable to get local issuer certificate
-->
2016-08-04T15:29:28.912+01:00 error vpxd[16756] [Originator@6876 sub=vpxNfcClient opID=909E5426-000012CB-b0-7d] [VpxNfcClient] Unable to connect to NFC server: The remote host certificate has these problems:
-->
--> * The host certificate chain is incomplete.
-->
--> * unable to get local issuer certificate
2016-08-04T15:29:28.913+01:00 error vpxd[16756] [Originator@6876 sub=HostAccess opID=909E5426-000012CB-b0-7d] [VpxdHostAccess] Failed to upload files: vim.fault.SSLVerifyFault
2016-08-04T15:29:28.918+01:00 error vpxd[16756] [Originator@6876 sub=DAS opID=909E5426-000012CB-b0-7d] [VpxdDasConfigLRO] InstallDas failed on host guebesx-dell-001.skybet.net: class Vim::Fault::AgentInstallFailed::Exception(vim.fault.AgentInstallFailed)
2016-08-04T15:29:28.919+01:00 info vpxd[16756] [Originator@6876 sub=MoHost opID=909E5426-000012CB-b0-7d] [HostMo::UpdateDasState] VC state for host host-6787 (uninitialized -> init error), FDM state (UNKNOWN_FDM_HSTATE -> UNKNOWN_FDM_HSTATE), src of state (null -> null)
2016-08-04T15:29:28.950+01:00 info vpxd[16756] [Originator@6876 sub=vpxLro opID=909E5426-000012CB-b0-7d] [VpxLRO] -- FINISH task-internal-15007334
2016-08-04T15:29:28.950+01:00 info vpxd[16756] [Originator@6876 sub=Default opID=909E5426-000012CB-b0-7d] [VpxLRO] -- ERROR task-internal-15007334 -- -- DasConfig.ConfigureHost: vim.fault.AgentInstallFailed:
--> Result:
--> (vim.fault.AgentInstallFailed) {
--> faultCause = (vmodl.MethodFault) null,
--> reason = "AgentUploadFailed",
--> statusCode = <unset>,
--> installerOutput = <unset>,
--> msg = ""
--> }
--> Args:
-->  

I’m not sure what had caused the certificate error, but a simple disconnect and reconnect of the host cleared the fault and allowed the HA agent to configure successfully.

One chapter closes, another begins…

Today a chapter closes on my career at CSC.

I’ve been working in the same office for 22 years, originally starting straight from Uni in the Unix Support team for the Post Office, looking after NCR and HP Unix servers around the country – much of which was on dial-up modem rather than IP networking.

While part of the Post Office/Consignia/Royal Mail I moved through NT Infrastructure and Internet Infrastructure teams, picking up Windows Server and Internet technology skills, then we were outsourced to CSC in June 2003.

I spent a short time in the Firewall management team in CSC, but then moved to the team looking after Windows server infrastructure for the NHS account. This team was almost entirely formed from ex-Royal Mail staff, and had set up a significant amount of automation and standardisation already. It was here that I was first exposed to VMware ESX, and it immediately resonated with me.

Due to the similarities with Unix, and because I could see the future benefits of virtualised infrastructure, I decided to try and become the team expert in VMware ESX. I learned a lot along the way, and I’m grateful to the TAM team at VMware for the learning opportunities they made available – Joshua Lory, Adrian Voss, Jesse Shapiro and Liam Farrell, I thank you all.

I was by no means the only VMware expert though, having colleagues with the same thirst for knowledge really pushed me along and we have been pretty competitive in our quest for certification and recognition. I wouldn’t even have thought to apply for vExpert if my colleague Darry Cauldwell hadn’t done so, and I believe my achieving double VCAP-DCV and VCIX-NV has pushed others along the certification path.

But 22 years is a long time to spend in one location, and I’ve felt for a while that it was time to find a new challenge, so I will be starting a new role on Monday, with Sky Betting and Gaming. It will be a very different working environment compared with a global outsourcer, but one I’m really looking forward to.

NSX LoadBalancer – character “/” is not permitted in server name

lberr

This was an odd error that a colleague brought to me while testing automation around the configuration of an NSX Edge.

He had created the Edge successfully, and configured the Load Balancer, but on trying to enable it, it was erroring. When he tried enabling it through the Web Client, the above error was displayed, and the change was automatically reverted.

After a lot of digging, I discovered that the configuration for the Load Balancer had a Pool where the “IP Address / VC Container” object was a Service Group, and one of the members of that Service Group was an IPSet for the CIDR block that NSX was trying to include in the server name.

I’m not sure whether that is even a supported configuration, but I changed it to point to a Service Group that included the members of the target web farm, and the Load Balancer could then be configured successfully.

PowerCLI code snippet to get storage driver details

This is just a brief post to share a code snippet that I built to display the storage driver in use.

The driver and it’s version are critical for VMware VSAN, and I needed a quick and easy way of checking it. I might revise the code at a later date to run across multiple hosts in a cluster, and output the results in a table, but for now, here’s the basics.

connect-viserver <vcname>
$esxcli = Get-EsxCli -vmhost <esxihostname>
$adapter = $esxcli.storage.core.adapter.list() |
select Description,Driver,HBAName | where {$_.HBAName -match "vmhba0"}
$driver = $adapter.Driver -replace "_", "-"
$esxcli.software.vib.list() |
Select Name,Version,Vendor,ID,AcceptanceLevel,InstallDate,ReleaseDate,Status |
Where {$_.Name -match ($driver + "$")}

This displays output such as

Name            : scsi-megaraid-sas
Version         : 6.603.55.00-1OEM.550.0.0.1331820
Vendor          : LSI
ID              : LSI_bootbank_scsi-megaraid-sas_6.603.55.00-1OEM.550.0.0.1331820
AcceptanceLevel : VMwareCertified
InstallDate     : 2016-05-03
ReleaseDate     :
Status          :

This works for the servers I’ve tried it on (Dell) but as usual YMMV…

Github Desktop from behind a corporate proxy server

After having just helped a colleague get through the tortuous path of configuring Github Desktop to work through a proxy, I thought it might be worth blogging it all.

Different parts of Github Desktop require the proxy information to be provided in different ways, and without all 3 pieces of configuration, you will find that some things work, but not others.

  1. Internet Explorer proxy setting
    This *has* to be set to a specific proxy server, and not using an autoconfig script.
  2. .gitconfig
    This is found in your user home directory (usually C:\Users\<Username>) and requires the following lines:
    [http]
    proxy = http:// <proxy-address>:<port>
    [https]
    proxy = http:// <proxy-address>:<port>
  3. HTTPS_PROXY/HTTP_PROXY environment variable
    You can set this in your local environment, or in the system environment settings, as long as it’s visible to the Github Desktop processes.
    eg.
    set HTTPS_PROXY=http://<proxy-address>:<port>

If a userid/password is required, it’s recommended that you run something like CNTLM to do the authentication, rather than adding the plaintext credentials to the proxy string.

Once you’ve configured all that, if you’re using Enterprise Github, you will probably need to use a Personal Access Token, rather than your password, to authenticate Github Desktop. This can be created by logging in with a browser and going to Settings / Personal Access Tokens.

I hope that helps someone out, but if not, I’m sure I’ll be using it as a reminder when I have to change it all between using it at Home and at Work…

ESXi 6.0 – Switching from persistent scratch to transient scratch

KB article 1033696 is very helpful when you want to configure persistent scratch on your USB/SDCard/PXE booted ESXi host, however when you want to go the other way, things can be slightly complicated.

Consider the following situation. You have installed ESXi onto a local USB stick, and have temporarily retasked a drive from what will become your VSAN array, to be used to run up VCenter and a PSC.
On the next reboot, ESXi will see the persistent local storage, and automatically choose to run scratch on it.
From that point onwards, how do you switch back and release the disk for use by VSAN?

You can’t set the advanced configuration "ScratchConfig.ConfiguredScratchLocation" to  blank (eg “”), that was the first thing I tried. It accepts the command, but the setting remains pointed at the VMFS location.

You can’t just unmount or delete the VMFS filesystem, it’s in use

You can’t set the advanced configuration "ScratchConfig.ConfiguredScratchLocation" to  /tmp/scratch, it accepts the value, but on reboot, it’s discovered the VMFS filesystem again.

Other combinations of advanced configuration settings, and editing or removing the /etc/vmware/locker.conf also failed to stop it from loading the scratch onto the VMFS filesystem at boot.

In the end, I was able to get around this, by using storcli to offline the disk. The server could then be rebooted without mounting the VMFS filesystem, so scratch was then running from /tmp/scratch (on the ramdisk). The disk could then be brought online again, and the VMFS filesystem destroyed. I guess an alternative approach would be to point the scratch location at an NFS location, which should take precedence over a “discovered” local persistent VMFS filesystem, and allow the VMFS filesystem to be deleted.

I hope that helps someone else, as I spent far more time than I should have going round in a loop, steadily losing my marbles, because there didn’t seem to be any information around about how to do it.

nsx-dragon

NSX Ninjas course, week 1

After just over a year of trying, I finally managed to get on the VMware NSX Ninjas course. I was first offered it last April (2015) in Palo Alto, but with a small baby at home, and the fact that it was straight after our company conference in Orlando, I had to decline.
I then missed out on it a number of times, due to only finding out about sessions while they were happening.

Anyway, our UK TAM, Liam Farrell, managed to get places for 3 of us (me, @MrCNeale and @BlobbieH) on a course running from VMware’s UK HQ in Staines, which is slightly more travel friendly than Palo Alto. Our instructor for the week was Red1 Bali (@tredwitter), who is actually a freelance consultant, rather than a VMware employee.

For those who haven’t come across the NSX Ninjas course before, my understanding is, that it is provided to VMware Partners (at zero cost other than their own travel and accommodation), and the aim of it is to get people who’ve done the NSX ICM course up through VCIX-NV (week 1) and prepare them for VCDX-NV (week 2).

The course ran from Monday lunchtime, to Friday lunchtime, with days 1-3 billed as NSX 401 Troubleshooting, day 4 NSX Operations, and day 5 NSX Automation.

Maybe because we’d been trying to get on this course for so long, I suspect we had insanely high expectations, and the first day or so felt a little disappointing – a little slow going and not very “deep”. Possibly this was because some people on the course had failed to do the *mandatory* prerequisites of taking the NSX ICM course and passing the VCP-NV, so Red1 was having to take things a little slower. I know people have busy working lives, but attending a deeply technical course without having completed the prereqs just isn’t on in my opinion.

Anyway, the pace soon ramped up, and we were working through the labs, including fixing all the problems caused by them starting with expired licenses. As we progressed through the course presentations, Red1 started introducing faults into our lab environments for us to fix. Some of these were straightforward to find, but some were definitely not so easy, and were an excellent way of getting you into the command lines, debug tools, and logs, to find what had gone wrong.

The course ended with content on Operationalizing NSX, based on the use of LogInsight and vROPS, the latter being of less interest to us at the moment, as it’s not part of the product suite that we use. However, the breaking of the lab environments and subsequent troubleshooting, continued, with Red1 delivering a seemingly inexhaustible supply of failure scenarios. These were what I enjoyed most about the week, as digging into a gnarly technical fault is something I relish (maybe less so if there’s a production outage on the back of it though!).

All in all, I definitely recommend the course if you can get on it, and a big thanks to VMware for providing it, our UK TAM Liam Farrell for getting us the places, and Red1 for being an excellent instructor.

Stay tuned for week 2, scheduled for the middle of June.