VMware TAM round table, Manchester 

I ventured over the Penines yesterday to attend the VMware TAM round table meeting, being held in the Manchester Piccadilly Hilton. This provided the opportunity to meet both our companies outgoing TAM and our new one, and learn a little more about what our current contract can provide, that we’re not really utilising.

I also made notes on the presentations, and thought I would share them in case it’s useful to someone else. Apologies in advance to the presenters for my paraphrasing – if the decks are made available, I may update these notes!

I arrived pretty early, due to the train times. Tea/Coffee and pastries were provided, then we converged on the meeting room. After introductions, the first topic up was Simon Todd, covering VSAN 6.5

VSAN 6.5 – Simon Todd

Success Stories

  • SKY – one of the UKs largest VSAN implementations at 6 Petabytes
    • Using it to maintain competitiveness, and to provide grow as you go, expand when needed.
    • Used for production workloads, eg SQL, Exchange, On Demand video, Sky Q, Video Transcoding, UHD streaming
  • Water Utility Company
    • 66-74% cost saving on VM storage cost
    • Procurement cycle went from 3-6 months for traditional SAN to 7 days for extra capacity
    • Billing run for 7M customers dropped from 16 hours to 3 hours
  • A380 – runs VSAN to collect data from 300k sensors for data analysis for preventative maintenance. Every hour saved on the ground saves $25k
  • Oil Rigs, Nuclear subs, Aircraft carriers – anywhere that server maintenance is tricky

Performance Testing

  • Have to use the  right tools, and use them in the right way
  • iometer is a legacy tool, if you’re testing All-Flash storage you need to use >1 outstanding IOs per target – the manufacturer IO figures usually state the queue length and block size.
  • For testing, use VSAN proactive tests, HCI bench, you can use iometer but have to understand how SSDs work and set the configuration accordingly
  • Performance stats now available in the vCenter Web Client (since 6.0U2), going back 90 days.

Configuration Considerations

  • Have to make sure RAID controller, firmware versions and driver versions match what is on the HCL
  • Can use ready nodes to ensure they’re correct from out of the box. If they don’t match your requirements, you are able to increase the spec (more memory, storage) the default spec is just a minimum.
  • Can mix multiple vendors in a single cluster – try and keep the specifications the same (CPU/Mem/Storage) to avoid wastage.
  • Ideally have 2 Disk Groups per host (this means minimum of 2 cache devices)
  • Use multiple capacity devices per Disk Group
  • VMware are working on having VSAN managing firmware and driver revisions, to help you with matching the HCL
  • Network
    • Ensure MTU > 1500
    • Use a different multicast address per cluster
    • 10GigE is a *must* for all flash
    • Use Network IO Control if you have shared interfaces (usually the case if you’re using 10GigE)

VSAN 6.5 – what’s new?

  • iSCSI access
    • Provide block storage to servers not in the VSAN cluster
    • Can use for eg Oracle RAC, Physical workloads
    • Max LUN size is 62TB
    • Still enables Dedupe and Compression, RAID0/1/5
  • 2 node direct connect – connect 2 nodes with crossover cables and have a remote witness – this enables a low cost ROBO entry point for VSAN
  • Supports NVMe, 512e storage, 100Gbps networking
  • New PowerCLI support
    • Health check and remediation
    • Lots of new cmdlets
  • VSAN now ready for
    • VMware Integrated Containers
    • Photon
    • Docker Volume Driver
  • All Flash is now supported in all license versions, higher license versions add things like dedupe/compression

A useful website is https://StorageHub.vmware.com

Cloud Foundation – Lee Dilworth

Lee provided an overview of the new VMware Cloud Foundation offering. From a personal viewpoint, it seems like a new ‘unified SDDC platform’ seems to be offered each year, but maybe that’s just my perception….

  • High demand for technology that simplifies infrastructure, but hard to integrate the different technologies.
  • SDDC Manager – provides Automated Lifecycle Management, of Compute, Network and Storage
  • This is an ‘Integrated Platform’, vSphere + NSX + VSAN
  • Provides Cross Cloud Architecture, Private and Public Cloud (AWS)
  • Can be used on a limited range of VSAN ready nodes (3 vendors at present, including Dell), or VxRACK
  • Based on full stack vSphere (vSphere + NSX + VSAN) with SDDC manager on top, plus a range of optional components such as LogInsight, VRO, and VRA via external integration.
  • SDDC Manager
    • Single point management (manages Hardware and Software)
    • One management domain
    • One to multiple workload domains
    • Provides full lifecycle management
    • Integrates into the Web Client
  • Hardware management
    • Uses OOB management agents in Top Of Rack switches
    • Provides Discovery, Bootstrap, Monitoring
    • Uses both In Band and Out of Band connections
  • Management Domain
    • One management domain per cloud instance
    • Uses 3 nodes minimum but 4 recommended
    • Dedicated VCenter plus redundant PSCs
    • Both vDS and NSX vSwitch
    • VSAN
  • Workload Domains
    • Either VDI or standard Virtual Infrastructure
    • Carved out by the SDDC manager
    • Dedicated VC in management domain
    • Shared SSO with management PSCs
    • VSAN
    • NSX – dedicated NSX Manager in management domain, controllers in workload domain.
  • Can automatically deploy and patch vSphere, NSX, VSAN
  • Can deploy but not currently patch LogInsight etc
  • You can upgrade workload domains independently
  • Minimum of 8 nodes (4 mgmt, 4 workload)
  • Maximum of 8 racks
  • VSAN All Flash *or* Hybrid, and can even use network attached storage

Training and Certification Update – Ed Wills (I think!)

There was a short session to give an update on the latest training courses.

vSphere 6.5

  • What’s new 5.5->6.5 – 3 days
  • vSphere ICM 6.5 – 5 days
  • vSphere Optimize and Scale 6.5 – 5 days

Cloud

  • Cloud Automation Design and Deploy 7.1 – 5 days
  • vCD ICM 8.1 – 5 days
  • Cloud Orchestration and Extensibility – 5 days

Fast Track

  • Horizon 7 ICM & App Volumes – 5 days
  • NSX ICM & Troubleshooting and Operations 6.2 – 5 days
  • vSphere ICM & VSAN 6.5 – 5 days

Enterprise Learning Subscription

This was something I’d not heard of before, but you can register people for 75 training credits per person per year and get access to:

  • All on-demand courses
  • Learning Zone
  • Exam prep materials
  • VCP exam voucher

There is a minimum of 5 people per company.

Training Needs Analysis

This is a new offering, where VMware will perform an analysis of what training your staff require.

It considers business needs, current staff competencies, training methods, cost, effectiveness, and produces a benchmark of the current state, what training is required and why, priorities, where the training will be delivered, who should receive it, how the training will be delivered and how much it will cost.

vRealise Automation – Kim Ranyard

Kim gave an overview of the history of vRA

  • It was originally DynamicOps Cloud Automation Center
  • Then bought by VMware
  • Released as vCAC 5.1 -> 5.2 -> 6.0 -> 6.1
  • Then vRA 6.2 -> 7.0 -> 7.0.1 -> 7.1 -> 7.2

vRA 7.0

  • Designed to accelerate time-to-value
  • Simplified Virtual Appliances HA Landscape
    (instead of needing large numbers of VMs to get it up and working, condensed to 1, or 2 for HA)
  • Enhanced Authentication capabilities
  • Per-tenant branding of the portal
  • Unified Service Design
  • Converged Application Authoring
  • Out-of-the-Box blueprints for more apps, such as MS SQL Server, LAMP stack
  • Able to dynamically configure NSX components
  • Blueprints as code – you can export/import blueprints as YAML
  • Event Broker – provides centralised policy management, helps to integrate with vRO workflows

vRA 7.1

  • Now includes a silent install option
  • Can migrate from 6.2 to 7.1
  • Fixes a number of 6.x upgrade blockers
  • Includes a number of provisioning enhancements, eg provision eager-zero disks, change number of vCPU on a VM
  • Data collection improvements
  • Picks up vSphere Infrastructure changes better, in case someone makes a change outside of vRA
  • Has Out-of-the-Box IPAM integrations
  • Includes more Ready-to-Import blueprints

Application-Centric Infrastructure

  • Can now scale out/in a service (blueprints only), eg add additional app servers to a service to cope with increased load, scale back as load decreases
  • AD integration – can create/delete AD objects OOtB
  • New ‘reconfigure states’ to enable triggering other workflows

vRA 7.2

  • Enhanced update API
  • Migration improvements
  • LDAP support
  • Scale in/out for XaaS components
  • Enhanced LoadBalancing capability
  • IPAM framework extended
  • Re-assign managed VMs
  • Azure endpoint support
  • Container management (container host, and containers)
  • ServiceNow integration

vSphere 6.5 – David Burgess

The most interesting section to me as I’ve not really had chance to look at it yet, was this section on what’s new with vSphere 6.5.

vCenter 6.5

VCSA

  • The VCSA is now the preferred version of vCenter, and new features will be added to it, not to the Windows version.
  • VCSA exclusive features today:
    • Native HA capability
    • Integrated VMware Update Manager
    • Improved Appliance Management
    • Native Backup/Restore
    • Uses PhotonOS rather than SuSE.
  • VCSA Deployment
    • The installer has support for Windows, Mac and Linux
    • Deploys the OVF, then configures as a second step
    • Options to Install/Upgrade/Migrate/Restore
    • Can migrate from Windows, 5.5 or 6.0 to 6.5
  • VCSA has an HTML5 management interface for the appliance itself
  • VCSA HA – Active/Passive with a witness VM (3 VMs in total)
  • HTML5 Web Client
    • Now fully supported by VMware
    • ~90% feature parity with the flash web client
  • Performance is much better – less resource intensive (applies to Windows vCenter too)

ESX Lifecycle

  • Host profiles are much improved
  • Auto-Deploy – there is now a graphical image builder (rather than just the PowerCLI cmdlets), and it supports IPv6 and UEFI

vSphere API & CLI

  • New REST API for VM management
  • Choice of SDKs and automation tools – multiple languages, plus PowerCLI and DCLI

Security

  • Enhanced Logging
  • VM Encryption – both disk and vmotion traffic
    • Uses an external Key Management Server
    • Can have a non-crypto admin user that can do most admin but not access console, read/write data etc
  • Encrypted vMotion – can be set to Disabled/Opportunistic/Required
  • UEFI Secure boot (for the hypervisor) – needs signed drivers
  • VM Secure boot (UEFI secure boot for the VM)

Application Availability and Resource Management

  • Proactive HA – detect hardware degraded conditions, vMotion guests off host. Hardware OEM participation is required, eg Dell OpenManage, HP Insight Manager
  • HA Orchestrated Restart – VM-to-VM dependency checks (this has validation checks to prevent dependency loops for example)
  • 5 Restart Priorities (up from 3 in previous versions)
  • HA Admission Control – this has been updated to simplify
    • Chooseter Failures To Tolerate
    • Based on % of resources reserved
    • Automatic calculations, rather than manual reconfiguration whenever you add/remove a host
    • Overrides are possible
  • New DRS options
    • Even distribution (helps to balance out the cluster even if it’s not required for performance reasons
    • Can base on consumed memory rather than active memory
    • Takes into account CPU overcommitment

Other changes

  • New CPU models and architectures are now supported
  • LUN limit has been increased to 512
  • Supports vRDMA (virtualised Remote Direct Memory Access) via a paravirtual driver.

 

The day then concluded with a demo of VRA with Codestream.

I felt it was a worthwhile event, and it was great to meet a few new people. Thanks again to the VMware UK TAM team for running it.

Windows Failover Cluster VM Snapshot Issue

I configured my first WFC servers a few weeks back, having previously been at an all Veritas Cluster Server shop. Nothing particularly special about them, in fact 2 of the clusters are just 2 node clusters with an IP resource acting as a VIP.

We came to configuring backups this week, and the day after the backup had run on one of the cluster nodes, I noticed that the resource had failed over to the second node in the cluster.

Digging into the eventlog showed a large number of NTFS warnings (eventIds 50, 98, 140), as well as errors for FailoverClustering  (eventIds 1069, 1177, 1564) and Service Control Manager (eventIds 7024, 7031, 7036).

wfcerrors

A bit of digging into KB articles such as KB1037959 reveals that snapshotting is not supported with WFC.

However, the issue seems to be caused by quiescing the VM and capturing the memory state with the snapshot. Just snapshotting the disk state does not appear to cause any issues with NTFS or Clustering in our testing, but obviously this is just a crash-consistent backup.

ESXi 6 – weird host HA error

I came across a strange fault with VMware HA today, where a host was reporting an error in its ability to support HA, and  wouldn’t “Reconfigure for HA”

Attempts to perform the reconfigure failed and generated a failed task with the status “Cannot install the vCenter Server agent service. Cannot upload agent”

Screen Shot 2016-08-04 at 15.59.32

Taking the host in and out of Maintenance Mode had no effect, and I could find no pertinent errors in the host logs.

I couldn’t find anything particularly relevant in a google search either, but on digging through the VCenter logs I found the following:

 2016-08-04T15:29:28.567+01:00 info vpxd[16756] [Originator@6876 sub=HostUpgrader opID=909E5426-000012CB-b0-7d] [VpxdHostUpgrader] Fdm on host-6787 has build 3018524. Expected build is 3634793 - will upgrade
2016-08-04T15:29:28.725+01:00 info vpxd[16756] [Originator@6876 sub=HostAccess opID=909E5426-000012CB-b0-7d] Using vpxapi.version.version10 to communicate with vpxa at host guebesx-dell-001.skybet.net
2016-08-04T15:29:28.910+01:00 warning vpxd[16756] [Originator@6876 sub=Libs opID=909E5426-000012CB-b0-7d] SSL: Unknown SSL Error
2016-08-04T15:29:28.911+01:00 info vpxd[16756] [Originator@6876 sub=Libs opID=909E5426-000012CB-b0-7d] SSL Error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
2016-08-04T15:29:28.911+01:00 warning vpxd[16756] [Originator@6876 sub=Libs opID=909E5426-000012CB-b0-7d] SSL: connect failed
2016-08-04T15:29:28.911+01:00 warning vpxd[16756] [Originator@6876 sub=Default opID=909E5426-000012CB-b0-7d] [NFC ERROR] NfcNewAuthdConnectionEx: Failed to connect to peer. Error: The remote host certificate has these problems:
-->
--> * The host certificate chain is incomplete.
-->
--> * unable to get local issuer certificate
-->
2016-08-04T15:29:28.912+01:00 error vpxd[16756] [Originator@6876 sub=vpxNfcClient opID=909E5426-000012CB-b0-7d] [VpxNfcClient] Unable to connect to NFC server: The remote host certificate has these problems:
-->
--> * The host certificate chain is incomplete.
-->
--> * unable to get local issuer certificate
2016-08-04T15:29:28.913+01:00 error vpxd[16756] [Originator@6876 sub=HostAccess opID=909E5426-000012CB-b0-7d] [VpxdHostAccess] Failed to upload files: vim.fault.SSLVerifyFault
2016-08-04T15:29:28.918+01:00 error vpxd[16756] [Originator@6876 sub=DAS opID=909E5426-000012CB-b0-7d] [VpxdDasConfigLRO] InstallDas failed on host guebesx-dell-001.skybet.net: class Vim::Fault::AgentInstallFailed::Exception(vim.fault.AgentInstallFailed)
2016-08-04T15:29:28.919+01:00 info vpxd[16756] [Originator@6876 sub=MoHost opID=909E5426-000012CB-b0-7d] [HostMo::UpdateDasState] VC state for host host-6787 (uninitialized -> init error), FDM state (UNKNOWN_FDM_HSTATE -> UNKNOWN_FDM_HSTATE), src of state (null -> null)
2016-08-04T15:29:28.950+01:00 info vpxd[16756] [Originator@6876 sub=vpxLro opID=909E5426-000012CB-b0-7d] [VpxLRO] -- FINISH task-internal-15007334
2016-08-04T15:29:28.950+01:00 info vpxd[16756] [Originator@6876 sub=Default opID=909E5426-000012CB-b0-7d] [VpxLRO] -- ERROR task-internal-15007334 -- -- DasConfig.ConfigureHost: vim.fault.AgentInstallFailed:
--> Result:
--> (vim.fault.AgentInstallFailed) {
--> faultCause = (vmodl.MethodFault) null,
--> reason = "AgentUploadFailed",
--> statusCode = <unset>,
--> installerOutput = <unset>,
--> msg = ""
--> }
--> Args:
-->  

I’m not sure what had caused the certificate error, but a simple disconnect and reconnect of the host cleared the fault and allowed the HA agent to configure successfully.

PowerCLI code snippet to get storage driver details

This is just a brief post to share a code snippet that I built to display the storage driver in use.

The driver and it’s version are critical for VMware VSAN, and I needed a quick and easy way of checking it. I might revise the code at a later date to run across multiple hosts in a cluster, and output the results in a table, but for now, here’s the basics.

connect-viserver <vcname>
$esxcli = Get-EsxCli -vmhost <esxihostname>
$adapter = $esxcli.storage.core.adapter.list() |
select Description,Driver,HBAName | where {$_.HBAName -match "vmhba0"}
$driver = $adapter.Driver -replace "_", "-"
$esxcli.software.vib.list() |
Select Name,Version,Vendor,ID,AcceptanceLevel,InstallDate,ReleaseDate,Status |
Where {$_.Name -match ($driver + "$")}

This displays output such as

Name            : scsi-megaraid-sas
Version         : 6.603.55.00-1OEM.550.0.0.1331820
Vendor          : LSI
ID              : LSI_bootbank_scsi-megaraid-sas_6.603.55.00-1OEM.550.0.0.1331820
AcceptanceLevel : VMwareCertified
InstallDate     : 2016-05-03
ReleaseDate     :
Status          :

This works for the servers I’ve tried it on (Dell) but as usual YMMV…

ESXi 6.0 – Switching from persistent scratch to transient scratch

KB article 1033696 is very helpful when you want to configure persistent scratch on your USB/SDCard/PXE booted ESXi host, however when you want to go the other way, things can be slightly complicated.

Consider the following situation. You have installed ESXi onto a local USB stick, and have temporarily retasked a drive from what will become your VSAN array, to be used to run up VCenter and a PSC.
On the next reboot, ESXi will see the persistent local storage, and automatically choose to run scratch on it.
From that point onwards, how do you switch back and release the disk for use by VSAN?

You can’t set the advanced configuration "ScratchConfig.ConfiguredScratchLocation" to  blank (eg “”), that was the first thing I tried. It accepts the command, but the setting remains pointed at the VMFS location.

You can’t just unmount or delete the VMFS filesystem, it’s in use

You can’t set the advanced configuration "ScratchConfig.ConfiguredScratchLocation" to  /tmp/scratch, it accepts the value, but on reboot, it’s discovered the VMFS filesystem again.

Other combinations of advanced configuration settings, and editing or removing the /etc/vmware/locker.conf also failed to stop it from loading the scratch onto the VMFS filesystem at boot.

In the end, I was able to get around this, by using storcli to offline the disk. The server could then be rebooted without mounting the VMFS filesystem, so scratch was then running from /tmp/scratch (on the ramdisk). The disk could then be brought online again, and the VMFS filesystem destroyed. I guess an alternative approach would be to point the scratch location at an NFS location, which should take precedence over a “discovered” local persistent VMFS filesystem, and allow the VMFS filesystem to be deleted.

I hope that helps someone else, as I spent far more time than I should have going round in a loop, steadily losing my marbles, because there didn’t seem to be any information around about how to do it.

NSX Ninjas course, week 1

After just over a year of trying, I finally managed to get on the VMware NSX Ninjas course. I was first offered it last April (2015) in Palo Alto, but with a small baby at home, and the fact that it was straight after our company conference in Orlando, I had to decline.
I then missed out on it a number of times, due to only finding out about sessions while they were happening.

Anyway, our UK TAM, Liam Farrell, managed to get places for 3 of us (me, @MrCNeale and @BlobbieH) on a course running from VMware’s UK HQ in Staines, which is slightly more travel friendly than Palo Alto. Our instructor for the week was Red1 Bali (@tredwitter), who is actually a freelance consultant, rather than a VMware employee.

For those who haven’t come across the NSX Ninjas course before, my understanding is, that it is provided to VMware Partners (at zero cost other than their own travel and accommodation), and the aim of it is to get people who’ve done the NSX ICM course up through VCIX-NV (week 1) and prepare them for VCDX-NV (week 2).

The course ran from Monday lunchtime, to Friday lunchtime, with days 1-3 billed as NSX 401 Troubleshooting, day 4 NSX Operations, and day 5 NSX Automation.

Maybe because we’d been trying to get on this course for so long, I suspect we had insanely high expectations, and the first day or so felt a little disappointing – a little slow going and not very “deep”. Possibly this was because some people on the course had failed to do the *mandatory* prerequisites of taking the NSX ICM course and passing the VCP-NV, so Red1 was having to take things a little slower. I know people have busy working lives, but attending a deeply technical course without having completed the prereqs just isn’t on in my opinion.

Anyway, the pace soon ramped up, and we were working through the labs, including fixing all the problems caused by them starting with expired licenses. As we progressed through the course presentations, Red1 started introducing faults into our lab environments for us to fix. Some of these were straightforward to find, but some were definitely not so easy, and were an excellent way of getting you into the command lines, debug tools, and logs, to find what had gone wrong.

The course ended with content on Operationalizing NSX, based on the use of LogInsight and vROPS, the latter being of less interest to us at the moment, as it’s not part of the product suite that we use. However, the breaking of the lab environments and subsequent troubleshooting, continued, with Red1 delivering a seemingly inexhaustible supply of failure scenarios. These were what I enjoyed most about the week, as digging into a gnarly technical fault is something I relish (maybe less so if there’s a production outage on the back of it though!).

All in all, I definitely recommend the course if you can get on it, and a big thanks to VMware for providing it, our UK TAM Liam Farrell for getting us the places, and Red1 for being an excellent instructor.

Stay tuned for week 2, scheduled for the middle of June.

 

vExpert and Ravello Labs

A month into 2016 and we’re rapidly approaching the announcement of the VMware vExperts for 2016. The benefits of being recognised by this programme have varied over the years (and as the numbers have increased, the ‘freebies’ have inevitably gradually dwindled), but one of the things I have found most useful is the free use of Ravello Labs for running nested ESXi. See here for more details about what is on offer.

Being able to spin up a personalised lab environment for testing has meant that I’ve not run a home lab nested on my laptop all year – in fact I’m not sure I could actually run one on it these days with the increased memory requirements and increasing number of VMs that seem to comprise a modern VMware lab. See here for a comparison of Ravello Systems labs against some physical home lab options

There are some limitations of course, the main one is that you can’t easily run the VCSA (there are ways, but it’s a pain to implement), and some other VMware linux based appliances seem to be affected in the same way. There is also a memory size limitation – I think this is 8Gb at the time of writing, but obviously you can add additional ESXi “servers” to your lab to scale out if necessary, and my laptop currently only has 8Gb total…

I still currently use VMware Hands On Labs to get hands on with specific products (NSX, VROPS, VRA), as they are quick to spin up and are generally pre-canned with the product I’m wanting to investigate, but for general VMware ESXi/VCenter work, I’m very impressed with Ravello Labs, and I’d like to say a big thank you to them for this very generous benefit, and please keep it running for 2016!