Problem adding Active Directory auth to Netbackup Appliance

I’ve spent a fair amount of time recently on two things that I’ve not done much of before: Certificates, and integrating to AD authentication. Of the two, the AD auth has generally been straightforward, and installing signed certs has been….. well…  let’s just call it a learning curve.

Of course, I’ve documented the processes internally, and where possible I’ve automated the configuration for future use.

One of the things that came up recently was adding AD auth for some Netbackup appliances, where shared admin credentials had been in use and we were wanting to avoid individual local accounts. As this is a well documented process I was expecting it to be pretty straightforward, and after checking some details with Veritas Support (such as whether the credentials used for configuring the authentication were stored and used for each lookup – they’re not), I wrote up the change plan and began implementation.

On the first appliance, all was well, and it was as simple as the documentation implied. It was when I came to the next one that I hit problems.

I submitted the details and credentials for the local Active Directory, and after a short delay received the following errors:

- [Error] Unable to configure the appliance for Active Directory Authentication.
Check the credentials, authorization of user, and network connectivity issues
- [Error] Unable to join the domain. Please check the credentials used to join the domain, network connectivity, etc. Otherwise contact support
Unable to configure the appliance for Active Directory Authentication. Check
the credentials, authorization of user, and network connectivity issues
Unable to join the domain. Please check the credentials used to join the
domain, network connectivity, etc. Otherwise contact support
Command failed!

Obviously I then tried exactly the same again, unsurprisingly with the same result.

Next step was a tcpdump of traffic between the appliance and the domain controller, after a little trial and error around the filter to find the crux of the matter, I captured the following on port TCP/445:

10:08:56.983135 IP (tos 0x0, ttl 64, id 34248, offset 0, flags [DF], proto TCP (6), length 60)
10:08:56.983135 IP (tos 0x0, ttl 64, id 34248, offset 0, flags [DF], proto TCP (6), length 60)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [S], cksum 0x4556 (correct), seq 2083948561, win 14600, options [mss 1460,sackOK,TS val 91164121 ecr 0,nop,wscale 10], length 0
10:08:56.983365 IP (tos 0x0, ttl 127, id 27767, offset 0, flags [DF], proto TCP (6), length 60)
domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37346: Flags [S.], cksum 0xe3c7 (correct), seq 1747220571, ack 2083948562, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 499809341 ecr 91164121], length 0
10:08:56.983410 IP (tos 0x0, ttl 64, id 34249, offset 0, flags [DF], proto TCP (6), length 52)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [.], cksum 0x3286 (correct), seq 1, ack 1, win 15, options [nop,nop,TS val 91164121 ecr 499809341], length 0
10:08:57.001109 IP (tos 0x0, ttl 64, id 34250, offset 0, flags [DF], proto TCP (6), length 246)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [P.], cksum 0x1f58 (incorrect -> 0x02a5), seq 1:195, ack 1, win 15, options [nop,nop,TS val 91164139 ecr 499809341], length 194
SMB PACKET: SMBnegprot (REQUEST)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x8
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 12819 (0x3213)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 0 (0x0)
smb_bcc=155
Dialect=PC NETWORK PROGRAM 1.0
Dialect=MICROSOFT NETWORKS 1.03
Dialect=MICROSOFT NETWORKS 3.0
Dialect=LANMAN1.0
Dialect=LM1.2X002
Dialect=DOS LANMAN2.1
Dialect=LANMAN2.1
Dialect=Samba
Dialect=NT LANMAN 1.0
Dialect=NT LM 0.12

10:08:57.001402 IP (tos 0x0, ttl 127, id 27768, offset 0, flags [DF], proto TCP (6), length 40)
domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37346: Flags [R.], cksum 0x1836 (correct), seq 1, ack 195, win 0, length 0

The last two packets in the conversation are an SMB Negotiate request from the appliance, which the domain controller responds to with a TCP Reset packet. Rude!

At this point, the sensible thing was to compare with the working one, however as that was already successfully using AD auth, I wasn’t sure whether it would be a good comparison, however I found that using smbclient to try and establish a session on the problem appliance also produced the same result.

10:27:06.868465 IP (tos 0x0, ttl 64, id 38798, offset 0, flags [DF], proto TCP (6), length 60)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [S], cksum 0x71e9 (correct), seq 3921922782, win 14600, options [mss 1460,sackOK,TS val 3537132880 ecr 0,nop,wscale 10], length 0
10:27:06.868694 IP (tos 0x0, ttl 127, id 29676, offset 0, flags [DF], proto TCP (6), length 60)
 domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37879: Flags [S.], cksum 0xd55f (correct), seq 3976192417, ack 3921922783, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 769833214 ecr 3537132880], length 0
10:27:06.868730 IP (tos 0x0, ttl 64, id 38799, offset 0, flags [DF], proto TCP (6), length 52)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [.], cksum 0x241e (correct), seq 1, ack 1, win 15, options [nop,nop,TS val 3537132880 ecr 769833214], length 0
10:27:06.960943 IP (tos 0x0, ttl 64, id 38800, offset 0, flags [DF], proto TCP (6), length 246)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [P.], cksum 0x2378 (incorrect -> 0x856b), seq 1:195, ack 1, win 15, options [nop,nop,TS val 3537132973 ecr 769833214], length 194
SMB PACKET: SMBnegprot (REQUEST)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x8
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 47233 (0xb881)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 0 (0x0)
smb_bcc=155
Dialect=PC NETWORK PROGRAM 1.0
Dialect=MICROSOFT NETWORKS 1.03
Dialect=MICROSOFT NETWORKS 3.0
Dialect=LANMAN1.0
Dialect=LM1.2X002
Dialect=DOS LANMAN2.1
Dialect=LANMAN2.1
Dialect=Samba
Dialect=NT LANMAN 1.0
Dialect=NT LM 0.12

10:27:06.961426 IP (tos 0x0, ttl 127, id 29678, offset 0, flags [DF], proto TCP (6), length 261)
 domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37879: Flags [P.], cksum 0x82f9 (correct), seq 1:210, ack 195, win 514, options [nop,nop,TS val 769833223 ecr 3537132973], length 209
SMB PACKET: SMBnegprot (REPLY)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x88
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 47233 (0xb881)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 17 (0x11)
NT1 Protocol
DialectIndex=9 (0x9)

This time, the SMB negotiate elicited a reply rather than a reset, which selected dialect 9, “NT LANMAN 1.0”, which is the root of the problem. The local DCs for the appliance where AD auth was working had had NTLMv1/SMB1 temporarily enabled, whereas the ones for the appliance where it wasn’t working, didn’t have NTLMv1/SMB1 enabled.

I then started to look into enabling SMB2 on Samba. The appliances are built on RedHat 6.6, and use Samba 3.6.23. While this version of Samba was the first to support SMB2, it *only* supported it as a server, not as a client. The support for SMB2 as a client only came in at version 4.1.0. I also checked the very latest build of the appliance software, which turned out to use exactly the same version as the one we were on. At this point I felt I’d taken it as far as I could, and raised the issue with Veritas support.

Anyone who’s dealt with frontline support from an organisation like Veritas will understand the frustration I went through for the next two weeks, as the frontline engineer assigned tried to follow his call script, completely ignoring all the diagnostics I’d already done. Ultimately he did escalate to a backline engineer (I’d not stamped and shouted as it wasn’t an impacting issue) and it quickly got pushed to the Netbackup engineering team in the states.

In the mean time I’d spoken to the beta team about what version would be in the next release, which confirmed that it would be a version that fixed the problem.

The reply back from engineering was that I was the only person to have come across this issue (meaning that either people just don’t enable AD auth, or that those who do, have SMB1 enabled on their domains), so as it would require significant testing to uplift Samba to a much higher version, they wouldn’t be releasing a patch to fix it. However they confirmed what the beta team had said about it being fixed in the forthcoming version of Netbackup.

Anyway, I’ll be awaiting that release with interest, and I hope my documenting the issue here helps someone else.

Advertisements

Windows Failover Cluster VM Snapshot Issue

I configured my first WFC servers a few weeks back, having previously been at an all Veritas Cluster Server shop. Nothing particularly special about them, in fact 2 of the clusters are just 2 node clusters with an IP resource acting as a VIP.

We came to configuring backups this week, and the day after the backup had run on one of the cluster nodes, I noticed that the resource had failed over to the second node in the cluster.

Digging into the eventlog showed a large number of NTFS warnings (eventIds 50, 98, 140), as well as errors for FailoverClustering  (eventIds 1069, 1177, 1564) and Service Control Manager (eventIds 7024, 7031, 7036).

wfcerrors

A bit of digging into KB articles such as KB1037959 reveals that snapshotting is not supported with WFC.

However, the issue seems to be caused by quiescing the VM and capturing the memory state with the snapshot. Just snapshotting the disk state does not appear to cause any issues with NTFS or Clustering in our testing, but obviously this is just a crash-consistent backup.

Testing network connectivity

One of the difficulties of working in a tightly controlled datacenter environment is establishing whether something isn’t working because of firewall rules. With most things, you can just test TCP connections using the telnet client, which is a nice simple command line utility that I generally include in Windows installs purely for that purpose.

With UDP it’s a little more difficult, and with trying to confirm that firewalls were blocking UDP/1434 between MS SQL Server installations in two sites, I’ve ended up with the following.

  • Wireshark installed and running on both servers
  • Powershell used with the following function Test-Port

With wireshark running and a capture filter for port 1434, I have then been running test-port -comp destserver -port 1434 -udp -udptimeout 10000 and checking both wireshark captures.

While the test-port reports success (UDP is connectionless after all, so a send is generally going to be accepted) , the wireshark tells a different story, of packets leaving and not arriving. One for the firewall guys to resolve.

I also discovered while looking into this, that on linux there’s a way of testing both TCP and UDP connections from the commannd line using special files:

/dev/tcp/host/port
If host is a valid hostname or Internet address, and port is an integer port number
or service name, bash attempts to open a TCP connection to the corresponding socket.

/dev/udp/host/port
If host is a valid hostname or Internet address, and port is an integer port number
or service name, bash attempts to open a UDP connection to the corresponding socket.

For example:

rich@www:~$ cat < /dev/tcp/localhost/22
SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.1
^C

See also my post here on testing network connectivity from ESXi.

NSX LoadBalancer – character “/” is not permitted in server name

lberr

This was an odd error that a colleague brought to me while testing automation around the configuration of an NSX Edge.

He had created the Edge successfully, and configured the Load Balancer, but on trying to enable it, it was erroring. When he tried enabling it through the Web Client, the above error was displayed, and the change was automatically reverted.

After a lot of digging, I discovered that the configuration for the Load Balancer had a Pool where the “IP Address / VC Container” object was a Service Group, and one of the members of that Service Group was an IPSet for the CIDR block that NSX was trying to include in the server name.

I’m not sure whether that is even a supported configuration, but I changed it to point to a Service Group that included the members of the target web farm, and the Load Balancer could then be configured successfully.

Github Desktop from behind a corporate proxy server

After having just helped a colleague get through the tortuous path of configuring Github Desktop to work through a proxy, I thought it might be worth blogging it all.

Different parts of Github Desktop require the proxy information to be provided in different ways, and without all 3 pieces of configuration, you will find that some things work, but not others.

  1. Internet Explorer proxy setting
    This *has* to be set to a specific proxy server, and not using an autoconfig script.
  2. .gitconfig
    This is found in your user home directory (usually C:\Users\<Username>) and requires the following lines:
    [http]
    proxy = http:// <proxy-address>:<port>
    [https]
    proxy = http:// <proxy-address>:<port>
  3. HTTPS_PROXY/HTTP_PROXY environment variable
    You can set this in your local environment, or in the system environment settings, as long as it’s visible to the Github Desktop processes.
    eg.
    set HTTPS_PROXY=http://<proxy-address>:<port>

If a userid/password is required, it’s recommended that you run something like CNTLM to do the authentication, rather than adding the plaintext credentials to the proxy string.

Once you’ve configured all that, if you’re using Enterprise Github, you will probably need to use a Personal Access Token, rather than your password, to authenticate Github Desktop. This can be created by logging in with a browser and going to Settings / Personal Access Tokens.

I hope that helps someone out, but if not, I’m sure I’ll be using it as a reminder when I have to change it all between using it at Home and at Work…

HP Insight Remote Support WBEM problem and resolution

So this post is a little off topic. I recently had to migrate an HP Business Critical “dial home” service from an HP SIM based install, to their new HP IRS 7.x. Mostly, this was because the HP SIM version was going obsolete at the beginning of next year.

The install of IRS itself is very simple, especially compared with the HP SIM based version which took a number of days to get set up correctly. However, getting it working and monitoring the servers wasn’t quite as straightforward as expected.

Issue #1 Windows Servers

The first problem I had was getting it to monitor the Windows servers via SNMP. Tracing through, there were a number of items which needed configuring:

  • IRS Server – SNMP Service not running. This is the default setting on our standard build. Ok set to Automatic and start.
  • IRS Server – SNMP Service config. Add the chosen community string to be accepted from the target servers
  • Target Servers – SNMP Service config. Add the chosen community string to be accepted from the IRS Server.
  • Target Servers – SNMP Serive config. Add the IRS server (and community string) as a trap destination
  • Target Servers – HP System Management Homepage. Change to use SNMP instead of WBEM as its data source

Once all that was in place, a test trap populated each server into the IRS console, and all was good. And yes, I know that this is pretty obvious, most of these I’d already done, but I’ve listed them all for completeness…

Issue #2 HP-UX Servers

On the HP-UX servers, SNMP wasn’t used, but a WBEM user was already created for the HP SIM based service, and this was going to be reused.

The servers added to the IRS console without any issue, however they would not register for WBEM alerts. After a lot of investigation between me and the Unix guys (who fortunately now sit very close to me after an office re-org!) we were still at a loss, so I had to raise a support call with HP.

The lady from HP was quickly on the case, and the cause was, as they say, a doozy. When you install IRS it apparently sets the WBEM registration URL, to ….. localhost. D’Oh!

The necessary commands to change this, assuming an IRS server IP of 172.16.0.1 are (for WBEM and WMI respectively)

rsadmin config -set wbem.subscription.url=https://172.16.0.1:7905/wbem

This should respond:
GLOBAL wbem.subscription.url was https://localhost:7905/wbem => https://172.16.0.1:7905/wbem

and..

rsadmin config -set wmi.subscription.url=https://172.16.0.1:7905/wmi

This should respond:
GLOBAL wmi.subscription.url was https://localhost:7905/wmi => https://172.16.0.1:7905/wmi

This cured the problem and I was able to create all the necessary WBEM subscriptions.

Reconfiguring VSAN storage on Dell PERC H710P Mini Array Controlller

I recently had to reorganise the storage on one of our VSAN clusters. The hosts have H710P array controllers, which don’t have pass-thru capability, so each disk has to be created as a RAID0 Virtual Disk on the array controller.

In addition, the 2 SSD drives had been placed into a single RAID0 array, which needed breaking apart, to enable the use of 2 VSAN Disk Groups (giving 2 separate failure domains instead of 1 great big one!)

On top of this, with only 3 hosts in the farm at this point in time, there was no option to fully evacuate the data from each host, I had to treat each server as a “failure” and allow VSAN to create new mirror copies after the reconfiguration of each host.

Here are the steps I went through:

Original – 2x SSD in RAID0, 4 separate RAID0 HDD drives – in one disk group

New – 2 separate RAID0 SSD, 10 separate RAID0 HDD drives – equally divided between 2 disk groups

Steps

  1. Place host in maintenance mode, with “Ensure accessibility” option.

    (can only choose “Full data migration” if there are more than 3 hosts in the cluster and sufficient storage)

  2. To complete entering maintenance mode, it will be necessary to power down the NSX Controller running on this host.
  3. Attach remote server console (iDRAC) and reboot server
  4. Enter the BIOS (F2 at server boot)
  5. Select “Device Configuration”
  6. Select “Integrated RAID Controller 1: Dell PERC < PERC H710P Mini>”
  7. Delete the old SSD Virtual Disk:
    1. Select “Select Virtual Disk Operations”
    2. Choose the SSD disk from the “Select Virtual Disk” dropdown
    3. Select “Delete Virtual Disk”
    4. Tick the checkbox to Confirm and Select Yes
    5. Click Back
  8. Select “Create Virtual Disk”
  9. For each SDD to add as a VSAN SSD disk perform the following:
    1. Leave RAID Level at RAID0
    2. Select “Select Physical Disks”
    3. Select Media Type “SSD”
    4. Select the appropriate disk from the list
    5. Select “Apply Changes”
    6. Select “OK”
    7. Enter a “Virtual Disk Name” of “VSAN1_SSD1” or “VSAN2_SSD1”
    8. Leave all other settings at default and choose “Create Virtual Disk”
    9. Tick the checkbox to Confirm and Select Yes
    10. Select “OK”
    11. Repeat for the other SSD
  1. For each HDD to add as a VSAN disk perform the following:
    1. Select “Create Virtual Disk”
    2. Leave RAID Level at RAID0
    3. Leave Media Type at “HDD”
    4. Select the appropriate disk from the list
    5. Select “Apply Changes”
    6. Select “OK”
    7. Enter a “Virtual Disk Name” of “VSAN_HDD”
    8. Leave all other settings at default and choose “Create Virtual Disk”
    9. Tick the checkbox to Confirm and Select Yes
    10. Select “OK”
    11. Repeat for the other HDD drives
  2. Select “Back”, “Back”, “Finish” , “Finish” , “Finish” to leave the BIOS
  3. Allow the host to boot back up
  4. Allow the host to reconnect into VCenter
  5. Select the Cluster the host is in, and choose the Manage tab and Virtual SAN “Disk Management” subheading
  6. Select the disk group showing “Unhealthy” and click the “Remove Disk Group” icon.
  7. Select “Yes” to remove the disk group
  8. Launch PowerCLI and use the following script to change the disk type of the SSDs to SSD:
    $server = “hostname.domain.name”
    Connect-VIServer -Server $server -user root -password *******
    $esxcli = Get-EsxCli -VMHost $server
    $localDisk = Get-ScsiLun | where {$_.CapacityGB -lt 200 -and $_.CapacityGB -gt 180}|foreach {$canName = $_.CanonicalName;$satp = ($esxcli.storage.nmp.device.list() | where {$_.Device -eq $canName }).StorageArrayType;$esxcli.storage.nmp.satp.rule.add($null,$null,$null,$canname,$null,$null,$null,”enable_ssd”,$null,$null,$satp,$null,$null,$null);$esxcli.storage.core.claiming.reclaim($canName)}
    $esxcli.storage.core.device.list()|select Device, Size, IsSSD
    Disconnect-VIServer -confirm:$false
    

    (I found this on a forum post, and which I now can’t locate to give the proper attribution, sorry)

  9. Return to the Web Client and navigate to the host, select the “Manage” tab and the “Storage” and “Storage Devices” subsections. Note the “naa id” of the disks marked as SDD.
    These need the partition tables clearing, so they can be reused by VSAN
  10. Clearing the partition table:
    1. SSH to the host, and login as root
    2. cd /vmfs/devices/disks
    3. Use “ls <id>” to ensure the disk is there
    4. Issue the command “partedUtil mklabel /vmfs/devices/disks/<id> msdos” to clear the old and incorrect GPT table
    5. Repeat for the other SSD.
  11. Return to VCenter Web Client and select the Cluster the host is in, and choose the Manage tab and Virtual SAN “Disk Management” subheading
  12. Select the host and select the “Create a New Disk Group” icon
  13. Select an SSD and 5 HDD drives and click “OK” (if the SSDs aren’t displayed, you may need to do a storage rescan)
  14. Repeat to create a second disk group
  15. Ensure both disk groups are created successfully
  16. Return to the Hosts and Clusters view
  17. Take the host out of “Maintenance Mode”
  18. Select the cluster, and navigate to the “Monitor” tab, and the “Virtual SAN” and “Virtual Disks” subsections.
  19. Monitor until all entries in “Physical Disk Placement” are showing “Active” for all VM disk components. This will not start  until the timer (configurable in Advanced Setting “VSAN.ClomRepairDelay”, default 60 minutes) has expired.