Problem adding Active Directory auth to Netbackup Appliance

I’ve spent a fair amount of time recently on two things that I’ve not done much of before: Certificates, and integrating to AD authentication. Of the two, the AD auth has generally been straightforward, and installing signed certs has been….. well…  let’s just call it a learning curve.

Of course, I’ve documented the processes internally, and where possible I’ve automated the configuration for future use.

One of the things that came up recently was adding AD auth for some Netbackup appliances, where shared admin credentials had been in use and we were wanting to avoid individual local accounts. As this is a well documented process I was expecting it to be pretty straightforward, and after checking some details with Veritas Support (such as whether the credentials used for configuring the authentication were stored and used for each lookup – they’re not), I wrote up the change plan and began implementation.

On the first appliance, all was well, and it was as simple as the documentation implied. It was when I came to the next one that I hit problems.

I submitted the details and credentials for the local Active Directory, and after a short delay received the following errors:

- [Error] Unable to configure the appliance for Active Directory Authentication.
Check the credentials, authorization of user, and network connectivity issues
- [Error] Unable to join the domain. Please check the credentials used to join the domain, network connectivity, etc. Otherwise contact support
Unable to configure the appliance for Active Directory Authentication. Check
the credentials, authorization of user, and network connectivity issues
Unable to join the domain. Please check the credentials used to join the
domain, network connectivity, etc. Otherwise contact support
Command failed!

Obviously I then tried exactly the same again, unsurprisingly with the same result.

Next step was a tcpdump of traffic between the appliance and the domain controller, after a little trial and error around the filter to find the crux of the matter, I captured the following on port TCP/445:

10:08:56.983135 IP (tos 0x0, ttl 64, id 34248, offset 0, flags [DF], proto TCP (6), length 60)
10:08:56.983135 IP (tos 0x0, ttl 64, id 34248, offset 0, flags [DF], proto TCP (6), length 60)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [S], cksum 0x4556 (correct), seq 2083948561, win 14600, options [mss 1460,sackOK,TS val 91164121 ecr 0,nop,wscale 10], length 0
10:08:56.983365 IP (tos 0x0, ttl 127, id 27767, offset 0, flags [DF], proto TCP (6), length 60)
domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37346: Flags [S.], cksum 0xe3c7 (correct), seq 1747220571, ack 2083948562, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 499809341 ecr 91164121], length 0
10:08:56.983410 IP (tos 0x0, ttl 64, id 34249, offset 0, flags [DF], proto TCP (6), length 52)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [.], cksum 0x3286 (correct), seq 1, ack 1, win 15, options [nop,nop,TS val 91164121 ecr 499809341], length 0
10:08:57.001109 IP (tos 0x0, ttl 64, id 34250, offset 0, flags [DF], proto TCP (6), length 246)
appliance-fqdn.37346 > domaincontroller-fqdn.microsoft-ds: Flags [P.], cksum 0x1f58 (incorrect -> 0x02a5), seq 1:195, ack 1, win 15, options [nop,nop,TS val 91164139 ecr 499809341], length 194
SMB PACKET: SMBnegprot (REQUEST)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x8
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 12819 (0x3213)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 0 (0x0)
smb_bcc=155
Dialect=PC NETWORK PROGRAM 1.0
Dialect=MICROSOFT NETWORKS 1.03
Dialect=MICROSOFT NETWORKS 3.0
Dialect=LANMAN1.0
Dialect=LM1.2X002
Dialect=DOS LANMAN2.1
Dialect=LANMAN2.1
Dialect=Samba
Dialect=NT LANMAN 1.0
Dialect=NT LM 0.12

10:08:57.001402 IP (tos 0x0, ttl 127, id 27768, offset 0, flags [DF], proto TCP (6), length 40)
domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37346: Flags [R.], cksum 0x1836 (correct), seq 1, ack 195, win 0, length 0

The last two packets in the conversation are an SMB Negotiate request from the appliance, which the domain controller responds to with a TCP Reset packet. Rude!

At this point, the sensible thing was to compare with the working one, however as that was already successfully using AD auth, I wasn’t sure whether it would be a good comparison, however I found that using smbclient to try and establish a session on the problem appliance also produced the same result.

10:27:06.868465 IP (tos 0x0, ttl 64, id 38798, offset 0, flags [DF], proto TCP (6), length 60)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [S], cksum 0x71e9 (correct), seq 3921922782, win 14600, options [mss 1460,sackOK,TS val 3537132880 ecr 0,nop,wscale 10], length 0
10:27:06.868694 IP (tos 0x0, ttl 127, id 29676, offset 0, flags [DF], proto TCP (6), length 60)
 domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37879: Flags [S.], cksum 0xd55f (correct), seq 3976192417, ack 3921922783, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 769833214 ecr 3537132880], length 0
10:27:06.868730 IP (tos 0x0, ttl 64, id 38799, offset 0, flags [DF], proto TCP (6), length 52)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [.], cksum 0x241e (correct), seq 1, ack 1, win 15, options [nop,nop,TS val 3537132880 ecr 769833214], length 0
10:27:06.960943 IP (tos 0x0, ttl 64, id 38800, offset 0, flags [DF], proto TCP (6), length 246)
 appliance-fqdn.37879 > domaincontroller-fqdn.microsoft-ds: Flags [P.], cksum 0x2378 (incorrect -> 0x856b), seq 1:195, ack 1, win 15, options [nop,nop,TS val 3537132973 ecr 769833214], length 194
SMB PACKET: SMBnegprot (REQUEST)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x8
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 47233 (0xb881)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 0 (0x0)
smb_bcc=155
Dialect=PC NETWORK PROGRAM 1.0
Dialect=MICROSOFT NETWORKS 1.03
Dialect=MICROSOFT NETWORKS 3.0
Dialect=LANMAN1.0
Dialect=LM1.2X002
Dialect=DOS LANMAN2.1
Dialect=LANMAN2.1
Dialect=Samba
Dialect=NT LANMAN 1.0
Dialect=NT LM 0.12

10:27:06.961426 IP (tos 0x0, ttl 127, id 29678, offset 0, flags [DF], proto TCP (6), length 261)
 domaincontroller-fqdn.microsoft-ds > appliance-fqdn.37879: Flags [P.], cksum 0x82f9 (correct), seq 1:210, ack 195, win 514, options [nop,nop,TS val 769833223 ecr 3537132973], length 209
SMB PACKET: SMBnegprot (REPLY)
SMB Command = 0x72
Error class = 0x0
Error code = 0 (0x0)
Flags1 = 0x88
Flags2 = 0x1
Tree ID = 0 (0x0)
Proc ID = 47233 (0xb881)
UID = 0 (0x0)
MID = 1 (0x1)
Word Count = 17 (0x11)
NT1 Protocol
DialectIndex=9 (0x9)

This time, the SMB negotiate elicited a reply rather than a reset, which selected dialect 9, “NT LANMAN 1.0”, which is the root of the problem. The local DCs for the appliance where AD auth was working had had NTLMv1/SMB1 temporarily enabled, whereas the ones for the appliance where it wasn’t working, didn’t have NTLMv1/SMB1 enabled.

I then started to look into enabling SMB2 on Samba. The appliances are built on RedHat 6.6, and use Samba 3.6.23. While this version of Samba was the first to support SMB2, it *only* supported it as a server, not as a client. The support for SMB2 as a client only came in at version 4.1.0. I also checked the very latest build of the appliance software, which turned out to use exactly the same version as the one we were on. At this point I felt I’d taken it as far as I could, and raised the issue with Veritas support.

Anyone who’s dealt with frontline support from an organisation like Veritas will understand the frustration I went through for the next two weeks, as the frontline engineer assigned tried to follow his call script, completely ignoring all the diagnostics I’d already done. Ultimately he did escalate to a backline engineer (I’d not stamped and shouted as it wasn’t an impacting issue) and it quickly got pushed to the Netbackup engineering team in the states.

In the mean time I’d spoken to the beta team about what version would be in the next release, which confirmed that it would be a version that fixed the problem.

The reply back from engineering was that I was the only person to have come across this issue (meaning that either people just don’t enable AD auth, or that those who do, have SMB1 enabled on their domains), so as it would require significant testing to uplift Samba to a much higher version, they wouldn’t be releasing a patch to fix it. However they confirmed what the beta team had said about it being fixed in the forthcoming version of Netbackup.

Anyway, I’ll be awaiting that release with interest, and I hope my documenting the issue here helps someone else.

Advertisements

SMBv1 and Hide NAT

I was asked yesterday to investigate an issue where client PCs were getting disconnect messages from a share on a virtualised server.
It was thought that there was some problem with the virtual switch, and/or the VirtualConnect config of the blade enclosure.

A network capture was included, as was a visio diagram of the network configuration, which was very nice to have! It’s great to deal with experienced techies who know you’re going to need some background material, and think to include it with the request for help 🙂

What struck me as odd when looking at the capture was that the offending RST packets always appeared to coincide with a new TCP session being set up to port 445. ie New TCP session always followed by a RST of an old one, two new TCP sessions -> 2 RST of old connections.

A bit of digging pulled up this link, something I’d never come across before. Basically, SMBv1 does not support Hide NAT. It regards a connection between 2 IPs as a single client and a single server, and closes down any old connections between those two IPs whenever a new connection is made.

The main workaround is described in the link, namely disabling communications over port 445 for Windows XP. This forces it to use NetBIOS over TCP/IP. Woo-hoo :/

Windows Vista/7/8 talking to Windows 2008 Server and above can use SMBv2 which was written to cope with Hide NAT.

Testing so far has shown this workaround for Windows XP to be functional.

Anyway, chalk another issue down as “nowt to do with VMware” 😉