The case of the disappearing datastore

I was asked this week to investigate an issue where out of multiple hosts and multiple datastores, there was one host which couldn’t access a single datastore.

In the past, I’ve seen issues where a datastore was only visible to a single host in the cluster, or a host had lost access to all datastores, but never a single datastore unavailable on a single host.

Talking someone else through diagnosing something obscure like this is never easy, but getting the end of the vmkernel.log was enlightening:

2014-06-17T11:03:05.457Z cpu31:64368)WARNING: HBX: 1889: Failed to initialize VMFS3 distributed locking on volume 529dc221-e68ab3c4-3e52-b499baa3e4c6: Not supported
2014-06-17T11:03:05.457Z cpu31:64368)FSS: 890: Failed to get object f530 28 1 529dc221 e68ab3c4 99b43e52 c6e4a3ba 0 0 0 0 0 0 0 :Not supported
2014-06-17T11:03:05.457Z cpu31:64368)WARNING: Fil3: 2034: Failed to reserve volume f530 28 1 529dc221 e68ab3c4 99b43e52 c6e4a3ba 0 0 0 0 0 0 0
2014-06-17T11:03:05.457Z cpu31:64368)FSS: 890: Failed to get object f530 28 2 529dc221 e68ab3c4 99b43e52 c6e4a3ba 4 1 0 0 0 0 0 :Not supported
2014-06-17T11:03:05.583Z cpu36:64370)HBX: 676: Setting pulse [HB state abcdef02 offset 3710976 gen 35 stampUS 85863025463 uuid 539ed180-c09ff563-6044-e4115b10555a jrnl <FB 0> drv 14.54] on vol ‘Datastorename’ failed: Not supported
2014-06-17T11:03:05.583Z cpu36:64370)WARNING: FSAts: 1263: Denying reservation access on an ATS-only vol ‘Datastorename’

A bit of a google found this

Basically, what appears to have happened is that when the datastore was created, Hardware Assisted Locking was available on the storage array. Because it was available, the VMFS filesystem was created with the flag set to use it (ATS-Only)

At some point since, it was no longer supported, and it would seem that this host had lost access and attempted to reconnect (maybe a reboot) to the datastore and failed, because ATS-Only was set, and the array no longer supported that as a locking mechanism.

A few days later, and before the fix was implemented, a power outage took out half the hosts in the farm, when these came back, none of them could access the datastore (unsurprisingly).

Implementing the fix in that link (vmkfstools –configATSOnly 0 /vmfs/devices/disks/device-ID:Partition) removed the setting and restored access to the datastore for all hosts.

One thought on “The case of the disappearing datastore

  1. We had more dangerous case. We added new hosts and in some point 2 hosts had conflicting MAC (one get it from HW and second from restored configuration). We fixed it. But, later, we discover that it resulted in DS locking conflict – somehow assignment of slots for new hosts was broken, and new host (one more) git conflicting ID (there is something like this, not documented well). As a result , host freeze and refuse to mount this DS after restart (and few VM-s was temporary lost).

    After a few days of heavy troubleshooting, we started to move out VM-s and unmount DS one by one from hosts. Once it was unmounted from both hosts with conflicting MAC (which happen 2 weeks back and were resolved long ago) new host was able to mount and all locks released.

    So it all is not as easy as it looks. Each host got some ID when it mounts DS. It’s ID has record about MAC and somehow influence locking. When host lock objects it uses this ID and it determine some iSCSI object which it uses. Maybe there is master who allocates this, no idea. In any case, there are scenarios when allocation got wrong and it may result in host locked out from DS and other bad things. The remedy is to move out and then unmount DS from other hosts (maybe 2 at a time is enough). Once conflicting host unmount DS locks are released. And, if you had MAC conflict, it may have long time consequences in the future even when conflict itself is resolved,

Leave a comment