I recently experienced an intermittent VM outage as a result of a VNX/OE upgrade (the file side) and I'm still struggling to determine the root cause. Please note that this environment has been in production for months and has survived online failover/failback activities on at least (4) other occasions. The failover/failback times are consistently measured in the 20-40 second neighborhood and this is ESX 4.1 with VNX (all NFS).
These are the advanced ESX parameters that have come into the spotlight for NFS Locking:
A VMware KB article describing NFS configuration options states (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007909#NFSLock):
…When a lock file is created an update is periodically (every NFS.DiskFileLockUpdateFreq seconds) sent to the lock file to let other ESX hosts know that the lock is still active. The lock file updates generate small (84 byte) WRITE requests to the NFS server. Changing any of the NFS locking parameters changes how long it takes to recover stale locks. This formula can be used to calculate how long it takes to recover a stale NFS lock:
(NFS.DiskFileLockUpdateFreq * NFS.LockRenewMaxFailureNumber) + NFS.LockUpdateTimeout
This example demonstrates the above equation using the default values in VMware ESX 3.5 (same for 4.x):
X is the length of time it takes to recover from a stale NFS lock.
X = (NFS.DiskFileLockUpdateFreq * NFS.LockRenewMaxFailureNumber) + NFS.LockUpdateTimeout
X = (10 * 3) + 5
X = 35 seconds
When some of the VM’s crashed during the upgrade (it wasn't all VM’s – just some of them), the errors were consistent with a missing lock-file as the cause of the crash. Even though Duncan is discussing a different situation, the following post has the identical error: http://www.yellow-bricks.com/2010/03/29/cool-new-ha-feature-coming-up-to-prevent-a-split-brain-situation/
The cluster has been configured using the Best Practice recommendations for “NFS heartbeats” which result in a 125 second “buffer” before an NFS datastore is marked unavailable (http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html and http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007909).
What I’m seeking to understand is what the default values of the “NFS Locking” parameters really do if they are left at the default settings. My suspicion is that a VM has the potential to be powered off after 35 seconds if the NFS lock is lost on the ESX server that’s hosting the guest. If this is the case, I’ve been operating under a false assumption that an NFS environment has ~2 minutes before its in real danger of VM’s crashing.
If you’ve got any additional insight into these NFS Locking parameters, I’d really appreciate the input.
Thanks in advance,
2850 Premiere Parkway
Duluth, GA 30097