4 Replies Latest reply: Jun 5, 2011 6:30 AM by msodonnell RSS

Adding additional HBA/WWN resets host failover mode to 1

msodonnell

We are in the process of migrating our systems from a CX700 to a VNX 5700.  The VNX is on a completely new set of SAN switches.

 

Our first group is about 15 HP Blades running VMWare 4.1U1.  We set up the blade chassis configuration so that each blade had one HBA attached to the switch with the CX700 and one HBA on the fabric with the VNX attached.  We then used the storage vmotion to move the virtual machines (about 130 VM's, 10TB of data).

 

When we first connected the blades to the VNX, I set the host initiators in the VNX to ALUA/failover mode 4, and set the VMWare end to "round robin" (we're using the native VMWare multipathing).  Everything moved OK, and we have been running with all the VMWare data on the VNX for a couple of weeks.  .

 

Today, we wanted to go ahead and switch the HBA connections that had been on the CX700 to the VNX environment to provide redundancy.  We confirmed that there was no storage traffic going to the CX700 (the LUN's had been removed from the storage group).  We unplugged the HBA from the CX700, no problems showed up.  We then attached the fiber cables to the new switch with the VNX.

 

At that point, all the VM's stopped responding on the network.  All paths from the blades to the VNX showed "dead".  Thinking it was something with the moved connections, we moved the cables back to the CX700.  Still all paths are dead.

 

We then looked in Unisphere, under "connectivity status".  Each blade host showed the additional WWN's from the second HBA, but the failover mode on the host record had been reset to "failover mode 1".  Since VMWare had been expecting mode 4, I figure that's why it lost connectivity to the paths.  Resetting the host in the VNX back to failover mode 4, then re-scanning the storage in VMWare fixed the problem for that host.

 

Why would adding additional WWN's to the host entry cause the failover mode to reset to "1"?  Is this a bug in the VNX software, or is there a reason to reset it when adding the extra WWN's?

 

Any suggestions would be appreciated.

Mike O'Donnell

  • 1. Re: Adding additional HBA/WWN resets host failover mode to 1
    Beagless

    Hi Mike

    I have seen that happen on a Clariion when i set the failover mode to 4 and went back and it had chaged to 1. Have your run the relvant commands on your vmware hosts to change to ALUA also ?

     

    here is a link to a guy who had a similar issue and all the commands etc he had to run to change vmware to use ALUA, or you could use powerpath to look after all the path management for you 

     

    http://www.boche.net/blog/index.php/2010/02/04/configure-vmware-esxi-round-robin-on-emc-storage/

     

     

    anyway have a read of the above it may help you out

     

    paul

  • 2. Re: Adding additional HBA/WWN resets host failover mode to 1
    msodonnell

    I had the ALUA working fine on both the VNX and VMWare side.  I had actaully seen the site you mentioned when I was doing the research to configure ALUA.

     

    I just can't figure out why adding an additional WWN to the host record on the VNX would reset it's mode back to 1.  I would think once it's set for the host, it should stay, no matter what HBA/WWN changes you make.

     

    I'm just trying to find out if it's a bug (or "undocumented feature"), or if there's an intentional reason to reset it back.

  • 3. Re: Adding additional HBA/WWN resets host failover mode to 1
    glen

    This might be related to what you're experiencing:

     

    glen

     

    The following is a Primus(R) eServer solution:

     

    ID:  emc262738
    Domain: EMC1
    Solution Class: 3.X  Compatibility

     

    Goal       Why is the failover mode on the array  changing for 4 (ALUA) to 1 after an Storage Processor reboot or an  NDU?

     

    Fact       Product: CLARiiON CX4  Series

     

    Fact       EMC Firmware: FLARE Release  30

     

    Fact       Product: VMware ESX Server  4.0

     

    Fact       Product: VMware ESX Server 4.1

     

    [NOT]  Fact

     

    This statement does not apply: Product: VMware ESX Server 5.x

     

     

    Symptom    After a storage processor reboot (either because  of a non-disruptive upgrade [NDU]  or other reboot event), the failover mode for  the ESX 4.x hosts changed from 4 (ALUA) to 1 on all host  initiators.

     

    Cause      On this particular array, for each Storage  Group a Host LUN Zero was not configured. This allowed the array to present to  the host a "LUNZ." All host initiators had been configured to failover mode 4  (ALUA). When the storage processor rebooted due to a non-disruptive upgrade  (NDU), when the connection was reestablished, the ESX host saw the LUNZ as an  active/passive device and sent a command to the array to set the failover mode  to 1. This changed all the failover mode settings for all the LUNs in the  Storage Group and since the Failover Policy on the host was set to FIXED, when  one SP was rebooting, it lost access to the  LUNs.

     

    Fix        VMware will fix this issue in an upcoming patch  for ESX 4.0 and 4.1. ESX 5.x does not have this issue.

    To work around this issue, you can bind a small LUN, add to the Storage Group  and configure the LUN as Host LUN 0 (zero). You will need to reboot each host  after adding the HLU 0. For each Storage Group you will need a HLU 0.  See solution emc57314 for information on changing the HLU.

    These are the directions from VMware for the workaround:

    1. Present a 1.5 GB or larger LUN0 to all ESX hosts.  (This volume does not  need to be formatted, but must be equal to or larger than 1.5 GB.
    2. Roll a reboot through all hosts to guarantee that they are seeing the LUN 0  instead of the LUNZ.  A rescan may work, but a reboot guarantees that they will  not have any legacy data for the CommPath volume.
  • 4. Re: Adding additional HBA/WWN resets host failover mode to 1
    msodonnell

    Thanks for the info. 

     

    It does sound like a similar effect, but I'm not sure it was the same cause.   A couple of days before the issue with the reset, we had both SP's replaced.  During that time, as each SP went down and then back up, VMWare did exactly what it was supposed to.  The paths to the down SP showed "dead" but the other paths worked fine.

     

    The issue we had later was when we basically added additional host initiators to the existing host record on the VNX.  Still, it sounds a lot like the effect we had, so I'll go ahead and set up a small lun 0 in the group.

     

    Actually, on the CX700, we had a 5G LUN "0" we had set up years ago to address some of the other "LUNZ" issues we had with VMWare.  When we set up the VNX, apparently that LUN wasn't brought over.  Fortunately, the other luns were created on the VNX with non-zero host lun values, so it will be easy to create a small lun 0.

     

    Thanks again.

    Mike O.