Hello All,

 

This has come up quite a bit recently, so I will share what I know to help out. The issues we are seeing on our side will manifest in many different ways, but always as a rule of thumb when you are dealing with a customer using VMWare environment with EMC storage, ALWAYS go thru the checklist below.

 

An example of cases we will see on our end (not an exhaustive list)

 

  • Esx host disconnecting form the virtual centre
  • “Slow” vm performance
  • Lun randomly becoming inaccessible on esx host side
  • **Some** Performance cases
  • Dead paths detected on esx hosts

 

PSP’s

 

There are various multi pathing polices  (PSP’s) available on vSphere:

 

  • MRU
  • Fixed
  • Fixed_AP
  • Round Robin

 

For more info on these, please read (page 25):

 

http://www.vmware.com/pdf/vsphere4/r40/vsp_40_san_cfg.pdf

 

Now it can be argued that customer can use mixed pathing polices across the same luns over separate hosts. If they want to so this, they can expect to run into performance issues, as this is one of the main causes of path thrashing.

 

For example if you have LUN3, and on host1 is it accessed using round robin, then on host2 it is accesses by FIXED and on host 3 its accessed by MRU – in the case where a path failover has to occur, and incorporating murphy’s law this will usually not work as expected.

 

In the above scenario, this will generally  fill the esx logs with all sorts of scsi sense codes  (communication issue) coming from the SP and clogging up the logs.

 

For an effective path failover to occur on the esx side, a very specific scsi sense code has to be received, or no failover will occur.

 

If the logs are already being filled up with the above errors (due to miss configuration of PSP’s) – the failover sense code ****MAY*** not be received by the esx and, thus resulting in a dead path.

 

Add some time into the mix, and you can have multiple dead paths, resulting in the management services of the esx crashing – disconnecting the esx host form the VC and requiring a hard reboot of the host to reconnect (vm downtime) = not good.

 

The above is a very common scenario  - and theoretically – there should never be any reason or practice for accessing a lun with different PSP’s across separate hosts.

 

How to know what PSP the host should be using? (this guide does not cover Power Path!)

 

  1. Check the flare code version
  2. Open VMware HCL (VMware Compatibility Guide)
  3. Search for SAN model (e.g. VNX7500)
  4. From the below example, the default PSP is FIXED
  5. Therefore, ALL HOSTS need to be accessing all of their luns using VMW_PSP_FIXED
  6. Failover mode of Backend needs to be set to 4 = (ALUA)
  7. The customer may also use Round Robin, as stated below  - I would recommend sticking to the default policy – once all has settled down, they can switch to Round Robin if they wish

 

Example:

 

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=san&productid=19433&deviceCategory=san&partner=30&releases=76,158,148,24&keyword=VNX&arrayTypes=2&isSVA=1&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

 

hcl.png

 

 

Why use the recommended PSP?

 

Well, you “could” use different PSP’s – however referencing VMware Support contract, the customers environments must adhere to the recommendations of the Compatibility guide.  VMware may refuse support if the customers environments are not fully certified for use. – however, only certain flare versions are certified (by 3rd party) for use with specific PSP’s & San models – this is BLACK & WHITE – there are no grey areas to this fact. If a customer is using mixed PSP’s they must align them.

 

How do I check what PSP’s they are using?

 

From a esx log bundle:

 

  1. Extract the logs
  2. Browse to : /vm-support-xxxxxxxxxxxx/tmp/
  3. Locate file called esxcli-nmp-devices.xxxxxxxx.txt
  4. Example:

 

nmp.png

 

From Virtual Centre:

 

configuration tab / storage adapters / select HBA

 

detail window:

 

right click device / manage paths

 

vi.png

 

 

Some fancy grepping:

 

Extract all esx log bundles into a single directory.

 

$ grep "Storage Array Type:" vm-support-*/tmp/esxcli-nmp-devices*|awk '{print $5}'|sort |uniq

 

VMW_SATP_ALUA_CX   (failover mode 4)

VMW_SATP_CX

VMW_SATP_LOCAL (ignore – probably a cd-rom drive…)

 

All hosts should report VMW_SATP_ALUA_CX   (failover mode 4)  only

 

$ grep "Path Selection Policy:" vm-support-*/tmp/esxcli-nmp-devices*|awk '{print $5}'|sort |uniq

 

VMW_PSP_FIXED  (probably a cd-rom drive… but double check just to be sure…)

VMW_PSP_FIXED_AP

VMW_PSP_MRU

VMW_PSP_RR

 

All hosts should report only 1 PSP plugin for their shared luns (ignoring CD rom drives & local block controllers)

 

 

And now the important bit – how to fix this – align  pathing policies on a per host basis:

 

  1. Ensure all hosts are set to failover mode 4 (ALUA) on the backend (esx 4.x & 5.x)
  2. For any host that is changed, reboot
  3. once host is back up, put esx host into maintenance mode
  4. Open ssh session to esx host & run the following script:

 

~ # for i in `esxcfg-scsidevs -c | awk {'print $1'} | grep -i naa`

> do

> echo "esxcli nmp device setpolicy -d $i -P VMW_PSP_FIXED"

> done

 

The above example will change all PSP’s to FIXED, so change as appropriate if MRU / FIXED_AP / RR is required.

 

  5. Exit host form maintenance mode

 

At this stage al host should not be aligned with appropriate SATP & PSP’s.

 

Any questions, please let me know

 

Dave