XtremIO X2: Use of QoS and ODX can lead to host-side data unavailability


When using an XtremIO while actively using the QoS and ODX features, several volumes on the array may become inaccessible to the host, leading to host-side Data Unavailability. Attempts to reboot the affected host or re-map the volumes may fail.                                                           






The cause is due to a software issue in the ODX state machine, which occurs in cases where the XtremIO cluster is using the Quality of Service (QoS) feature.   
    The successfully executed commands are stuck in the kernel driver queues. This impacts the SCSI sessions, which become stalled.






This issue affects environments with all of the following conditions present:   
    1. One of the following versions are present on the XtremIO cluster-   

  •         XtremIO XIOS 6.2.0-85     
  •         XtremIO XIOS 6.2.1-36     
  •         XtremIO XIOS 6.3.0-63     
    2. Quality of Service (QoS) is enabled on the XtremIO cluster.   
    3. ODX is enabled on the XtremIO cluster.   
    4. Windows hosts using the XtremIO cluster are setup to use the ODX feature.                                                           






    If the problem described in the Issue section above is currently active, immediately open an SR with Dell EMC Global Technical support and reference this KB. Note that a maintenance window will need to be scheduled during a period of low I/O, as an XtremIO Technical Support Engineer will need to perform a sequential reboot of each and every Storage Controller in the cluster in order to release the stalled SCSI sessions in the kernel driver queues.   
    (IMPORTANT: Please review the following KB first prior to any actions being performed:  XtremIO: Preparing for an XtremIO Storage Controller Replacement (SC FRU) or Reboot  failure to check for the issues mentioned in the KB may lead to additional Data Unavailability on any remaining hosts/LUNs using the XtremIO cluster).   
    In order to prevent the issue, ensure that at least one of the conditions from the Change section above is not present. The two simplest solutions to implement are one of the following:   

  1.         Use the ODX feature without QoS (ODX enabled, QoS disabled)     
  3.         Use the QoS feature without ODX (ODX disabled, QoS enabled)      
    Note that for X2 clusters, disabling ODX on the XtremIO cluster does not require stopping the cluster.   
    Permanent fix   
    A fix will be available in a future XtremIO XIOS release.                                                           






In situations where remapping a volume was attempted, the volume may become stuck in a "remove_pending" state, visible from the command show-lun-mappings