VPLEX: Severe write latency on VPlex distributed devices due to backend performance issues

           

   Article Number:     530258                                   Article Version: 3     Article Type:    Break Fix 
   

 


Product:

 

VPLEX for All Flash,VPLEX GeoSynchrony,VPLEX GeoSynchrony 6.1,VPLEX GeoSynchrony 6.1 Patch 1,VPLEX GeoSynchrony 6.0 Patch 1,VPLEX GeoSynchrony 6.0 Patch 2,VPLEX GeoSynchrony 6.0 Service Pack 1,VPLEX GeoSynchrony 6.0 Service Pack 1 Patch 1

 

Issue:

 

 

Hosts are experiencing severe write latency which could cause a performance DU.   
   
    From VPlex side, there may be a stream of host aborts (stdf/10 events) with a status of 0x2a or 0x8a. This means that the hosts are aborting write IOs:   
   
    128.221.252.67/cpu0/log:5988:W/"00601672b5b475554-2":126043:<6>2019/01/10 19:39:09.68: stdf/10 Scsi Tmf [Abort Task] on fcp ITLQ: [10:00:00:00:C9:C9:AB:DC (0x10000000c9c9abdc) A0-FC00 (0x5000144260756500) 0x8000000000000 0x243] vol dd_vol taskElapsedTime(usec) 5339 dormantQCnt 0 enabledQCnt 0 status 2a00000000000605:400ac80000002     
      128.221.252.67/cpu0/log:5988:W/"00601672b5b475554-2":126043:<6>2019/01/10 19:39:09.68: stdf/10 Scsi Tmf [Abort Task] on fcp ITLQ: [10:00:00:00:C9:C9:AB:DC (0x10000000c9c9abdc) A0-FC00 (0x5000144260756500) 0x8000000000000 0x243] vol dd_vol taskElapsedTime(usec) 5339 dormantQCnt 0 enabledQCnt 0 status 2a00000000000605:400ac80000002     
      128.221.252.67/cpu0/log:5988:W/"00601672b5b475554-2":126043:<6>2019/01/10 19:39:09.68: stdf/10 Scsi Tmf [Abort Task] on fcp ITLQ: [10:00:00:00:C9:C9:AB:DC (0x10000000c9c9abdc) A0-FC00 (0x5000144260756500) 0x8000000000000 0x243] vol dd_vol taskElapsedTime(usec) 5339 dormantQCnt 0 enabledQCnt 0 status 2a00000000000605:400ac80000002
   
   
   
    Stream of backend timeouts (scsi/140 events) against the storage volume which corresponds to the virtual volume presented to the impacted hosts:   
        

      firmware.log_20181019000922:128.221.253.36/cpu0/log:5988:W/"0060166fc49615528-1":304763:<4>2019/01/10 19:39:09.68: scsi/140 Scsi command 0x7ec67ce41278 timeout, opcode 0x2a luid VPD83T3:60000000000000000000000000000000 nexus x fcp i 0xc0014487873b8800 t 0x5006016c47e02548 0x000f000000000000firmware.log_20181019000922:128.221.253.36/cpu0/log:5988:W/"0060166fc49615528-1":304764:<4>2019/01/10 19:39:09.68: scsi/140 Scsi command 0x7ec67b9c99f0 timeout, opcode 0x2a luid VPD83T3:60000000000000000000000000000000 nexus x fcp i 0xc0014487873b8800 t 0x5006016c47e02548 0x004f000000000000firmware.log_20181019000922:128.221.253.36/cpu0/log:5988:W/"0060166fc49615528-1":304765:<4>2019/01/10 19:39:09.68: scsi/140 Scsi command 0x7ec67f032f90 timeout, opcode 0x2a luid VPD83T3:60000000000000000000000000000000 nexus x fcp i 0xc0014487873b8900 t 0x5006016d47e02548 0x000f0000000000    
   
   
    Performance degradation events can also be observed (amf/249 events) against the impacted storage volumes.   
   
    128.221.253.67/cpu0/log:5988:W/"0060166fd1a610335-2":2359857:<4>2019/01/10 19:39:09.6: amf/249 Amf sop_xxxx performance has degraded. Average write I/O latency increased from 0.0 milliseconds to 216.197 milliseconds, which is above the acceptable limit of 200 milliseconds.     
      128.221.253.67/cpu0/log:5988:W/"0060166fd1a610335-2":2359857:<4>2019/01/10 19:39:09.6: amf/249 Amf sop_xxxx performance has degraded. Average write I/O latency increased from 0.0 milliseconds to 216.197 milliseconds, which is above the acceptable limit of 200 milliseconds.     
      128.221.253.67/cpu0/log:5988:W/"0060166fd1a610335-2":2359857:<4>2019/01/10 19:39:09.6: amf/249 Amf sop_xxxx performance has degraded. Average write I/O latency increased from 0.0 milliseconds to 216.197 milliseconds, which is above the acceptable limit of 200 milliseconds.
   
   
   
    A RAID-1 mirror leg built on a poorly performing storage volume can bring down the performance of the whole RAID-1 device and increase I/O latencies to the applications using this device. This is due to the fact that the write I/Os are only written after being acknowledged on both legs (one of them being the poorly performing leg)   
   
                                                                

 

 

Cause:

 

 

This can be due to any backend array or backend fabric issue that would cause a stream of backend timeouts to be seen on the VPLEX.                                                           

 

 

Resolution:

 

 

Enabling the mirror isolation feature would mitigate the high latency issue as it would isolate the bad performing device and stop I/Os from being processed on it. This should allow the applications to recover until the backend issues are resolved.   
   
    VPlexcli:/> device mirror-isolation enable     
     
      VPlexcli:/> device mirror-isolation show     
        Cluster    Enabled  Auto unisolation  Isolation Interval  Unisolation Interval     
        ---------  -------  ----------------  ------------------  --------------------     
        cluster-1  true     true              60                  14400     
        cluster-2  true     true              60                  14400
   
   
   
    This feature will automatically isolate [stop doing I/Os to] poorly performing RAID-1 legs. It is enabled per cluster.    
   
    Pros:   

         
  1.          Automatic functionality which can isolate poorly performing legs quickly after the issue  occurs [usually within a few minutes]     
  2.      
  3.          When a device becomes unisolated, the rebuild will happen automatically and will rebuild only the changes which occurred while the device leg was isolated. This usually only takes a matter of minutes to re-sync     
  4.      
  5.         The feature can be easily and quickly enabled and disabled      
  6.    
    Cons:   
         
  1.         While a device leg is isolated, the top level device no longer has redundancy     
  2.      
  3.         Once a device leg is isolated, VPLEX will not check to unisolate the leg for 4 hours [this avoids the situation where intermittent performance issues cause intermittent performance impact]     
  4.    
   
   
    For further details about the mirror isolation feature, please refer to VPLEX Admin Guide.   
   
    Using mirror isolation only relieves the performance issues hence stopping the performance impact. However, the root cause for the performance degradation needs to be investigated further. The backend array and backend fabric teams should be engaged to investigate this further.   
   
    NOTE:   
    Also reference KB 530520, "VPLEX: Single component failures in the fabric or Array controllers can lead to ongoing Performance DU on hosts accessing storage through VPLEX"