VPLEX: Storage-volume is marked degraded then gets cleared afterwards

           

   Article Number:     518037                                   Article Version: 5     Article Type:    Break Fix 
   

 


Product:

 

VPLEX Series,VPLEX VS2,VPLEX Metro,VPLEX GeoSynchrony 5.4 Service Pack 1,VPLEX GeoSynchrony 5.4 Service Pack 1 Patch 1,VPLEX GeoSynchrony 5.4 Service Pack 1 Patch 3,VPLEX GeoSynchrony 5.4 Service Pack 1 Patch 4

 

Issue:

 

 

Cluster status is reporting degraded volumes in the 'Cluster status' output as well as the 'storage-volume summary' output.   
        

      VPlexcli:/> cluster status       
        Cluster cluster-1       
                operational-status:            ok       
                transitioning-indications:       
                transitioning-progress:       
                health-state:                  degraded         
                  health-indications:            1 degraded storage-volumes
       
                local-com:                     ok       
       
       
        VPlexcli:/> storage-volume summary       
        SUMMARY (cluster-1)       
        StorageVolume Name   IO Status  Operational Status  Health State       
        -------------------  ---------  ------------------  ------------       
        test-1                 alive      degraded            degraded       
                
        Storage-Volume Summary  H Tier               L Tier              (no tier)             Total       
        ----------------------  -------------------  ------------------  --------------------  --------------------       
                
        Health                  out-of-date       0  out-of-date      0  out-of-date        0  out-of-date        0       
                                storage-volumes  53  storage-volumes  2  storage-volumes  280  storage-volumes  335       
                                unhealthy         0  unhealthy        0  unhealthy          1  unhealthy          1       
                
        Vendor                  DGC              53  DGC              2  DGC              280  DGC              335       
                
        Use                     used             53  used             2  meta-data          4  meta-data          4       
                                                                         unclaimed          3  unclaimed          3       
                                                                         used             273  used             328       
                
        Capacity                total         20.7T  total          51G  total           112T  total           133T
   
   
   
    When listing the degraded volumes, the health indication shows degraded read/write latency   
        
      VPlexcli:/> ll clusters/cluster-1/storage-elements/storage-volumes/test-1/       
       
        /clusters/cluster-1/storage-elements/storage-volumes/test-1:       
        Name                           Value       
        -----------------------------  ------------------------------------------------       
        application-consistent         false       
        block-count                    2621440       
        block-size                     4K       
        capacity                       10G       
        description                    -       
        free-chunks                    []       
        health-indications             [Director director-1-1-A reports         
                                         degraded-read-latency]          
          health-state                   degraded
       
        io-status                      alive       
        itls                           0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6,       
                                       0x500014xxxxxxxxxx/0x500601xxxxxxxxxx/6, ... (16       
                                       total)       
        largest-free-chunk             0B       
        locality                       -       
        operational-status             ok       
        provision-type                 legacy       
        storage-array-name             EMC-CLARiiON-APM00xxxxxxxxx       
        storage-volumetype             normal       
        system-id                      VPD83T3:60060xxxxxxxxxxxxxxxxxxxxxxxxx       
        thin-capable                   false       
        thin-rebuild                   false       
        total-free-space               0B       
        underlying-storage-block-size  512       
        use                            used       
        used-by                        [extent_test-1_1]       
        vendor-specific-name           DGC
   
   
   
    When the same commands 'cluster status' and 'storage-volume summary' are executed again, after some time they either show no degraded volumes or different degraded volumes than the previous ones.   
   
    When checking the firmware logs,  events AMF/249 and AMF/250 are reported against the LUN(s) that was reported as degraded which means that the degradation was due to to exceeding the performance policies limits and when the latency is within the acceptable range again, the degradation status is cleared.   
        
      128.221.252.38/cpu0/log:5988:W/"00601660dd49152453-1":104786:<4>2018/02/17 15:30:14.45: amf/249 Amf test-1 performance has degraded. Average write I/O latency increased from 0.0 milliseconds to 208.431 milliseconds, which is above the acceptable limit of 200 milliseconds.         
          128.221.252.38/cpu0/log:5988:W/"00601660dd49152453-1":104787:<6>2018/02/17 15:35:14.59: amf/250 Amf test-1 performance is now acceptable. Average write I/O latency decreased from 208.431 milliseconds to 0.0 milliseconds, which is below the threshold of 20 milliseconds.
   
   
    The maximum threshold for the read and write latency after which the LUN is declared degraded is based on the limits of the performance policies as shown below.   
        
      VPlexcli:/clusters/cluster-1/performance-policies> ll     
      Name                    Enabled     
      ----------------------  -------     
      storage-volume-latency  true     
            
      VPlexcli:/clusters/cluster-1/performance-policies> ll storage-volume-latency/     
            
      /clusters/cluster-1/performance-policies/storage-volume-latency:     
      Name                   Value     
      ---------------------  -----     
      average-read-latency   20ms     
      average-write-latency  20ms     
      enabled                true     
      maximum-read-latency   200ms       
        maximum-write-latency  200ms
     
      sampling-period        5min   
   
                                                                

 

 

Cause:

 

 

Before GeoSynchrony version 5.5, the storage volumes where declared degraded based on actual timeouts. More specifically, if eight (8) or more SCSI timeouts were seen in two consecutive 3 minute periods.   
   
    Starting from GeoSynchrony 5.5, the storage volume performance policies were introduced (enabled by default) and the storage-volumes are marked degraded based on the following (reference the VPLEX Administration Guides 5.5.x and later, look for "Storage volume degradation and restoration")   
   
    The below table is not included in the Administration Guide,   
    User-added image
                                                           

 

 

Resolution:

 

 

According to storage volumes performance policies, when a LUN exceeds the latency threshold (200ms by default, as explained in the Administration Guide) it gets declared degraded then when it gets back to the acceptable limit, it gets cleared again.    
    In doing so, VPLEX is behaving normally and as expected.   
   
    Check with the back end array team for possible cause of latency.   
   
    To modify the storage volume performance policy use command "set" from inside the (clusters/cluster-x/performance-policies/storage-volume-latency) context.   
   
    The following guidelines should be followed:   

         
  •         VPLEX chooses the maximum (read or write) latency by picking up the lowest value from a set of the maximum latency requirements of all of the applications the system supports. This will allow VPLEX to report performance based on the needs of the most performance critical application.     
  •      
  •         VPLEX chooses the average (read or write) latency by picking up the highest value from a set of the average latency requirements of all of the applications the system supports.     
  •      
  •         Be cautious in setting the sampling period to a value lower than 3-5 minutes as it could cause the VPLEX system to be busy due to frequently sampling for performance data, especially on a scaled setup.     
  •      
  •         Ensure the settings are same on both VPLEX clusters in a Metro environment. This ensures that VPLEX as a system behaves properly for distributed volumes.     
  •    
                                                             

 

 

Notes:

 

 

VPLEX GeoSynchrony code levels from 5.3.x and earlier are End of Service Life (EOSL).    
    VPLEX GeoSynchrony 5.4.x is End of Life as of 30 April 2018.   
    If you are running on any of the EOSL or EOL code versions you need to upgrade to GeoSynchrony 5.5 and later.