Device has fixed read errors


   Article Number:     488446                                   Article Version: 3     Article Type:    Break Fix 




ScaleIO 1.32.1,ScaleIO 1.32.2,ScaleIO 1.32.3,ScaleIO 1.32.4,ScaleIO 1.32.5,ScaleIO 1.32.6,ScaleIO 2.0,ScaleIO 2.0.0,ScaleIO,ScaleIO






      Issue Description   


      SDS device(s) have errors stating "Device has fixed read errors".   




      This can be seen when an SDS device has read errors that have been corrected by the background device scanner.   




      There are multiple symptoms that can show here when the background device scanner is on:   


              The GUI will show an error:         

  •         The "--query_sds --sds_id <SDS_ID>" output will show a counter for each device with corrected read errors:                


              15: Name: /dev/sdr Path: /dev/sdr Original-path: /dev/sdr ID: 2d63f7c80003000e           Storage Pool: SAS_pool1, Capacity: 1116 GB Error-fixes: 6 scanned 0 MB, Compare errors: 0 State: Normal        



              * Note: The device may also be in a "Error" state currently.           
                 **Note: There is no event in the MDM events log to indicate this "fixed read errors" condition was seen. 


      Possible symptoms that may show in other locations (not related to background device scanner, but overall device "health"):   


              Look for device read errors in the messages (/var/log/messages) output:          

               blk_update_request: critical medium error, dev sdr, sector 94390272  sd 0:2:15:0: [sdr] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE  sd 0:2:15:0: [sdr] tag#1 Sense Key : Medium Error [current]   sd 0:2:15:0: [sdr] tag#1 Add. Sense: Unrecovered read error        

              LonginFlightIOViolation messages in SDS trc logs:           

              contDevMngr_HandleLongInflightIoViolation:02998: IO on devId: 2d63f7c80003000e (/dev/sdr) took too long, Low threshold exceeded - waited for reaper 12250 milliscontDevMngr_HandleLongInflightIoViolation:02998: IO on devId: 2d63f7c80003000e (/dev/sdr) took too long, Low threshold exceeded - waited for reaper 13250 milliscontDevMngr_HandleLongInflightIoViolation:02998: IO on devId: 2d63f7c80003000e (/dev/sdr) took too long, Low threshold exceeded - waited for reaper 14250 millis        

              Output from sdbg_out.txt of the SDS which contains the device in question:         

              13: Dev path:/dev/sdr Size(lbs):0 Time grn:520577464  Io Counters:   GENERAL     Writes: 4852 Lbs: 2160443 MBs: 1054 Errors: 0     Reads: 49283 Lbs: 111376 MBs: 54 Errors: 12744   BM     Writes: 0 Lbs: 0 MBs: 0 Errors: 0     Reads: 0 Lbs: 0 MBs: 0 Errors: 0   COMB_MAP     Writes: 5 Lbs: 1390 MBs: 0 Errors: 2     Reads: 0 Lbs: 0 MBs: 0 Errors: 0   TOOTH_MAP     Writes: 426 Lbs: 688528 MBs: 336 Errors: 424     Reads: 0 Lbs: 0 MBs: 0 Errors: 0   IO     Writes: 4319 Lbs: 603064 MBs: 294 Errors: 16     Reads: 2076 Lbs: 16608 MBs: 8 Errors: 22        

              Output from counters_dump.txt of AVG_WRITE_LATENCY compared to other devices on the same node/cluster:       

              ID: 2d63f7c60003000c DEVICE_TYPE                DEV_LATENCY                          AVG_WRITE_LATENCY_IN_MICROSEC   0ID: 2d63f7c70003000d DEVICE_TYPE                DEV_LATENCY                          AVG_WRITE_LATENCY_IN_MICROSEC   0ID: 2d63f7c80003000e DEVICE_TYPE                DEV_LATENCY                          AVG_WRITE_LATENCY_IN_MICROSEC   11424ID: 2d63f7c90003000f DEVICE_TYPE                DEV_LATENCY                          AVG_WRITE_LATENCY_IN_MICROSEC   0ID: 2d63f7ca00030010 DEVICE_TYPE                DEV_LATENCY                          AVG_WRITE_LATENCY_IN_MICROSEC   0        



      The "Fixed Read Errors" counter will not cause any impact to the system, other than incrementing the counter as seen above.    


      The impact is felt when the device itself is failing. In this case it may cause SDS disconnects, rebuilds and general instability of the ScaleIO cluster.   







The root cause in this case is a device that is going bad/slowing down.  The background device scanner had an error when reading a particular 1 MB block on the device and fixed the problem itself by overwriting the block from the other copy (Primary or Secondary). In this case, the background device scanner found and fixed 6 read errors with device /dev/sdr.                                                            









      The fix is two fold:   

  1.         Determine if this was a one-off issue or something more systemic with the device in question. These errors usually indicate that the device in question is having hardware issues. If there are a lot of corrected read errors for a specific device and it has counted up over time, this is an indicator that the device is going bad and we are having trouble reading from this device. Running hardware diagnostics (if available) on this device and/or replacing the disk need to considered.       
  3.         This command will clear all the corrected read counters for a specific pool, not just a specific device. Perform this only if you're certain you and the customer are aware of what the counters indicate and after a decision has been made about the disk and what impact it is having upon the cluster.       
             Clear the fixed read error counters with this command:                
              scli --reset_scanner_error_counters --protection_domain_id <pd id> --storage_pool_id <sp id> --reset_corrected_read_error_counter        

               *Note: You may also need to clear device errors if the device is in an errored state.       


           **Be sure to use all of the device health status indicators to make the decision that a disk is going bad**   


      Impacted versions   




      Fixed in version   


      This is working as designed and is not a bug.