ScaleIO: How to deal with ScaleIO drive / device failure ?


ScaleIO Family





Use the following procedure:   
    1. Identify the SDS and the device failed.   
    A> We can get the SDS and device information from ScaleIO GUI Alert and Backend view.   
    B> We can get the SDS and device information from ScaleIO Event logs (/opt/emc/scaleio/mdm/bin/   
    2. Connect to identified SDS host and check the device status from OS level by using below commands -   
    A> fdisk -l   
    B> cat /var/log/messages | grep -i offline   
    C> cat /sys/block/sdX/device/state   
    3. If we do not find any device issue in OS level (step 2) , log into ScaleIO primary MDM and do ScaleIO login. Then issue command to clear the event and check.   
    A> scli --login --username <> --password <>   
    B> scli --clear_sds_device_error --sds_ip <> --device_path <>   
    C> scli --query_sds --sds_ip <>   
    4. If the deice found offline, dead in OS level, then try to make that device online / running by below command and check.   
    A> echo running > /sys/block/sdX/device/state   
    B> cat /sys/block/sdX/device/state   
    C> cat /var/log/messages | grep -i online   
    If still the device shows offline / dead, then it could be device hardware failure and needs to be replaced.   
    Login to primary MDM and do below check    
    A> scli --login --username <> --password <>   
    B> scli --query_sds --sds_ip <>






Example snaps of identifying device failure from GUI and Event logs -   
    From ScaleIO GUI Dashboard view   
    Device failed seen from Dashboard view   
    From ScaleIO GUI Alert view   
    Device failed seen from Alert view   
    From ScaleIO GUI Backend view   
    Device failed seen from Backend view   
    From Event log (/opt/emc/scaleio/mdm/bin/   
    Device failed seen from Event log