SDS stuck in "Join-pending" state with "Command before Add Mdm" in trc

           

   Article Number:     503020                                   Article Version: 3     Article Type:    Break Fix 
   

 


Product:

 

ScaleIO Software 2.0

 

Issue:

 

 

   

      "scli --query_sds" shows one SDS in "join-pending" state.   

   

      Scenario   

   

      This usually happens after an SDS has experienced issues and been disconnecting and reconnecting, for example, devices having long inflight IO may lead to this issue. Restarting SDS would not get the SDS connected.   

   

      Symptoms   

   

      "scli --query_sds" shows one SDS in "Join-pending" state, while the SDS process is up, and the MDM can communicate to the port without problem.   

   

      SDS trc contains messages like the following:   

   
     
       
18/06 20:36:20.083738 0x7f2d90338eb0:contCmd_NewRequest:05616: Command before Add Mdm        
     
   
   

      There may also be a large number of "IO_FAULT_NO_COMB" messages in the trc, as the SDCs may be sending IOs to the SDS, which has not been fully reconfigured by MDM. The above message may not stand out due to this.   

   

      MDM trc contains messages like the following:   

   
     
       
18/06 21:05:34.713697 0x7f394a024eb0:tgtMgr_CheckReturnedRc:00465: Tgt: <SDS ID> operation=2053 rc=WRONG_RECONF_MODE state: NORMAL processState: UP_INPROGRESS upDownState: DOWN mdmTgtConnectionGenNum: 756        
     
   
   

      This particular message (tgtMgr_CheckReturnedRc, with "operation=2053" and "rc=WRONG_RECONF_MODE" against the same SDS) may be seen in MDM trc over the entire duration of SDS being stuck in Join-pending state, with the same "mdmTgtConnectionGenNum".   

                                                             

 

 

Cause:

 

 

   

      MDM sends some of the control commands to SDS in a wrong method, and continues to retry them if SDS returns "WRONG_RECONF_MODE". This prevents MDM from initiating a full reconfiguration of the SDS, leaving it stuck in "join-pending" state.   

                                                             

 

 

Resolution:

 

 

   

      This Issue will be Fixed in ScaleIO version 3.0   

   

      Workaround   

   

      Switch the MDM ownership.   

                                                             

 

 

Notes:

 

 

   

      Impact   

   

      One SDS unable to join the cluster. Possible IO errors on SDC side (even in MDM_DATA_DEGRADED state)