SDS stuck in "Join-pending" state with "Command before Add Mdm" in trc


   Article Type: Break Fix 




ScaleIO Software 2.0






      "scli --query_sds" shows one SDS in "join-pending" state.   




      This usually happens after an SDS has experienced issues and been disconnecting and reconnecting, for example, devices having long inflight IO may lead to this issue. Restarting SDS would not get the SDS connected.   




      "scli --query_sds" shows one SDS in "Join-pending" state, while the SDS process is up, and the MDM can communicate to the port without problem.   


      SDS trc contains messages like the following:   

18/06 20:36:20.083738 0x7f2d90338eb0:contCmd_NewRequest:05616: Command before Add Mdm        

      There may also be a large number of "IO_FAULT_NO_COMB" messages in the trc, as the SDCs may be sending IOs to the SDS, which has not been fully reconfigured by MDM. The above message may not stand out due to this.   


      MDM trc contains messages like the following:   

18/06 21:05:34.713697 0x7f394a024eb0:tgtMgr_CheckReturnedRc:00465: Tgt: <SDS ID> operation=2053 rc=WRONG_RECONF_MODE state: NORMAL processState: UP_INPROGRESS upDownState: DOWN mdmTgtConnectionGenNum: 756        

      This particular message (tgtMgr_CheckReturnedRc, with "operation=2053" and "rc=WRONG_RECONF_MODE" against the same SDS) may be seen in MDM trc over the entire duration of SDS being stuck in Join-pending state, with the same "mdmTgtConnectionGenNum".   








      MDM sends some of the control commands to SDS in a wrong method, and continues to retry them if SDS returns "WRONG_RECONF_MODE". This prevents MDM from initiating a full reconfiguration of the SDS, leaving it stuck in "join-pending" state.   








      This Issue will be Fixed in ScaleIO version 3.0   




      Switch the MDM ownership.   










      One SDS unable to join the cluster. Possible IO errors on SDC side (even in MDM_DATA_DEGRADED state)