Unexplained MDM switchover in scaleIO cluster

Environment:

ScaleIO 2.0.0.1

 

 

 

Description:

Master MDM switched over unexpectedly to slave MDM

MDM event logs have the following events:-

534536 2016-07-08 00:00:30.961 MDM_CLUSTER_CONNECTED     INFO The MDM, ID 256570b60ecb6bf4, connected

534537 2016-07-08 00:00:30.981 MDM_CLUSTER_LOST_CONNECTION WARNING                   The MDM, ID 0ffec83815721eb2, lost connection

534538 2016-07-08 00:00:30.981 MDM_CLUSTER_CONNECTED     INFO The MDM, ID 14e2883f57c8b853, connected

534539 2016-07-08 00:00:31.154 MDM_CLUSTER_CONNECTED     INFO                      The MDM, ID 0ffec83815721eb2, connected

534540 2016-07-08 00:00:37.941 MDM_CLUSTER_LOST_CONNECTION WARNING                   The MDM, ID 0ffec83815721eb2, lost connection

 

Just prior to that we noticed the NET_OSCILLATION_COUNTER_PASSED_THRESHOLD errors in the Event logs as well on the MDM:-

534512 2016-07-08 00:00:23.753 NET_OSCILLATION_COUNTER_PASSED_THRESHOLD WARNING         SDS (Name: namwtp2sioc02, ID: 8926af0900000064) reports frequent disconnections from SDS (Name: namwtp2sioa02, ID: 8926af000000005b IP: 10.125.160.139). Medium window threshold (500 disconnections in 3600 seconds) exceeded.

534513 2016-07-08 00:00:23.754 NET_OSCILLATION_COUNTER_PASSED_THRESHOLD WARNING         SDS (Name: namwtp2sioa25, ID: 8926aeff0000005a) reports frequent disconnections from SDS (Name: namwtp2sioa34, ID: 8926aebf0000001a IP: 10.125.160.171). Long window threshold (700 disconnections in 86400 seconds) exceeded.

On the MDM we see the following logs in exp.0 in MDM logs folder:-

08/07 00:00:30.151469 Panic in file /data/emc/svc_flashbld/workspace/ScaleIO-RHEL6/src/mos/mos_oscillation_counters.c, line 1063, function mosOscCnt_IsNetCounter, PID 11701.Panic Expression ALWAYS_ASSERT .

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosDbg_BackTrace+0x28) [0x67c0ab]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosDbg_PanicPrepare+0x124) [0x66e470]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosOscCnt_IsNetCounter+0x53) [0x6627d1]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(countMgr_UpdateNetOscillatingCounterList+0x8b1) [0x5ab022]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(tgtMgr_HandleTgtOscCountersResponse+0x114) [0x4df4d2]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mdmTgtMsg_RecvRpcResponseCB+0x8f7) [0x4a96ef]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(netRecvGroup_CallRecvRpcResponse+0x14) [0x6183e9]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(netRecvGroup_WaitForWork+0x3cb) [0x618f08]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(netRecvGroup_WaitForWorkLoop+0x16) [0x618f80]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosUmt_StartFunc+0xec) [0x683c2b]

08/07 00:00:30.153304 ************************** BACKTRACES START *****************************

----------========== Backtrace of UMT 0x7f3818318eb8 ==========----------

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0() [0x67e838]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosUmtCond_TimedWait+0x36) [0x67e999]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosUmtCond_Wait+0xe) [0x67e9a9]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0() [0x652cc0]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosUmt_StartFunc+0xec) [0x683c2b]

/opt/emc/scaleio/mdm/bin/mdm-2.0.6035.0(mosUmt_SignalHandler+0x51) [0x685694]

/lib64/libpthread.so.0() [0x3f6420f7e0]

----------========== Backtrace of UMT 0x7f381832aeb8 ==========----------

 

In the messages on the MDM node we see the following logs:-

Jul  8 00:00:30 namwtp2sioa01 init: mdm main process (11701) terminated with status 255

Jul  8 00:00:30 namwtp2sioa01 init: mdm main process ended, respawning

 

In the scaleIO event logs on the MDM we see the following logs:-

534537 2016-07-08 00:00:30.981 MDM_CLUSTER_LOST_CONNECTION WARNING                   The MDM, ID 0ffeaaaaaaaa1eb2, lost connection

 

MDM trc logs have the following logs:-

08/07 00:00:30.872372 ---------- Process started. Version private ScaleIO R2_0.6035.0_Release Apr 20 2016. PID 20364 ----------

08/07 00:00:30.892282 e0aa7eb8:mosEventLog_PostInternal:00590: New event added. Message: "MDM started with the role of Manager". Additional info: "" Severity: Info

 

 

 

 

Resolution:

Issue will be fixed in ScaleIO version 2.0.1 release.

For a detailed resolution please refer to DELL EMC Support Solution 488491 https://support.emc.com/kb/488491

 

YOU MAY ALSO BE INTERESTED IN THE FOLLOWING CONTENTS FOR ScaleIO:


Top Services Topics