ESXi NIC team connectivity loss upon failback

           

   Article Number:     539367                                   Article Version: 2     Article Type:    Break Fix 
   

 


Product:

 

VxFlex Product Family,VxFlex OS,VxFlex Ready Node,VxRack Flex Series,VxRack Flex-PowerEdge 13G,VxRack Flex-PowerEdge 14G

 

Issue:

 

 

   

      Scenario   

   

      ESXi fails back to recently-lost NIC as soon as link light occurs on NIC, and the switch does not forward packets for a few more minutes.   

   

      Symptoms   

   

      It's important to note that the disruption does not occur upon the switch loss, but upon successful startup of the switch being rebooted.    

   

      The ESXi host's vmkernel log shows generic SDC disconnectivity, sockets down:   

   
      2019-09-09T16:16:36.858Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x4395857ff480 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.858Z cpu51:66657)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x4395857ff7fc socket 0x4395857ffd102019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827f9ac0 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827f40c0 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827f6280 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu8:66787)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827f9e3c socket 0x439d827fa3502019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827fbc80 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827f6dc0 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827fe980 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827f8f80 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827fc7c0 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827ff4c0 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu49:66336)WARNING: ScaleIO netCon_IsKaNeeded:3758 :CON 0x439d827f3580 didn't receive message for 30 iterations.  Marking as down2019-09-09T16:16:36.958Z cpu62:66792)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827f443c socket 0x439d827f49502019-09-09T16:16:36.959Z cpu46:66798)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827f65fc socket 0x439d827f6b102019-09-09T16:16:36.959Z cpu43:66803)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827fbffc socket 0x439d827fc5102019-09-09T16:16:36.959Z cpu46:66804)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827f713c socket 0x439d827f76502019-09-09T16:16:36.959Z cpu53:66815)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827fecfc socket 0x439d827ff2102019-09-09T16:16:36.959Z cpu38:66816)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827f92fc socket 0x439d827f98102019-09-09T16:16:36.960Z cpu36:66822)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827fcb3c socket 0x439d827fd0502019-09-09T16:16:36.960Z cpu21:66821)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827ff83c socket 0x439d827ffd502019-09-09T16:16:36.961Z cpu17:66791)WARNING: ScaleIO netSock_RcvIntrn:1920 :Error: Failed  to receive 128 data PTR 0x439d827f38fc socket 0x439d827f3e10    
   

      Impact   

   

      Volume access lost, DU.     

                                                             

 

 

Cause:

 

 

   

      Root Cause   

   

      Certain models of Cisco switches power up their network ports/ASICs before they are able to forward traffic.    

   

      ESXi's teaming function takes the link up state as the switch ready to forward traffic, and nothing gets through until the switch actually does start forwarding.   

   

          

                                                             

 

 

Resolution:

 

 

   

      Workaround   

   

      Set Net.TeamPolicyUpDelay to a value greater than the duration in milliseconds from when the switch powers up the ports until it can actually forward traffic.    

   

      The setting is found per host, under Configure → System → Advanced System Settings.   

   

      In this customer's environment, the delay varied depending on the switch in use.     

   

      Most modern Cisco models were ready within three minutes, but they elected to go with the maximum of 60,000ms (ten minutes) for compatibility with older switches.   

                                                             

 

 

Notes:

 

 

   

      Impacted Versions   

   

      N/A - Not a ScaleIO issue.    

   

      Fixed In Version   

   

      N/A - Not a ScaleIO issue.