ECS: Node reboots consistently and more frequently without an indication of a hardware issue in the logs

           

   Article Number:     539540                                   Article Version: 3     Article Type:    Break Fix 
   

 


Product:

 

ECS Appliance,ECS Appliance Gen 1,ECS Appliance Gen 2,ECS Appliance Hardware Gen1 C-Series,ECS Appliance Hardware Gen1 U-Series,ECS Appliance Hardware Gen2 C-Series,ECS Appliance Hardware Gen2 D-Series,ECS Appliance Hardware Gen2 U-Series

 

Issue:

 

 

Node reboots consistently and more frequently without an indication of a hardware issue in the hardware and OS logs.                                                           

 

 

Cause:

 

 

The CPU on the node is failing causing the node to reboot when under load.   

         
  1.         Check the CPU frequency of the nodes int he rack to confirm if the issue is present.     
  2.    
Command:   
    # domulti 'cpupower frequency-info | grep "CPU frequency"'   
    Example:   
admin@node1:~> domulti 'cpupower frequency-info | grep "CPU frequency"'...169.254.3.1========================================  current CPU frequency: 2.63 GHz (asserted by call to hardware)169.254.3.2========================================  current CPU frequency: 2.60 GHz (asserted by call to hardware)169.254.3.3========================================  current CPU frequency: 2.61 GHz (asserted by call to hardware)169.254.3.4========================================  current CPU frequency: 2.61 GHz (asserted by call to hardware)169.254.3.5========================================  current CPU frequency: 1.93 GHz (asserted by call to hardware)...    
   
         
  1.         Confirm the power consumption on the issue node is lower than the rest of the cluster.     
  2.    
Command:   
    # sudo cpupower monitor   
    Example: (Bad CPU with lower frequency below)   
admin@node5:~> sudo cpupower monitor              |Nehalem                    || SandyBridge        || Mperf              || RAPL        || Idle_StatsPKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C7   | PC2  | PC7  || C0   | Cx   | Freq || pack | dram || POLL | C1-H | C1E- | C3-H | C6-H   0|   0|   0|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00|| 99.97|  0.03|  1993||15618308|2210070||  0.00|  0.00|  0.00|  0.00|  0.00   0|   0|  12|  0.00|  0.00|  0.00|  0.00||  0.00|  0.00|  0.00||  0.02| 99.98|  1995||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.98   0|   1|   1|  0.02| 98.93|  0.00|  0.00||  0.00|  0.00|  0.00||  0.03| 99.97|  2059||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.98   0|   1|  13|  0.02| 98.93|  0.00|  0.00||  0.00|  0.00|  0.00||  0.05| 99.95|  2353||15618308|2210070||  0.00|  0.00|  0.00|  0.05| 99.91   0|   2|   2|  0.00| 99.38|  0.00|  0.00||  0.00|  0.00|  0.00||  0.03| 99.97|  1879||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.98   0|   2|  14|  0.00| 99.39|  0.00|  0.00||  0.00|  0.00|  0.00||  0.02| 99.98|  2220||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.98   0|   3|   3|  0.03| 97.99|  0.00|  0.00||  0.00|  0.00|  0.00||  0.12| 99.88|  1607||15618308|2210070||  0.00|  0.00|  0.00|  0.08| 99.82   0|   3|  15|  0.03| 98.00|  0.00|  0.00||  0.00|  0.00|  0.00||  0.06| 99.94|  1644||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.95   0|   4|   4|  0.10| 98.47|  0.00|  0.00||  0.00|  0.00|  0.00||  0.08| 99.92|  2579||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.93   0|   4|  16|  0.10| 98.47|  0.00|  0.00||  0.00|  0.00|  0.00||  0.03| 99.97|  2420||15618308|2210070||  0.00|  0.00|  0.00|  0.12| 99.86   0|   5|   5|  0.00| 98.25|  0.00|  0.00||  0.00|  0.00|  0.00||  0.06| 99.94|  2246||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.96   0|   5|  17|  0.00| 98.25|  0.00|  0.00||  0.00|  0.00|  0.00||  0.10| 99.90|  2191||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.92   1|   0|   6|  0.28| 85.44|  0.00|  0.00||  0.00|  0.00|  0.00||  8.24| 91.76|  2959||15618308|2210070||  0.01|  0.39|  2.28|  0.64| 88.42   1|   0|  18|  0.28| 85.45|  0.00|  0.00||  0.00|  0.00|  0.00||  0.05| 99.95|  1891||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.98   1|   1|   7|  0.22| 92.95|  0.00|  0.00||  0.00|  0.00|  0.00||  2.32| 97.68|  2359||15618308|2210070||  0.00|  0.31|  0.68|  0.55| 96.04   1|   1|  19|  0.22| 92.95|  0.00|  0.00||  0.00|  0.00|  0.00||  0.04| 99.96|  1811||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.99   1|   2|   8|  0.29| 91.05|  0.00|  0.00||  0.00|  0.00|  0.00||  2.75| 97.25|  2414||15618308|2210070||  0.00|  0.69|  1.80|  0.57| 94.20   1|   2|  20|  0.29| 91.05|  0.00|  0.00||  0.00|  0.00|  0.00||  0.07| 99.93|  2010||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.97   1|   3|   9|  0.09| 90.03|  0.00|  0.00||  0.00|  0.00|  0.00||  3.32| 96.68|  2748||15618308|2210070||  0.00|  0.32|  1.99|  0.16| 94.22   1|   3|  21|  0.09| 90.04|  0.00|  0.00||  0.00|  0.00|  0.00||  0.05| 99.95|  1997||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.99   1|   4|  10|  0.31| 90.66|  0.00|  0.00||  0.00|  0.00|  0.00||  4.24| 95.76|  2824||15618308|2210070||  0.00|  0.41|  0.68|  0.57| 94.11   1|   4|  22|  0.31| 90.66|  0.00|  0.00||  0.00|  0.00|  0.00||  0.05| 99.95|  1983||15618308|2210070||  0.00|  0.00|  0.01|  0.00| 99.99   1|   5|  11|  0.29| 86.34|  0.00|  0.00||  0.00|  0.00|  0.00||  7.45| 92.55|  2952||15618308|2210070||  0.00|  0.28|  2.02|  0.65| 89.54   1|   5|  23|  0.29| 86.35|  0.00|  0.00||  0.00|  0.00|  0.00||  0.05| 99.95|  2041||15618308|2210070||  0.00|  0.00|  0.00|  0.00| 99.87    
Example: (Good CPU power consumption)    
admin@node1:~> sudo cpupower monitor              |Nehalem                    || SandyBridge        || Mperf              || RAPL        || Idle_StatsPKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C7   | PC2  | PC7  || C0   | Cx   | Freq || pack | dram || POLL | C1-H | C1E- | C3-H | C6-H   0|   0|   0| 14.84|  5.47|  0.00|  0.00||  0.00|  0.00|  0.00|| 22.72| 77.28|  2620||34914217|4474791||  0.00|  8.36| 20.16| 32.83| 14.83   0|   0|  12| 14.84|  5.47|  0.00|  0.00||  0.00|  0.00|  0.00||  6.28| 93.72|  2595||34914217|4474791||  0.00|  1.67|  5.41| 19.27| 67.12   0|   1|   1| 14.29|  5.38|  0.00|  0.00||  0.00|  0.00|  0.00|| 20.81| 79.19|  2609||34914217|4474791||  0.01|  6.63| 16.31| 41.85| 13.56   0|   1|  13| 14.29|  5.38|  0.00|  0.00||  0.00|  0.00|  0.00||  8.48| 91.52|  2639||34914217|4474791||  0.00|  3.27|  2.64| 19.56| 65.67   0|   2|   2| 12.18|  3.68|  0.00|  0.00||  0.00|  0.00|  0.00|| 26.69| 73.31|  2632||34914217|4474791||  0.25|  6.32| 18.16| 34.00| 13.52   0|   2|  14| 12.18|  3.68|  0.00|  0.00||  0.00|  0.00|  0.00||  7.15| 92.85|  2594||34914217|4474791||  0.03|  2.47|  9.15| 41.71| 39.23   0|   3|   3| 12.48|  2.23|  0.00|  0.00||  0.00|  0.00|  0.00|| 26.68| 73.32|  2635||34914217|4474791||  0.00|  7.56| 14.52| 37.15| 12.72   0|   3|  15| 12.48|  2.23|  0.00|  0.00||  0.00|  0.00|  0.00||  9.62| 90.38|  2622||34914217|4474791||  0.09|  3.33| 12.04| 43.99| 30.61   0|   4|   4| 13.41|  3.92|  0.00|  0.00||  0.00|  0.00|  0.00|| 18.74| 81.26|  2616||34914217|4474791||  0.51|  9.07| 20.14| 33.97| 17.16   0|   4|  16| 13.41|  3.92|  0.00|  0.00||  0.00|  0.00|  0.00|| 15.24| 84.76|  2638||34914217|4474791||  0.00|  2.12|  5.97| 36.28| 39.66   0|   5|   5| 13.97|  5.20|  0.00|  0.00||  0.00|  0.00|  0.00|| 21.36| 78.64|  2623||34914217|4474791||  0.01|  4.82| 17.25| 34.66| 20.94   0|   5|  17| 13.97|  5.20|  0.00|  0.00||  0.00|  0.00|  0.00|| 10.13| 89.87|  2611||34914217|4474791||  0.69|  3.85|  6.30| 41.95| 37.35   1|   0|   6| 21.11| 12.75|  0.00|  0.00||  0.00|  0.00|  0.00|| 16.39| 83.61|  2610||34914217|4474791||  0.03|  6.84| 20.79| 36.19| 18.96   1|   0|  18| 21.11| 12.75|  0.00|  0.00||  0.00|  0.00|  0.00||  4.39| 95.61|  2679||34914217|4474791||  0.00|  0.87|  0.11|  0.58| 93.86   1|   1|   7| 12.81| 17.80|  0.00|  0.00||  0.00|  0.00|  0.00|| 24.56| 75.44|  2691||34914217|4474791||  0.14|  7.69| 17.64| 23.69| 24.95   1|   1|  19| 12.81| 17.80|  0.00|  0.00||  0.00|  0.00|  0.00||  2.28| 97.72|  2540||34914217|4474791||  0.18|  1.19|  3.29|  1.78| 91.35   1|   2|   8| 16.77| 13.97|  0.00|  0.00||  0.00|  0.00|  0.00|| 18.46| 81.54|  2653||34914217|4474791||  0.00|  6.44| 27.25| 27.49| 19.31   1|   2|  20| 16.77| 13.97|  0.00|  0.00||  0.00|  0.00|  0.00||  3.21| 96.79|  2575||34914217|4474791||  0.22|  0.48|  0.12|  1.57| 94.50   1|   3|   9| 18.42| 20.59|  0.00|  0.00||  0.00|  0.00|  0.00|| 17.12| 82.88|  2617||34914217|4474791||  0.10|  5.07| 17.24| 29.94| 29.80   1|   3|  21| 18.42| 20.59|  0.00|  0.00||  0.00|  0.00|  0.00||  2.30| 97.70|  2547||34914217|4474791||  0.00|  0.01|  0.19|  0.88| 96.64   1|   4|  10| 19.87| 11.49|  0.00|  0.00||  0.00|  0.00|  0.00|| 13.87| 86.13|  2573||34914217|4474791||  0.24|  7.75| 25.57| 33.11| 18.91   1|   4|  22| 19.87| 11.49|  0.00|  0.00||  0.00|  0.00|  0.00||  3.78| 96.22|  2547||34914217|4474791||  0.50|  1.30|  0.54|  5.62| 88.53   1|   5|  11| 22.04| 16.17|  0.00|  0.00||  0.00|  0.00|  0.00|| 20.26| 79.74|  2616||34914217|4474791||  0.08|  2.43| 12.37| 37.26| 26.76   1|   5|  23| 22.04| 16.17|  0.00|  0.00||  0.00|  0.00|  0.00||  6.63| 93.37|  2670||34914217|4474791||  0.00|  0.00|  0.38|  0.96| 91.83    
   
         
  1.         Validate that both power supplies are serving power and are in good health as this could impact the way a CPU can get its power impacting the whole clusters CPU frequency.     
  2.    
Command:   
    # domulti 'ipmitool sdr elist | grep PS'   
    Example:   
admin@node1:~> domulti 'ipmitool sdr elist | grep PS'...  192.168.219.1========================================HSBP PSOC Temp   | 29h | ok  | 15.1 | 20 degrees CPS1 Status       | 50h | ok  | 10.1 | Presence detectedPS2 Status       | 51h | ok  | 10.2 | Presence detectedPS1 Power In     | 54h | ok  | 10.1 | 10 WattsPS2 Power In     | 55h | ok  | 10.2 | 470 WattsPS1 Curr Out %   | 58h | ok  | 10.1 | 0 percentPS2 Curr Out %   | 59h | ok  | 10.2 | 27 percentPS1 Temperature  | 5Ch | ok  | 10.1 | 23 degrees CPS2 Temperature  | 5Dh | ok  | 10.2 | 24 degrees CPS1 Fan1 Fail    | A0h | ok  | 10.1 |PS1 Fan2 Fail    | A1h | ok  | 10.2 |PS2 Fan1 Fail    | A4h | ok  | 10.1 |PS2 Fan2 Fail    | A5h | ok  | 10.2 |...    
                                                             

 

 

Change:

 

 

   

      Node CPU frequency degraded or Power supply issues causing the nodes to have degraded power.   

                                                             

 

 

Resolution:

 

 

If you encounter this issue please open an SR and reference this KB 539540.