DCA:How to check Faulty Memory, DIMM identification with EDAC counters

Environment:

Data Computing Appliance V2

Data Computing Appliance V3

Description:

On DCA V2 servers there is Service running called EDAC which error detection and correction logs can be very useful to determine if there are any faulty DIMMs in the cluster and specify it's locations.

Resolution:

On specific Server run below commands to find out if there are any potential DIMM issues, the best practice is to check the DIMMs for each CPU separate:

For CPU Socket 0 :

 

[root@mdw ~]# cat /sys/devices/system/edac/mc/mc0/ce_count 541

For CPU Socket 1:

 

[root@mdw ~]# cat /sys/devices/system/edac/mc/mc1/ce_count 1024

From both outputs we can see corrected DIMM errors as total for all DIMMs for each CPU.

For a detailed step by step resolution please refer to EMC Support Solution 489984 https://support.emc.com/kb/489984