|Article Number: 540277||Article Version: 7||Article Type: Break Fix|
VxRail Appliance Series,VxRail G560 14G Node,VxRail G560F 14G Node
What is DDR4 "self-healing"? on Dell EMC VxRail 14G servers with 4.5.400+/4.7.210+ which includes bios 2.1.x
How do these DDR4 "self-healing" capabilities (BIOS enhancements) change recommended customer and Technical Support actions when encountering memory errors on a server?
There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and newer (Available at 4.5.400+/4.7.210+ which includes bios 2.1.x). These enhancements do change the recommended steps/actions to take if memory errors occur and are logged to the VC or dial home or LifeCycle log.
Note: If you are getting memory errors with DDR4 and you are running pre 4.5.400+/4.7.210+, please update your code to the latest revision to include memory Self-healing enhancements.
Note: Current memory troubleshooting steps incorporate moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot.
With 4.5.400+/4.7.210+, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.
1. Memory retraining enhancements - Memory retraining which happens during boot, optimize the signal timing/margining for each DIMM/slot for best access. Timing characteristics of a DIMM may change for several different reasons: Any of the above errors being logged in the VC events/ Dial home/SEL/LifeCycle log will result in PPR being scheduled for the next reboot (warm or cold).
With either of these correctable or uncorrectable (multibit) memory errors, the resulting memory retraining on reboot/restart may "self-heal" the failing DIMM by optimizing the signal timing/margining for each DIMM/slot. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.
2. Post Package Repair (PPR) - The second "self-healing' memory enhancement, results in repairing a failing memory location on a DIMM by disabling the location/address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.
Previously, this functionality was limited to the manufacturing process. Just like with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that will result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS will automatically force a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation has been run. Examples of the errors are:
Note: In a situation where you encounter message ID VXR500MEM8000/MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX) which appears in isolation (ie – not in a similar time-frame) to any corresponding VXR500MEM0005/VXR500MEM0701/VXR500MEM0702 messages, it will not result in a PPR being scheduled for the next reboot.
Message ID VXR500MEM8000/MEM8000 in isolation OR with a corresponding MCE (machine check exception) is an indication of a general failure of the DIMM module and is not a situation where the correctable or uncorrectable buckets will initially overflow. This type of memory event should be treated as a DIMM failure and the listed DIMM module should be replaced at the customer’s earliest convenience.
After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation will be similar to:
1. Memory retraining enhancements - Memory retraining which happens during boot, optimize the signal timing/margining for each DIMM/slot for best access. Timing characteristics of a DIMM may change for several different reasons:
Any of the above errors being logged in the VC events/ Dial home/SEL/LifeCycle log will result in PPR being scheduled for the next reboot (warm or cold).
None of these features will force a node reboot. Nodes will reboot only when the customer initiates the reboot.
Parent Power edge KB