Dell EMC VxRail: KB explaining VXRMEM500 events and impacts of DDR4 self healing

           

   Article Number:     540277                                   Article Version: 7     Article Type:    Break Fix 
   

 


Product:

 

VxRail Appliance Series,VxRail G560 14G Node,VxRail G560F 14G Node

 

Issue:

 

 

What is DDR4 "self-healing"? on Dell EMC VxRail 14G servers with 4.5.400+/4.7.210+ which includes bios 2.1.x   
    How do these DDR4 "self-healing" capabilities (BIOS enhancements) change recommended customer and Technical Support actions when encountering memory errors on a server?   
   
    There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and newer (Available at 4.5.400+/4.7.210+ which includes bios 2.1.x). These enhancements do change the recommended steps/actions to take if memory errors occur and are logged to the VC or dial home or LifeCycle log.   
   
    Note: If you are getting memory errors with DDR4 and you are running pre 4.5.400+/4.7.210+, please update your code to the latest revision to include memory Self-healing enhancements.   
   
    Note: Current memory troubleshooting steps incorporate moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot.   
   
    With 4.5.400+/4.7.210+, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.
                                                           

 

 

Resolution:

 

 

   

      1. Memory retraining enhancements - Memory retraining which happens during boot, optimize the signal timing/margining for each DIMM/slot for best access. Timing characteristics of a DIMM may change for several different reasons:   

   
         
  •         Changes in Server memory configuration     
  •      
  •         BIOS changes     
  •      
  •         Different operating temperatures of the Server or DIMM     
  •      
  •         The general age of the DIMM     
  •    
    Previously, BIOS updates or memory configuration changes being detected would have resulted in memory retraining occurring during the subsequent boot. Starting with BIOS 2.1.x, additional correctable and uncorrectable memory errors "triggers" were added for scheduled retraining:   
         
  •         Warning - VXR500MEM0701/MEM0701- "Correctable memory error rate exceeded for DIMM_XX."     
  •      
  •         Critical - VXR500MEM0702/MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."     
  •      
  •         Critical - VXR500MEM0005/MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."     
  •      
  •               
  •    
    Any of the above errors being logged in the VC events/ dial home/ SEL/LifeCycle logs will result in Memory retraining being scheduled for the next reboot (warm or cold), BIOS will automatically force a cold reboot regardless of what is initiated.   
         
  •         Critical - VXR500MEM0001/MEM0001 - "Multi-bit memory errors detected on memory device at location(s) DIMM_XX."     
  •    
    This Multi-bit error results in the server rebooting due to the fatal error. Memory retraining will automatically occur during that boot.   
   
    With either of these correctable or uncorrectable (multibit) memory errors, the resulting memory retraining on reboot/restart may "self-heal" the failing DIMM by optimizing the signal timing/margining for each DIMM/slot. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.   
   
    2. Post Package Repair (PPR) - The second "self-healing' memory enhancement, results in repairing a failing memory location on a DIMM by disabling the location/address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.   
    Previously, this functionality was limited to the manufacturing process. Just like with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that will result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS will automatically force a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation has been run. Examples of the errors are:   
         
  •         Warning - VXR500MEM0701/MEM0701- "Correctable memory error rate exceeded for DIMM_XX."     
  •      
  •         Critical - VXR500MEM0702/MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."     
  •      
  •         Critical - VXR500MEM0005/MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."     
  •    
   

      Any of the above errors being logged in the VC events/ Dial home/SEL/LifeCycle log will result in PPR being scheduled for the next reboot (warm or cold).     
     
      Note: In a situation where you encounter message ID VXR500MEM8000/MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX) which appears in isolation (ie – not in a similar time-frame) to any corresponding VXR500MEM0005/VXR500MEM0701/VXR500MEM0702 messages, it will not result in a PPR being scheduled for the next reboot.     
      Message ID VXR500MEM8000/MEM8000 in isolation OR with a corresponding MCE (machine check exception) is an indication of a general failure of the DIMM module and is not a situation where the correctable or uncorrectable buckets will initially overflow. This type of memory event should be treated as a DIMM failure and the listed DIMM module should be replaced at the customer’s earliest convenience.       
     
     
      After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation will be similar to:   

   
         
  •         Message ID VXR500MEM9060 - "The PostPackage Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier."     
  •    
    A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation fails after the reboot. An example of a failing PPR message is:   
         
  •         Critical - Message ID UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."     
  •    
                                                             

 

 

Notes:

 

 

None of these features will force a node reboot. Nodes will reboot only when the customer initiates the reboot.   
   
    Addition reading:   
    Parent Power edge KB   
    https://www.dell.com/support/article/us/en/19/qna44643/what-is-ddr4-self-healing-on-dell-poweredge-servers-with-intel-xeon-scalable-processors?lang=en   
    Whitepaper   
    https://downloads.dell.com/manuals/common/dellemc_poweredge_yx4x_memoryras.pdf