Data Domain: netmon process repeatedly crashing in DDOS 6.0, or causing DD reboots due to OOM (Out Of Memory)

           

   Article Number:     492781                                   Article Version: 11     Article Type:    Break Fix 
   

 


Product:

 

Data Domain

 

Issue:

 

 

This article describes an issue with Data Domain Network Monitoring daemon (netmon) which may repeatedly crash in DDOS 6.0 with the following PANIC strings:   
        

Nov 29 13:08:00 occ01ss014 netmon: ERROR: MSG-INTRNL-00001: PANIC: lib/dd_memstats.c: dd_malloc_verify_fence: 502: !(footer->header == header)Nov 29 13:08:04 occ01ss014 netmon: ERROR: MSG-INTRNL-00001: PANIC: lib/dd_memstats.c: dd_malloc_verify_fence: 501: !(footer->magic == MALLOC_TAIL_MAGIC)    
   
    Some customers have seen, however, the DD to have restarted instead, due to an Out Of Memory (OOM) condition, particularly so on the DD models with the lowest amount of installed RAM (such as 8 GiB DD2200):   
Id      Post Time                  Clear Time                 Severity   Class         Object        Message-----   ------------------------   ------------------------   --------   -----------   -----------   -----------------------------------------------------------------------------p0-17   Wed Apr  5 06:11:07 2017   Wed Apr  5 06:13:35 2017   INFO       Filesystem                  EVT-FILESYS-00012: System rebooted-----   ------------------------   ------------------------   --------   -----------   -----------   -----------------------------------------------------------------------------    
   
    When the DD reboots, an alert ASUP will be created, and a customer may find the following or similar messages in the ASUP, indicating the DD rebooted due to an OOM condition:   
Apr  5 06:00:21 SDIDDNP00001 kernel: (E4)[   2484499.243191] ddsh invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0    
   
    The kernel log file (debug/platform/kern.info) would also show an entry similar to the one below (note text below is for a different event than the text output above):   
Apr  3 21:53:53 localhost kernel: (E5)[REPLAY](U0)(MSG-KERN-00018):[   6576803.370014] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled    
   
                                                                

 

 

Cause:

 

 

Research has found there is a problem with the LLDP library as used in the "netmon" binary. The Link Layer Discovery Protocol (LLDP) is a vendor-neutral link layer protocol in the Internet Protocol Suite used by network devices for advertising their identity, capabilities, and neighbours on an IEEE 802 local area network, principally wired Ethernet.   
   
    Support for this protocol was initially added to "netmon" in DDOS 6.x, to gather and record extra information about the network and neighbouring devices, and as such it doesn't serve any critical purpose, or having LLDP support for netmon disabled will result in any loss of features or functionality, less so affect backups.   
   
    A leak has been recently found in the LLDP component of the netmon daemon, which will leak memory over time (repeatedly requests memory which it never frees). For some customers, this will result in the netmon process repeatedly crashing, for others, netmon will not crash but will continue to leak and consume memory until the system uses up it all, applications start to swap, and the DD eventually restarts due to OOM.   
        

     
                                                             

 

 

Resolution:

 

 

EMC Data Domain Engineering are aware of this issue and we are committed to delivering a fix in the following releases:   

         
  •         DDOS 6.0.1.10     
  •      
  •         DDOS 6.0.2.0. and later (pending confirmation)     
  •      
  •         DDOS 6.1.0.0 and later     
  •    
    In the meantime , customers can go for a workaround, which consists of disabling the support for LLDP in the "netmon" process, which being just an add-on for monitoring the network, doesn't affect backups and related workloads (restores, replication, etc.) at all.   
   
    To apply the workaround, start by logging in the DD through the CLI as "sysadmin" or equivalent user, and turn to SE mode by running:   
#### Get the DD serial number to use later as the SE password# system show serialno#### Turn to SE privilege mode by running the command below and using the DD serial number obtained above as the password# priv set se    
   
    Now you are in SE mode (the prompt will have changed), check the current setting:   
# reg show dynamic.netmon.lldp.enableddynamic.netmon.lldp.enabled = 1    
   
    Now you can disable LLDP as below:   
# reg set dynamic.netmon.lldp.enabled 0Set dynamic.netmon.lldp.enabled = 0    
   
    Confirm registry is set to zero, and the workaround will be in immediate effect:   
# reg show dynamic.netmon.lldp.enableddynamic.netmon.lldp.enabled = 0    
   

     
      Note this workaround will avoid netmon crashing again, but if the process has leaked and is assigned lots of memory, that will not be freed automatically. To make sure the leaked memory not available for other processes is not causing performance issues down the line, you should plan running an scheduled DD reboot, or reach out to your contracted support provider for restarting the "netmon" process without a reboot.