DataDomain FS process (ddfs) may crash with memory allocation failures, but available memory still exists

           

   Article Number:     489444                                   Article Version: 7     Article Type:    Break Fix 
   

 


Product:

 

Data Domain

 

Issue:

 

 

A customer or Data Domain Support may notice the FS process has crashed, and upon reviewing of the ddfs.info log file, has noticed messages such as the following:   

      Jul 30 22:29:54 dd.example.com ddfs[27331]: ERROR: MSG-INTRNL-00001: PANIC: lib/dd_alloc.c: _dd_malloc_pc: 1147: Malloc returned null: file ddr/repl/lrepl_client.c, line 1345, size 1048636Jul 30 22:29:54 dd.example.com ddfs[27331]: ERROR: MSG-INTRNL-00001: PANIC: include/dd_alloc.h: _dd_malloc_aligned_pc: 629: Memalign returned null: file ddr/gc/gc_cm_process.c, line 680, size 106508Jul 30 22:29:54 dd.example.com ddfs[27331]: ERROR: MSG-INTRNL-00001: PANIC: include/dd_alloc.h: _dd_malloc_aligned_pc: 629: Memalign returned null: file ddr/gc/gc_cm_process.c, line 680, size 106508Jul 30 22:29:54 dd.example.com ddfs[27331]: ERROR: MSG-INTRNL-00001: PANIC: include/dd_alloc.h: _dd_malloc_aligned_pc: 629: Memalign returned null: file ddr/gc/gc_cm_process.c, line 680, size 106508Jul 30 22:29:54 dd.example.com ddfs[27331]: ERROR: MSG-INTRNL-00001: PANIC: lib/dd_alloc.c: _dd_malloc_pc: 1147: Malloc returned null: file ddr/cs/cs_v0.c, line 12624, size 471388    
   

      The internal FS process memory allocation function has been shown to repeatedly failed (returned null) and eventually the FS PANICs and restarts to recover.     
     
      However, the kernel logs (kern.info) for the time of the FS crash shows there is plenty of system memory available:   

   
      Jul 30 22:32:22 dd.example.com kernel: (E6)[   7021750.574502] Signal 3 posted to ddfs(pid=27331)  by ddr_stated(pid=15005)Jul 30 22:32:22 dd.example.com kernel: (E6)[   7021750.574545] VM stats: MemFree: 133988964 kB SwapCached:        0 kB Active: 64559228 kB Inactive:    34196 kB SwapTotal:  5242876 kB SwapFree:  5242876 kB Dirty:     4824 kB Writeback:        0 kB    
   

          

                                                             

 

 

Cause:

 

 

The underlying Linux OS kernel has a configuration for any process not being able to have more than a certain amount of memory allocations active at any given time.   
   
    This is safeguard on ordinary Linux installs where multiple processes may contend for memory, but it may not be appropriate for DDOS because here a single process (ddfs) is configured to use nearly all the system memory, just because the FS process needs it for working properly.   
   
    The safeguard limits the amount of memory allocations for any individual process, so a leaking process wouldn't cause other processes memory allocations to fail. In DDOS with a single FS process given permission to use over 95% of the system memory, and specially on larger iron with NUMA architectures, this safeguard may be too strict and fail FS memory allocations for no good reason.   
     
                                                           

 

 

Resolution:

 

 

DD Engineering is considering changes to the default DDOS configuration on some or all DD models in upcoming versions so the kernel tuning is adjusted for the particular needs of DDOS and a single huge process like "ddfs" using nearly all memory.   
   
    In the meantime, a workaround is available for partners and DD Support for those customers having hit this problem, after proper root causing by DataDomain Support.   
   
    If you feel you have encountered this problem, please contact DataDomain Support, reference this KB article and proactively upload a Support Upload Bundle (SUB) for the troubled DD for analysis.