Data Domain: BOOST write streams reported gradually increase over time with DDOS 6.1 / 6.2[1]

           

   Article Number:     530747                                   Article Version: 3     Article Type:    Break Fix 
   

 


Product:

 

Data Domain,DD OS 6.1,DD OS 6.2

 

Issue:

 

 

When running some DD OS 6.1 or 6.2 versions, a customer may see in ASUP performance data or from the "system show performance" command line,that the number of (BOOST) write streams accounted for is much higher than expected, increasing over time, and bearing no relation with the real number or active (BOOST) backup streams from backup applications. For example, a customer may see something like this, while no backups are ongoing:   

# system show performance view legacy duration 1 hour interval 10 min                     -----------Throughput (MB/s)----------- ---------------Protocol-----------------  Compression  ------Cache Miss--------  -----------Streams----------- -MTree Active-  -State-  -----Utilization-----  --Latency--  -----------------                                                                                                                                                                     ReplDate       Time      Read  Write Repl Network  Repl Pre-comp ops/s  load    data(MB/s)    wait(ms/MB)  gcomp lcomp  thra unus ovhd data meta    rd/  wr/  r+/  w+/  in/ out      rd/wr      'CDPVMSFIRL'     CPU        disk       in ms          stream  ---------- --------  ----- ----- ----in/out--- ----in/out--- -----  --%--   --in/out---   --in/out---  ----- -----  ---- ---- ---- ---- ----  ----------------------------- --------------  ---------  -avg/max---- --max---  --avg/sdev-  ---------------2019/02/06 16:43:00    0.0  10.8   0.00/  0.28   0.00/  0.62   166 19669.41%   0.00/  0.00  18.82/ 68.92    1.6   1.4    0%  12%  32%   3%   1%  1/ 278/   0/   0/   0/   0       0/ 1       ---V---I--L    3%/  7%[9]  26%[ 3]    0.3/  4.5  428.0/  9.02019/02/06 16:53:01    0.0  27.1   0.23/  0.36   0.78/  0.78   126 19670.33%   0.04/  0.00   6.64/ 84.34    1.4   2.0    0%   9%  26%   1%   1%  0/ 276/   0/   0/   0/   0       0/ 1       ---V---I--L    2%/  6%[9]  24%[ 0]    0.3/  4.2  424.0/ 12.02019/02/06 17:03:00    0.1   1.0   0.16/  0.00   0.56/  0.00    45 19671.07%   0.00/  0.00  32.77/ 53.55    1.4   2.5    0%  27%  44%   6%   1%  0/ 275/   0/   0/   0/   0       0/ 1       ---V---I--L    2%/  8%[9]  25%[ 4]    0.3/  5.5  423.0/ 12.02019/02/06 17:13:00    0.2   0.5   0.00/  0.20   0.00/  0.61   129 19671.61%   0.00/  0.00  36.56/ 58.61    1.8   1.5    0%  23%  37%   2%   0%  0/ 275/   0/   0/   0/   0       0/ 1       ---V---I--L    2%/  7%[9]  24%[10]    0.2/  3.9  422.5/ 12.52019/02/06 17:23:00    0.1   0.5   0.10/  0.00   0.37/  0.00   127 19672.12%   0.00/  0.00  34.14/ 56.13    1.6   1.9    0%  31%  49%   7%   0%  0/ 275/   0/   0/   0/   0       0/ 1       ---V---I--L    3%/  6%[2]  25%[ 0]    0.2/  5.0  422.0/ 12.02019/02/06 17:33:00    0.3   0.5   0.19/  0.19   0.73/  0.47   125 19672.63%   0.00/  0.00  34.19/ 56.20    1.3   2.4    0%   0%   9%   1%   1%  0/ 275/   0/   0/   0/   0       0/ 1       ---V---I--L    2%/  5%[2]  25%[ 3]    0.2/  4.8  423.5/ 12.52019/02/06 17:43:00    0.1   0.5   0.00/  0.41   0.00/  0.92   129 19673.14%   0.00/  0.00  34.53/ 56.30    1.9   1.3    0%  22%  43%   5%   2%  0/ 275/   0/   0/   0/   0       0/ 1       ---V---I--L    2%/  6%[9]  26%[ 3]    0.2/  4.2  423.0/ 12.0    
   
   
    Note above that despite the lack of any significant ingest, the number of write streams is huge (around 270). When the number of streams reaches the platform hard limits, there may be replication errors occurring, such as the ones below:   
01/13 11:16:38.508 (tid 0x7f8fadf5cb60): mrepl ctx 7: failed to allocate a RDWR stream for the target btree. 01/13 11:16:38.667 (tid 0x7f8fadf5cb60): repl ctx 7: mrepl_start_transfer_3_svc error Failed to allocate a RDWR stream for the target btree.    
   
   
    Another symptom which may be seen by customers when the number of streams exceed the platform limits is replication lagging, as replication can't make use of any new streams, and also backups would eventually start to fail because of the inability to reserve further write streams.   
                                                                

 

 

Cause:

 

 

There is a defect in the code for some DDOS 6.1 and 6.2 versions in the code path dealing with aborted or abandoned BOOST backups, by which the calculation of how long the backup has failed or being abandoned is done wrong, and hence under some circumstances some of those abandoned streams may not be closed after the OST_ABANDON_TIMEOUT period has passed, and hence the stream is leaked and stays in the DD FS list of open BOOST write streams.                                                           

 

 

Resolution:

 

 

This defect (not affecting DDOS 6.0 or earlier) has been fixed in DDOS for the following versions:   

         
  •         DDOS 6.1.2.30     
  •      
  •         DDOS 6.2.0.10     
  •    
    For customers being unable to upgrade to a fixed release, the workaround consists of restarting BOOST so that the list of streams is cleared out and all possibly leaked connections are dropped. Of course, this would incur some downtime, as disabling BOOST ("ddboost disable" from the DD command line) would interrupt ongoing BOOST backups, and not allow any new ones until BOOST has been re-enabled ("ddboost enable" from the CLI).