This is a story about an issue which I hit couple of weeks ago.  It was resolved quickly and partially it was me failing to see what has been going on, but since I have seen others hitting the same wall.  Plus, to make it extra nice, we did hit another bug which got us off the right track initially... but first thing first - the intro.

 

Intro section is simple.  We had dozen of DD systems and they range from DD160 to DD890.  Small ones are special like isolated zone or DMZ while big ones serve to protect most of DC machinery.  Big stuff usually runs on not so latest releases so my big servers are based on 8.0.x NetWorker release and DDOS 5.2.5.0.  However, small ones run 8.1.x and I started to update DDOS to 5.4.2.1.  First I did small server doing some simple backups (isolated location) on DD160 - no issue I could see. This server does not use replication at all nor DD Boost - it is very simplistic.

 

Our next one was DD620 - two of them - and as you could guess - here we use replication.  At the time this update was made we were running 8.1.1.8 NetWorker release. Going from 5.2.2.5 to 5.4.2.1 was smooth, but soon enough our SQL TLOG was failing with following:

 

80797:nsrclone: Failed to perform DDCL filecopy: Starting a file copy failed ([5005] no room left).

80797:nsrclone: Failed to perform DDCL filecopy: Starting a file copy failed ([5005] no room left).

80797:nsrclone: Failed to perform DDCL filecopy: Starting a file copy failed ([5005] no room left).

 

Well, in my view this had to be DDOS as it just started right after DDOS update.  That's wrong assumption, but we will get to that.

 

Where do you see the log?  Well, as it fails during cloning, and I use CCR, cloning log can be found in /nsr/tmp/sg/<GROUP NAME>. When inside, you can do ls -ltra and you will get listing as per timestamp.  Last file in time interval is the one containing cloning details. 

 

When opening a ticket, I decided to use DD support first as I had no initial reason to suspect anything else.  I went to do support bundle, but something weird happened. When using GUI it would reset itself at the point when it needs to grab it.  OK, that's strange so I went to CLI.  I simply exported /ddvar and wanted to make mount, but see what happens:

[root@hcrvelin ~]# cd /H

[root@hcrvelin H]# ll

ls: lost+found: Stale NFS file handle

ls: vers: Stale NFS file handle

ls: cacerts: Stale NFS file handle

ls: core: Stale NFS file handle

ls: home: Stale NFS file handle

ls: log: Stale NFS file handle

ls: releases: Stale NFS file handle

ls: reports: Stale NFS file handle

ls: snmp: Stale NFS file handle

ls: support: Stale NFS file handle

ls: tools: Stale NFS file handle

ls: traces: Stale NFS file handle

ls: lock: Stale NFS file handle

ls: certificates: Stale NFS file handle

ls: etc: Stale NFS file handle

ls: firmware: Stale NFS file handle

total 608

drwxrwxrwx 2 root ftp  4096 Mar 20  2014 cacerts

drwxrws--- 2 root ftp  4096 Mar 20  2014 certificates

drwxrwsr-x 2 root ftp  4096 Mar 20  2014 core

drwxrwsr-x 2 root ftp  4096 Jul 17  2013 etc

drwxrwxrwx 2 root ftp  4096 Mar 20  2014 firmware

drwxrwsr-x 7 root ftp  4096 Mar 20  2014 home

drwxrwsr-x 2 root ftp  4096 Sep 19 12:21 lock

drwxrwsr-x 3 root ftp  4096 Oct  5 01:45 log

drwxrwsr-x 2 root ftp 16384 Nov 14  2012 lost+found

drwxrwxrwx 2 root ftp  4096 Mar 20  2014 releases

drwxrwsr-x 2 root ftp  4096 Mar 20  2014 reports

drwxrwsr-x 3 root ftp  4096 Mar 20  2014 snmp

drwxrwsr-x 2 root ftp  4096 Oct  6 08:00 support

drwxrwsr-x 3 root ftp  4096 Sep 15 11:46 tools

drwxrwsr-x 2 root ftp  4096 Oct  6 21:21 traces

drwxrwsr-x 5 root ftp  4096 Sep 15 11:47 vers

[root@hcrvelin H]# cd support

-bash: cd: support: Stale NFS file handle

[root@hcrvelin H]#

 

So, what I did above was to mount /ddvar export under /H (previously defined via /etc/hosts).  As you can see you get stale NFS messages.  And you can forget about cd-in into any of those or copying bundle.  Trying to disable and enable NFS does not help here. Rebooting DD box does not help either.  Something is broken and surely this is broken from 5.4.2.1.  I did quick check and I found following:

  • all systems that we updated to 5.4.2.1 from 5.2.5.0 has this issue with NFS
  • one system which we purchased recently and it came with 5.4.1.1 was also updated to 5.4.2.1, but it did not exhibit this issue

 

At this point I suspect that original issue with "no room left" was caused by NFS going crazy and DD thinking appliance is full, but the story was shaky as it could not explain why I would see it only during replication. Backup on both DDs went just fine.  And even when it failed, I could restart group and failed ssids would replicate just fine.  I went to so far to test restore - all ok. By this time I also had EMC on the line and they suggested to create CIFS share of /ddvar.  This also worked without any issues and we were able to get support bundles.  At this point things started to develop rather quick. 

 

First, let me finish the story with DDOS and NFS I told you above. It's a bug and last time I spoke with tech there was no fix.  Bug id is 115371 and it is caused by rather large change made in NFS code.  It affects only /ddvar so no big issues there. If you are not flexible and insist on having your NFS mount of /ddvar to work, simple workaround is to use noac parameter in NFS options.  Here is what I used as options to mount it this

 

DDBOX:/ddvar /H nfs hard,intr,noac,vers=3,tcp,bg 1 1

 

This looks now much better as you can see on following output:

 

[root@hcrvelin ~]# mount -a

[root@hcrvelin ~]# cd /H

[root@hcrvelin H]# ll

total 76

drwxrwxrwx 2 root ftp  4096 Mar 20  2014 cacerts

drwxrws--- 2 root ftp  4096 Mar 20  2014 certificates

drwxrwsr-x 2 root ftp  4096 Mar 20  2014 core

drwxrwsr-x 2 root ftp  4096 Jul 17  2013 etc

drwxrwxrwx 2 root ftp  4096 Mar 20  2014 firmware

drwxrwsr-x 7 root ftp  4096 Mar 20  2014 home

drwxrwsr-x 2 root ftp  4096 Sep 19 12:21 lock

drwxrwsr-x 3 root ftp  4096 Oct  5 01:45 log

drwxrwsr-x 2 root ftp 16384 Nov 14  2012 lost+found

drwxrwxrwx 2 root ftp  4096 Mar 20  2014 releases

drwxrwsr-x 2 root ftp  4096 Mar 20  2014 reports

drwxrwsr-x 3 root ftp  4096 Mar 20  2014 snmp

drwxrwsr-x 2 root ftp  4096 Oct  6 08:00 support

drwxrwsr-x 3 root ftp  4096 Sep 15 11:46 tools

drwxrwsr-x 2 root ftp  4096 Oct  6 21:21 traces

drwxrwsr-x 5 root ftp  4096 Sep 15 11:47 vers

[root@hcrvelin H]# ll support/

total 1001252

-rw-r--r-- 1 root ftp       634 Oct  6 08:00 am_email

-rw-r--r-- 1 root ftp    892733 Oct  6 07:02 autosupport

-rw-r--r-- 1 root ftp    990251 Oct  5 07:02 autosupport.1

-rw-r--r-- 1 root ftp    858480 Sep 26 07:02 autosupport.10

-rw-r--r-- 1 root ftp    856243 Sep 25 07:02 autosupport.11

-rw-r--r-- 1 root ftp    866806 Sep 24 07:02 autosupport.12

-rw-r--r-- 1 root ftp    856128 Sep 23 07:02 autosupport.13

-rw-r--r-- 1 root ftp    868425 Oct  4 07:02 autosupport.2

-rw-r--r-- 1 root ftp    870761 Oct  3 07:02 autosupport.3

-rw-r--r-- 1 root ftp    861549 Oct  2 07:02 autosupport.4

-rw-r--r-- 1 root ftp    863479 Oct  1 07:02 autosupport.5

-rw-r--r-- 1 root ftp    864060 Sep 30 07:02 autosupport.6

-rw-r--r-- 1 root ftp    885014 Sep 29 07:02 autosupport.7

-rw-r--r-- 1 root ftp    996959 Sep 28 07:03 autosupport.8

-rw-r--r-- 1 root ftp    857614 Sep 27 07:02 autosupport.9

-rw-r--r-- 1 root ftp      3711 Oct  6 01:00 sfsshowcomp_output

-rw-r--r-- 1 root ftp 506083508 Sep 19 12:40 support-bundle.tar.gz

-rw-r--r-- 1 root ftp 506799750 Sep 19 12:04 support-bundle.tar.gz.1

 

Having access to support bundle, which contains all necessary logs to figure out what might be going on, gave us also a clue about something I failed to realize myself - it seems like we are using more sessions than allowed by DD. Initially this didn't make any sense to me as all I did was changing DDOS, but story runs a bit deeper than that.  The reason why this is now coming so obvious with DDOS 5.4 is because these values are enforced in 5.4 while before they were not.  So, once DDOS meets limit, it blocks any new session which to source it would look like as there is no more room left and so session would fail.  A bit misleading, but that's what happened.  What about the limits?  What I just described happened on DD620 and if you read carefully DD620 specs you will find following:

 

dd620.JPG.jpg

 

So, while source can push 30 sessions, target can only accept 20 and in this case both target and source are DDs 620 so my common limit is 20 sessions.  But wait, isn't nsrclone supposed to be sequential?

 

Now, unlike NetWorker, on DD you can see logs in miliseconds so it was much easier to see what is going on.  The group which always failed for me was SQL transaction log for 16 databases which had 16 save sets.  Each save set usually causes two save sets to be saved.  Still, nsrclone always used to be sequential, but this has changed recently.  It changed first in 8.2 where concurrent cloning had been brought to masses.  For some strange reason, engineering brought that too also in path release known as 8.1.1.6.  Remember, I was already running 8.1.1.8 so in theory I was affected.  To test this I tried 8.1.1.9, 8.1.1.7 and 8.1.1.5.  As expected, 8.1.1.5 was not affected by this.  EMC was surely aware of this as they published guidelines in release notes and KB (see https://support.emc.com/kb/188548), but I made mistake assuming this applied if you used something that it is actually called imediatte cloning.  Immediate cloning is the feature where you instruct savegrp to kick in CCR right after ssid is completed and not when all ssids in group are completed.  nsradmin view of immediate cloning looks like following:

 

dd621.JPG.jpg

 

In old days, prior to 8.1.1.6, nsrclone was sequential indeed.  This is totaly unrelated to immediate cloning feature. As of 8.1.1.6 this is no longer the case.  So how many sessions does NW do?  This is easy to see - again see the cloning log under /nsr/logs/sg/<NSR GROUP>.  Here is the hint:

 

77426:nsrclone: Cloning save sets from the volume hcrvelin.203

104861:nsrclone: Total client count in a volume hcrvelin.203 is 1

77426:nsrclone: Cloning save sets from the volume hcrvelin.203

104861:nsrclone: Total client count in a volume hcrvelin.203 is 1

104864:nsrclone: Number of clone threads per client set to 30. Max number of savesets per thread is set to 0. Max client threads is set to 1 for volume hcrvelin.203

 

Yup, that's 30 session. What is the limit I have? 20 sessions. 10 too many.  It looks like NSR is now trying to get always up to 30 sessions.  Here is example of log when using group with 9 clients:

 

104861:nsrclone: Total client count in a volume hcrvelin.202 is 9

104864:nsrclone: Number of clone threads per client set to 3. Max number of savesets per thread is set to 0. Max client threads is set to 9 for volume hcrvelin.202

 

So max(tpc) x max(ct)<=30. How to get this under control, like get it sequential as it was prior to 8.1.1.6?  Well, you must create file called /nsr/debug/nsrcloneconfig and place inside following:

 

max_threads_per_client=0

max_client_threads=0

 

The article I mentioned earlier (and release notes) also cover some other example so be sure to read it to get full understanding. Be aware that you can set this as general values or as per NSR group resource.  But be aware, more than 30 sessions in total is not supported at the moment.  And if you have DD620, 30 might be too much already.  What about DD890?  Should be enough, but if you use multiple datazone that all use same DD890, I can easily see how you can hit the wall there too.  So make sure to plan this. 

 

Be also aware that /nsr/debug/nsrcloneconfig file works since 8.1.1.7.  Since change was introduced in 8.1.1.6 I suspect it has been quickly observed that something needs to be done to tame these CCR sessions.  Future wise, to avoid all potential issues that might arise of DD being used by multiple datazones or applications which might not know about each other, I expect EMC NetWorker to use DD to learn about available number of CCR sessions that in can use. I also expect (or hope) DDOS to be enhanced and to introduce queuing mechanism - so instead of dropped sessions - just queue them.  Right now, if you are not sure what to use, revert back to sequential way as seen above and start planning/evaluating changes to meet new possibilities.  I can imagine that concurrent cloning in combination with immediate cloning might be good thing, but if you have large number small savesets which end around same time (eg. SQL TLOG) and their number might go beyond limits imposed by DDOS - then be careful.