To get us started, I thought I would talk a little bit about "Avamar Zen".
Avamar operates best in "steady state" -- that is, the amount of data being removed from the system by garbage collection each day is as much or more than the amount of data being ingested during backups each day.
One of the most common issues the Avamar support team sees generally starts with a panicked call -- "My Avamar is full! I need to get backups tonight!"
There are a variety of reasons this can happen. Support will help you, the customer, to get the system back into a state where backups can run. From there the case will go down one of two paths.
The Sad Path
The first path is a path of terrible suffering and pain.
If the current capacity issue is addressed (usually by removing data and running garbage collection) without making changes to the data ingest rate or the data expiry rate, the system will run for some time before it inevitably becomes full again. You will find yourself back on the phone with support in a few days. Or a few weeks. Or a few months.
There may be checkpoint overhead issues. There may be garbage collection failures. There will almost certainly be backup failures.
This will happen over... and over... and over again. You will be frustrated. Support will be frustrated. After a while, management on both sides will be frustrated.
The Happy Path
The second path leads to Avamar nirvana.
I'll copy the Wikipedia definition of Nirvana so you'll have it in front of you:
[Nirvana] ... refers to release from a state of suffering after [a] ... period of committed.
When the data ingest rate and the data removal rate on a system are in balance and the capacity is monitored regularly, maintenance will run as scheduled, backups will run as scheduled, and the capacity will gradually stabilize.
There are a number of tools that can help you understand and manage your Avamar server's capacity. Here is a brief overview of the most customer-friendly ones:
Inside the enterprise manager, there are graphs showing capacity history and forecast. This is a good way to review the capacity history at a glance. The graphs themselves are fairly self-explanatory so I won't spend a lot of time on them.
There is a report built into the Avamar software called "Activities - DPN Summary". This report will tell you on a backup-by-backup basis how much new data is being sent to the server by a client.
To generate the report on your own Avamar system:
- Open the Avamar Adminstrator GUI and log into the grid
- Select Tools => Manage Reports...
- Scroll down until you find "Activities - DPN Summary" and select it
- Click the Run button
- Select the appropriate date range (be aware that very large ranges could cause the GUI to stop responding for some time) and then click Retrieve
- Click the Export button and you can save the report as a file in comma separated values (CSV) format so it can be imported into spreadsheet software for easier analysis
The columns that are likely to be of most interest will be columns I (as in India) through M (as in Mike). These columns are, respectively:
I - ModReduced - Bytes saved by using compression
J - ModNotSent - Bytes present on the Avamar server but not in the client caches
K - ModSent - New bytes added to the server by the backup
L - TotalBytes - The total size of the data being protected (whether or not we had to send it)
M - PcntCommon - The percentage of data for the backup that is already on the grid (higher is better)
Column K in particular is useful for measuring capacity and capacity growth. Using the ModSent information for each backup still present on the grid and the size of the initial backup for the client, you can do a rough "back of the envelope" calculation of how much space that client is consuming on the grid.
One quick caveat - the DPN Summary is a report, not a status which means it includes information for backups that may have already expired from the grid.
Using capacity.sh requires you to log into the utility node of the grid using SSH. If you don't know how to do this already, this option is probably not for you.
The capacity.sh script is shipped as part of the Avamar base install. It's a shell script that analyzes the ingest data and garbage collect data for a system and produces an ASCII report showing the daily ingest for the last 14 days (by default - use --days=n to specify a number of days), the daily garbage collect performance and the net change.
The report will also show the highest change rate clients on the system (in other words, the clients using the most storage after de-dupe).
To run it, log into the utility node as the admin user and type "capacity.sh" at the prompt.
You'll get back output that looks something like this:
Date New Data #BU Removed #GC Net Change
---------- ---------- ----- ---------- ----- ----------
2012-03-06 4888 mb 6 -1 mb 4 4887 mb
2012-03-07 1232 mb 9 0 mb 1232 mb
2012-03-08 63902 mb 9 -2 mb 4 63900 mb
2012-03-12 1158 mb 4 0 mb 1158 mb
2012-03-13 497 mb 7 -1 mb 1 496 mb
2012-03-14 1661 mb 8 -1 mb 1 1660 mb
2012-03-15 4772 mb 10 -1 mb 1 4771 mb
2012-03-16 781 mb 8 -268 mb 1 513 mb
2012-03-17 701 mb 9 0 mb 1 701 mb
2012-03-18 369 mb 7 0 mb 1 369 mb
2012-03-19 503 mb 9 0 mb 1 503 mb
2012-03-20 1630 mb 7 0 mb 1630 mb
---------- ---------- ----- ---------- ----- ----------
Average 6841 mb -22 mb 6818 mb
Top 5 Capacity Clients Added % of Total ChgRate
---------------------- ------------ ---------- ---------
client1 68405 mb 83.3% 3.022%
client2 4571 mb 5.6% 1.851%
client3 3844 mb 4.7% 2.592%
client4 3062 mb 3.7% 1.914%
client5 1738 mb 2.1% 0.128%
Total for all clients 82100 mb 100.0% 0.016%
From the output, it's very easy to see which direction capacity is moving. On this particular grid, we are adding much more data that we are removing. I would be worried if I didn't know that the system is a testing grid that is less than 5% full.
If capacity utilization is increasing day overday even after data has started expiring from the grid, no matter how storage nodes are added, sooner or later the system will fill up. The capacity.sh script is a very good way to show this trend.
Long-Term Capacity Management
If your Avamar system is not in steady state even after all your clients begin expiring backups, there are really only two long term options:
- Back up less
- Expire more
For the first option, there are different approaches you could take. If there is spare capacity available on another grid, clients can be moved. If there are high change rate clients consuming large amounts of your capacity, it might be better to move those clients off to Data Domain. If there are non-critical clients, they could be backed up less frequently (or not at all). There may be items such as temp files that should be excluded from the datasets to avoid backing up high change, low value data from each client.
For the second option, it's a good idea to periodically review retention practices. Do you need all of the data that has been backed up? Is it still valuable? One other important consideration when deleting items from the Avamar is that de-duplication is a double-edged sword. Be sure to take a look at the DPN Summary report when deleting backups. No matter what the GUI says about the size of the backup being deleted, deleting individual backups will only reclaim roughly the amount of space listed in the "ModSent" column of the DPN Summary. The overall size of the backup might be 500GB but if there are only 2MB of unique data, you will only regain 2MB of space.
Those are the basics of capacity management on Avamar. I look forward to your questions!
Unfortunately there's no quick and easy way to get the dedup ratio for a whole grid. I've filed a Request For Enhancement (RFE) to request that a future version of the Enterprise Manager (EM) report the global dedup ratio for each grid.
In the meantime, it's possible to set up ODBC on a Windows system so that it can query the SQL Views in the Avamar Administrator Server database. The DPN Summary report is based on information found in the v_dpnsummary view, so with an ODBC connection configured you could use database software, a spreadsheet, a reporting package, etc. to run an automatic calculation or report based on the information in v_dpnsummary. It's not the most straightforward option but it would work.
Ian, a two-part question for you:
when we activate a client by right-clicking on the avamar client icon in the toolbar and saying “Manage, Activate”, the Avamar server adds a suffix like ".our-org.com".
So the client name will be something like "websrv01.our-org.com".
We have many avamar servers, and they are not set up identically, so today, we have many different suffixes: "our-org.com", "ent.our-org.com", no-suffix, etc.
On the other hand, when we activate a node using the MCCLI command line, we can give it any name we want, including no suffix at all. No-suffix seems to works fine, backups work fine, Avamar has the full DNS name in its database somewhere - all is good.
But we'd like to get our client names consistent, if we can.
Is there any reason NOT to just use (for example) "websrv01" instead of "websrv01.our-org.com"?
I would guess that if there WERE a problem, the client would fail to activate.
We can rename a client using, for example, "mccli client edit --name=webssrv01.our-org.com --new-name=websrv01".
And if we do, do its old backups get orphaned? So would we have to rename them, too?
Thanks for your advice here.
For the first question, we recommend using the fully qualified name instead of the short name because by default, Avamar will not allow two clients with the same name to be activated to the same grid. This could be an issue if you have, for example, Windows Domain Controllers called "dc1.west.our-org.com" and "dc1.east.our-org.com". One of these clients would fail to activate because of the name conflict.
The answer to the second question requires a bit of background first.
When explaining how client accounts work, I always like to use a bucket as a metaphor for a client account. Inside the bucket, you will find all the backups for that client. Engraved into the handle of the bucket is a unique identifier called the Client ID or CID. There is also a label on the bucket (the hostname) but that's only used for two things:
- We use the hostname as a human-friendly identifier because humans aren't very good at remembering 20-byte hexadecimal strings (funny, that).
- We compare the hostname and CID whenever an activated client checks in. If the hostname and CID do not match what we have recorded, we will not issue backup or restore workorders to that client. This is to prevent a rogue client from stealing the hostname of an activated client and impersonating it.
So using the bucket analogy, changing the label on the bucket (the hostname) will not affect the contents of that bucket (the backups).
I ran some quick testing of this and as long as the client name you use as "new-name" is the short name or any of the fully qualified names associated with the client, you shouldn't have any issues. You do have to be careful with this, however. If you make a typo in the new-name, the Avamar Backup Agent will not be able to process workorders because the mapping between the hostname and CID on the client will not match the mapping between the hostname and CID on the Avamar server.
I hope this helps!
How do we ensure that we have a good backup with Avamar?
We can try using "mccli backup validate..."; but that runs a long time, and I don't see where the results go.
We can look at the "Last Successful Backup Date" from "mccli client show...", but that apparently reports success even if the backup had exceptions.
We can look at activities, and ee if they Completed, Completed With Exceptions, etc. But do we have to look for a success on each plugin, to be sure? For example, Windows filesystem and Winodws VSS have to both succeed, right?
Just looking for a straightforward way to day "Yes, we're OK on server xxx".
Validating a backup is essentially doing a restore and discarding the results. It will prove that the backup is consistent (in other words, the backup is on the server and all the bits that were backed up can be restored) but it doesn't tell us anything about the content of that backup. For example, on Windows clients, if VSS is not functioning correctly, any open files on the client will not be backed up. This backup will be valid (since it is consistent and restorable) but if you were trying to restore one of those open files, you will not be able to do so.
When reviewing the activities, keep in mind that each plugin has a specific purpose. If the filesystem backup succeeds but the VSS backup fails, it will not be possible to perform a bare metal recovery of this system but any data backed up as part of the file system backup will be available for restore. Similarly, if your DB2 backups succeed but your filesystem backups fail, you will be able to recover the databases but not the file system. Naturally, this applies to any plug-in backups.
There are some reports built into the Avamar Administrator that you may find useful for determining if a client is fully protected or not. In particular, take a look at the "Activities - Exceptions", "Activities - Failed" and "Client - No Activities" reports. The two "Activities" reports are pretty self-explanatory (they report on any clients that complete with exceptions or fail). The "Client - No Activities" report will give you information on any client that is not running backups.
It is also possible to create your own reports using the GUI, though the options for customization are somewhat limited.
You can configure the system to send built-in or custom reports to you by e-mail on a daily basis. If you find that these reports are not sufficient for your needs, you could use the ODBC connectivity I mentioned above to "roll your own" or you could use a reporting package such as Data Protection Advisor to generate more in-depth reports "out of the box".
Ian, thanks for your knowledgeable and well-stated responses.
Here's a new question:
I’m trying to back up my laptop. I get thousands of errors like this:
2012-03-30 08:37:20 avtar Error <5137>: Unable to open "C:\temp.txt" (code 5: Access is denied).
I can open these files myself.
The Backup Agent service is running as “Local System Account”.
Why can’t it open these files (thousands of them - possibly all files on the machine)? Any ideas?
Also, the GUI reports Success even though thousands, possibly all, files failed. That doesn't seem right. Is there a flag I can set to prevent bogus "Success" results?
I would recommend reviewing the NTFS permissions for these files. It's possible somebody has deleted the SYSTEM account from the ACL or removed the account's read permissions for these files. By default, the SYSTEM account is granted Full Control for all files on a system and Microsoft recommends that this not be changed.
I've also seen this message if the VSS snapshot fails for some reason (for example if the snapshot is removed out from under the running backup). See if there are any SnapVol errors in the Windows event logs.
Are the backups being marked "Completed" or "Completed w/exceptions"?
For C:/Temp.txt, SYSTEM has Full control.
I see no Snapvol events in Application, Security, or System event logs.
Avamar Client, "Backup...", History says "Completed Successfully", no indication of exceptions.
Avamar Client, Manage, View console says "Completed (45055 errors)";
Avamar administrator, Log on to Server, Activities, says "Completed w exceptions".
So I still don't see why these files are not readable.
Is a windows service involoved? Can I bypass it? I tried cutting and pasting the "avtar.exe ..." command line from the ..avs/var/clientlogs/... into a command shell to try and run the backup as myself, not through the service. But I got:
C:\Program Files\avs\bin>avtar --sysdir="C:\Program Files\avs\etc" --bindir="C:\Program Files\avs\bin" --vardir="C:\Program Files\avs\var" --ctlcallport=1706 --
ctlinterface="3001-Windows-Windows-Test -1333110692426" --logfile="C:\Program Files\avs\var\clientlogs\lm-cmdlinetest.log" --sessionattr=dtlt=true
avtar Info <5241>: Logging to C:\Program Files\avs\var\clientlogs\lm-cmdlinetest.log
avtar Info <5551>: Command Line: avtar --sysdir="C:\Program Files\avs\etc" --bindir="C:\Program Files\avs\bin" --vardir="C:\Program Files\avs\var" --ctlcallport
=1706 --ctlinterface="3001-Windows-Windows-Test -1333110692426" --logfile="C:\Program Files\avs\var\clientlogs\lm-cmdlinetest.log" --sessionattr=dtlt=true
avtar FATAL <10790>: Unable to connect to 127.0.0.1:1706 with proprietary encryption
avtar Info <9901>: Cancel Request being processed (setting code from 0 to 536870920)
Appreciate any advice. Guess I could open a ticket with EMC...
It is possible to run avtar manually but you can't copy and paste the command line from the log because avtar commands that are run as part of scheduled backups use the "CTL" interface to communicate with their caller to retrieve the information required to run the backup (targets, options, etc.) and return status messages. For file system backups the caller will be the "Avamar Backup Agent" or "Backup Agent" service (internally we call it "avagent" which is the name of the binary). Since avagent didn't start this avtar process, it won't be listening for replies and you will receive the FATAL you've pasted above.
I think it would be best to open a service request for this issue. Support can do an in-depth analysis of the logs or work with you live via WebEx and you're likely to get a faster resolution this way.
If you speak with L2 support, they can show you how to run a "degenerate" test that will process the filesystem but discard the results instead of sending them to the server. Such a test is normally used to isolate performance bottlenecks (it measures how fast avtar can read the filesystem since it doesn't have to wait for replies to come back from the server) but it would also be useful for this type of troubleshooting since it would allow you to keep your tests local to the client.
We plan to migrate a few thousand servers to several Avamar grids. But we don't want to throw too many first-time on-demand backups at an Avamar grid all at once.
How many is too many? Or is there any reason not to queue up on-demand backups for 1,000 different servers and wait for them to complete?
(Obviously we don't want to throw more data at the grid than it can hold after being de-duped. We think we have that part figured out.)
Another way of saying it:
We can monitor various things:
total client GB that will be ingested,
number of sessions in use on the Avamar server,
etc. Is there a metric on one of these that we should be careful not to exceed?