Best Practices for using DistCp to Back Up Hadoop
In this post I will describe the recommended best practices for the backup of non-Isilon Hadoop environments to an EMC Isilon cluster. With its very robust erasure-coding data protection that provides greater than 80% storage efficiency, EMC Isilon is in my opinion an ideal backup target for data located on a Hadoop cluster. DistCp (distributed copy) is a standard tool that comes with all Hadoop distributions and versions that can be used to copy entire Hadoop directories. DistCp runs as a MapReduce job to perform file copies in parallel, fully utilizing your systems if desired. There is also an option to limit the bandwidth to control the impact on other tasks.
The test environment used for this post consists of the following:
- Pivotal HD (PHD) 2.0.1 - This was installed using Pivotal Control Center 2.0. Default values were used for all settings. In particular, HDFS was installed on the PHD nodes for a traditional DAS configuration.
- Isilon OneFS 7.2.0
Since DistCp is a standard Hadoop tool, the approach outlined in this document will be applicable to most, if not all other Hadoop distributions and versions.
For the remainder of the document, we will assume that the data we want to back up is located on the PHD Hadoop HDFS cluster in the directory /mydata. We will back up this data to the Isilon cluster in the directory /ifs/hadoop/backup/mydata.
THE SIMPLEST BACKUP METHOD
The simplest backup command is shown below:
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update /mydata hdfs://all-nc-s-hdfs/backup/mydata
This can be executed on any host with the Hadoop client (hadoop) installed. The user executing the command must have permissions to read the source files and write the target files.
The options -skipcrccheck and -update must be specified to avoid the CRC check on the target files that will be placed on the Isilon cluster. Isilon does not store the Hadoop CRC and calculating it would be too expensive so these parameters are required to prevent errors related to the CRC check.
The next parameter "/mydata" is the source path on the source Hadoop cluster. This could also be "/" to back up your entire HDFS namespace. Note that since the path is not fully-qualified, it uses the HDFS NameNode specified in the fs.defaultFS parameter of core-site.xml.
The final parameter "hdfs://all-nc-s-hdfs/backup/mydata" is the target path on your Isilon cluster. The host portion "all-nc-s-hdfs" can be a relative or fully-qualified DNS name such as all-nc-s-hdfs.example.com. It should be the SmartConnect Zone DNS name for your Isilon cluster. The directory portion "/backup/mydata" is relative to the HDFS root path defined in your Isilon cluster access zone. If your HDFS root path is /ifs/hadoop, then this will refer to /ifs/hadoop/backup/mydata.
Files whose sizes are identical on the source on target directories are assumed to be unchanged and will not be copied. In particular, file timestamps are not used to determine changed files. For more details on DistCp, refer to http://hadoop.apache.org/docs/r1.2.1/distcp2.html.
By default, the owner, group, and permissions of the target files will be reset to the default for new files created by the user initiating DistCp. Any owner, group, and permissions defined for the source file will be lost. To retain this information from the source files, use the -p option. Because the -p option needs to be able to perform chown/chgrp, the user initiating DistCp must be a super-user on the target system. The root user on the Isilon cluster can be used for this. For example:
[root@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update -pugp /mydata hdfs://all-nc-s-hdfs/backup/mydata
USING SNAPSHOTS FOR YOUR BACKUP SOURCE
The backup of large datasets may take a very long time. Files that exist at the beginning of the DistCp process, when the directory structure is scanned, may no longer exist when that file is actually copied, producing errors. Further, an application may require a consistent single point-in-time backup for it to be usable. To deal with these issues, it is recommended that you create an HDFS snapshot of your source to ensure that the dataset does not change during the backup process. Note that this is unrelated to the SnapshotIQ feature of your target Isilon cluster.
To use HDFS snapshots, you must first allow snapshots for a particular directory.[gpadmin@phddas2-master-0 ~]$ hdfs dfsadmin -allowSnapshot /mydata
Allowing snapshot on /mydata succeeded
Immediately before a backup with DistCp, create the HDFS snapshot. [gpadmin@phddas2-master-0 ~]$ hdfs dfs -createSnapshot /mydata backupsnap
Created snapshot /mydata/.snapshot/backupsnap
The name of this snapshot is "backupsnap" and it can be accessed at the HDFS path /mydata/.snapshot/backupsnap. Any changes to your HDFS files after this snapshot will not be reflected in the subsequent backup.
Now we can back up the snapshot to Isilon using the following command:
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update /mydata/.snapshot/backupsnap hdfs://all-nc-s-hdfs/backup/mydata
When the backup command completes, you can delete the snapshot. Doing so will free up any space used to hold older versions of files that were modified since the snapshot.
[gpadmin@phddas2-master-0 ~]$ hdfs dfs -deleteSnapshot /mydata backupsnap
USING ISILON SNAPSHOTS FOR YOUR BACKUP TARGET
Independent from using snapshots for your backup source, you may want to keep multiple snapshots of your backup target directory to restore older versions of files.
To create snapshots on Isilon, you must have a SnapshotIQ license. Snapshots can be created via the web admin interface or CLI. To create a single Isilon snapshot manually via the CLI, which can be added to your backup process discussed in the Scheduling Backups section below, SSH into any Isilon node and execute the following:
all-nc-s-1# isi snapshot snapshots create /ifs/hadoop/backup/mydata --name backup-2014-07-01 --expires 1D --verbose
Created snapshot backup-2014-07-01 with ID 6
For more details regarding Isilon OneFS snapshots, refer to the Isilon OneFS CLI Administration Guide.
SYNCIQ REPLICATION FOR MULTIPLE ISILON CLUSTERS
Once the DistCp backup to the Isilon cluster is complete, you can use OneFS SyncIQ to replicate snapshots across a WAN to other Isilon clusters. This can provide a very versatile and efficient component of your disaster recovery strategy.
HANDLING DELETED FILES
By default, files deleted from the source Hadoop cluster will not be deleted from the target Hadoop cluster. If you require this behavior, simply add the -delete argument to the DistCp command. When using this command, it is recommended to use snapshots on the backup target to allow for the recovery of deleted files.
The steps to back up a Hadoop cluster can be automated and scheduled using a variety of methods. Apache Oozie is often used to automate Hadoop tasks and it directly supports DistCp. Cron can also be used to execute a suitable Shell script.
To automate running commands in an SSH session, you should enable password-less SSH to allow your cron user to connect to your Hadoop client and your Isilon cluster (if using SnapshotIQ).
The standard method to restore a DistCp backup, from Isilon to a traditional Hadoop infrastructure, is to run DistCp in the reverse direction. This is done simply by swapping the source and target paths.
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update hdfs://all-nc-s-hdfs/backup/mydata /mydata
You may want to create a snapshot of your target directory so that you can undo any mistakes made during the recovery process. However, you should be aware of the additional disk usage needed to maintain snapshots.
DIRECT ACCESS TO BACKUP DATA USING HDFS
The backup target files on Isilon are accessible from Hadoop applications in the same way as the source files due to Isilon’s support for HDFS. This provides a method to use your backup data directly, without having to first restore it to your original source Hadoop environment, which can save you analysis time overall. For example, if you normally run a MapReduce command such as:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep /mydata/mydataset1 output1 ABC
You can execute this MapReduce job against the backup dataset on Isilon using the following command:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep hdfs://all-nc-s-hdfs/backup/ /mydata/mydataset1 output1 ABC
If you would like to specify a fully-qualified Hadoop path, instead of using the fs.defaultFS parameter, check with your application provider for details. Further, an Isilon cluster designed for backup and archive, instead of high performance, will likely not provide the same performance as your primary Hadoop environment so testing is recommended and/or you should consult with EMC Isilon for proper sizing.
RECOVERY FROM ISILON SNAPSHOTS
If you need to recover a file from a previous Isilon snapshot, they are available in the /ifs/.snapshot directory. For details and other options, refer to the Isilon OneFS CLI Administration Guide.
HDFS VERSION COMPATIBILITY
Isilon is compatible with multiple versions of HDFS, which can be used simultaneously to access the same dataset, and can automatically detect the appropriate HDFS version per connection without any configuration; refer to the Isilon OneFS CLI Administration Guide for the list of supported Hadoop distributions and versions or visit Hadoop Distributions and Products Supported by OneFS. This means that multiple Hadoop environments running different versions of Hadoop can easily backup to a single Isilon cluster using HDFS.
In the case where Isilon does not support your Hadoop version, you can still use DistCp to backup and restore your Hadoop data with Isilon using HFTP. For instance, PHD 2.0 and later is not supported on Isilon OneFS 7.1.1 and earlier. In this configuration, you will need to build a small Hadoop cluster using a version of Hadoop that is directly supported by Isilon, then run DistCp on this new cluster using the HFTP protocol to access your source data on your original Hadoop cluster. The HFTP protocol is a read-only file system that is compatible across different versions of Hadoop. For example:
[gpadmin@phddas2-master-0 ~]$ hadoop distcp -skipcrccheck -update hftp://phddas2-namenode-0/mydata hdfs://all-nc-s-hdfs/backup/mydata
The size of the new small cluster that will run the DistCp MapReduce job primarily depends on how much throughput is required. If you only need to back up at the rate of 10 Gbps, then you will only need a single Hadoop node. None of your data will be stored on this small Hadoop cluster so disk requirements are minimal.
EMC Isilon is a great platform for Hadoop and other Big Data applications. It uses erasure coding to protect data with greater than 80% storage efficiency, in contrast to traditional HDFS with 33% storage efficiency. EMC Isilon has several classes of node types from the very dense NL400 to the high performance S210, and the X410 in between. This allows different Isilon tiers to be optimized for particular workloads. The backup of traditional Hadoop environments to Isilon is easy to do and will allow for the densest usable HDFS backup target.