In Isilon and Cloudera Backup and Disaster Recovery Integration we reviewed Cloudera BDR integration for HDFS replication between a DAS cluster and an Isilon Cluster. In this post we will close the loop on BDR replication and review how to setup and integrate Hive replication

 

Assumptions:

CDH 5.8 and greater

UID/GID parity - through local accounts or LDAP, parity in uid and gid is important to maintain consistent access across storage

DAS Cloudera cluster setup complete

Isilon Cloudera cluster setup complete

DNS Name resolution fully functional - all host, forward and reverse

Both the source and destination clusters must have a Cloudera Enterprise license

 

Note the following when using replication jobs for clusters with Isilon:

 

• hdfs user is mapped to root on Isilon, If you specify alternate users with the Run As option when creating replication schedules, those users must also be superusers.

 

• Always select the 'Skip Checksum Checks' property when creating replication schedules.

 

• Kerberos authentication is fully supported from CDH 5.8 and higher, the account used to replicate data will need a principal and keytab to enable authentication against the target, see the Cloudera documentation for additional information on configuring this.

 

• Data replication can fail if the source data is modified during replication, it is therefore recommended to leverage snapshots as the source of data replication. If enabled replication can automatically make use of snapshots to prevent this issue. For more details see the following Cloudera documentation Using Snapshots with Replication

 

• Source clusters that use Isilon storage do not support HDFS snapshots. Since snapshots are used to ensure data consistency during replications in scenarios where the source files are being modified. Therefore, when replicating from an Isilon cluster source, it is recommended that you do not replicate Hive tables or HDFS files that could be modified before the replication completes without taking additional steps to ensure data replication succeeds effectively. Additional options would be to leverage SyncIQ to replicate data between Isilon clusters or using Isilon native snapshots in conjunction with metastore replication.

 

 

In our example we have loaded a sample set of data for use by Impala on our DAS Cloudera cluster, since Impala shares the Hive metastore database we can use BDR Hive replication to replicate this Impala database and the HDFS data to our Isilon Cloudera cluster. This illustrates that both Hive and Impala based databases and the HDFS based tables can be replicated with BDR.

 

 

 

1. In Hue, we see the tpcds_parquet database in the impala/hive metastore

2.png

 

2. The tpcds__parquet table definition and information can be seen here in Hue

3.png

 

 

 

3. The data for the tables is seen here in the /user/hive/warehouse

4a.png

 

 

4. Run a sample Impala query to validate the data on the DAS cluster8.png

 

 

 

5. On the Isilon cluster, the tpcds_parquet database, tables and HDFS data do not exist

11.png

12.png

 

 

6. Since we have already created a replication Peer in blog post 1 we can move straight on to setting up Hive/Impala replication using the Cloudera Backup tools

a.png

 

 

 

7. Select the DAS cluster as  source; a replication schedule and which databases to replicate can be defined here. Also the Run As Username; any user will need superuser permissions and kerberos enablement if the clusters use kerberos.

b.png

 

 

Again, make sure to always check "Skip Checksum Checks" as the target is Isilon.

 

You also have the option to override the location of the exported metadata and location of the HDFS data is replicated to, for more details see: Hive/Impala Replication

c.png

If the source HDFS data is not enabled for snapshots, you'll see the following information. It is highly recommended to use snapshots with Hive/Impala replication. To configure this, make the source HDFS data default location - /user/hive/warehouse snapshottable. BDR will now automatically make use of this feature when replicating data.

 

d.png

 

We have enabled snapshots on the default location for data: /user/hive/warehouse

 

 

 

8. Having defined the schedule, execute it

e.png

 

9. The replication then executes copying the metadata & data: we see it copy the database, tables and HDFS data

f.png

 

 

 

10. We can now see the tpcds__parquet database in the metastore, the BDR job take care of location specific URI and paths relating to the metadata and data now being on a different Hadoop cluster, this is the critical piece of Hive/Impala replication and why using BDR is so useful.


g.png

 

 

 

11. Running a simple SQL query against the customer table on both clusters validates the database, table and HDFS data replication was successful.

 

On DAS cluster

h.png

 

On Isilon Cluster

i.png

 

 

Hive Replication and Incremental Replication

1. Drop a Hive/Impala Table on the Isilon cluster

a1.png

 

 

a2.png

 

 

 

2.Execute replication, Incremental Replication only copies the data for the missing table

a3.png

 

 

3. Update Hue's view of the metastore data

a4.png

The table is now present and can be queried

a6.png

 

Having now replicated the Hive/Impala metastore data and underlying table data on HDFS to the Isilon cluster, we can again leverage exciting native Isilon features to protect this data further; Snapshots, SyncIQ, NDMP backup etc..This short demonstration illustrate how Cloudera BDR can be used to backup and replicate HDFS data between Hadoop DAS clusters and Isilon integrated Hadoop clusters easily.

 

 

Using Hadoop with Isilon - Isilon Info Hub

russ_stevenson

Isilon