For the latest information on Hadoop Benchmarking with Terasort, click here.

 

Introduction

This blog post will describe in detail how to benchmark Hadoop. In particular, it will cover how to use the Terasort suite to benchmark YARN MapReduce. Although applicable to any benchmarking with Terasort, there are specific recommendations that apply when using an EMC Isilon cluster for HDFS storage.

 

Like any benchmark, Terasort, or the related Teragen and Teravalidate benchmarks, may have limited or even no relevance to your particular workload. If you have a specific workload you are trying to measure or optimize, then you should use that exact workload. In the absence of a specific workload, a generic benchmark may be useful.

 

There are many benchmarks related to Hadoop and Terasort is just one of them, although it appears to be the most widely used. It is a MapReduce benchmark and likely has little relevance to non-MapReduce workloads such as HBase, Impala, Hive on Tez, HAWQ, and SOLR.

 

Prerequisites

This post assumes you already have a fully operational Hadoop cluster and that you have already run some basic MapReduce jobs such as pi and wordcount. You may want to refer to the EMC Isilon Hadoop Starter Kit for complete details.

 

This post applies to MapReduce on YARN which means Hadoop 2.0 or higher.

 

Server Operating System Settings

For a thorough explanation of the optimal OS settings, you may refer to the following blog:

http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/

 

These settings should be applied to all Linux servers in your Hadoop cluster. These settings do not apply to the Isilon nodes themselves.

 

Sysctl Settings

Add the following to the end of /etc/sysctl.conf.

vm.swappiness = 1

vm.overcommit_ratio = 100

Transparent Huge Pages

Add the following to the end of /etc/rc.local.

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Storage of Intermediate Data

Many MapReduce jobs (including Terasort) will produce a significant amount of intermediate temporary files during the execution of the job. The output of the map tasks are often written to local disks outside of HDFS, transferred over the network to other nodes, and written to and read from more local disks. The performance of the disks that will be used for the intermediate files is critical for Terasort and similar MapReduce jobs. Traditionally each local disk is used for HDFS files and intermediate files.

 

When using EMC Isilon for HDFS storage, you will still need a storage system for the intermediate files. Options for storing the intermediate files include the following: local disks, local SSDs, PCI-based flash, RAM disk (persistence is not needed), EMC VNX, EMC XtremIO, and EMC Isilon (via NFS).

 

Virtual Servers

If you are using VMware for your virtualized servers, be aware that you may be using lazy-zeroed VMDKs. This means that the VMDKs are created very quickly but the drawback is that the first write to each sector of the virtual disk is significantly slower than subsequent writes to the same sector. This means that optimal VMDK performance may not be achieved until after several days of normal usage. To accelerate this, refer to the Fill Disk section in the EMC Isilon Hadoop Starter Kit.

Global Hadoop Parameters

Some Hadoop parameters are "global" meaning that they apply to all jobs. Hadoop also has per-job parameters which will be addressed in a subsequent section.

 

All parameters below will ultimately be in yarn-site.xml. If you are using Cloudera Manager or Ambari, these parameters can be set using that platform's GUI which will automatically create the necessary yarn-site.xml file for you. All YARN services should be restarted after yarn-site.xml has been updated.

 

The parameters below are recommended for MapReduce benchmarks.

 

Property

Value

Notes

yarn.log-aggregation.retain-seconds

-1

By default, Hadoop will purge log files of old jobs. Settings this value to -1 will prevent this.

yarn.nodemanager.local-dirsSee NotesPath to the directory where temporary intermediate files will be stored. Multiple paths can be separated with commas.

yarn.nodemanager.pmem-check-enabled

false

If you find that your jobs get killed due to exceeding their memory allocation, set this value to false.

yarn.nodemanager.resource.cpu-vcores

Number of CPU cores

Set this to the number of CPU cores on each of your Node Manager servers. By default, this is the maximum number of map or reduce tasks that will run on each Node Manager. This is a per-NodeManager setting.

yarn.nodemanager.resource.memory-mb

Physical or Virtual Machine RAM minus 12 GiB

In general, leave 12 GiB of RAM for use by the OS and Hadoop services. The remainder can be use by the map and reduce tasks. This is a per-NodeManager setting.

yarn.nodemanager.vmem-check-enabled

false

If you find that your jobs get killed due to exceeding their memory allocation, set this value to false.

yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.

resourcemanager.scheduler.fair.

FairScheduler

There are several schedulers that YARN can use to assign map and reduce tasks to the Node Managers. The Fair Scheduler does the best job of evenly distributing the tasks when the Node Manager is not co-located with the Data Node.

yarn.scheduler.increment-allocation-mb

1

We may need to fine-tune the amount of memory allocated to the map and reduce tasks. Settings this to a low value allows you maximum flexibility. This should be the same value as yarn.scheduler.minimum-allocation-mb. Required for the Fair Scheduler.

yarn.scheduler.maximum-allocation-mb

65536

In case the default is too small, allow a large memory allocation by the map and reduce tasks.

yarn.scheduler.minimum-allocation-mb

1

This should be the same value as yarn.scheduler.increment-allocation-mb. Required for the Fair Scheduler.

yarn.scheduler.minimum-allocation-vcores

0

We may want to benchmark with oversubscribed CPUs - meaning more map tasks than CPUs. Setting this value to 0 allows this.

 

Isilon Configuration

For optimal HDFS performance from an Isilon cluster, you should ensure the following:

  • All Isilon and compute nodes in the cluster have at least one, preferrably two, 10 Gigabit Ethernet ports connected to a 10 Gigabit switch.
  • HDFS NameNode traffic to Isilon is distributed using a SmartConnect zone and you are using the SmartConnect zone DNS name for your NameNode URL.
  • The SmartConnect zone should be dynamic and the number of IPs should be an exact multiple of the number of nodes. This will give you excellent High Availability and a very even traffic distribution.
  • This same SmartConnect zone should also be specified with the "isi hdfs racks" command. This will allow HDFS DataNode traffic to use these Isilon nodes.
  • Note that separate SmartConnect zones for NameNode and DataNode traffic are no longer recommended.

 

The following Isilon configuration settings should be applied. Note that the HDFS block size defined on Isilon is used for reading. It does not affect writing files using HDFS nor does it affect the layout of the file on the disks. HDFS clients that read from Isilon will read in chunks up to the HDFS block size.

isiloncluster1-1# isi hdfs settings modify --server-threads 256

isiloncluster1-1# isi hdfs settings modify --default-block-size 512M

Files on an Isilon cluster can be laid out and cached differently depending on the expected access pattern. MapReduce tasks read and write files on HDFS sequentially so we will mark the directory containing test files for streaming access. For convenience, we will name the directory "streaming-21" to indicate a streaming access pattern and 2:1 data protection. 2:1 data protection is the default for small Isilon clusters and allows for the loss of any two drives or one node.

isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1/hadoop/benchmarks/streaming-21
isiloncluster1-1# isi set -R -p +2:1 -a streaming -l streaming /ifs/isiloncluster1/zone1/hadoop/benchmarks/streaming-21

 

MapReduce Basics

All benchmarks in this post run as MapReduce jobs in Hadoop. Therefore we must have a good understanding of MapReduce on Hadoop. A MapReduce job can be broken down into two phases - map and reduce.

 

Map Phase

In the map phase, multiple map tasks execute in parallel, each reading a part of the input data stored on HDFS sequentially. The amount of HDFS data read by each map task is equal to the HDFS block size. Each map task will perform some computation on the data and generate an intermediate result which will then be written to disks local to the compute node (NodeManager). In the case of terasort, the input data is sorted and partitioned for a specific reducer in the next phase. In the case of teravalidate, the input data is checked to ensure that it has been sorted. In the case of TestDFSIO read, the input data is read and discarded.

 

Reduce Phase

In the reduce phase, multiple reduce tasks execute in parallel. Each begins by downloading the intermediate data produced by the map tasks. Each reduce task has a specific partition from each map task that it downloads. This network transfer is often referred to as the shuffle and the intermediate data is sometimes referred to as shuffle data. For larger jobs, the total amount of intermediate data downloaded by a single reducer may exceed the amount of memory allocated to the reducer so the local disk will be used as a type of swap file. As each partition is downloaded from the nodes that ran the map tasks, the intermediate results are merge-sorted using memory and disk files. Once all partitions have been downloaded and merged, the actual reduce function can be applied to produce the final result. In the case of terasort, the reduce function simply writes the merge-sorted results to a single file on HDFS. You will end up with one output file per reducer. The concatenation of these files in order will produce a completely sorted file.

 

MapReduce Task Scheduling

Understanding MapReduce task scheduling is critical to optimizing the performance. We previously defined two critical YARN parameters. yarn.nodemanager.resource.memory-mb defines how much memory can be allocated to all map or reduce tasks (generically called YARN containers) running on a NodeManager. yarn.nodemanager.resource.cpu-vcores defines how may CPU cores can be allocated. MapReduce jobs tell the YARN scheduler (ResourceManager) the amount of memory and CPU cores required for its map and reduce tasks. This is done by explicitly setting the following job values or using defaults.

 

mapreduce.map.memory.mb

mapreduce.map.cpu.vcores (default is 1)

mapreduce.reduce.memory.mb

mapreduce.reduce.cpu.vcores (default is 1)

 

When a MapReduce job is submitted, the YARN scheduler distributes the map tasks among all NodeManagers in the cluster until the NodeManagers have no remaining memory or no remaining CPU cores. Tasks will be queued until a NodeManager has both memory and CPU cores to execute it. Once most of the map tasks complete, YARN will begin to schedule the reduce tasks.

 

You can estimate the maximum number of map tasks that can execute simultaneously on your cluster using the following formulas. The first shows the maximum as limited by memory and the second shows the maximum as limited by CPU cores. The smallest of these will be effective.

 

max_map_tasks_1 = node_manager_count * round_down(yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb)

max_map_tasks_2 = node_manager_count * round_down(yarn.nodemanager.resource.cpu-vcores / mapreduce.map.cpu.vcores)

 

All benchmarks, with the possible exception of Terasort, will perform best when all map tasks execute simultaneously. If you find that some map tasks are queued, then can try to decrease the number of map tasks or decrease mapreduce.map.memory.mb.

 

MapReduce Memory Usage

The parameters mapreduce.*.memory.mb are only used to reserve a specific amount of memory from the YARN NodeManagers. We also need to need to tell the Java runtime how much the Java application (our map or reduce task) can allocate. This is done by setting the parameters below. The values should be about 75% of the container size. So if mapreduce.map.memory.mb is 2048, set mapreduce.map.java.opts to -Xmx1536m.

 

mapreduce.map.java.opts

mapreduce.reduce.java.opts

 

If the entire Java process exceeds the amount requested from YARN, the NodeManager will terminate the task by default. This behavior can be disabled by setting both yarn.nodemanager.pmem-check-enabled and yarn.nodemanager.vmem-check-enabled to false.

 

Additionally, you will want to let the map task know how much memory it can use for sorting its intermediate output. To avoid excessive disk I/O, this should be larger than the total amount of intermediate data produce by the map task. For Terasort with a 512 MiB HDFS block size, use 768 MiB.

 

mapreduce.task.io.sort.mb

 

For large clusters, you may also need to increase the memory allocated to the Application Master that launched with the job.

 

yarn.app.mapreduce.am.resource.mb

yarn.app.mapreduce.am.command-opts (see notes for mapreduce.map.java.opts)

 

Monitoring your MapReduce Cluster and Jobs

The YARN ResourceManager GUI has a wealth of information to help you optimize your cluster. The URL is http://your-resource-manager-fqdn:8088/. Click on Nodes and ensure that the Active Nodes and the Memory Total matches your expectations.

 

During the execution of a job, you can monitor in activity in real-time. You'll be able to determine how many of the tasks are currently executing, pending, complete, or failed.

 

Compression

Compression can often increase the performance of network and disk I/O-bound systems. A MapReduce job can read compressed input data, use compressed intermediate data, and/or write compressed output data. You will need to determine if compression in a benchmark is applicable to your particular benchmark goals. Usually, compression is only used for the intermediate data produce by Terasort map tasks. To enable this compression, use the following parameters.

 

mapreduce.map.output.compress=true

mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec

 

Running the Terasort Suite

The Terasort suite has three components:

  1. Teragen - This generates a random unsorted dataset on HDFS. Each 100 byte record consists of a randonly generated key and an arbitrary value. You will specify the quantity and size of the files. This is a write performance test.
  2. Terasort - This will read the unsorted data produced by Teragen and sort it. The output will be written to HDFS. This fully exercises the entire MapReduce framework including HDFS I/O, local disk I/O, network I/O, CPUs, memory, and more.
  3. Teravalidate - This will read the output from Terasort and validate that the input data is sorted. For our purposes, this is effectively a read performance test as we do not expect the actual validation to fail.

These tests must be run in the order shown. Additionally, Be aware that the parameters chosen for prior tests may affect subsequent tests. For instance, the number of reducers chosen during Terasort will define the number of mappers used during Teravalidate.

 

Teragen

In the examples shown below, the user who executes the jobs is called hduser1. Login to any Hadoop node as hduser1.

Create a script to execute the job.

# teragen.sh

# Kill any running MapReduce jobs
mapred job -list | grep job_ | awk ' { system("mapred job -kill " $1) } '
# Delete the output directory

hadoop fs -rm -r -f -skipTrash /benchmarks/streaming-21/hduser1/terasort/terasort-input

# Run teragen
time hadoop jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
teragen \
-Ddfs.blocksize=512M \
-Dio.file.buffer.size=131072 \
-Dmapreduce.map.java.opts=-Xmx1536m \
-Dmapreduce.map.memory.mb=2048 \
-Dmapreduce.task.io.sort.mb=256 \
-Dyarn.app.mapreduce.am.resource.mb=1024 \
-Dmapred.map.tasks=64 \
10000000000  \
/benchmarks/streaming-21/hduser1/terasort/terasort-input

You may need to edit the path to the jar file:

Apache or Hortonworks HDP: /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

Cloudera CDH (deployed with Cloudera Manager): /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

 

The parameter mapred.map.tasks defines the number of map tasks that will run. This is equal to the number of files that will be generated. Start with a value equal to the total number of CPU cores in your cluster, minus 1. For example, if you have 10 nodes with 16 cores each, use 159 map tasks. Teragen does not have any reduce tasks.

 

The numeric parameter 10000000000 is the total number of 100-byte records that will be generated by Teragen job. These records will be evenly split over each of the map tasks. Each map task will write one file of size 1 TB / mapred.map.tasks. Your first test should be with 1 GB to ensure that your process works properly. Then increase to 1 TB or beyond.

 

1 TB = 10000000000 records (10 zeros)

1 GB = 10000000 records (7 zeros)

 

The parameter dfs.blocksize defines the HDFS block size used when creating and writing new files. See the MapReduce Basics section for a description of the remaining parameters.

 

The final parameter is the directory to create the files in. It must not exist before executing the job.

 

To execute the script:

[hduser1@hadoop-master-0 ~]$ sh ./teragen.sh

While the job executes, you will want to monitor the job closely in the YARN ResouceManager GUI (see Monitoring your MapReduce Cluster and Jobs above). In particular for Teragen, all map tasks should execute simultaneously to get the best performance.

 

When the job completes, note the elapsed time from the time command in your script. This is your Teragen job time. Your dataset size (e.g. 1 TB) divided by your job time will give you a rough value for your aggregate cluster write throughput.

 

Terasort

This will be very similar to the Teragen job. Create a new script to execute the job.

# terasort.sh

# Kill any running MapReduce jobs
mapred job -list | grep job_ | awk ' { system("mapred job -kill " $1) } '
# Delete the output directory

hadoop fs -rm -r -f -skipTrash /benchmarks/streaming-21/hduser1/terasort/terasort-output
# Run terasort
time hadoop jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \

terasort \

-Ddfs.blocksize=512M \

-Dio.file.buffer.size=131072 \

-Dmapreduce.map.java.opts=-Xmx1536m \

-Dmapreduce.map.memory.mb=2048 \

-Dmapreduce.map.output.compress=true \

-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec \

-Dmapreduce.reduce.java.opts=-Xmx1536m \

-Dmapreduce.reduce.memory.mb=2048 \

-Dmapreduce.task.io.sort.factor=100 \

-Dmapreduce.task.io.sort.mb=768 \

-Dyarn.app.mapreduce.am.resource.mb=1024 \

-Dmapred.reduce.tasks=100 \

-Dmapreduce.terasort.output.replication=1 \

/benchmarks/streaming-21/hduser1/terasort/terasort-input \

/benchmarks/streaming-21/hduser1/terasort/terasort-output

Terasort has both map tasks and reduce tasks. The number of reduce tasks can be specified with mapred.reduce.tasks but the number of map tasks is automatically determined by dividing the total size of the input files by the HDFS block size. 1 TB (base 10) divided by 512 MiB (base 2) will give you 1863 map tasks. However, since the files created by Teragen are usually not exact multiples of the block size, you will usually get slightly more map tasks.

 

A good starting point for the number of reduce tasks is the number of CPU cores in the cluster divided by 2. This will allow all reduce tasks to begin shuffling data across the network when the last wave of map tasks begin processing.

 

Refer to the MapReduce Basics section regarding the memory-related parameters.

 

The parameter mapreduce.task.io.sort.factor is used by the reduce tasks and specifies the number of on-disk files that will be merge-sorted in a single pass. A high number will cause a high amount of interleaved I/O on the local disks containing intermediate data. You can increase this significantly when using SSDs for local disks.

 

By default, Terasort sets the HDFS replication of the output files to 1. On a traditional HDFS system (not Isilon), this means the data only lives on a single disk in the cluster and can be lost due to a single disk failure. This may be acceptable for job output data since the job can often be run again to reproduce it. If you need to protect your output data at the HDFS default of 3 replicas and wish to estimate the performance of it, set mapreduce.terasort.output.replication to 3.

 

The final two parameters are the input and output directories. The input directory must have been created by Teragen. The output directory must not exist before executing the job.

 

To execute the script:

[hduser1@hadoop-master-0 ~]$ sh ./terasort.sh

While the job executes, you will want to monitor the job closely in the YARN ResouceManager GUI (see Monitoring your MapReduce Cluster and Jobs above). In particular for Terasort, all of the map tasks will usually not execute simultaneously. You will either see that all of the cluster memory is used or all of the CPU cores are used.

 

When the job completes, note the elapsed time from the time command in your script. This is your Terasort job time.

 

Teravalidate

This will be very similar to the Terasort job. Create a new script to execute the job.

# teravalidate.sh

# Kill any running MapReduce jobs
mapred job -list | grep job_ | awk ' { system("mapred job -kill " $1) } '
# Delete the output directory

hadoop fs -rm -r -f -skipTrash /benchmarks/streaming-21/hduser1/terasort/terasort-report
# Run teravalidate
time hadoop jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \

teravalidate \

-Ddfs.blocksize=512M \

-Dio.file.buffer.size=131072 \

-Dmapreduce.map.java.opts=-Xmx1536m \

-Dmapreduce.map.memory.mb=2048 \

-Dmapreduce.reduce.java.opts=-Xmx1536m \

-Dmapreduce.reduce.memory.mb=2048 \

-Dmapreduce.task.io.sort.mb=256 \

-Dyarn.app.mapreduce.am.resource.mb=1024 \

-Dmapred.reduce.tasks=1 \

/benchmarks/streaming-21/hduser1/terasort/terasort-output \

/benchmarks/streaming-21/hduser1/terasort/terasort-report

Teravalidate has both map tasks and reduce tasks. The map tasks do all the work of actually reading the sorted data and validating it. The number of map tasks can't be set. It is automaticaly set to be equal the number of files produced by the previous Terasort job. This will match the number of reduce tasks in the Terasort job. So if the Terasort job specified 100 reduce tasks, Teravalidate will run with 100 map tasks. The reduce task is trivial and simply summarizes the very small results from the map tasks. Run Teravalidate with a single reduce task.

 

Usually, there are no parameters that need to be tuned for Teravalidate.

 

The final two parameter are the input and output directories. The input directory must have been created by Terasort. The output directory must not exist before executing the job.

 

To execute the script:

[hduser1@hadoop-master-0 ~]$ sh ./teravalidate.sh

While the job executes, you will want to monitor the job closely in the YARN ResouceManager GUI (see Monitoring your MapReduce Cluster and Jobs above). In particular for Teravalidate, all map tasks should execute simultaneously to get the best performance.

 

When the job completes, note the elapsed time from the time command in your script. This is your Teravalidate job time. Your dataset size (e.g. 1 TB) divided by your job time will give you a rough value for your aggregate cluster read throughput.

 

Performance Optimization

After your first run of a 1 TB Teragen, Terasort, and Teravalidate, you will most likely not be impressed with the results. You will now need to experiment with different values of various parameters to find the optimal combination. However, be aware that some of this optimization will only apply to these benchmarks and the ideal parameters for your specific MapReduce jobs may be completely different. In any case, this section provides some techniques to help you optimize any MapReduce job.

 

General Optimization

  1. Ensure that there are no failed tasks. By default, failed tasks are attempted 3 times before aborting the entire jobs. MapReduce does an excellent job of recovering from these failures but for benchmarking, we want to ensure that there are no task failures. You may want to set mapreduce.map.maxattempts and mapreduce.reduce.maxattempts to 1 to force the entire the job to immediately abort if any tasks fail.
  2. Run the job at least three times with the same parameters to get an idea of the variation in performance. Variation should be very little (less than 10% of the mean).
  3. Be aware of disk caching. If you expect that MapReduce jobs will be executed against large datasets that often do not reside in cache, then you should flush your disk caches prior to each test. The Hadoop Test Driver (see below) can do this automatically.

Optimizing Teragen

  1. Ensure that all map tasks execute simultaneously. Decrease mapred.map.tasks or decrease mapreduce.map.memory.mb until this occurs.
  2. Try various values of mapred.map.tasks. Increase and decrease the number by multiples of 2 (e.g. 16, 32, 64, 128) until a peak is found. Then finely adjust the number until the exact peak has been found.

 

Optimizing Terasort

  • Ensure that your map tasks are not spilling intermediate output more than once. Simply make sure that mapreduce.task.io.sort.mb is slightly larger than your HDFS block size. You may need to increase mapreduce.map.memory.mb and mapreduce.map.java.opts to avoid out of memory errors.
  • Try various values of mapred.reduce.tasks. Increase and decrease the number by multiples of 2 (e.g. 16, 32, 64, 128) until a peak is found. Then finely adjust the number until the exact peak has been found.
  • Try various values of mapreduce.reduce.memory.mb. Higher values may reduce or eliminate the amount of intermediate data that is spilled to disk in the reduce tasks. Lower values may allow more reduce tasks to run concurrently. Remember that you will also need to adjust mapreduce.reduce.java.opts whenever this value is changed.
  • Try various values of mapreduce.task.io.sort.factor. Increase and decrease the number by multiples of 3 and 10 (e.g. 10, 30, 100, 300, 1000).
  • For large clusters, increase the memory allocated to the Application Master by setting yarn.app.mapreduce.am.resource.mb and yarn.app.mapreduce.am.command-opts.
  • Try different map output compression codecs or disable it completely.

Optimizing Teravalidate

  • Ensure that all map tasks execute simultaneously. Decrease mapreduce.map.memory.mb until this occurs.
  • Optimal performance of Teravalidate will occur when the number of map tasks is ideal. Therefore, run Terasort with different values of mapred.reduce.tasks until the subsequent Teravalidate performance is optimal. Note that the optimal performance of Terasort and Teravalidate may occur with different values of the Terasort reduce task count.

 

Hadoop Test Driver

Once you start trying to optimize MapReduce jobs, you quickly see that there are many parameters that can be adjusted to a wide variety of values. Even the fairly simply framework of MapReduce is complex enough to make predicting the results of different parameters impossible, requiring testing at each combination of parameters. The Hadoop Test Driver has been written specifically for testing all of these parameters and recording the results. The Hadoop Test Driver deserves its own blog post but for now, you can view the documentation and obtain the software from https://github.com/claudiofahey/hadoop-test-driver.


References