Are you in the midst of a Hadoop POC or attempting to optimize Hadoop performance? This document will describe in detail how to benchmark Hadoop. In particular, it will cover how to use the Terasort suite to benchmark YARN MapReduce. Although applicable to any benchmarking with Terasort, there are specific recommendations that apply when using an EMC Isilon cluster for HDFS storage.
Like any benchmark, Terasort, or the related Teragen and Teravalidate benchmarks, may have limited or even no relevance to your particular workload. If you have a specific workload you are trying to measure or optimize, then you should use that exact workload. In the absence of a specific workload, a generic benchmark may be useful.
There are many benchmarks related to Hadoop and Terasort is just one of them, although it appears to be the most widely used. It is a MapReduce benchmark and likely has little relevance to non-MapReduce workloads such as HBase, Impala, Hive on Tez, HAWQ, and SOLR.
Access step by step guide
EMC is sending its top Federation big data experts and senior executives to engage with you at the big data event of the year. Strata + Hadoop World is where big data's most influential business decision makers, strategists, architects, developers, and analysts gather – you won’t want to miss it.
Do you want to?
• Interact with top EMC execs like Sam Grocott and Aidan O’Brien
• Engage with elite EMC Federation big data architects, data scientists and visionaries
• Meet-face-to face with innovators and thought leaders
If the answer is yes, contact your EMC account manager to schedule a meeting with an EMC Federation expert now. You will also have the opportunity to meet EMC Federation experts by visiting the EMC Booth #631.
Access additional Strata + Hadoop World event details
The 3rd platform involves mobile devices and platforms in cloud, big data, analytics, and social technologies - the new frontier, stretching the reach and impact of data centers across geographies. Learn how to transform your traditional, process-driven IT model into a digitized, market-driven environment so your business can thrive.
In this 60-minute webcast with IDC, VCE and EMC you will learn about:
• Market environments and transitions to maximize your IT investments
• Accelerating time from idea to results on integrated, modular solutions and extensions from the 2nd to 3rd platform
• Designing for future needs with scale-out, modular technologies – buy as you go, private, public, hybrid
• Customer case studies and benefits of a modern, connected approach
If you missed the live webcast Jan 20, it is now available on-demand
After eight weeks of fine-tuning the virtual HDaaS infrastructure, Adobe succeeded in running a 65-terabyte Hadoop workload - significantly larger than the largest known virtual Hadoop workloads. In addition, this was the largest workload ever tested by EMC in a virtual Hadoop environment on Isilon.
Fundamentally, these results proved that Isilon as the HDFS layer worked. In fact, the POC refutes claims by some in the industry that suggest shared storage will cause problems with Hadoop. To the contrary, Isilon had no adverse effects and even contributed superior results in a virtualized HDaaS environment compared to traditional Hadoop clusters. These advantages apply to many aspects of Hadoop, including performance, storage efficiency, data protection, and flexibility.
Download the white paper
Data Science and Big Data Analytics is about harnessing the power of data to gain new insights. Covering the breadth of activities, methods, and tools that Data Scientists use, the book focuses on concepts and principles that can be practically applied to any industry and technology environment. The learning is supported and explained with examples that you can replicate using open-source software.
This book will help you:
• Become a contributor on a data science team
• Deploy a structured lifecycle approach to data analytics problems
• Apply appropriate analytic techniques and tools to analyze big data
• Learn how to tell a compelling story with data to drive business action
• Prepare for EMC Proven Professional Data Scientist certification