ECN Big Data Roundup, January 2015

Follow ECN Everything Big Data technical community

Inside EMC 920X300.jpg

Step-By-Step Guide:  Hadoop Benchmark With Terasort

Are you in the midst of a Hadoop POC or attempting to optimize Hadoop performance?  This document will describe in detail how to benchmark Hadoop. In particular, it will cover how to use the Terasort suite to benchmark YARN MapReduce.  Although applicable to any benchmarking with Terasort, there are specific recommendations that apply when using an EMC Isilon cluster for HDFS storage.

Like any benchmark, Terasort, or the related Teragen and Teravalidate benchmarks, may have limited or even no relevance to your particular workload. If you have a specific workload you are trying to measure or optimize, then you should use that exact workload. In the absence of a specific workload, a generic benchmark may be useful.

There are many benchmarks related to Hadoop and Terasort is just one of them, although it appears to be the most widely used. It is a MapReduce benchmark and likely has little relevance to non-MapReduce workloads such as HBase, Impala, Hive on Tez, HAWQ, and SOLR.


Access step by step guide



Strata + Hadoop World Feb 17-20, San Jose:  Meet With EMC Big Data Experts and Executives

EMC is sending its top Federation big data experts and senior executives to engage with you at the big data event of the year. Strata + Hadoop World is where big data's most influential business decision makers, strategists, architects, developers, and analysts gather – you won’t want to miss it.

Do you want to?

• Interact with top EMC execs like Sam Grocott and Aidan O’Brien

• Engage with elite EMC Federation big data architects, data scientists and visionaries

• Meet-face-to face with innovators and thought leaders


If the answer is yes, contact your EMC account manager to schedule a meeting with an EMC Federation expert now.  You will also have the opportunity to meet EMC Federation experts by visiting the EMC Booth #631.


Access additional Strata + Hadoop World event details



On-Demand Webcast:  The 3rd Platform, A New Frontier to Modernize Your Infrastructure


The 3rd platform involves mobile devices and platforms in cloud, big data, analytics, and social technologies - the new frontier, stretching the reach and impact of data centers across geographies.  Learn how to transform your traditional, process-driven IT model into a digitized, market-driven environment so your business can thrive.

In this 60-minute webcast with IDC, VCE and EMC you will learn about:

Market environments and transitions to maximize your IT investments

Accelerating time from idea to results on integrated, modular solutions and extensions from the 2nd to 3rd platform

Designing for future needs with scale-out, modular technologies – buy as you go,  private, public, hybrid

Customer case studies and benefits of a modern, connected approach

If you missed the live webcast Jan 20, it is now available on-demand



Customer Success:  Adobe Virtualizes Its Large Scale Hadoop Deployment


After eight weeks of fine-tuning the virtual HDaaS infrastructure, Adobe succeeded in running a 65-terabyte Hadoop workload - significantly larger than the largest known virtual Hadoop workloads. In addition, this was the largest workload ever tested by EMC in a virtual Hadoop environment on Isilon.


Fundamentally, these results proved that Isilon as the HDFS layer worked. In fact, the POC refutes claims by some in the industry that suggest shared storage will cause problems with Hadoop. To the contrary, Isilon had no adverse effects and even contributed superior results in a virtualized HDaaS environment compared to traditional Hadoop clusters. These advantages apply to many aspects of Hadoop, including performance, storage efficiency, data protection, and flexibility.


Download the white paper


Education:  Want to become a Data Scientist?

Data Science and Big Data Analytics is about harnessing the power of data to gain new insights. Covering the breadth of activities, methods, and tools that Data Scientists use, the book focuses on concepts and principles that can be practically applied to any industry and technology environment. The learning is supported and explained with examples that you can replicate using open-source software.

This book will help you:

• Become a contributor on a data science team

• Deploy a structured lifecycle approach to data analytics problems

• Apply appropriate analytic techniques and tools to analyze big data

• Learn how to tell a compelling story with data to drive business action

Prepare for EMC Proven Professional Data Scientist certification


Learn more




Follow @EMCBigData




Subscribe to EMC Big Data Blog



Watch EMC Big Data YouTube Playlist