hadoop_sketch.pngAs we close out the last few weeks of the year one tradition that I am sure will be alive and well is the annual flood of “Best of the Year” and “Top Predications for Next Year” lists.  My prediction is that Hadoop will largely be ignored in the festivities despite the recent release of Hadoop 3.  Why?  Hadoop has largely had its days in the social media pundit spotlight.  The amount of oxygen being consumed by the collective hyper-ventilation over machine learning and AI alone are enough to ensure and there is not much chance of Hadoop trending right now.  So between the data analytics hype machine being arguably hotter than it has ever been and the amazing innovation coming from the open source community it is unlikely that anything that has been around for 8-10 years is going to make it on any “Best of” lists.

 

But that doesn’t mean you shouldn’t care about Hadoop anymore, and especially, please don’t buy into the “Hadoop is dead, Spark killed it” narrative.  The same conditions that motivated Google and Yahoo to build Hadoop based on a “write once, read many, schema on read” design still exist.  The number of Hadoop contributors and the amount of new code that is being added to both the core and related projects indicates to me that this technology is far from being fully exploited.  There are now millions of organizations trying to manage volumes of data that only the mega web companies were dealing with when Hadoop premiered.  One of the first documented Hadoop clusters in use at Yahoo managed a whopping 1.2TB of data.  Despite the technological advances made in traditional relational database management systems (whether the SMP or MPP designs), they are still cost prohibitive for the scale of data most organizations need for rich analytics based on a combination of structured and un-structured data.  I’m going to go out on a limb and suggest that although both technologies will continue to get more capable, the Hadoop family of technologies will always scale larger, handle more data formats and cost less per peta byte than a RDBMS.  So, what do I expect to happen with Hadoop in 2018 and beyond?  Here is a short list of ideas:

 

  1. Defining What is Hadoop?  Just kidding, why should this year be any different.  Despite being one of the most successful open source initiatives of all times, people still struggle to define what Hadoop is.  The Apache Software Foundation defines the Hadoop scope of the project to included four modules: the common utilities, HDFS, YARN and MapReduce.  They also list a set of 11 Apache related projects including Cassandra, Hive, Hbase, Pig, Spark, Tez and ZooKeeper.  The extensive scope of the Hadoop “ecosystem” is what causes so much confusion in the business community.  If you list everything in the ecosystem it causes business decision makers to worry about complexity.  If you limit the discussion to just the base components then the technical community misses the breath of potential applications and flexibility available to build something very targeted and efficient.  We are just going to have to continue to live with the fact that it is difficult to put Hadoop in a single platform category but also recognize that this is one of the most interesting and useful characteristics as well.
  2. Hadoop 3.0 has many important improvements to get excited about.  For example: erasure coding data protection for durably storing data with significant space savings compared to replication, shell script rewrites to fix many long-standing bugs and include some new features, and improved NameNode availability through support for multiple standby NameNodes to name just three.  The bigger news for 2018 is that many well know contributors to Hadoop are saying that 3.0 is primarily getting ready for the bigger improvements coming in 3.1 and 3.2.  For example, the Apache Hadoop community plans to use the new resource type support in Hadoop 3.0 to deliver GPU support in version 3.1 potentially leading to support for FPGAs in 3.2.
  3. Speeding Up Time to Value.  As the scope and power of the Hadoop platform have expanded over the years, so has the complexity of architecting and building solutions.  Cloudera and Hortonworks have stepped in to provide the Hadoop user community with valuable integration services, additional software, advising and support.  Dell EMC has partnered with both Hortonworks and Cloudera to add additional integration support with our Ready Bundles for Hadoop.  Dell EMC Ready Bundles for Hadoop are integrated Hadoop solutions designed to address data analytics requirements, reduce costs and deliver outstanding performance. Dell EMC has been working with the leading innovators in big data since 2008, and started designing and building custom Hadoop solutions in 2009.  Our most recent bundle offer is based on the Cloudera Enterprise CDH 5.12 for Apache Hadoop, Dell EMC PowerEdge 14th Generation Servers, 25 Gbe Z9100-ON Networking together with our Dell EMC Syncsort ETL Offload Solution.  With our deep roots in data analytics solutions and Hadoop, Dell EMC together with our has the expertise, tools and solutions needed to drive successful, flexible and scalable Hadoop deployments.


It is an incredibly exciting time to be working in data analytics.  The promise of developing intelligent applications using machine learning and the huge volumes of digital data that is available is already being realized by virtually every aspect of society and business and we have only begun to explore the future potential.  I hope you found some to the links in this article useful.  Please enjoy any opportunities you have to celebrate this season of change and holidays and watch this community for more articles in the new year.

 

Thanks for reading,

Phil Hummel

@GotDisk