For most organizations, the value proposition of Hadoop is that it provides a low-cost way to keep large volumes of data online, and it creates a framework for analyzing that data to uncover new value drivers in their business. However, Hadoop is not the silver bullet for Big Data; Hadoop is a storage substrate and a set of really handy tools that allow us to ingest, process, disseminate, and archive data very effectively. Its value, whether deployed on Isilon or in physical servers, is more in its adoption as a standard framework for analyzing unstructured data and how that framework is being integrated with by so many other really interesting and useful data technologies. That's the power of Hadoop, the ecosystem of knowledge and integrated tools. Hadoop has a ton of promise because of the scale of data we might collect and analyze in a cost effective manner, but for many companies, Hadoop environments typically lack many of the critical capabilities that enterprise IT has come to expect from every other application stack in the data center and that is where Isilon can help.
Leveraging Isilon with Hadoop to meet IT standards
For the last twenty years, IT professionals have operated with enterprise IT standards for designing infrastructure to support critical business applications:
- Disaster Recovery and High Availability
- Applications and data must be protected from accidental deletion, application corruption, and site disasters so that business can continue, even when infrastructure fails. This can include backups, replication, and off-site retention.
- Security and Compliance
- Applications and data must be secured from unwarranted access, both internally and externally. Data must be retained to meet regulatory compliance that is verifiable to regulating bodies.
- Applications and data must be virtualized, for efficiency, mobility, and scalability.
- Consolidate. Consolidate. Consolidate.
- Applications and data must be consolidated to run on as small of an infrastructure as is responsible to avoid unnecessary expenses for energy, real estate, maintenance, and licensing.
- Scalability and Efficiency
- Applications and data must reside on infrastructure that is easily scalable to meet the increasing demands on IT without increasing complexity.
However, these requirements are often overlooked when people start talking about Hadoop. While I understand the business transformation that analytics powered by Hadoop might deliver are incredibly exciting, I don’t believe the standards of enterprise IT should be mutually exclusive to delivering on the promises of Big Data. Let’s look at those same key tenants IT professionals have been using for many years and understand the differences between how Hadoop is traditionally deployed with direct attached storage in server (DAS) and how Isilon can be leveraged to better align to enterprise IT standards:
Disaster Recovering and High Availability
Hadoop with DAS
Hadoop with Isilon
Hadoop does a 3X mirror for data protection and has no replication natively
Isilon supports snapshots, clones, and replication natively
Security and Compliance
Hadoop with DAS
Hadoop with Isilon
Hadoop does not support kerberized authentication as it assumes all members of the domain it is in are trusted
Isilon supports integrating with AD or LDAP and gives you the ability to safely segment access based on policy
Hadoop has no native encryption or enforced retention
Isilon supports Self Encrypting Drives across our entire range of nodes and has compliant retention leveraging WORM
IT Standards and Requirements: Hadoop with DAS versus Isilon
Hadoop with DAS
Hadoop with Isilon
Scalability and Efficiency
Hadoop natively 3X mirrors files in a cluster, meaning 33% storage efficiency
Isilon is typically 80% efficient and we have sub-file level deduplication for further efficiency gains
Scale compute and storage independently
Hadoop marries the storage with the compute so if you need more space, you have to pay for more CPU that may go unused or if you need more compute, you end up with lots of overhead storage capacity
Isilon allows you to scale compute as needed and Isilon for storage as needed; aligning your costs with your requirements
Say that three times fast...Hadoop-Dedupe-Dedupe-Dedupe
This melodic incantation is the distillation of Isilon’s value proposition for Hadoop done by some witty Isilon employee…I cannot imagine why it has not caught on yet. This catchy tune tells us that Hadoop is a critical technology framework in the new world of Big Data and you should absolutely be leveraging Isilon for your Enterprise Data Lake because we:
- Hadoop - with HDFS on Isilon, we dedupe storage requirements by removing the 3X mirror on standard HDFS deployments because Isilon is 80% efficient at protecting and storing data.
- Dedupe - applying Isilon's SmartDedupe can further dedupe data on Isilon, making HDFS storage even more efficient.
- Dedupe - by using HDFS on Isilon, we remove the need to have a separate landing zone for data before ingesting to HDFS...we bring analytics to the data, not the other way around.
- Dedupe - by leveraging vHadoop, we dedupe the number of servers required to run Hadoop.
For the less whimsical, more detail-oriented types, this side-by-side comparison of Hadoop on DAS versus Hadoop on Isilon should be more to your liking:
What many of my customers are trying to do is build the foundation for an Enterprise Data Lake; a place where internally-sourced data could co-reside with interesting data sets sourced externally and is co-processed to give our analytical queries greater scope and context. Having a Data Lake that is capable of granting access to data via multiple standard protocols and access methods is key, but so is being able to derive value from the data leveraging the powerful Hadoop ecosystem. A Data Lake has to meet enterprise IT standards and it has to be Hadoop friendly; with Isilon, you can have it all!
Why Isilon is awesome for Hadoop
Here at Isilon, we get involved in a lot of conversations around Data Lakes and customers generally like the idea that we can provide the HDFS protocol access on Isilon. The fact that Isilon can play nicely with physical, virtual, and multi-distribution environments for Hadoop without the need for the traditional, dedicated server stack being setup in a Hadoop silo aligns nicely to their goals of efficiency and consolidation. We have also had quite a bit of success and interest from customers in highly regulated industries because of the way we bring enterprise features like data protection, performance tiering, encryption, compliant retention, dedupe, secure authentication, and highly efficient storage to Hadoop environments in ways no other solution has figured out.
So, we may be a little biased, but we believe leveraging Isilon as the storage technology to support and deliver Hadoop functionality is pretty slick, especially given that we can be so much more than just an HDFS storage environment by delivering next generation access to data via protocols like NFS, CIFS, FTP, HTTP and REST. Surprising to many customers, we bring all this value to Hadoop while offering a significantly lower TCO than traditional approaches.
Did I mention that we are just as performance oriented as the DAS approach to Hadoop? By bringing Hadoop to the data and not bringing data to Hadoop, we typically can process queries sooner, but we can also process faster. Check out the example below that outlines the entire Hadoop job cycle from data ingest to viewing results.
From time to time, I encounter a Hadoop expert that will argue mightily that disk locality matters in Hadoop and that our approach negates this from a Hadoop context. The argument here is that part of the performance for Hadoop is derived from having blocks of data stored on hard drives that exist on the same servers that are doing all tasks associated with MapReduce jobs. While we do implement HDFS differently, I generally bring this commentary back to a couple of facts:
- MapReduce is a distributed batch job typically running on servers connected via 1G star networks. Batch and 1G networks are the keys here; these jobs are not high speed, low latency because the infrastructure is not networked that way. It’s moot as there are better ways to process data at speed. Hadoop is Big Data, not fast data.
- A study at UC Berkley that analyzed Facebook logs found that only 34% of tasks run on the same node that has the input data. If you think disk-locality is a big deal, read the article here.
- Today, a single non-blocking 10 Gbps switch port (up to 2500 MB/sec full duplex) can provide more bandwidth than a typical disk subsystem with 12 disks (360 – 1200 MB/sec).
While I fully admit that I am celebrity endorser of Isilon as a Data Lake platform who is paid to talk about how great Isilon is for Big Data environments, let’s just recap a few things that I firmly believe we have proven here:
- Isilon provides an incredibly flexible storage substrate for an Enterprise Data Lake by providing multi-protocol access to data, including Hadoop.
- Isilon is cheaper, faster, and easier to deploy Hadoop in the enterprise.
- Isilon delivers enterprise-class IT standards in our deployment of Hadoop beyond the traditional approaches in the Hadoop ecosystem.
Sounds pretty great, right? Say it again with me…Hadoop, Dedupe, Dedupe, Dedupe!