Project Gaucho began, as many things often do, with a conversation over a good dinner. The dinner was good enough to be used as the name of the project, but the real meat was the conversation between a large Web Analytics company's storage team and the analytics specialist from EMC who was with them at dinner.
Initially the storage team had dismissed the discussion, one of them told me later that they had the "yeah, yeah, what do you want to sell me now?' reaction to it initially, but as the conversation went on the storage team and EMC specialist realized that they had stumbled on something big; something that they both saw huge potential and huge value in undertaking in a joint fashion that could produce major results for both organizations. Thus Project Gaucho was launched and today has become one of the most successful examples of customer partnering and technology validation available within EMC.
What had stuck in the storage team's mind that night was something I said about another customer who was spending many millions of dollars a month with a cloud service provider for their analytics processing. The storage admin remembered that he had had a conversation with someone recently about the same problem; a huge amount of money was flying out the door to support Hadoop projects and what he immediately realized was they had the same issue; a shadow IT environment in the cloud that they could not control. It also did not hurt that this company is a huge Isilon customer with many Petabytes of Isilon storage on their floor. The storage team already knew that they could provide native HDFS services from the Isilon (which is where the data they wanted to analyze already lived), but what they were missing was the front-side compute to make it happen. Thus the Federation of EMC products came into play and the overall project, audacious at the time, came into being.
Me: "You give me one month of your AWS bill, and I will make sure you never go back to the cloud again."
Customer: "Prove it."
As we talked further what emerged was both profound in its thought-process and scope, but breathtakingly simple in its execution. I already knew that splitting compute and storage was the right way to go, and the only way IT can wrestle back some level of standards from the wild-west of analytics environments. What I did not have was anything to back that up. Sure we had tried it, with 8 Isilon nodes and a handful of VM's. When I asked what the sizing guidelines were, how to spec the environment, testing and performance expectations, and other tech guidance from the business units in EMC, what I got was a lot of crickets, a few completely conflicting accounts of what was needed, and a few mumbled, "it'll probably work, but it might not, or it might be really slow."
What I had proposed was what you see on that original "back of the napkin" sketch: a vBlock implementation, running Pivotal Data Services, deployed via BDE, with Isilon for the data set and VNX for the VM-ware boot images. Pretty simple in actuality. Unfortunately also 100% unique at the time it was proposed, but that was also the exciting part for both EMC and the customer. For us, a chance to vet the architecture with real data, in a live-fire simulation, under real-world workload conditions. For this organization, the chance to bring what were traditionally completely siloed groups together to build a first for their organization; a comprehensive solution platform, and to validate the fundamental architecture as a vehicle for many other use cases. In spite of some fits and starts, we were able to get all the requisite approvals, gear on the floor and tests run.
It cannot be overstated how instrumental the cooperation of all the BU's and account team members were in this effort. With the federation of many organizations the phrase "it takes a village" was never more apt than it was with this project. We were able to get amazing cooperation from the various teams and the level of effort paid off in the customer's continued support of the project even as it seemed it might not get off the ground.
To say it worked would be a staggering understatement. Why? There are three primary factors that can be considered the keys to the success that we measured:
- Operational capability
- Performance on real-world data-sets
- Time to results
Operational capability was an easy win - they already had Isilon and so knew about its capabilities and simplicity of management even at scale. The marriage of in-place analytics support with native HDFS meant that the deployment of large Hadoop clusters was a question of minutes instead of weeks - we spun up multiple 128 node clusters in less than 15 minutes and began running Map-Reduce jobs via Pig.
Is there anything easier to manage than a fully virtual environment? Server goes down? No problem - move the work over to a different one. Storage device issues? No impact due to the redundancy and HA built into an enterprise-grade platform like Isilon. Cloud-like features like on-demand scaling and self service? Yep - portal for deploying Hadoop clusters of any size and flavor was a simple exercise in using VCAC and V-Cloud Director to produce the right tiles and configuration menus.
With multi-tenancy and simple deployment the operational equation was a no-brainer and both the storage and compute teams agreed that it was as functional as they could have hoped. Not to say we did not encounter challenges; we did a fair amount of "engineered on the fly" work to make things functional, but the Isilon and the compute environments both performed as expected without major issues.
Performance was interesting to map out since we were largely in uncharted waters. As we all know, Hadoop either has a math or a space problem but rarely does it have both. To fix one you have to buy both parts if you are using the antiquated and truly non-operational DAS model. The split was exactly what we hoped - we could add data seamlessly and in place as needed and add more space through growing out the Isilon cluster as normal, and we could throw more compute nodes at specific jobs with the expected diminishing returns curve of performance as the cluster time-sliced resources more and more. The biggest challenge was in memory usage and heap sizing with shuffle space issues being a close second. We worked through these with a fair degree of trail and error, but ultimately found combinations that allowed us to properly size and scale the clusters to meet the test-plan we had set out. What was eye-opening was the point at which the curve falls off - the VM environment performed on par with a comparable physical system with small numbers of nodes, but when scaled out it continued to improve well past the point we had expected it to fall over. We attributed this to two things:
First was the optimization of the jobs for the environment - solid technology will never fix sub-optimal code. You can throw more hardware at it to make it go incrementally faster, but simple code tweaks and using built-in functions gave us important performance gains. The second was the fact that with the combination of the Isilon reducing the overall need to wait for multiple write chains to finish, and the ability for all nodes in the cluster to participate because of the lack of the need for data locality computations (all nodes see all the HDFS volume of data as "local") the jobs were able to get down to work a lot fast than in a legacy model of DAS where they had to spend a fair amount of time mapping out jobs and waiting for data. In our testing, regardless of cluster size, every node worked on a task against a given data set. This can create sub-optimal scenarios where a very small data set tries to get broken up on too many nodes and the process of doing that work outweighs the time it would take to just run the job on one or two nodes - there are many ways to handle this and we are investigating some programatic controls internally to see if we can find a managed appraoch to this that runs inline with the jobs themselves. The end result though is that this organization was able to analyze much larger volumes of data than they could previously because they are no longer constrained by the node count of the cluster and the storage space on each node.
Time to results is key to analytics, and just eliminating the process of loading the Hadoop environment reduced the time to run a job from multiple days into hours. Do not let developers or data scientists minimize this piece of a job! The time it takes to run a job and get results is the time it takes from the minute the data is created to the time the analytics result set is created, and includes the time it takes to load data into the engine doing the math. Not only did the jobs run at the same rate or better, but without any lead-time to load the data we eliminated many hours of manual work required to populate the HDFS environment before jobs could run at all. This means that our time to results was exponentially shorter, our time to provision a functional and working environment was significantly shortened, and so the overall value of our platform means that this organization can reach conclusions and derive insight with significantly greater agility. That is money against the bottom line and that is the value of the federated solution.
Like so many great things, it started over casual dinner conversation and a diagram on a napkin. What emerged was a resounding affirmation that with the right tools, people, and collaboration it is possible to affect real change into the legacy mind-set and create value from things we often take for granted.