Find Communities by: Category | Product




I was asked to speak at Hadoop Summit 2015 by our good friends and partner Hortonworks. After getting over the initial excitement I realized that this called for something different from the typical presentation that I might deliver at a trade-show event. Hadoop Summit attracts the cream of the crop and the technical level of the audience meant that a typical, marketing-heat presentation would not suffice. Also as I thought more about the activity that I am seeing around the analytics and the adoption of Hadoop in particular, I hit on the notion that what I was really seeing was the true emergence of the DevOps that I had always imagined in my years in operations.


I spent 20 years in data center and IT operations roles before getting into the manufacturer world and it was during that time that I developed my view of the world of IT and technology: All the glitz, bells, whistles, and flash in the world doesn’t mean anything if you cannot make it run in production, day-in and day-out without requiring an army and advanced degrees and skills in very fringe technologies.


Unfortunately the early days of Hadoop and advanced analytics tools have been exactly that; and the barrier to entry I find most commonly in the IT world is that it is “just too complex” to get Hadoop up and running to make it worthwhile. This exposes the key second barrier to entry for analytics, which is not knowing WHAT value analytics delivers.


The Old World

When I started in IT, and indeed up through today, there were two distinct type of IT teams; Dev and Ops. This delicate and often out-of-balance relationship of builders and runners led to many interesting conflicts that weren’t always restricted to the data center or conference rooms (I remember a particularly aggressive flag football game at a company picnic in particular…). Generally this relationship took the form of Developers building things in a total vacuum and tossing it to Operations to run in production with little or no communication and validation of the ability of it actually running in the first place.


This led to a lot of difficult and often unwieldy systems being built to support increasingly complex and distributed systems. A lot of tribal knowledge and unsupportable environments were created for a single purpose. Fortunately technologies emerged to lighten this load on the Ops side of the equation in the form of shared compute and storage resources through innovations like SAN and virtualization. Developers could finally take advantage of technologies in order to create more accurate mirrors of the production environment for the purposes of testing and developing their applications.


The New World

The world of Hadoop and advanced analytic tools looks a lot like the world before the emergence of these advanced operational technologies. Massive environments of dedicated, single-function servers and hardware looks a lot like 1993 all over again and that is something that operations has been trying for a long time to get away from.


This accepted model of Hadoop deployments has its roots in a simple fact of IT skunk-works type projects; it was built to solve a specific challenge, using what was available at the time. So given the challenges of the need to move and analyze a lot of data quickly, and the time that this was being built - pre-2010 - the biggest challenge was getting data to a CPU as fast as possible. Since SANs and Networking were not nearly as evolved; and the adoption of things like 10gig networking wasn’t in place, it made sense to put the data local to the CPU for the best speed. This has resulted in the expected outcome of all Hadoop being built the same way, regardless of its operational capabilities or lack thereof…



The Killer App for DevOps

So what is happening now is very similar to what happened some 15 years ago in IT when Operations decided they had enough of single-purpose environments.Slide4.jpg

Operations fixes broken stuff and that model is very inefficient; so they are aggressively trying to innovate. As Hadoop stops being the realm of science fair project and becomes the de facto platform for the creation and manipulation of analytics models; it becomes increasingly important that it mature in terms of its ability to perform in the enterprise. If you can’t run it inside the expected platforms that IT has already put in place, then the majority of organizations will simply take a pass. This need for operationalization of Hadoop and its ecosystem is the killer app that DevOps has been waiting for to emerge and mature fully inside any organization, not just a massive enterprise or tech-startup.


DevOps is the marriage of Developers and Operations into a single hive-mind organization that strives for stability and security while providing an agile and flexible platform for the delivery of new technologies to the business and to consumers.



In the analytics world this takes the form of delivering on the concept of Data as a Service and the ability to provide data and analytics capabilities to the data consumer directly. In a broader sense, it is the use-case that if companies can get right, they can truly have a chance to make a market-changing move into new areas of their own business.




Where do we go from here?


As custodians of Hadoop and HDFS we are entrusted to build a platform that we can embrace and extend into all the areas that we need to leverage this type of capability for large data-set processing and storage. The call to action for all of us is to NOT lose sight of the need to make sure this technology isn’t relegated to the scrap heap that so much of the new, hot tech ends up on.


Hadoop has staying power and will continue to be a platform that companies use to deliver actionable insights into data for a long-time to come, but if it is incapable of living in the DevOps frame-of-mind and set-of-tools then it will be a pyrrhic victory at best. In the act of being successful in the accepted model, it will eliminate itself from competition in the future as organizations move past it to something with the ability to run in the DevOps model.


If you missed this presentation watch the on-demand version of it here.


Today, it is hard to dispute the value of data-driven approaches. As a result, businesses are seeking now, more than ever, for ways to harvest the value of data through analytics. But data and analytic methods have also been victims of the greatest villain of IT since its inception: the siloed approach.


It’s extremely important to understand that IT has been burned from the beginning of time by silos.  In order to quickly solve a business problem, IT has been dealing with silos created by mainframes, applications, storage, networking and more.  As a result, data has always been a victim of this approach since there was no strategy to pool and aggregate data across these silos.


There is now an opportunity for IT to show its greatest value to the business by architecting a data lake, transcending silos across data types, data stores, analytical methods, speed of analytics, and more.  IT can deliver an information landscape that expands the universe of data and the universe of analytical approaches to improve the accuracy of the analytical method, accelerating the speed at which value can be obtained, and enabling faster innovation in the data-driven world.



What is really exciting about a data lake is the ability to analyze a problem from a myriad of perspectives, increasing the quality of the outcome. If you consider hiring, for example, employers have been constrained to considering only a well-defined set of parameters before deciding which candidate is the best match. On the other hand, recruiting companies such as Gild, taps on big data entries from Tweet, Facebook, blogging feeds, question-answer web sites, such as Quora, and others, to analyze how a candidate interacts with the outside world and where his/her expertise lies, before deciding if the candidate is right for the environment. The retention rates and the employee satisfaction on the job can be far greater, yielding to a much better outcome than before.


A data lake sounds like the holy grail, but how can an organization actually build one when most IT environments are heterogeneous- a mix of data managed from traditional second platform technologies (file, block, etc) and new emerging third platform technologies (Hadoop, mobile, cloud)?  In other words, how can IT create this modern architecture that integrates and brings together diverse data sources with ease of management for IT and ease of access and analysis for data scientists, business analysts, and application developers.


Additionally, a data lake is not your traditional BI/DW environment - it requires new skills and processes, which organizations may struggle with.  Last, a data lake may not meet enterprise class requirements, such as mission-critical availability, performance, security, and data governance.



Being at Hadoop Strata 2015 at San Jose is like being transported to the future and see how the software technology will look like less than a decade from now.


I have visited Hadoop Strata all the way from Melbourne - Australia to try to catch up with all the new vendors in the market and bring this expertise to the rest of world as many of these products are focusing in their region.


Now lets get into business,.....


Apache Spark


Apache Spark is still looking like the way to go when it comes to the fading map reduce and Apache Mahout being transitioned to spark MLLib, we had a lack of presistant tables storage for databases in spark, we were able to get away with registering temporary tables but now all changed for Spark with the announced BlinkDB that will be holding the metastore and tables storage.


spark_roadmap.jpgR on spark is a very aggressive move as it eliminates the need to program R outside of Spark eco-system, allowing analytics within the eco system using the same RDDs (Resilient Distributed Data sets) that are living in memory.

the main benefits of the spark platform is the ability to stream data using Spark Streaming, query data using SparkSQL (previously Shark) and apply analytics using MLLib and and R within the same platform utilizing the same RDD that is occupying the memory, therefore boosting the performance since the data is moved to memory at the beginning eliminating any need to read from spinning disks as long as their is a sufficient memroy to hadnle the data or even partially in memory as Spark by nature spills over to Disk when there is insufficient memory to keep all the datasets in.


the question that I get asked all the time, can spark used independently from Hadoop? and the answer is Yes, Spark can run in a standalone mode or using a cluster like Apache Mesos, the only problem is Spark will become another silo rather than being integrated in Hadoop and utilizing the different data stored from the mixed workloads and the Data Lake.


Unified Data Storage on Hadoop


As part of setting standards in working with Hadoop files, there is a big move and efforts to push for data standardization using Avro and Parquet in order to avoid the chaos of the different files format and extensions.


Avro is used for row based storage format for Hadoop and Parquet is used for columnar based storage, depending on how the data will be dealt with and what type of a file its we can easily decide how to store it, for most of the time Avro is the way to go unless we are working with a dataset that consist of plenty of columns and I want to avoid scanning un-needed columns as they wont be used much, the I can easily select a Parquet format and pre-select my columns in order to optimize my queries for examples in Hadoop.


Cloudera was showing the Kite SDK that can easily convert the data into Avro or Parquet format, we can also integrate Kite to do the conversion at the point of ingestion if needed. Most of the current databases in Hadoop can read Avro and Parquet formats (Pivotal Hawq, Cloudera Impala, Hive, Spark...etc)


Hive on Tez vs Hive on Spark


hive_logo.pngthe future of relational databases is here now, from my point of view its only a matter of time until companies starts replacing their existing traditional ones to the Hadoop ones like Pivotal Hawq or Apache Hive...etc. I have attended a smart talk by a smart performance engineer Mostafa Mokhtar from Hortonworks, they were trying to benchmark every aspect of Hive stand alone compared to Hive on Tez and Hive on Spark, we were hearing a lot of noise on how disruptive Hive on Spark would be, surprisingly Hortonworks was able to show that Hive on Tez is more than 70% faster than Hive on Spark, although Spark-SQL is more than %60 faster than hive on Spark and probably faster than Hive on Tez as well.


this is a great news for current Hive users as its obvious that Hive on Tez is the way to go on this one.


Oracle Heretic


Posted by Oracle Heretic Mar 4, 2015

Bitly URL:


Tweet this document:

Being in business now spawns huge amounts of data. Hence the #bigdata opportunity. #EMC

Follow me on Twitter:


Visit the EMC store:


Click here to learn more about EMC’s Solutions

I have been reading Jeff Needham's book Disruptive Possibilities: How Big Data Changes Everything, which I have been finding to be excellent. In the process, I think I finally get it as to what the big data fuss is all about.


The problem is that the process of being in business now produces an enormous quantity of data, which if mined appropriately, can lead to interesting insights. And the real rub is that if you don't mine that data, your competitor will. Thus, the big data opportunity becomes irresistable.



Chevy Volt


I like to think in terms of things that I know. Like my car. I drive a 2014 Chevy Volt, an electric car in other words. This car gets around 127 mpg, and it reports that to me every time I start it up or shut it down.


But that's just the beginning of the data that this car churns out. It's a computing platform after all. The thing is completely digital. It is true drive-by-wire. All of the controls are simply digital controllers for the software that runs the car. And it has sensors galore. 360 degree cameras, GPS, OnStar, you name it.


Now, the rub for GM is that there are several hundred thousand Chevy Volts running around in the world at this moment, and all of them are spawning data at a frightening rate. Try grabbing all of the data from just one of them!


But if GM can grab all of that data, what an opportunity. It is impossible to even predict the insights that GM might obtain from the data collected by several hundred thousand smart cars running around on the roads.



Tesla Model X


And the real rub for GM: There is an emerging car company called Tesla, which manufactures electric cars too. And Teslas are just as smart as Volts, if not smarter. You can be sure that Elon Musk is capturing the data from all of those Teslas, and doing something interesting with it.


Again, being in business (in this case the business of making electric cars) creates the opportunity to capture data from your customer's use of your product. Which if you don't exploit that opportunity, your competitor will. And the result eventually will be a marked competitive advantage.


This same analysis can be applied to any enterprise. All firms of any significant size now face the challenge of either building or acquiring access to a big data reservoir (so-called data lake), and beginning to explore the use of that data. Eventually, this will simply become required in order to be in business.

I have the honor of representing EMC in MIT's BigData@CSAIL initiative. At a recent annual meeting, I had the opportunity to speak about the challenges enterprises are having in adopting Big Data. The talk resonated with the other industry representatives in the room so I thought it worthwhile to share these observations in this forum.


During the past decade, a handful of companies have pioneered the Big Data movement. They shared the common challenge of extracting value from unprecedented amounts and kinds of data. To achieve this, they developed new software, fueled by Moore’s law advancements, capable of processing data with substantial economies of scale. This allowed the Big Data pioneers to discover and profitably monetize previously inaccessible predictive relationships - essentially the Long Tail concept applied to analytics.


The pioneers’ extraordinary success has inspired many companies to explore using Big Data to improve or grow their business. The plethora of open-source Big Data software appears to make this an easy task. However, three fundamental challenges are making crossing the Big Data chasm difficult.


shutterstock_192399383.jpgReading the popular press, Big Data appears to be magic – just put your data in a hat, hire magicians called “Data Scientists”, and profitable insights materialize in a puff of smoke. This, however, is very far from reality. Finding obscure profitable predictive relationships may require a substantial amount of exploratory analytics. The time required to find these insights can be highly variable. So too their value. This uncertainty makes it exceedingly difficult to estimate the return on investment of implementing a Big Data platform. The traditional business school approaches to evaluating opportunities and managing investments don’t apply. Business leaders, therefore, must take a Big Data “leap of faith” which is the first of the three challenges.



The brave leaders that make the leap often land in an unfamiliar sea of technology options – Hadoop, Spark, Storm, Kafka, Solr, HBase, Redis, Cassandra, MongoDB, etc. Disruptive “born digital” companies talk of new architectural approaches like the Lambda Architecture while incumbent data warehouse vendors promote more traditional looking hybrid architectures.  No two Big Data environments look alike and are difficult to compare. Put simply, the newly Big Data converted have a lot of choices but with no way to choose. This can make implementing Big Data an arduous task of evaluating various technologies and assembling them into a coherent analytics environment. Many IT organizations may not be prepared for such an involved procurement, evaluation, and integration project making this “some assembly required” aspect of Big Data the second of the three challenges.


shutterstock_20217214.jpgThose that manage to build a Big Data platform need data to put on it. In established companies, that data is often in silos within the different departments and business units. Compelling the data’s migration out of the silos requires convincing their owners that the Big Data platform will provide unequaled unique value, particularly if everyone participates.  To a silo owner, however, migrating all data to a single Big Data platform may seem like an unacceptable "concentration of risk”. Mitigating the impact of bad internal actors, external security breaches, and rare faults is a difficult task made even more so when the data is collected together in a single system. Therefore, big data platforms desperately need the same robust, proven data protection, management, and security capabilities that exist in contemporary enterprise storage and database systems. Unfortunately, many Big Data technologies lack these capabilities, which constitutes the third of the three challenges.


These three fundamental challenges -  “leap of faith”, “some assembly required”, and “concentration of risk” – are standing in the way of many companies implementing Big Data. At EMC, the Big Data Solutions and Federation Data Lake teams are aggressively developing offerings based on our proven enterprise products to address these challenges and enable our customers to focus on using Big Data to grow their business. With EMC, you can take a step toward the future without leaving the valuable part of the past behind.

Project Gaucho began, as many things often do, with a conversation over a good dinner. The dinner was good enough to be used as the name of the project, but the real meat was the conversation between a large Web Analytics company's storage team and the analytics specialist from EMC who was with them at dinner.

Initially the storage team had dismissed the discussion, one of them told me later that they had the "yeah, yeah, what do you want to sell me now?' reaction to it initially, but as the conversation went on the storage team and EMC specialist realized that they had stumbled on something big; something that they both saw huge potential and huge value in undertaking in a joint fashion that could produce major results for both organizations. Thus Project Gaucho was launched and today has become one of the most successful examples of customer partnering and technology validation available within EMC.

The Why

What had stuck in the storage team's mind that night was something I said about another customer who was spending many millions of dollars a month with a cloud service provider for their analytics processing. The storage admin remembered that he had had a conversation with someone recently about the same problem; a huge amount of money was flying out the door to support Hadoop projects and what he immediately realized was they had the same issue; a shadow IT environment in the cloud that they could not control. It also did not hurt that this company is a huge Isilon customer with many Petabytes of Isilon storage on their floor. The storage team already knew that they could provide native HDFS services from the Isilon (which is where the data they wanted to analyze already lived), but what they were missing was the front-side compute to make it happen. Thus the Federation of EMC products came into play and the overall project, audacious at the time, came into being.

Me: "You give me one month of your AWS bill, and I will make sure you never go back to the cloud again."

Customer: "Prove it."

The What

As we talked further what emerged was both profound in its thought-process and scope, but breathtakingly simple in its execution. I already knew that splitting compute and storage was the right way to go, and the only way IT can wrestle back some level of standards from the wild-west of analytics environments. What I did not have was anything to back that up. Sure we had tried it, with 8 Isilon nodes and a handful of VM's. When I asked what the sizing guidelines were, how to spec the environment, testing and performance expectations, and other tech guidance from the business units in EMC, what I got was a lot of crickets, a few completely conflicting accounts of what was needed, and a few mumbled, "it'll probably work, but it might not, or it might be really slow."

vblock-pivotal-isilon architecture.jpeg.jpgWhat I had proposed was what you see on that original "back of the napkin" sketch: a vBlock implementation, running Pivotal Data Services, deployed via BDE, with Isilon for the data set and VNX for the VM-ware boot images. Pretty simple in actuality. Unfortunately also 100% unique at the time it was proposed, but that was also the exciting part for both EMC and the customer. For us, a chance to vet the architecture with real data, in a live-fire simulation, under real-world workload conditions. For this organization, the chance to bring what were traditionally completely siloed groups together to build a first for their organization; a comprehensive solution platform, and to validate the fundamental architecture as a vehicle for many other use cases. In spite of some fits and starts, we were able to get all the requisite approvals, gear on the floor and tests run.

It cannot be overstated how instrumental the cooperation of all the BU's and account team members were in this effort. With the federation of many organizations the phrase "it takes a village" was never more apt than it was with this project. We were able to get amazing cooperation from the various teams and the level of effort paid off in the customer's continued support of the project even as it seemed it might not get off the ground.

The Results

To say it worked would be a staggering understatement. Why? There are three primary factors that can be considered the keys to the success that we measured:

- Operational capability
- Performance on real-world data-sets

- Time to results

Operational capability was an easy win - they already had Isilon and so knew about its capabilities and simplicity of management even at scale. The marriage of in-place analytics support with native HDFS meant that the deployment of large Hadoop clusters was a question of minutes instead of weeks - we spun up multiple 128 node clusters in less than 15 minutes and began running Map-Reduce jobs via Pig.

Is there anything easier to manage than a fully virtual environment? Server goes down? No problem - move the work over to a different one. Storage device issues? No impact due to the redundancy and HA built into an enterprise-grade platform like Isilon. Cloud-like features like on-demand scaling and self service? Yep - portal for deploying Hadoop clusters of any size and flavor was a simple exercise in using VCAC and V-Cloud Director to produce the right tiles and configuration menus.

With multi-tenancy and simple deployment the operational equation was a no-brainer and both the storage and compute teams agreed that it was as functional as they could have hoped. Not to say we did not encounter challenges; we did a fair amount of "engineered on the fly" work to make things functional, but the Isilon and the compute environments both performed as expected without major issues.

Performance was interesting to map out since we were largely in uncharted waters. As we all know, Hadoop either has a math or a space problem but rarely does it have both. To fix one you have to buy both parts if you are using the antiquated and truly non-operational DAS model. The split was exactly what we hoped - we could add data seamlessly and in place as needed and add more space through growing out the Isilon cluster as normal, and we could throw more compute nodes at specific jobs with the expected diminishing returns curve of performance as the cluster time-sliced resources more and more. The biggest challenge was in memory usage and heap sizing with shuffle space issues being a close second. We worked through these with a fair degree of trail and error, but ultimately found combinations that allowed us to properly size and scale the clusters to meet the test-plan we had set out. What was eye-opening was the point at which the curve falls off - the VM environment performed on par with a comparable physical system with small numbers of nodes, but when scaled out it continued to improve well past the point we had expected it to fall over. We attributed this to two things:

First was the optimization of the jobs for the environment - solid technology will never fix sub-optimal code. You can throw more hardware at it to make it go incrementally faster, but simple code tweaks and using built-in functions gave us important performance gains. The second was the fact that with the combination of the Isilon reducing the overall need to wait for multiple write chains to finish, and the ability for all nodes in the cluster to participate because of the lack of the need for data locality computations (all nodes see all the HDFS volume of data as "local") the jobs were able to get down to work a lot fast than in a legacy model of DAS where they had to spend a fair amount of time mapping out jobs and waiting for data. In our testing, regardless of cluster size, every node worked on a task against a given data set. This can create sub-optimal scenarios where a very small data set tries to get broken up on too many nodes and the process of doing that work outweighs the time it would take to just run the job on one or two nodes - there are many ways to handle this and we are investigating some programatic controls internally to see if we can find a managed appraoch to this that runs inline with the jobs themselves. The end result though is that this organization was able to analyze much larger volumes of data than they could previously because they are no longer constrained by the node count of the cluster and the storage space on each node.

Time to results is key to analytics, and just eliminating the process of loading the Hadoop environment reduced the time to run a job from multiple days into hours. Do not let developers or data scientists minimize this piece of a job! The time it takes to run a job and get results is the time it takes from the minute the data is created to the time the analytics result set is created, and includes the time it takes to load data into the engine doing the math. Not only did the jobs run at the same rate or better, but without any lead-time to load the data we eliminated many hours of manual work required to populate the HDFS environment before jobs could run at all. This means that our time to results was exponentially shorter, our time to provision a functional and working environment was significantly shortened, and so the overall value of our platform means that this organization can reach conclusions and derive insight with significantly greater agility. That is money against the bottom line and that is the value of the federated solution.


Like so many great things, it started over casual dinner conversation and a diagram on a napkin. What emerged was a resounding affirmation that with the right tools, people, and collaboration it is possible to affect real change into the legacy mind-set and create value from things we often take for granted.