Being at Hadoop Strata 2015 at San Jose is like being transported to the future and see how the software technology will look like less than a decade from now.
I have visited Hadoop Strata all the way from Melbourne - Australia to try to catch up with all the new vendors in the market and bring this expertise to the rest of world as many of these products are focusing in their region.
Now lets get into business,.....
Apache Spark is still looking like the way to go when it comes to the fading map reduce and Apache Mahout being transitioned to spark MLLib, we had a lack of presistant tables storage for databases in spark, we were able to get away with registering temporary tables but now all changed for Spark with the announced BlinkDB that will be holding the metastore and tables storage.
R on spark is a very aggressive move as it eliminates the need to program R outside of Spark eco-system, allowing analytics within the eco system using the same RDDs (Resilient Distributed Data sets) that are living in memory.
the main benefits of the spark platform is the ability to stream data using Spark Streaming, query data using SparkSQL (previously Shark) and apply analytics using MLLib and and R within the same platform utilizing the same RDD that is occupying the memory, therefore boosting the performance since the data is moved to memory at the beginning eliminating any need to read from spinning disks as long as their is a sufficient memroy to hadnle the data or even partially in memory as Spark by nature spills over to Disk when there is insufficient memory to keep all the datasets in.
the question that I get asked all the time, can spark used independently from Hadoop? and the answer is Yes, Spark can run in a standalone mode or using a cluster like Apache Mesos, the only problem is Spark will become another silo rather than being integrated in Hadoop and utilizing the different data stored from the mixed workloads and the Data Lake.
Unified Data Storage on Hadoop
As part of setting standards in working with Hadoop files, there is a big move and efforts to push for data standardization using Avro and Parquet in order to avoid the chaos of the different files format and extensions.
Avro is used for row based storage format for Hadoop and Parquet is used for columnar based storage, depending on how the data will be dealt with and what type of a file its we can easily decide how to store it, for most of the time Avro is the way to go unless we are working with a dataset that consist of plenty of columns and I want to avoid scanning un-needed columns as they wont be used much, the I can easily select a Parquet format and pre-select my columns in order to optimize my queries for examples in Hadoop.
Cloudera was showing the Kite SDK that can easily convert the data into Avro or Parquet format, we can also integrate Kite to do the conversion at the point of ingestion if needed. Most of the current databases in Hadoop can read Avro and Parquet formats (Pivotal Hawq, Cloudera Impala, Hive, Spark...etc)
Hive on Tez vs Hive on Spark
the future of relational databases is here now, from my point of view its only a matter of time until companies starts replacing their existing traditional ones to the Hadoop ones like Pivotal Hawq or Apache Hive...etc. I have attended a smart talk by a smart performance engineer Mostafa Mokhtar from Hortonworks, they were trying to benchmark every aspect of Hive stand alone compared to Hive on Tez and Hive on Spark, we were hearing a lot of noise on how disruptive Hive on Spark would be, surprisingly Hortonworks was able to show that Hive on Tez is more than 70% faster than Hive on Spark, although Spark-SQL is more than %60 faster than hive on Spark and probably faster than Hive on Tez as well.
this is a great news for current Hive users as its obvious that Hive on Tez is the way to go on this one.