Part 1


See Part 1 of this blog series.

Part 2


See Part 2 of this blog series.

Part 3


Installation Procedure


In this section, I'll provide a high-level overview of the installation procedure that I used to build this system, as well as some important details.

EMC Elastic Cloud Storage (ECS)

The system has two ECS U300 clusters (one for each site) running ECS version 2.2. Complete documentation can be found here.


The system has two independent installations of Hortonworks Data Platform (HDP) 2.4 (one for each site). At a minimum, the following components should be installed.

  • YARN + MapReduce2
  • Hive
  • ZooKeeper
  • Flume
  • Kafka
  • Spark


Anaconda Python


To install Anaconda Python, download Python 2.7 for Linux 64-bit. On each host in your Hadoop clusters, install it with the following commands.

# bash ./Anaconda2-*.sh -b -p /opt/anaconda

# /opt/anaconda/bin/pip install pykafka


Zeppelin Notebook


Zeppelin needs to be installed at just one site. It can be easily deployed using Ambari using the procedure described here.


To configure Zeppelin to use Anaconda Python:

  1. In the Zeppelin UI, click Interpreter.
  2. Find the parameter zeppelin.pyspark.python and set it to "/opt/anaconda/bin/python".


To make PySpark the default for paragraphs typed into Zeppelin notebooks:

  1. In Ambari, click Zeppelin Notebook, then Configs.
  2. Under Advanced zeppelin-config, find zeppelin.interpreters and move "org.apache.zeppelin.spark.PySparkInterpreter" to the beginning of the list. For example:
  3. Save and restart Zeppelin.




See Part 2 for how to create the topics with the appropriate number of partitions and replicas. This must be done on each.




See Part 2 for the flume.conf configuration to use. This must be done on each.


KDD Cup 99 Data


Download from KDD Cup 1999 Data. Place the file on the server that you will use to run the data generator script ( at each site.


Anomaly Detection Demo Application


It will be convenient to used a shared NFS drive that is accessible from hosts at each site. If one is not available, then repeat this procedure at each site.


First, clone the Git repository.

$ git clone


Edit the file as appropriate for your environment.


Import the Zeppelin*.json files into Zeppelin.


You may want to edit and use the script to automatically start the data generator and Spark Streaming jobs at site 1 and site 2.


Run on site 1 to run the Spark job to build the model.


Questions? Problems?

If you have any questions or problems, leave a comment below.