Find Communities by: Category | Product

Ready Solutions for AI

2 Posts authored by: MikeB@Dell

MNIST with Intel BigDL

Posted by MikeB@Dell Aug 10, 2018



Here at Dell EMC we just announced the general availability of the Dell EMC Ready Solutions for AI - Machine Learning with Hadoop design so I decided to write a somewhat technical article highlighting the power of doing deep learning with the Intel BigDL framework executing on Apache Spark. After a short introduction covering the components of our Hadoop design, I will walk through an example of training an image classification neural network using BigDL.  Finally I show how that example can be extended by training the model to classify a new category of images – emojis.


The combination of Apache Spark and Intel BigDL creates a powerful data science environment by enabling the compute and data intensive operations associated with creating AI to be performed on the same cluster hosting your Hadoop Data Lake and existing analytics workloads. YARN will make sure everything gets their share of the resources.  The Dell EMC Ready Solution also brings Cloudera Data Science Workbench (CDSW) to your Hadoop environment. CDSW enables secure, collaborative access to your Hadoop environment. Data Scientists can choose from a library of published engines that are preconfigured, saving time and frustration dealing with library and framework version control, interoperability etc. Dell EMC includes an engine configured with Intel BigDL and other common Python libraries.

Getting Started

To follow along with the steps in this blog post you can use the docker container included in the BigDL github repository. Installing docker is beyond the scope of this post, however you can find directions to install docker for various operating systems here (link to On my Dell XPS 9370 running Ubuntu 16.04 I use the following commands to get started:

cd ~
git clone
cd BigDL/docker/BigDL
sudo docker build --build-arg BIGDL_VERSION=0.5.0 -t bigdl/bigdl:default .
sudo docker run -it --rm -p 8080:8080 -e NotebookPort=8080 \
-e NotebookToken=”thisistoken” bigdl/bigdl:default

Tip: Notice the space + period that is at the end of our build command. That tells docker to look for our Dockerfile in the current directory.

You should now be able to point a web browser at localhost:8080 and enter the token (“thisistoken” without “” in example). You will be presented with a Jupyter notebook showing several examples of Apache Spark and Intel BigDL operations. For this post we are going to look at the neural_networks/introduction_to_mnist.ipynb notebook. You can launch the notebook by clicking the neural_networks folder and then the introduction_to_mnist.ipynb file.

This example covers the get_mnist function that is used in many of the example notebooks. By going through this we can get a view in to how easy it is to deal with the huge datasets that are necessary for training a neural network. In the get_mnist function we read our dataset and create an RDD of all of the images, sc.parallelize(train_images) loads 60,000 images in to memory ready for distributed, parallel processing to be performed on them. With Hadoop's native Yarn resource scheduler we can decide to do these operations with 1 executor/2 cpu cores or 32 executors with 16 cores each.

Tip: The resource sizing of your Spark executors should be such that executors * cores per executor * n is equal to your batch size.

This example notebook reads the images using a helper function written by the BigDL team for the mnist dataset, mnist.read_data_sets. BigDL provides us the ImageFrame API to read image files. We can read individual files or folders of images. In a clustered environment we create distributed image frames which is an RDD of ImageFeature. ImageFeature represents one individual image, storing the image as well as label and other properties using key/value.

At the end of the introduction_to_mnist notebook we see that samples are created by zipping up the images RDD with the labels RDD. We perform a map of Sample.from_ndarray to end up with an RDD of samples. A sample consists of one or more tensors representing the image and one or more tensors representing the label. This sample of tensors are the structure we need to feed our data in to a neural network.

We can close this introduct_to_mnist notebook now and we will take a look at the cnn.ipynb notebook. The notebook does a good job of explaining what is going on at each step. We use our previously covered mnist helper functions to read the data, define a LeNet model, define our optimizer and train the model. I would encourage you to go through this example once and learn what is going on. After that, go ahead and click Kernel > Restart & Clear Output.

Now that we are back to where we started, let’s look at how we can add a new class to our model. Currently the model is being trained to predict 10 classes of numbers. What if, for example, I wanted the model to be able to recognize emojis also? It is actually really simple, and if you follow along we will go through a quick example of how to train this model to also recognize emoji for a total of 11 classes to categorize.

The Example

The first step is going to be to acquire a dataset of emojis and prepare them. Add a cell after cell [2]. In this cell we will put

git clone
pip install opencv-python


Then we can insert another cell and put:

import glob
from os import listdir
from os.path import isfile, join
from import *
import cv2

path = 'emoji-dataset/emoji_imgs_V5/'
images = [cv2.imread(file) for file in glob.glob("emoji-dataset/emoji_imgs_V5/*.png")]
emoji_train = images[1:1500]
emoji_test = images[1501:]
emoji_train_rdd = sc.parallelize(emoji_train)
emoji_test_rdd = sc.parallelize(emoji_test)

Here we are taking all of our emoji images and dividing them up, then turning those lists of numpy arrays in to RDDs. This isn’t the most efficient way to divide up our dataset but it will work for this example.

Currently these are color emoji images, and our model is defined for mnist which is grayscale. This means our model is set to accept images of shape 28,28,1.  RGB or BGR images will have a shape of 28,28,3. Since we have these images in an RDD though we can use cv2 to resize and grayscale the images. Go ahead and insert another cell to run:

emoji_train_rdd = x: (cv2.resize(x, (28, 28))))
emoji_train_rdd = x: (cv2.cvtColor(x, cv2.COLOR_BGR2GRAY)))
emoji_test_rdd = x: (cv2.resize(x, (28, 28))))
emoji_test_rdd = x: (cv2.cvtColor(x, cv2.COLOR_BGR2GRAY)))
emoji_train_rdd = x: Sample.from_ndarray(x, np.array(11)))
emoji_test_rdd = x: Sample.from_ndarray(x, np.array(11)))
train_data = train_data.union(emoji_train_rdd)
test_data = test_data.union(emoji_test_rdd)

This could easily be reduced to a function, however if you want to see the transformations step by step you can just insert the below code between each line. For example we can see after the resize our image is shape (28, 28, 3) after the resize but before grayscale, and (28, 28, 1) after.

single = emoji_train_rdd.take(1)

Output is 28,28,1.

Then, in our def build_model cell we just want to change the last line so it is
lenet_model = build_model(11) . After this the layers are built to accept 11 classes.


We can see the rest of the results by running through the example. Since our emoji dataset is so small compared to the handwritten number images we will go ahead and create a smaller rdd to predict against to ensure we see our emoji samples after the optimizer is setup and the model is trained. Insert a cell after our first set of predictions and put the following in it:

sm_test_rdd = test_data.take(8)
sm_test_rdd = sc.parallelize(sm_test_rdd)
predictions = trained_model.predict(sm_test_rdd)
imshow(np.column_stack([np.array(s.features[0].to_ndarray()).reshape(28,28) for s in sm_test_rdd.take(11)]),cmap='gray'); plt.axis('off')
print 'Ground Truth labels:'
print ', '.join(str(map_groundtruth_label(s.label.to_ndarray())) for s in sm_test_rdd.take(11))
print 'Predicted labels:'
print ', '.join(str(map_predict_label(s)) for s in predictions.take(11))

We should see that our model has correctly associated the label of “10” to our emoji images and something in the range of 0-9 for the remaining images. Congratulations! We have now trained our lenet model to do more than just recognize examples of 0-9. If we had multiple individual emojis we wanted to recognize instead of just “emoji” we would simply need to provide more specific labels such as “smiling” or “laughing” to the emojis and then define our model with our new number of classes.

The original cnn.ipynb notebook trained a lenet model with 10 classes. If we compare the visualization of the weights in the convolution layers we can see that even adding just a few more images and a new class has changed these a lot.

Original Model

11 Class Model

Thanks everyone, I hope you enjoyed this blog post. You can find more information about the Dell EMC Ready Solutions for AI at this url:

Deep learning has exploded over the landscape of both the popular and business media landscapes.  Current and upcoming technology capable of powering the calculations required by deep learning algorithms has enabled a rapid transition from new theories to new applications.  One of current supporting technologies that is expanding at an increasing rate is in the area of faster and more use case specific hardware accelerators for deep learning such as GPUs with tensor cores and FPGAs hosted inside of servers. Another foundational deep learning technology that has advanced very rapidly is the software that enables implementations of complex deep learning networks. New frameworks, tools and applications are entering the landscape quickly to accomplish this, some compatible with existing infrastructure and others that require workflow overhauls.

As organizations begin to develop more complex strategies for incorporating deep learning they are likely to start to leverage multiple frameworks and application stacks for specific use cases and to compare performance and accuracy. But training models is time consuming and ties up expensive compute resources. In addition, adjustments and tuning can vary between frameworks, creating a large number of framework knobs and levers to remember how to operate. What if there was a framework that could just consume these models right out the box?


BigDL is a distributed deep learning framework with native Spark integration, allowing it to leverage Spark during model training, prediction, and tuning. One of the things that I really like about Intel BigDL is how easy it is to work with models built and/or trained in Tensorflow, Caffe and Torch. This rich interop support for deep learning models allows BigDL applications to leverage the plethora of models that currently exist with little or no additional effort. Here are just a few ways this might be used in your applications:

  • Efficient Scale Out - Using BigDL you can scale out a model that was trained on a single node or workstation and leverage it at scale with Apache Spark. This can be useful for training on a large distributed dataset that already exists in your HDFS environment or for performing inferencing such as prediction and classification on a very large and often changing dataset.

  • Transfer Learning - Load a pretrained model with weights and then freeze some layers, append new layers and train / retrain layers. Transfer learning can improve accuracy or reduce training time by allowing you to start with a model that is used to do one thing, such as classify a different objects, and use it to accelerate development of a model to classify something else, such as specific car models.

  • High Performance on CPU - GPUs get all of the hype when it comes to deep learning. By leveraging Intel MKL and multi threading Spark tasks you can achieve better CPU driven performance leveraging BigDL than you would see with Tensorflow, Caffe or Torch when using Xeon processors.

  • Dataset Access - Designed to run in Hadoop, BigDL can compute where your data already exists. This can save time and effort since data does not need to be transferred to a seperate GPU environment to be used with the deep learning model. This means that your entire pipeline from ingest to model training and inference can all happen in one environment, Hadoop.

Real Data + Real Problem

Recently I had a chance to take advantage of the model portability feature of BigDL. After learning of an internal project here at Dell EMC, leveraging deep learning and telemetry data to predict component failures, my team decided we wanted to take our Ready Solution for AI - Machine Learning with Hadoop and see how it did with the problem.

The team conducting the project for our support organization was using Tensorflow with GPU accelerators to train an LSTM model. The dataset was sensor readings from internal components collected at 15 minute intervals showing all kinds of metrics like temperature, fan speeds, runtimes, faults etc.

Initially my team wanted to focus on testing out two use cases for BigDL:

  • Using BigDL model portability to perform inference using the existing tensorflow model
  • Implement an LSTM model in BigDL and train it with this dataset

As always, there were some preprocessing and data cleaning steps that had happened before we could get to modeling and inference. Luckily for us though we received the clean output of those steps from our support team to get started quickly. We received the data in the form of multiple csv files, already balanced with records of devices that did fail and those that did not. We got over 200,000 rows of data that looked something like this:



Converting the data to a tfrecord format used by Tensorflow was being done with Python and pandas dataframes. Moving this process to be done in Spark is another area we knew we wanted to dig in to, but to start we wanted to focus on our above mentioned goals. When we started the pipeline looked like this:

From Tensorflow to BigDL


For BigDL, instead of creating tfrecords we needed to end up with an RDD of Sample(s). Each Sample is one record of your dataset in the form of feature, label. Feature and label are in the form of one or more tensors and we create the sample from ndarray. Looking at the current pipeline we were able to simple take the objects created before writing to tfrecord and instead wrote a function that took these arrays and formed our RDD of Sample for BigDL.

def convert_to(x, y):
  sequences = x
  labels = y

  record = zip(x,y)
  record_rdd = sc.parallelize(record)

  sample_rdd = x:Sample.from_ndarray(x[0], x[1]))
  return sample_rdd

train = convert_to(x_train,y_train)
val = convert_to(x_val,y_val)
test = convert_to(x_test,y_test)

After that we took the pb and bin files representing the pretrained models definition and weights and loaded it using the BigDL Model.load_tensorflow function. It requires knowing the input and output names for the model, but the tensorflow graph summary tool can help out with that. It also requires a pb and bin file specifically, but if what you have is a ckpt file from tensorflow that can be converted with tools provided by BigDL.

model_def = "tf_modell/model.pb"
model_variable = "tf_model/model.bin"
inputs = ["Placeholder"]
outputs = ["prediction/Softmax"]
trained_tf_model = Model.load_tensorflow(model_def, inputs, outputs, byte_order = "little_endian", bigdl_type="float", bin_file=model_variable)

Now with our data already in the correct format we can go ahead and inference against our test dataset. BigDL provides Model.evaluate and we can pass it our RDD as well as the validation method to use, in this case Top1Accuracy.

results = trained_tf_model.evaluate(test,128,[Top1Accuracy()])


Defining a Model with BigDL


After testing out loading the pretrained tensorflow model the next experiment we wanted to conduct was to train an LSTM model defined with BigDL. BigDL provides a Sequential API and a Functional API for defining models. The Sequential API is for simpler models, with the Functional API being better for complex models. The Functional API describes the model as a graph. Since our model is LSTM we will use the Sequential API.

Defining an LSTM model is as simple as:

def build_model(input_size, hidden_size, output_size):
    model = Sequential()
    recurrent = Recurrent()
    recurrent.add(LSTM(input_size, hidden_size))
    model.add(InferReshape([-1, input_size], True))
    model.add(Select(2, -1))
    model.add(Linear(hidden_size, output_size))
    return model

lstm_model = build_model(n_input, n_hidden, n_classes)


After creating our model the next step is the optimizer and validation logic that our model will use to train and learn.

Create the optimizer:

optimizer = Optimizer(


Set the validation logic:




Now we can do trained_model = optimizer.optimize() to train our model, in this case for 50 epochs. We also set our TrainSummary folder so that the data was logged. This allowed us to also get visualizations in Tensorboard, something that BigDL supports.


At this point we had completed the two initial tasks we had set out to do, load a pretrained Tensorflow model using BigDL and train a new model with BigDL. Hopefully you found some of this process interesting, and also got an idea for how easy BigDL is for this use case. The ability to leverage deep learning models inside Hadoop with no specialized hardware like Infiniband, GPU accelerators etc provides a great tool that is sure to change up the way you currently view your existing analytics.