Find Communities by: Category | Product



The process of training a deep neural network is akin to finding the minimum of a function in a very high-dimensional space. Deep neural networks are usually trained using stochastic gradient descent (or one of its variants). A small batch (usually 16-512), randomly sampled from the training set, is used to approximate the gradients of the loss function (the optimization objective) with respect to the weights. The computed gradient is essentially an average of the gradients for each data-point in the batch. The natural way to parallelize the training across multiple nodes/workers is to increase the batch size and have each node compute the gradients on a different chunk of the batch. Distributed deep learning differs from traditional HPC workloads where scaling out only affects how the computation is distributed but not the outcome.


Challenges of large-batch training

It has been consistently observed that the use of large batches leads to poor generalization performance, meaning that models trained with large batches perform poorly on test data. One of the primary reason for this is that large batches tend to converge to sharp minima of the training function, which tend to generalize less well. Small batches tend to favor flat minima that result in better generalization [1]. The stochasticity afforded by small batches encourages the weights to escape the basins of attraction of sharp minima. Also, models trained with small batches are shown to converge farther away from the starting point. Large batches tend to be attracted to the minimum closest to the starting point and lack the explorative properties of small batches.

The number of gradient updates per pass of the data is reduced when using large batches. This is sometimes compensated by scaling the learning rate with the batch size. But simply using a higher learning rate can destabilize the training. Another approach is to just train the model longer, but this can lead to overfitting. Thus, there’s much more to distributed training than just scaling out to multiple nodes.


An illustration showing how sharp minima lead to poor generalization. The sharp minimum of the training function corresponds to a maximum of the testing function which hurts the model's performance on test data. [1]

How can we make large batches work?

There has been a lot of interesting research recently in making large-batch training more feasible. The training time for ImageNet has now been reduced from weeks to minutes by using batches as large as 32K without sacrificing accuracy. The following methods are known to alleviate some of the problems described above:


  1. Scaling the learning rate [2]
    The learning rate is multiplied by k, when the batch size is multiplied by k. However, this rule does not hold in the first few epochs of the training since the weights are changing rapidly. This can be alleviated by using a warm-up phase. The idea is to start with a small value of the learning rate and gradually ramp up to the linearly scaled value.

  2. Layer-wise adaptive rate scaling [3]
    A different learning rate is used for each layer. A global learning rate is chosen and it is scaled for each layer by the ratio of the Euclidean norm of the weights to Euclidean norm of the gradients for that layer.

  3. Using regular SGD with momentum rather than Adam
    Adam is known to make convergence faster and more stable. It is usually the default optimizer choice when training deep models. However, Adam seems to settle to less optimal minima, especially when using large batches. Using regular SGD with momentum, although more noisy than Adam, has shown improved generalization.

  4. Topologies also make a difference
    In a previous blog post, my colleague Luke showed how using VGG16 instead of DenseNet121 considerably sped up the training for a model that identified thoracic pathologies from chest x-rays while improving area under ROC in multiple categories. Shallow models are usually easier to train, especially when using large batches.



Large-batch distributed training can significantly reduce training time but it comes with its own challenges. Improving generalization when using large batches is an active area of research, and as new methods are developed, the time to train a model will keep going down.


  1. On large-batch training for deep learning: Generalization gap and sharp minima. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter. 2016. arXiv preprint arXiv:1609.04836.
  2. Accurate, large minibatch SGD: Training imagenet. Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. arXiv preprint arXiv:1706.02677.
  3. Large Batch Training of Convolutional Networks . Yang You, Igor Gitman, Boris Ginsburg. 2017. arXiv preprint arXiv:1708.03888.



The potential of neural networks to transform healthcare is evident. From image classification to dictation and translation, neural networks are achieving or exceeding human capabilities. And they are only getting better at these tasks as the quantity of data increases.


But there’s another way in which neural networks can potentially transform the healthcare industry: Knowledge can be replicated at virtually no cost. Take radiology as an example: To train 100 radiologists, you must teach each individual person the skills necessary to identify diseases in x-ray images of patients’ bodies. To make 100 AI-enabled radiologist assistants, you take the neural network model you trained to read x-ray images and load it into 100 different devices.


The hurdle is training the model. It takes a large amount of cleaned, curated, labeled data to train an image classification model. Once you’ve prepared the training data, it can take days, weeks, or even months to train a neural network. Even once you’ve trained a neural network model, it might not be smart enough to perform the desired task. So, you try again. And again. Eventually, you will train a model that passes the test and can be used out in the world.

neural-network-workflow.pngWorkflow for Developing Neural Network Models


In this post, I’m going to talk about how to reduce the time spent in the Train/Test/Tune cycle by speeding up the training portion with distributed deep learning, using a test case we developed in Dell EMC’s HPC and AI Innovation Lab to classify pathologies in chest x-ray images. Through a combination of distributed deep learning, optimizer selection, and neural network topology selection, we were able to not only speed the process of training models from days to minutes, we were also able to improve the classification accuracy significantly.


Starting Point: Stanford University’s CheXNet

We began by surveying the landscape of AI projects in healthcare, and Andrew Ng’s group at Stanford University provided our starting point. CheXNet was a project to demonstrate a neural network’s ability to accurately classify cases of pneumonia in chest x-ray images.

The dataset that Stanford used was ChestXray14, which was developed and made available by the United States’ National Institutes of Health (NIH). The dataset contains over 120,000 images of frontal chest x-rays, each potentially labeled with one or more of fourteen different thoracic pathologies. The data set is very unbalanced, with more than half of the data set images having no listed pathologies.


Stanford decided to use DenseNet, a neural network topology which had just been announced as the Best Paper at the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), to solve the problem. The DenseNet topology is a deep network of repeating blocks over convolutions linked with residual connections. Blocks end with a batch normalization, followed by some additional convolution and pooling to link the blocks. At the end of the network, a fully connected layer is used to perform the classification.


An Illustration of the DenseNet Topology (source: Kaggle)


Stanford’s team used a DenseNet topology with the layer weights pretrained on ImageNet and replaced the original ImageNet classification layer with a new fully connected layer of 14 neurons, one for each pathology in the ChestXray14 dataset.


Building CheXNet in Keras

It’s sounds like it would be difficult to setup. Thankfully, Keras (provided with TensorFlow) provides a simple, straightforward way of taking standard neural network topologies and bolting-on new classification layers.


from tensorflow import keras
from keras.applications import DenseNet121

orig_net = DenseNet121(include_top=False, weights=’imagenet’, input_shape=(256,256,3))

Importing the base DenseNet Topology using Keras


In this code snippet, we are importing the original DenseNet neural network (DenseNet121) and removing the classification layer with the include_top=False argument. We also automatically import the pretrained ImageNet weights and set the image size to              256x256, with 3 channels (red, green, blue).


With the original network imported, we can begin to construct the classification layer. If you look at the illustration of DenseNet above, you will notice that the classification layer is preceded by a pooling layer. We can add this pooling layer back to the new network with a single Keras function call, and we can call the resulting topology the neural network's filters, or the part of the neural network which extracts all the key features used for classification.


from keras.layers import GlobalAveragePooling2D

filters = GlobalAveragePooling2D()(orig_net.output)

Finalizing the Network Feature Filters with a Pooling Layer


The next task is to define the classification layer. The ChestXray14 dataset has 14 labeled pathologies, so we have one neuron for each label. We also activate each neuron with the sigmoid activation function, and use the output of the feature filter portion of our network as the input to the classifiers.


from keras.layers import Dense

classifiers = Dense(14, activation=’sigmoid’, bias_initializer=’ones’)(filters)

Defining the Classification Layer


The choice of sigmoid as an activation function is due to the multi-label nature of the data set. For problems where only one label ever applies to a given image (e.g., dog, cat, sandwich), a softmax activation would be preferable. In the case of ChestXray14, images can show signs of multiple pathologies, and the model should rightfully identify high probabilities for multiple classifications when appropriate.


Finally, we can put the feature filters and the classifiers together to create a single, trainable model.


from keras.models import Model

chexnet = Model(inputs=orig_net.inputs, outputs=classifiers)

The Final CheXNet Model Configuration


With the final model configuration in place, the model can then be compiled and trained.


Accelerating the Train/Test/Tune Cycle with Distributed Deep Learning

To produce better models sooner, we need to accelerate the Train/Test/Tune cycle. Because testing and tuning are mostly sequential, training is the best place to look for potential optimization.


How exactly do we speed up the training process? In Accelerating Insights with Distributed Deep Learning, Michael Bennett and I discuss the three ways in which deep learning can be accelerated by distributing work and parallelizing the process:

  • Parameter server models such as in Caffe or distributed TensorFlow,
  • Ring-AllReduce approaches such as Uber’s Horovod, and
  • Hybrid approaches for Hadoop/Spark environments such as Intel BigDL.

Which approach you pick depends on your deep learning framework of choice and the compute environment that you will be using. For the tests described here we performed the training in house on the Zenith supercomputer in the Dell EMC HPC & AI Innovation Lab. The ring-allreduce approach enabled by Uber’s Horovod framework made the most sense for taking advantage of a system tuned for HPC workloads, and which takes advantage of Intel Omni-Path (OPA) networking for fast inter-node communication. The ring-allreduce approach would also be appropriate for solutions such as the Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA.


The MPI-RingAllreduce Approach to Distributed Deep Learning

Horovod is an MPI-based framework for performing reduction operations between identical copies of the otherwise sequential training script. Because it is MPI-based, you will need to be sure that an MPI compiler (mpicc) is available in the working environment before installing horovod.


Adding Horovod to a Keras-defined Model

Adding Horovod to any Keras-defined neural network model only requires a few code modifications:

  1. Initializing the MPI environment,
  2. Broadcasting initial random weights or checkpoint weights to all workers,
  3. Wrapping the optimizer function to enable multi-node gradient summation,
  4. Average metrics among workers, and
  5. Limiting checkpoint writing to a single worker.

Horovod also provides helper functions and callbacks for optional capabilities that are useful when performing distributed deep learning, such as learning-rate warmup/decay and metric averaging.


Initializing the MPI Environment

Initializing the MPI environment in Horovod only requires calling the init method:


import horovod.keras as hvd



This will ensure that the MPI_Init function is called, setting up the communications structure and assigning ranks to all workers.


Broadcasting Weights

Broadcasting the neuron weights is done using a callback to the Keras method. In fact, many of Horovod’s features are implemented as callbacks to, so it’s worthwhile to define a callback list object for holding all the callbacks.


callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0) ]


You’ll notice that the BroadcastGlobalVariablesCallback takes a single argument that’s been set to 0. This is the root worker, which will be responsible for reading checkpoint files or generating new initial weights, broadcasting weights at the beginning of the training run, and writing checkpoint files periodically so that work is not lost if a training job fails or terminates.


Wrapping the Optimizer Function

The optimizer function must be wrapped so that it can aggregate error information from all workers before executing. Horovod’s DistributedOptimizer function can wrap any optimizer which inherits Keras’ base Optimizer class, including SGD, Adam, Adadelta, Adagrad, and others.


import keras.optimizers

opt = hvd.DistributedOptimizer(keras.optimizers.Adadelta(lr=1.0))


The distributed optimizer will now use the MPI_Allgather collective to aggregate error information from training batches onto all workers, rather than collecting them only to the root worker. This allows the workers to independently update their models rather than waiting for the root to re-broadcast updated weights before beginning the next training batch.


Averaging Metrics

Between steps error metrics need to be averaged to calculate global loss. Horovod provides another callback function to do this called MetricAverageCallback.


callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),

This will ensure that optimizations are performed on the global metrics, not the metrics local to each worker.


Writing Checkpoints from a Single Worker

When using distributed deep learning, it’s important that only one worker write checkpoint files to ensure that multiple workers writing to the same file does not produce a race condition, which could lead to checkpoint corruption.


Checkpoint writing in Keras is enabled by another callback to However, we only want to call this callback from one worker instead of all workers. By convention, we use worker 0 for this task, but technically we could use any worker for this task. The one good thing about worker 0 is that even if you decide to run your distributed deep learning job with only 1 worker, that worker will be worker 0.

callbacks = [ ... ]

if hvd.rank() == 0:


Result: A Smarter Model, Faster!

Once a neural network can be trained in a distributed fashion across multiple workers, the Train/Test/Tune cycle can be sped up dramatically.


The figure below shows exactly how dramatically. The three tests shown are the training speed of the Keras DenseNet model on a single Zenith node without distributed deep learning (far left), the Keras DenseNet model with distributed deep learning on 32 Zenith nodes (64 MPI processes, 2 MPI processes per node, center), and a Keras VGG16 version using distributed deep learning on 64 Zenith nodes (128 MPI processes, 2 MPI processes per node, far right). By using 32 nodes instead of a single node, distributed deep learning was able to provide a 47x improvement in training speed, taking the training time for 10 epochs on the ChestXray14 data set from 2 days (50 hours) to less than 2 hours!


Performance comparisons of Keras models with distributed deep learning using Horovod

The VGG variant, trained on 128 Zenith nodes, was able to complete the same number of epochs as was required for the single-node DenseNet version to train in less than an hour, although it required more epochs to train. It also, however, was able to converge to a higher-quality solution. This VGG-based model outperformed the baseline, single-node model in 4 of 14 conditions, and was able to achieve nearly 90% accuracy in classifying emphysema.


Accuracy comparison of baseline single-node DenseNet model vs VGG variant with distributed deep learning


In this post we’ve shown you how to accelerate the Train/Test/Tune cycle when developing neural network-based models by speeding up the training phase with distributed deep learning. We walked through the process of transforming a Keras-based model to take advantage of multiple nodes using the Horovod framework, and how these few simple code changes, coupled with some additional compute infrastructure, can reduce the time needed to train a model from days to minutes, allowing more time for the testing and tuning pieces of the cycle. More time for tuning means higher-quality models, which means better outcomes for patients, customers, or whomever will benefit from the deployment of your model.


Lucas A. Wilson, Ph.D. is the Lead Data Scientist in Dell EMC's HPC & AI Engineering group. (Twitter: @lucasawilson)


MNIST with Intel BigDL

Posted by MikeB@Dell Aug 10, 2018



Here at Dell EMC we just announced the general availability of the Dell EMC Ready Solutions for AI - Machine Learning with Hadoop design so I decided to write a somewhat technical article highlighting the power of doing deep learning with the Intel BigDL framework executing on Apache Spark. After a short introduction covering the components of our Hadoop design, I will walk through an example of training an image classification neural network using BigDL.  Finally I show how that example can be extended by training the model to classify a new category of images – emojis.


The combination of Apache Spark and Intel BigDL creates a powerful data science environment by enabling the compute and data intensive operations associated with creating AI to be performed on the same cluster hosting your Hadoop Data Lake and existing analytics workloads. YARN will make sure everything gets their share of the resources.  The Dell EMC Ready Solution also brings Cloudera Data Science Workbench (CDSW) to your Hadoop environment. CDSW enables secure, collaborative access to your Hadoop environment. Data Scientists can choose from a library of published engines that are preconfigured, saving time and frustration dealing with library and framework version control, interoperability etc. Dell EMC includes an engine configured with Intel BigDL and other common Python libraries.

Getting Started

To follow along with the steps in this blog post you can use the docker container included in the BigDL github repository. Installing docker is beyond the scope of this post, however you can find directions to install docker for various operating systems here (link to On my Dell XPS 9370 running Ubuntu 16.04 I use the following commands to get started:

cd ~
git clone
cd BigDL/docker/BigDL
sudo docker build --build-arg BIGDL_VERSION=0.5.0 -t bigdl/bigdl:default .
sudo docker run -it --rm -p 8080:8080 -e NotebookPort=8080 \
-e NotebookToken=”thisistoken” bigdl/bigdl:default

Tip: Notice the space + period that is at the end of our build command. That tells docker to look for our Dockerfile in the current directory.

You should now be able to point a web browser at localhost:8080 and enter the token (“thisistoken” without “” in example). You will be presented with a Jupyter notebook showing several examples of Apache Spark and Intel BigDL operations. For this post we are going to look at the neural_networks/introduction_to_mnist.ipynb notebook. You can launch the notebook by clicking the neural_networks folder and then the introduction_to_mnist.ipynb file.

This example covers the get_mnist function that is used in many of the example notebooks. By going through this we can get a view in to how easy it is to deal with the huge datasets that are necessary for training a neural network. In the get_mnist function we read our dataset and create an RDD of all of the images, sc.parallelize(train_images) loads 60,000 images in to memory ready for distributed, parallel processing to be performed on them. With Hadoop's native Yarn resource scheduler we can decide to do these operations with 1 executor/2 cpu cores or 32 executors with 16 cores each.

Tip: The resource sizing of your Spark executors should be such that executors * cores per executor * n is equal to your batch size.

This example notebook reads the images using a helper function written by the BigDL team for the mnist dataset, mnist.read_data_sets. BigDL provides us the ImageFrame API to read image files. We can read individual files or folders of images. In a clustered environment we create distributed image frames which is an RDD of ImageFeature. ImageFeature represents one individual image, storing the image as well as label and other properties using key/value.

At the end of the introduction_to_mnist notebook we see that samples are created by zipping up the images RDD with the labels RDD. We perform a map of Sample.from_ndarray to end up with an RDD of samples. A sample consists of one or more tensors representing the image and one or more tensors representing the label. This sample of tensors are the structure we need to feed our data in to a neural network.

We can close this introduct_to_mnist notebook now and we will take a look at the cnn.ipynb notebook. The notebook does a good job of explaining what is going on at each step. We use our previously covered mnist helper functions to read the data, define a LeNet model, define our optimizer and train the model. I would encourage you to go through this example once and learn what is going on. After that, go ahead and click Kernel > Restart & Clear Output.

Now that we are back to where we started, let’s look at how we can add a new class to our model. Currently the model is being trained to predict 10 classes of numbers. What if, for example, I wanted the model to be able to recognize emojis also? It is actually really simple, and if you follow along we will go through a quick example of how to train this model to also recognize emoji for a total of 11 classes to categorize.

The Example

The first step is going to be to acquire a dataset of emojis and prepare them. Add a cell after cell [2]. In this cell we will put

git clone
pip install opencv-python


Then we can insert another cell and put:

import glob
from os import listdir
from os.path import isfile, join
from import *
import cv2

path = 'emoji-dataset/emoji_imgs_V5/'
images = [cv2.imread(file) for file in glob.glob("emoji-dataset/emoji_imgs_V5/*.png")]
emoji_train = images[1:1500]
emoji_test = images[1501:]
emoji_train_rdd = sc.parallelize(emoji_train)
emoji_test_rdd = sc.parallelize(emoji_test)

Here we are taking all of our emoji images and dividing them up, then turning those lists of numpy arrays in to RDDs. This isn’t the most efficient way to divide up our dataset but it will work for this example.

Currently these are color emoji images, and our model is defined for mnist which is grayscale. This means our model is set to accept images of shape 28,28,1.  RGB or BGR images will have a shape of 28,28,3. Since we have these images in an RDD though we can use cv2 to resize and grayscale the images. Go ahead and insert another cell to run:

emoji_train_rdd = x: (cv2.resize(x, (28, 28))))
emoji_train_rdd = x: (cv2.cvtColor(x, cv2.COLOR_BGR2GRAY)))
emoji_test_rdd = x: (cv2.resize(x, (28, 28))))
emoji_test_rdd = x: (cv2.cvtColor(x, cv2.COLOR_BGR2GRAY)))
emoji_train_rdd = x: Sample.from_ndarray(x, np.array(11)))
emoji_test_rdd = x: Sample.from_ndarray(x, np.array(11)))
train_data = train_data.union(emoji_train_rdd)
test_data = test_data.union(emoji_test_rdd)

This could easily be reduced to a function, however if you want to see the transformations step by step you can just insert the below code between each line. For example we can see after the resize our image is shape (28, 28, 3) after the resize but before grayscale, and (28, 28, 1) after.

single = emoji_train_rdd.take(1)

Output is 28,28,1.

Then, in our def build_model cell we just want to change the last line so it is
lenet_model = build_model(11) . After this the layers are built to accept 11 classes.


We can see the rest of the results by running through the example. Since our emoji dataset is so small compared to the handwritten number images we will go ahead and create a smaller rdd to predict against to ensure we see our emoji samples after the optimizer is setup and the model is trained. Insert a cell after our first set of predictions and put the following in it:

sm_test_rdd = test_data.take(8)
sm_test_rdd = sc.parallelize(sm_test_rdd)
predictions = trained_model.predict(sm_test_rdd)
imshow(np.column_stack([np.array(s.features[0].to_ndarray()).reshape(28,28) for s in sm_test_rdd.take(11)]),cmap='gray'); plt.axis('off')
print 'Ground Truth labels:'
print ', '.join(str(map_groundtruth_label(s.label.to_ndarray())) for s in sm_test_rdd.take(11))
print 'Predicted labels:'
print ', '.join(str(map_predict_label(s)) for s in predictions.take(11))

We should see that our model has correctly associated the label of “10” to our emoji images and something in the range of 0-9 for the remaining images. Congratulations! We have now trained our lenet model to do more than just recognize examples of 0-9. If we had multiple individual emojis we wanted to recognize instead of just “emoji” we would simply need to provide more specific labels such as “smiling” or “laughing” to the emojis and then define our model with our new number of classes.

The original cnn.ipynb notebook trained a lenet model with 10 classes. If we compare the visualization of the weights in the convolution layers we can see that even adding just a few more images and a new class has changed these a lot.

Original Model

11 Class Model

Thanks everyone, I hope you enjoyed this blog post. You can find more information about the Dell EMC Ready Solutions for AI at this url:



The Hadoop open source initiative is, inarguably, one to the most dramatic software development success stories of the last 15 years.  In every measure of open source software activity that includes the number of contributors, number of code check-ins, number of issues raised and fixed, and the number of new related projects, the commitment of the open source community to the ongoing improvement and expansion of Hadoop is impressive. Dell EMC has been offering solutions that are based on our infrastructure with both Cloudera and Hortonworks Hadoop distributions since 2011.  This year, in collaboration with Intel, we are adding a new product to our analytics portfolio called the Ready Solution for AI: Machine Learning with Hadoop.  As the name implies, we are testing and validating new features for machine learning including a full set of deep learning tools and libraries with our partners Cloudera and Intel.  Traditional machine learning with Apache MLlib on Apache Spark has been widely deployed and discussed for several years.  The concept of achieving deep learning with neural network-based approaches using Apache Spark running on CPU-only clusters is relatively new. We are excited to offer articles such as this to share our findings.



The performance testing that we describe in this article was done in preparation for our launch of Ready Solutions for AI.  The details of scaling BigDL with Apache Spark on on Intel® Xeon® Scalable Clusters is an area that might be new to many data scientists and IT professionals.  In this article, we discuss:


  • Key Spark properties and where to find more information
  • Comparing the capabilities of three different Intel Xeon Scalable processors for scaling impact
  • Comparing the performance of the current Dell EMC PowerEdge 14G server platform with the previous generation
  • Sub NUMA clustering background and performance comparison


We describe our testing environment and show the baseline performance results against which we compared software and hardware configuration changes. We also provide a summary of findings for Ready Solution for AI: Machine Learning with Hadoop.


Our performance testing environment


The following table provides configuration information for worker nodes used in our Cloudera Hadoop CDH cluster for this analysis:


ComponentCDH cluster with Intel Xeon Platinum processorCDH cluster with Intel Xeon Gold 6000 series processorCDH cluster with Intel Xeon Gold 5000 series processor
Server ModelPowerEdge R740xdPowerEdge R740xdPowerEdge R740xd
ProcessorIntel Xeon Platinum 8160 CPU @ 2.10GHzIntel Xeon Gold 6148 CPU @ 2.40GHzIntel Xeon Gold 5120 CPU @ 2.20GHz
Memory768 GB768 GB768 GB
Hard drive(6) 240 GB SSD and (12) 4TB SATA HDD(6) 240 GB SSD and (12) 4TB SATA HDD(2) 240 GB SSD and (4) 1.2TB SAS HDD
Storage controllerPERC H740P Mini (Embedded)PERC H740P Mini (Embedded)Dell HBA330
Network adaptersMellanox 25GbE 2P ConnectX4LXMellanox 25GbE 2P ConnectX4LXQLogic 25GE 2P QL41262 Adapter
Operating SystemRHEL 7.3RHEL 7.3RHEL 7.3
Apache Spark version2.12.12.1
BigDL version0.


Configuring Apache Spark properties


To the get the maximum performance from Spark for running deep learning workloads, there are several properties that must be understood. The following list describes key Apache Spark parameters:


  • Number of executors—Corresponds to the number of nodes on which you want the application to run.  Consider the number of nodes running the Spark service, the number of other Spark jobs that must run concurrently, and the size of the training problem that you are submit.
  • Executor cores—Number of cores to use on each executor. This number can closely match the number of physical cores on the host, assuming that the application does not share the nodes with other applications.
  • Executor memory—Amount of memory to use per executor process. For deep learning applications, the amount of memory allocated for executor memory is dependent on the dataset and the model.  There are several checks and best practices that can help you avoid Spark "out of memory" issues while loading or processing data, however, we do not cover these details in this article.
  • Deployment mode—Deployment mode for Spark driver programs. The deployment mode can be either "client" or "cluster" mode.  To launch the driver program locally use client mode and to deploy the application remotely use cluster mode from one of the nodes inside the cluster. If the Spark application is launched from the Cloudera Data Science Workbench, deployment mode defaults to client mode.


For detailed information about all Spark properties, see Apache Spark Configuration.


Scaling with BigDL


To study scaling of BigDL on Spark, we ran two popular image classification models:

  • Inception V1Inception is an image classification deep neural network introduced in 2014. Inception V1 was one of the first CNN classifiers that introduced the concept of a wider network (multiple convolution operations in the same layer). It remains a popular model for performance characterization and other studies.
  • ResNet-50ResNet is an image classification deep neural network introduced in 2015. ResNet introduced a novel architecture of shortcut connections that skip one or more layers.


ImageNet, the publicly available hierarchical image database, is used as the dataset. In our experiments, we found that ResNet-50 is computationally more intensive compared to Inception V1. We used a learning rate = 0.0898 and a weight decay = 0.001 for both models. Batch size was selected by using best practices for deep learning on Spark.  Details about sizing batches are described later in the article.  Because we were interested in scaling only and not in accuracy, we ensured that our calculated average throughput (images per second) were stable and stopped training after two epochs.  The following table shows the Spark parameters that were used as well as the throughput observed:



4 Nodes

8 Nodes

16 Nodes

Number of executors




Batch size                  









350 GB for Inception

600 GB for ResNet

350 GB for Inception  600 GB for ResNet

350 GB for Inception  600 GB for ResNet

Inception – observed throughput (images/second)




ResNet – observed throughput (images/second)





The following figure shows that deep learning models scale almost linearly when we increase the number of nodes.



Calculating batch size for deep learning on Spark


When training deep learning models using neural networks, the number of training samples is typically large compared to the amount of data that can be processed in one pass. Batch size refers to the number of training examples that are used to train the network during one iteration. The model parameters are updated at the end of each iteration. Lower batch size means that the model parameters are updated more frequently.


For BigDL to process efficiently on Spark, batch size must be a multiple of the number of executors multiplied by the number of executor cores. Increasing the batch size might have an impact on the accuracy of the model. The impact depends on various factors that include the model that is used, the dataset, and the range of the batch size. A model with a large number of batch normalization layers might experience a greater impact on accuracy when changing the batch size. Various hyper-parameter tuning strategies can be applied to improve the accuracy when using large batch sizes in training.


Intel Xeon Scalable processor model comparison

We compared the performance of the image classification training workload on three Intel processor models. The following table provides key details of the processor models.


Intel Xeon Platinum 8160 CPUIntel Xeon Gold 6148 CPUIntel Xeon Gold 5120 CPU
Base Frequency2.10 GHz2.40 GHz2.20 GHz
Number of cores242014
Number of AVX-512 FMA Units221


The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.


The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.



The Intel Xeon Gold 6148 processor slightly outperforms the Intel Xeon Platinum 8160 processor for the Inception V1 workload because it has higher clock frequency. However, for a computationally more intensive workload, the Intel Xeon Platinum 8160 processor outperforms the Xeon Gold 6148 processor.


Both the Intel Xeon Gold 6148 and Intel Xeon Platinum 8160 processors significantly outperform the Intel Xeon Gold 5120 processor. The primary reason for this performance is the number of Intel Advance Vector Extension units (AVX-512 FMA) in these processors.

Intel AVX-512 is a set of new instructions that can accelerate performance for workloads and uses such as deep learning applications. Intel AVX-512 provides ultra-wide 512-bit vector operations capabilities to increase the performance of applications that use identical operations on multiple datasets.


The Intel Xeon Gold 6000 or Intel Xeon Platinum series are recommended for deep learning applications that are implemented in BigDL and Spark.

Comparing deep learning performance across Dell server generations


We compared the performance of the image classification training workload on three Intel processor models, across different server generations. The following table provides key details of the processor models.



Dell EMC PowerEdge R730xd Server with Intel Xeon E5-2683 v3

Dell EMC PowerEdge R730xd with Intel Xeon E5-2698 v4

Dell EMC PowerEdge R740xd with Intel Xeon Platinum 8160 


(2) Xeon E5-2683 v3

(2) Xeon E5-2698 v4

(2) Xeon Platinum 8160


512 GB

512 GB

768 GB

Hard drive

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(6) 240 GB SSD and (12) 4TB SATA HDD

RAID controller

PERC R730 Mini

PERC R730 Mini

PERC H740P Mini (Embedded)

Network adapters

Intel Ethernet 10G 4P X520/I350 rNDC -

Intel Ethernet 10G 4P X520/I350 rNDC -

Mellanox 25GbE 2P ConnectX4LX

Operating system

RHEL 7.3

RHEL 7.3

RHEL 7.3

The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors across Dell EMC PowerEdge server generations. The experiment was performed on four-node clusters.

We observed that the PowerEdge R740xd server with the Intel Xeon Platinum 8160 processor performs 44 percent better than the PowerEdge R730xd server with the Xeon E5-2683 v3 processor and 28 percent better than the PowerEdge R730xd server with the Xeon E5-2698 v4 processor with an Inception V1 workload under the settings noted earlier.




Sub NUMA Clustering Comparison


Sub NUMA clustering (SNC) is a feature available in Intel Xeon Scalable processors. When SNC is enabled, each socket is divided into two NUMA domains, each with half the physical cores and half the memory of the socket. SNC is similar to the Cluster-on-Die option that was available in the Xeon E5-2600 v3 and v4 processors with improvements to remote socket access. At the operating system level, a dual socket server that has SNC enabled displays four NUMA domains.  The following table shows the measured performance impact when enabling SNC.




We found only a 5 percent improvement in performance with SNC enabled, which is within the run-to-run variation in images per second across all the tests we performed.  Sub NUMA clustering improves performance for applications that have high memory locality. Not all deep learning applications and models have high memory locality. We recommend that you either leave Sub NUMA clustering disabled (the default BIOS setting) or disable it.

Ready Solutions for AI


For this Ready Solution, we leverage Cloudera’s new Data Science Workbench, which delivers a self-service experience for data scientists.  Data Science Workbench provides multi-language support that includes Python, R, and Scala directly in a web browser. Data scientists can manage their own analytics pipelines, including built-in scheduling, monitoring, and email alerts in an environment that is tightly coupled with a traditional CDH cluster.

We also incorporate Intel BigDL, a full-featured deep learning framework that leverages Spark running on the CDH cluster.  The combination of Spark and BigDL provides a high performance approach to deep learning training and inference on clusters of Intel Xeon Scalable processors.




It takes a wide range of skills to be successful with high performance data analytics. Practitioners must understand the details of both hardware and software components as well as their interactions. In this article, we document findings from performance tests that cover both the hardware and software components of BigDL with Spark.  Our findings show that:

  • BigDL scales almost linearly for the Inception and ResNet image classification models for up to a 16-node limit, which we tested.
  • Intel Xeon Scalable processors with two AVX-512 (Gold 6000 series and Platinum series) units perform significantly better than processors with one AVX-512 unit (Gold 5000 series and earlier).
  • The PowerEdge server default BIOS setting for Sub NUMA clustering (disabled) is recommended for Deep Learning applications using BigDL.

Authors: Bala Chandrasekaran, Phil Hummel, and Leela Uppuluri

Deep learning has exploded over the landscape of both the popular and business media landscapes.  Current and upcoming technology capable of powering the calculations required by deep learning algorithms has enabled a rapid transition from new theories to new applications.  One of current supporting technologies that is expanding at an increasing rate is in the area of faster and more use case specific hardware accelerators for deep learning such as GPUs with tensor cores and FPGAs hosted inside of servers. Another foundational deep learning technology that has advanced very rapidly is the software that enables implementations of complex deep learning networks. New frameworks, tools and applications are entering the landscape quickly to accomplish this, some compatible with existing infrastructure and others that require workflow overhauls.

As organizations begin to develop more complex strategies for incorporating deep learning they are likely to start to leverage multiple frameworks and application stacks for specific use cases and to compare performance and accuracy. But training models is time consuming and ties up expensive compute resources. In addition, adjustments and tuning can vary between frameworks, creating a large number of framework knobs and levers to remember how to operate. What if there was a framework that could just consume these models right out the box?


BigDL is a distributed deep learning framework with native Spark integration, allowing it to leverage Spark during model training, prediction, and tuning. One of the things that I really like about Intel BigDL is how easy it is to work with models built and/or trained in Tensorflow, Caffe and Torch. This rich interop support for deep learning models allows BigDL applications to leverage the plethora of models that currently exist with little or no additional effort. Here are just a few ways this might be used in your applications:

  • Efficient Scale Out - Using BigDL you can scale out a model that was trained on a single node or workstation and leverage it at scale with Apache Spark. This can be useful for training on a large distributed dataset that already exists in your HDFS environment or for performing inferencing such as prediction and classification on a very large and often changing dataset.

  • Transfer Learning - Load a pretrained model with weights and then freeze some layers, append new layers and train / retrain layers. Transfer learning can improve accuracy or reduce training time by allowing you to start with a model that is used to do one thing, such as classify a different objects, and use it to accelerate development of a model to classify something else, such as specific car models.

  • High Performance on CPU - GPUs get all of the hype when it comes to deep learning. By leveraging Intel MKL and multi threading Spark tasks you can achieve better CPU driven performance leveraging BigDL than you would see with Tensorflow, Caffe or Torch when using Xeon processors.

  • Dataset Access - Designed to run in Hadoop, BigDL can compute where your data already exists. This can save time and effort since data does not need to be transferred to a seperate GPU environment to be used with the deep learning model. This means that your entire pipeline from ingest to model training and inference can all happen in one environment, Hadoop.

Real Data + Real Problem

Recently I had a chance to take advantage of the model portability feature of BigDL. After learning of an internal project here at Dell EMC, leveraging deep learning and telemetry data to predict component failures, my team decided we wanted to take our Ready Solution for AI - Machine Learning with Hadoop and see how it did with the problem.

The team conducting the project for our support organization was using Tensorflow with GPU accelerators to train an LSTM model. The dataset was sensor readings from internal components collected at 15 minute intervals showing all kinds of metrics like temperature, fan speeds, runtimes, faults etc.

Initially my team wanted to focus on testing out two use cases for BigDL:

  • Using BigDL model portability to perform inference using the existing tensorflow model
  • Implement an LSTM model in BigDL and train it with this dataset

As always, there were some preprocessing and data cleaning steps that had happened before we could get to modeling and inference. Luckily for us though we received the clean output of those steps from our support team to get started quickly. We received the data in the form of multiple csv files, already balanced with records of devices that did fail and those that did not. We got over 200,000 rows of data that looked something like this:



Converting the data to a tfrecord format used by Tensorflow was being done with Python and pandas dataframes. Moving this process to be done in Spark is another area we knew we wanted to dig in to, but to start we wanted to focus on our above mentioned goals. When we started the pipeline looked like this:

From Tensorflow to BigDL


For BigDL, instead of creating tfrecords we needed to end up with an RDD of Sample(s). Each Sample is one record of your dataset in the form of feature, label. Feature and label are in the form of one or more tensors and we create the sample from ndarray. Looking at the current pipeline we were able to simple take the objects created before writing to tfrecord and instead wrote a function that took these arrays and formed our RDD of Sample for BigDL.

def convert_to(x, y):
  sequences = x
  labels = y

  record = zip(x,y)
  record_rdd = sc.parallelize(record)

  sample_rdd = x:Sample.from_ndarray(x[0], x[1]))
  return sample_rdd

train = convert_to(x_train,y_train)
val = convert_to(x_val,y_val)
test = convert_to(x_test,y_test)

After that we took the pb and bin files representing the pretrained models definition and weights and loaded it using the BigDL Model.load_tensorflow function. It requires knowing the input and output names for the model, but the tensorflow graph summary tool can help out with that. It also requires a pb and bin file specifically, but if what you have is a ckpt file from tensorflow that can be converted with tools provided by BigDL.

model_def = "tf_modell/model.pb"
model_variable = "tf_model/model.bin"
inputs = ["Placeholder"]
outputs = ["prediction/Softmax"]
trained_tf_model = Model.load_tensorflow(model_def, inputs, outputs, byte_order = "little_endian", bigdl_type="float", bin_file=model_variable)

Now with our data already in the correct format we can go ahead and inference against our test dataset. BigDL provides Model.evaluate and we can pass it our RDD as well as the validation method to use, in this case Top1Accuracy.

results = trained_tf_model.evaluate(test,128,[Top1Accuracy()])


Defining a Model with BigDL


After testing out loading the pretrained tensorflow model the next experiment we wanted to conduct was to train an LSTM model defined with BigDL. BigDL provides a Sequential API and a Functional API for defining models. The Sequential API is for simpler models, with the Functional API being better for complex models. The Functional API describes the model as a graph. Since our model is LSTM we will use the Sequential API.

Defining an LSTM model is as simple as:

def build_model(input_size, hidden_size, output_size):
    model = Sequential()
    recurrent = Recurrent()
    recurrent.add(LSTM(input_size, hidden_size))
    model.add(InferReshape([-1, input_size], True))
    model.add(Select(2, -1))
    model.add(Linear(hidden_size, output_size))
    return model

lstm_model = build_model(n_input, n_hidden, n_classes)


After creating our model the next step is the optimizer and validation logic that our model will use to train and learn.

Create the optimizer:

optimizer = Optimizer(


Set the validation logic:




Now we can do trained_model = optimizer.optimize() to train our model, in this case for 50 epochs. We also set our TrainSummary folder so that the data was logged. This allowed us to also get visualizations in Tensorboard, something that BigDL supports.


At this point we had completed the two initial tasks we had set out to do, load a pretrained Tensorflow model using BigDL and train a new model with BigDL. Hopefully you found some of this process interesting, and also got an idea for how easy BigDL is for this use case. The ability to leverage deep learning models inside Hadoop with no specialized hardware like Infiniband, GPU accelerators etc provides a great tool that is sure to change up the way you currently view your existing analytics.

By: Lucas A. Wilson, Ph.D. and Michael Bennett


Artificial intelligence (AI) is transforming the way businesses compete in today’s marketplace. Whether it’s improving business intelligence, streamlining supply chain or operational efficiencies, or creating new products, services, or capabilities for customers, AI should be a strategic component of any company’s digital transformation.


Deep neural networks have demonstrated astonishing abilities to identify objects, detect fraudulent behaviors, predict trends, recommend products, enable enhanced customer support through chatbots, convert voice to text and translate one language to another, and produce a whole host of other benefits for companies and researchers. They can categorize and summarize images, text, and audio recordings with human-level capability, but to do so they first need to be trained.


Deep learning, the process of training a neural network, can sometimes take days, weeks, or months, and effort and expertise is required to produce a neural network of sufficient quality to trust your business or research decisions on its recommendations. Most successful production systems go through many iterations of training, tuning and testing during development. Distributed deep learning can speed up this process, reducing the total time to tune and test so that your data science team can develop the right model faster, but requires a method to allow aggregation of knowledge between systems.


There are several evolving methods for efficiently implementing distributed deep learning, and the way in which you distribute the training of neural networks depends on your technology environment. Whether your compute environment is container native, high performance computing (HPC), or Hadoop/Spark clusters for Big Data analytics, your time to insight can be accelerated by using distributed deep learning. In this article we are going to explain and compare systems that use a centralized or replicated parameter server approach, a peer-to-peer approach, and finally a hybrid of these two developed specifically for Hadoop distributed big data environments.


Distributed Deep Learning in Container Native Environments

Container native (e.g., Kubernetes, Docker Swarm, OpenShift, etc.) have become the standard for many DevOps environments, where rapid, in-production software updates are the norm and bursts of computation may be shifted to public clouds. Most deep learning frameworks support distributed deep learning for these types of environments using a parameter server-based model that allows multiple processes to look at training data simultaneously, while aggregating knowledge into a single, central model.


The process of performing parameter server-based training starts with specifying the number of workers (processes that will look at training data) and parameter servers (processes that will handle the aggregation of error reduction information, backpropagate those adjustments, and update the workers). Additional parameters servers can act as replicas for improved load balancing.



Parameter server model for distributed deep learning


Worker processes are given a mini-batch of training data to test and evaluate, and upon completion of that mini-batch, report the differences (gradients) between produced and expected output back to the parameter server(s). The parameter server(s) will then handle the training of the network and transmitting copies of the updated model back to the workers to use in the next round.


This model is ideal for container native environments, where parameter server processes and worker processes can be naturally separated. Orchestration systems, such as Kubernetes, allow neural network models to be trained in container native environments using multiple hardware resources to improve training time. Additionally, many deep learning frameworks support parameter server-based distributed training, such as TensorFlow, PyTorch, Caffe2, and Cognitive Toolkit.


Distributed Deep Learning in HPC Environments

High performance computing (HPC) environments are generally built to support the execution of multi-node applications that are developed and executed using the single process, multiple data (SPMD) methodology, where data exchange is performed over high-bandwidth, low-latency networks, such as Mellanox InfiniBand and Intel OPA. These multi-node codes take advantage of these networks through the Message Passing Interface (MPI), which abstracts communications into send/receive and collective constructs.


Deep learning can be distributed with MPI using a communication pattern called Ring-AllReduce. In Ring-AllReduce each process is identical, unlike in the parameter-server model where processes are either workers or servers. The Horovod package by Uber (available for TensorFlow, Keras, and PyTorch) and the mpi_collectives contributions from Baidu (available in TensorFlow) use MPI Ring-AllReduce to exchange loss and gradient information between replicas of the neural network being trained. This peer-based approach means that all nodes in the solution are working to train the network, rather than some nodes acting solely as aggregators/distributors (as in the parameter server model). This can potentially lead to faster model convergence.



Ring-AllReduce model for distributed deep learning


The Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA allows users to take advantage of high-bandwidth Mellanox InfiniBand EDR networking, fast Dell EMC Isilon storage, accelerated compute with NVIDIA V100 GPUs, and optimized TensorFlow, Keras, or Pytorch with Horovod (or TensorFlow with tensorflow.contrib.mpi_collectives) frameworks to help produce insights faster.


Distributed Deep Learning in Hadoop/Spark Environments

Hadoop and other Big Data platforms achieve extremely high performance for distributed processing but are not designed to support long running, stateful applications. Several approaches exist for executing distributed training under Apache Spark. Yahoo developed TensorFlowOnSpark, accomplishing the goal with an architecture that leveraged Spark for scheduling Tensorflow operations and RDMA for direct tensor communication between servers.


BigDL is a distributed deep learning library for Apache Spark. Unlike Yahoo’s TensorflowOnSpark, BigDL not only enables distributed training - it is designed from the ground up to work on Big Data systems. To enable efficient distributed training BigDL takes a data-parallel approach to training with synchronous mini-batch SGD (Stochastic Gradient Descent). Training data is partitioned into RDD samples and distributed to each worker. Model training is done in an iterative process that first computes gradients locally on each worker by taking advantage of locally stored partitions of the training data and model to perform in memory transformations. Then an AllReduce function schedules workers with tasks to calculate and update weights. Finally, a broadcast syncs the distributed copies of model with updated weights.



BigDL implementation of AllReduce functionality


The Dell EMC Ready Solutions for AI, Machine Learning with Hadoop is configured to allow users to take advantage of the power of distributed deep learning with Intel BigDL and Apache Spark. It supports loading models and weights from other frameworks such as Tensorflow, Caffe and Torch to then be leveraged for training or inferencing. BigDL is a great way for users to quickly begin training neural networks using Apache Spark, widely recognized for how simple it makes data processing.


One more note on Hadoop and Spark environments: The Intel team working on BigDL has built and compiled high-level pipeline APIs, built-in deep learning models, and reference use cases into the Intel Analytics Zoo library. Analytics Zoo is based on BigDL but helps make it even easier to use through these high-level pipeline APIs designed to work with Spark Dataframes and built in models for things like object detection and image classification.



Regardless of whether you preferred server infrastructure is container native, HPC clusters, or Hadoop/Spark-enabled data lakes, distributed deep learning can help your data science team develop neural network models faster. Our Dell EMC Ready Solutions for Artificial Intelligence can work in any of these environments to help jumpstart your business’s AI journey. For more information on the Dell EMC Ready Solutions for Artificial Intelligence, go to

Lucas A. Wilson, Ph.D. is the Lead Data Scientist in Dell EMC's HPC & AI Engineering group. (Twitter: @lucasawilson)

Michael Bennett is a Senior Principal Engineer at Dell EMC working on Ready Solutions.