Find Communities by: Category | Product

1 2 Previous Next

Ready Solutions for AI

18 Posts

A data science project lifecycle includes data acquisition, preparation and validation, model development, model validation, productization, and monitoring. For many projects, model development and productization could be iterative processes. Models in production need to be continuously monitored to ensure they are performing as expected.  Each of these steps brings a unique set of challenges and requirements that are not met by traditional software development platforms. When building an AI based application, a data science platform that manages the life cycle of data, models and applications proves immensely valuable. The concept of a “data science workbench” that enables self-service access to datasets, compute instances and version control is now seen as a crucial enablement tool to the productivity and success of data science teams.

To meet this need for our customers, Dell EMC has partnered with Domino Data Lab to offer the Domino Data Science Platform powered by Dell EMC PowerEdge servers. One of the things I first liked about Domino was its straightforward integration with Apache Spark. In the Dell EMC HPC and AI Innovation Lab this allows us to tie our projects with existing Spark clusters such as the Dell EMC Ready Solutions for Data Analytics with Intel Analytics Zoo to enable AI on Spark. Here are some highlights of the features:

Reproducibility Engine: One of the key features of Domino is the Domino Reproducibility Engine. A project within Domino constitutes three components: data, code and the model. When developing and tuning a model, data scientists would run multiple experiments by varying model architecture and hyper parameters. Domino’s Reproducibility Engine captures and stores all these components. This allows for accurate reproduction of the experiments and allows data scientists to revisit the states or conditions in which the model produced optimal results. This also helps future collaborators to iterate through the model history and understand how it was developed, even when the original researcher is no longer available.

Flexibility of compute infrastructure: Domino enables IT administrators to provision compute environments and provide them to data scientists so they can run experiments. The compute infrastructure could be a cluster of CPU nodes, accelerators or an integration into big data frameworks like Cloudera’s distribution including Apache Hadoop (CDH) or Apache Spark. Data scientists can map these compute infrastructures to their projects based on their needs.

Software environments packaged as a container: Domino also enables IT administrators to capture required software packages, libraries and dependencies as a Docker container. Domino comes pre-configured with containers for popular languages, tools, and packages such as R, Python, SAS, Jupyter, RStudio, Tensorflow, and H2O. Custom environments can be built by appending to the Dockerfile and building the container. These version-controlled environments can then be mapped to projects.

Productizing models: Using Domino, a model can be published in several ways. Published models are automatically versioned, secured via access control, and highly available.

      • Models can be published as REST APIs, which can be accessed directly by applications for inference.
      • Data scientists can create self-service web forms using Launchers that lets the end users to easily view and consume analytics and reports.
      • Models can also be published as interactive Flask or Shiny apps and deployed through Domino.

Model usage tracking: Once the model has been published into production, Domino automatically captures model usage and other details. This helps in calculating the ROI of the model and helps to streamline future model development.


Refer to this brief for other features of Domino.

Integration into Cloudera

Domino supports running training jobs using Hadoop and Spark worker nodes hosted on existing Cloudera managed clusters.


Hadoop and Spark integration touches three aspects of the Domino Data Science Platform:


  • Projects - Projects are repositories that contain data, code, model and environment settings. Domino tracks the entire project and revisions it. An AI application or model can typically map to a project
  • Environments – Environments consists of software and hardware context in which the projects run. Software context includes a Docker container that incorporates all the libraries and settings for the project. Hardware context includes the nodes where the training job runs.
  • Workspaces – Workspaces are interactive development environments (IDEs) like Jupyter Notebook, JupyterLab and R Studio.

Projects in Domino can be configured to access a Hadoop/Spark cluster environment. An environment is configured with libraries, settings and hardware to access the Hadoop cluster. Once configured, the data scientist can launch a Jupyter notebook session or other workspace and run Spark training jobs. The Spark driver program will run on the Domino executor (as shown in the figure) and spawn workers in the Hadoop cluster for the machine learning training task.

By configuring the Docker container with deep learning libraries from the Analytics Zoo like BigDL, data scientist can request resources on demand from the Spark cluster to build data pipelines and train models.


Running on Kubernetes

Kubernetes has become the de-facto tool container orchestration. Domino Data is built to run on an existing Kubernetes cluster. This streamlines deployment, maintenance, scaling, and updates. Kubernetes leverages a farm of nodes - PowerEdge servers that provide the compute resources that power jobs and runs. This enables flexibility, improved resource utilization and version control while developing cloud native AI applications.

Starting with Spark version 2.3, Apache Spark supports Kubernetes natively with spark-submit. Using this method, Spark utilizes Kubernetes scheduler rather than YARN or the built-in. Spark supports both client mode and cluster mode with Kubernetes. The Spark driver runs inside or outside of the Kubernetes cluster to submit jobs to the Kubernetes scheduler, which then creates the workers. When the application completes the executor pods terminate and are cleaned up.


We have a demo of AI-assisted Radiology using distributed deep learning on Apache Spark and Analytics Zoo using Domino Data Science Platform at the Intel boot in Strata Data Conference (September 2019) in New York. Visit us at the Strata conference to learn more.


In a future post we can explore running Spark on Kubernetes in more detail. Follow this space to get best practices around running Apache Spark on Kubernetes.

sharing gears.pngIn Part 1 of “Share the GPU Love” we covered the need for improving the utilization of GPU accelerators and how a relatively simple technology like VMware DirectPath I/O together with some sharing processes could be a starting point.  As with most things in technology, some additional technology, and knowledge you can achieve high goals beyond just the basics.  In this article, we are going to introduce another technology for managing GPU-as-a-service – NVIDIA GRID 9.0.


Before we jump to this next technology, let’s review some of the limitations of using DirectPath I/O for virtual machine access to physical PCI functions. The online documentation for VMware DirectPath I/O has a complete list of features that are unavailable for virtual machines configured with DirectPath I/O.  Some of the most important ones are:


vmware vgu.png

  • Fault tolerance
  • High availability
  • Snapshots
  • Hot adding and removing of virtual devices

The technique of “passing through” host hardware to a virtual machine (VM) is simple but doesn’t leverage many of the virtues of true hardware virtualization.  NVIDIA delivers software to virtualize GPUs in the data center for years.  The primary use case has been Virtual Desktop Infrastructure (VDI)  using vGPUs.  The current release - NVIDIA vGPU Software 9 adds the vComputeServer vGPU capability for supporting artificial intelligence, deep learning, and high-performance computing workloads.  The rest of this article will cover using vGPU for machine learning in a VMware ESXi environment.

We want to compare the setup and features of this latest NVIIDA software version, so we worked on adding the vComputeServer to our PowerEdge ESXi that we used for the DirectPath I/O research in our first blog [add blog here].  Our NVIDIA Turing architecture T4 GPUs are on the list of supported devices, so we can check that box and our ESXi version is compatible.  The NVIDIA vGPU software documentation for VMware vSphere has an exhaustive list of requirements and compatibility notes.

You’ll have to put your host into maintenance mode during installation and then reboot after the install of the VIB completes.  When the ESXi host is back online you can use the now familiar nvidia-smi command with no parameters and see a list of all available GPUs that indicates you are ready to proceed.

We configured two of our T4 GPUs for vGPU use and setup the required licenses.  Then we followed the same approach that we used for DirectPath I/O to build out VM templates with everything that is common to all developments and use those to create the developer specific VMs – one with all Python tools and another with R tools.  NVIDIA vGPU software supports only 64-bit guest operating systems. No 32-bit guest operating systems are supported.  You should only use a guest OS release that is supported by both for NVIDIA vGPU software and by VMware.  NVIDIA will not be able to support guest OS releases that are not supported by your virtualization software.

vmware vgpu.JPG.jpg

Now that we have both a DirectPath I/O enabled setup and the NVIDIA vGPU environment let’s compare the user experience.  First, starting with vSphere 6.7 U1 release, vMotion with vGPU and suspend and resume with vGPU are supported on suitable GPUs. Always check the NVIDIA Virutual GPU Software Documentation for all the latest details.  vSphere 6.7 only supports suspend and resume with vGPU. vMotion with vGPU is not supported on release 6.7. [double check this because vMotion is supported I just cant remember what version and update number it is] 

vMotion can be extremely valuable for data scientists doing long running training jobs that you don’t get with DirectPath I/O and suspend/resume of vGPU enabled VMs creates opportunities to increase the return from your GPU investments by enabling scenarios with data science model training running at night and interactive graphics intensive applications running during the day utilizing the same pool of GPUs.  Organizations with workers spread across time zones may also find that suspend/resume of vGPU enabled VMs to be useful.

There is still a lot of work that we want to do in our lab including capturing some informational videos that will highlight some of the concepts we have been talking about in this last two articles.  We are also starting to build out some VMs configured with Docker so we can look at using our vGPUs with NVIDIA GPU Cloud (GCP) deep learning training and inferencing containers.  Our goal is to get more folks setting up a sandbox environment using these articles along with the NVIDIA and VMware links we have provided.  We want to hear about your experience working with vGPUs and VMware.  If you have any questions or comments post them in the feedback section below.


Thanks for reading,

Phil Hummel - @GotDisk

sharing gears.pngAnyone that works with machine learning models trained by optimization methods like stochastic gradient descent (SGD) knows about the power of specialized hardware accelerators for performing a large number of matrix operations that are needed.  Wouldn’t it be great if we all had our own accelerator dense supercomputers?  Unfortunately, the people that manage budgets aren’t approving that plan, so we need to find a workable mix of technology and, yes, the dreaded concept, process to improve our ability to work with hardware accelerators in shared environments.


We have gotten a lot of questions from a customer trying to increase the utilization rates of machines with specialized accelerators.  Good news, there are a lot of big technology companies working on solutions. The rest of the article is going to focus on technology from Dell EMC, NVIDIA, and VMware that is both available today and some that are coming soon.  We also sprinkle in some comments about the process that you can consider.  Please add your thoughts and questions in the comments section below.


We started this latest round of GPU-as-a-service research with a small amount of kit in the Dell EMC Customer Solutions Center in Austin.  We have one Dell EMC PowerEdge R740 with 4 NVIDIA T4 GPUs connected to the system on the PCIe bus.  Our research question is “how can a group of data scientists working on different models with different development tools share these four GPUs?”  We are going to compare two different technology options:

  1. VMware Direct Path I/O

Our server has ESXi installed and is configured as a 1 node cluster in vCenter.  I’m going to skip the configuration of the host BIOS and ESXi and jump straight to creating VMs.  We started off with the Direct Path I/O option.  You should review the article “Using GPUs with Virtual Machines on vSphere – Part 2: VMDirectPath I/O” from VMware before trying this at home.  It has a lot of details that we won’t repeat here.



There are many approaches available for virtual machine image management that can be set up by the VMware administrators but for this project, we are assuming that our data scientists are building and maintaining the images they use.  Our scenario is to show how a group of Python users can have one image and the R users can have another image that both use GPUs when needed.  Both groups are using primarily TensorFlow and Keras.


Before installing an OS we changed the firmware setting to EFI in the VM Boot Options menu per the article above.  We also used the VM options to assign one physical GPU to the VM using Direct Path I/O before proceeding with any software installs.  It is important for there to be a device present during configuration even though the VM may get used later with or without an assigned GPU to facilitate sharing among users and/or teams.


Once the OS was installed and configured with user accounts and updates, we installed the NVIDIA GPU related software and made two clones of that image since both the R and Python environment setups need the same supporting libraries and drivers to use the GPUs when added to the VM through Direct Path I/O.  Having the base image with an OS plus NVIDIA libraries saves a lot of time if you want a new type of developer environment.


With this much of the setup done, we can start testing assigning and removing GPU devices among our two VMs.  We use VM options to add and remove the devices but only while the VM is powered off. For example, we can assign 2 GPUs to each VM, 4 GPUs to one VM and none to the other or any other combination that doesn’t exceed our 4 available devices.  Devices currently assigned to other VMs are not available in the UI for assignment, so it is not physically possible to create conflicts between VMs. We can NVIDIA’s System Management Interface (nvidia-smi) to list the devices available on each VM.


Remember above when we talked about process, here is where we need to revisit that.  The only way a setup like this works is if people release GPUs from VMs when they don’t need them.  Going a level deeper there will probably be a time when one user or group could take advantage of a GPU but would choose to not take one so other potentially more critical work can have it.  This type of resource sharing is not new to research and development.  All useful resources are scarce, and a lot of efficiencies can be gained with the right technology, process, and attitude


Before we talk about installing the developer frameworks and libraries, let’s review the outcome we desire. We have 2 or more groups of developers that could benefit from the use of GPUs at different times in their workflow but not always.  They would like to minimize the number of VM images they need and have and would also like fewer versions of code to maintain even when switching between tasks that may or may not have access to GPUs when running.  We talked above about switching GPUs between machines but what happens on the software side?  Next, we’ll talk about some TensorFlow properties that make this easier.


TensorFlow comes in two main flavors for installation tensorflow and tensorflow-gpu.  The first one should probably be called “tensorflow-cpu” for clarity.  For this work, we are only installing the GPU enabled version since we are going to want our VMs to be able to use GPU for any operations that TF supports for GPU devices. The reason that I don’t also need the CPU version when my VM has not been assigned any GPUs is that many operations available in the GPU enabled version of TF have both a CPU and a GPU implantation. When an operation is run without a specific device assignment, any available GPU device will be given priority in the placement.   When the VM does not have a GPU device available the operation will use the CPU implementation.


There are many examples online for testing if you have a properly configured system with a functioning GPU device.  This simple matrix multiplication sample is a good starting point.  Once that is working you can move on a full-blown model training with a sample data set like the MNIST character recognition model.  Try setting up a sandbox environment using this article and the VMware blog series above. Then get some experience with allocating and deallocating GPUs to VMs and prove that things are working with a small app.  If you have any questions or comments post them in the feedback section below.


Thanks for reading.

Phil Hummel



We know that using really large batch sizes during training can cause models to poorly generalize. But how do large batches actually affect the generalization and optimization of neural network models? 2018 was a great year for research on Neural Machine Translation (NMT).  We’ve seen an explosion in the number of research papers published in this field, ranging from descriptions of new and interesting architectures to efficient training techniques. Research papers have shown how larger batch sizes and reduced precision can help to improve both the training time and quality.


Figure 1: Numbers of papers published in Arxiv with ‘neural machine translation’ in the title or abstract in the ‘cs’ category.

In our previous blogs, we showed how to effectively scale an NMT system, as well as some of the challenges associated with scaling. In this blog, we will explore the effectiveness of large batch training using Intel® Xeon® Scalable processors. The work discussed in the blog is based on neural network training performed using Zenith super computer at Dell EMC’s HPC and AI Innovation lab.

System Information

CPU Model

Intel®  Xeon® Gold 6148 CPU @ 2.40GHz

Operating System

Red Hat Enterprise Linux Server release 7.4 (Maipo)

Tensorflow Version

Anaconda TensorFlow 1.12.0 with Intel® MKL

Horovod Version




Incredible strong scaling efficiency helps to dramatically reduce the time to solution of the model. To best visualize this, consider figure 2. The time to solution drops from around 1 month on a single node to just over 6 hours using 200 nodes. This 121x faster solution would significantly help the productivity of NMT researchers using CPU-based HPC infrastructures. The results observed were based on the models achieving a baseline BLEU score (case-sensitive) of 27.5.


Figure 2: Time to train the model to solution

For the single node case, we have used the largest batch size that could fit in a nodes memory, 25,600 tokens per worker. For all other cases we use a global batch size of 819,200, leading to per-worker batch sizes of 25,600 in the 16-node case, down to only 2,048 in the 200-node case. The number of training iterations is similar for all experiments in the 16-200 node range, and is increased by a factor of 16 for the single-node case (to compensate for the larger batch).


Figure 3: Translation quality (BLEU) when trained with different batch sizes on Zenith.

Scaling out the “transformer” model training using MPI and Horovod improves throughput performance, while producing models of similar translation quality as shown in Figure 3.  The results were obtained by using newstest2014 as the test set. Models of comparable quality can be trained in a reduced amount of time by scaling computation over many more nodes, and with larger global batch sizes (GBZ). Our experiments on Zenith demonstrate ability to train models of comparable or higher translation quality (as measured by BLEU score) than the reported best for TensorFlow's official model, even when training with batches of a million or more tokens.

Note: The results shown in figure 3 were obtained by using the settings mentioned in our previous blog and by using Open MPI.



Here in this blog, we showed the generalization of large batch training of NMT model. We also showed how efficiently Intel® Xeon® Scalable processors is able to scale and reduce the time to solution. We hope this would benefit the productivity of the NMT research community using CPU-based HPC infrastructures.

The field of machine language translation is rapidly shifting from statistical machine learning models to efficient neural network architecture designs which can dramatically improve translation quality. However, training a better performing Neural Machine Translation (NMT) model still takes days to weeks depending on the hardware, size of the training corpus and the model architecture. Improving the time-to-solution for NMT training will be crucial if these approaches are to achieve mainstream adoption.

Intel® Xeon® Scalable processors are the workhorse of the modern datacenter, and over 90% of the Top500 super computers run on Intel. We can apply the supercomputing approach of scaling out to multiple servers to training NMT models in any datacenter. In this article we show some the effectiveness of and highlight important considerations when scaling a NMT model using Intel® Xeon® Scalable processors.

Encoder – decoder architecture

An NMT model reads a sentence in a source language and passes it to an encoder, which builds an intermediate representation. A decoder then processes the intermediate representation to produce a translated sentence in a target language.

enc-dec-architecture.pngFigure 1: Encoder-decoder architecture

The figure above illustrates the encoder-decoder architecture. The English source sentence, “Hello! How are you?”  is read and processed by the architecture to produce a translated German sentence “Hallo! Wie geht sind Sie?”. Traditionally, Recurrent Neural Network (RNN) were used in encoders and decoders, but other neural network architectures such as Convolutional Neural Network (CNN) and attention mechanism based architectures are also used.

Architecture and environment

The Transformer model is one of the current architectures of interest in the field of NMT, and is built with variants of the attention mechanism which replace the traditional RNN components in the architecture. This architecture was able to produce a model which achieved state of the art results in English-German and English-French translation tasks.



Figure 2: Multi-head attention block

The above figure shows the multi-head attention block used in the transformer architecture. At a high-level, the scaled dot-product attention can be thought as finding the relevant information, in the form of values (V) based on Query (Q) and Keys (K). Multi-head attention can be thought of as several attention layers in parallel, which together can identify distinct aspects of the input.

We use the Tensorflow official model implementation of the transformer architecture, which has been augmented with Uber’s Horovod distributed training framework. The training dataset used is the WMT English-German parallel corpus, which contains 4.5M English-German sentence pairs.

Our tests were performed in house on Zenith super computer in the Dell EMC HPC and AI Innovation lab. Zenith is a Dell EMC PowerEdge C6420-based cluster, consisting of 388 dual socket nodes powered by Intel® Xeon® Scalable Gold 6148 processors and interconnected with an Intel® Omni-path fabric.

System Information

CPU Model

Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz

Operating System

Red Hat Enterprise Linux Server release 7.4 (Maipo)

Tensorflow Version

1.10.1 with Intel® MKL

Horovod Version



Open MPI 3.1.2


Note: We used a specific Horovod branch to handle sparse gradients. Which is now part of the main branch in their GitHub repository.

Weak scaling, environment variables and TF configurations

When training using CPUs, environment variable settings and TensorFlow runtime configuration values play a vital role in improving the throughput and reducing the time to solution.

Below are the suggested settings based on our empirical tests when running 4 processes per node for the transformer (big) model on 50 zenith nodes.

Environment Variables:




export KMP_AFFINITY=granularity=fine,verbose,compact,1,0

TF Configurations:




Experimenting with weak scaling options allows to find the optimal number of processes run per node such that the model is fit in the memory and performance doesn’t deteriorate. For some reason TensorFlow creates an extra thread. Hence, to avoid oversubscription it’s better to set the OMP_NUM_THREADS to 9, 19 or 39 when training with 4,2,1 process per node respectively. Although we didn’t see it affecting the throughput performance in our experiments but may affect performance in a very large-scale setup.

Taking advantage of multi-threading can dramatically improve performance. This can be done by setting OMP_NUM_THREADS such that the product of its value and number of MPI ranks per node equals the number of available CPU cores per node. In the case of Zenith, this is 40 cores, as each PowerEdge C6420 node contains 2 20-core Intel® Xeon® Gold 6148 processors.

The KMP_AFFINITY environment variable provides a way to control the interface which binds OpenMP threads to physical processing units, while KMP_BLOCKTIME, sets the time in milliseconds that a thread should wait after completing a parallel execution before sleeping. TF configuration settings, intra_op_parallelism_threads and inter_op_parallelism_threads, are used to adjust the thread pools there by optimizing the CPU performance.



Figure 3: Effect of environment variables

The above results show that there’s a 1.67x improvement when environment variables are set correctly.

Faster distributed training

Training a large neural network architecture can be time consuming, making it difficult to perform rapid prototyping or hyper parameter tuning. Thanks to distributed training and open source frameworks like Horovod, which allows to train a model using multiple workers, the time to train can be substantially reduced. In our previous blog we showed the effectiveness of training an AI radiologist with distributed deep learning and using Intel® Xeon® Scalable processors. Here, we show how distributed deep learning improves the time to train for machine translation models.


Figure 4: Scaling Performance

The above chart shows the throughput of the transformer (big) model when trained using up to 100 Zenith nodes. Our experiments show linear performance when scaling up the number of nodes. Based on our tests, which include setting the correct environment variables and the optimal number of MPI processes per node, we see a 79x improvement on 100 Zenith nodes with 2 processes per node compared to the throughput on single node with 4 processes.

Translation Quality

NMT models’ translation quality is measured in terms of BLEU (Bi-Lingual Evaluation Understudy) score. It’s a measure to compute the difference between the human and machine translated output.

In a previous blog post we explained some of the challenges of large-batch training of deep learning models. Here, we experimented using a large global batch size of 402k tokens to determine the models’ accuracy on the English to German translation task. Hyper parameters were set to match those used for the transformer (big) model, and the model was trained using 50 Zenith nodes with 4 processes per node. The learning rate grows linearly for 4000 steps to 0.001 and then follows inverse square root decay.


Case-Insensitive BLEUCase-Sensitive BLEU
TensorFlow Official Benchmark Results28.9-
Our results29.1528.56


Note: Case-Sensitive score not reported in the Tensorflow Official Benchmark.

The above table shows our results on the test set (newstest2014) after training the model for around 2.7 days (26000 steps). We can see a clear improvement in the translation quality compared to the results posted on the Tensorflow Official Benchmarks page. This shows that training with large batches does not adversely affect the quality of the resulting translation models, which is an encouraging result for future studies with even larger batch sizes.


In this post we showed how to effectively train a Neural Machine Translation(NMT) system using Intel® Xeon® Scalable processors using distributed deep learning. We highlighted some of the best practices for setting environment variables and the corresponding scaling performance. Based on our experiments, and following other research work on NMT to understand some of the important aspects of scaling an NMT system, we were able to demonstrate better translation quality and accelerate the training process. With research interest in the field of neural machine translation continuing to grow, we expect to see more interesting and innovative NMT architectures in the future.

As I mentioned in our previous blog post, the translation quality of neural machine translation (NMT) systems has improved immensely in recent years. However, these models still take considerable time to train, and little work has been focused on improving their time to solution. Distributed training across multiple compute nodes can potentially improve the time to train, but there are various challenges associated with scale-out training of NMT systems.

In this blog we highlight solutions developed at Dell EMC which address a few common issues encountered when scaling an NMT architecture like the Transformer model in TensorFlow, highlight the performance benefits associated with these solutions. All of the experiments and results obtained used Zenith, Dell EMC’s  very own Intel® Xeon® Scalable processor-based supercomputer, which is housed in the Dell EMC HPC & AI Innovation Lab in Austin, Texas.

Performance degradation and OOM errors

One of the main roadblocks to scaling  NMT models is the memory required to accumulate gradients. When training neural networks, the gradients are vectors – or directional arrays – of numbers which roughly correspond to the difference between the current network weights and a set of weights which provide a better solution. Essentially, the gradients point each weight value in a different, and hopefully, better direction which leads to better solutions. While convolutional neural networks for image classification use dense gradient vectors which can be easily worked with, the design of the transformer model uses an embedding layer which does not necessarily scale well to multiple servers.

This design causes severe performance degradation and out of memory (OOM) errors because TensorFlow does not accumulate the embedding layer gradients correctly. Gradients from the embedding layer are sparse, whereas the gradients from the projection matrix are dense. TensorFlow then accumulates both of these tensors as sparse objects. This has a dramatic effect on TensorFlow’s gradient accumulation strategy, and subsequently on the total size of the accumulated gradient tensor. This results in large message buffers which scale linearly with the number of processes, thereby causing segmentation faults or out-of-memory errors.

The assumed-sparse tensors make Horovod (the distributed training framework used with TensorFlow) to perform gradient accumulation by MPI_Gather rather than MPI_Reduce. To fix this issue, we can convert all assumed sparse tensors to dense tensors. This is done by adding the flag “sparse_as_dense=True” in Horovod’s DistributedOptimizer method.


opt = hvd.DistributedOptimizer(opt, sparse_as_dense=True)


Figure 1: Accumulate size

Figure 1 shows the accumulation size when using 64 nodes (1ppn, batch_size=5000 tokens). There’s an 82x reduction in accumulation size when the assumed sparse tensors are converted to dense. This solution allows to scale and train the model using 100’s of nodes.


Figure 2: Scaled speedup (strong) performance.


Apart from the weak scaling performance benefit shown in our previous blog, the reduced gradient size also provides a way to perform efficient strong scaling. Figure 2 shows the strong scaling speedup performed on zenith and stampede2 super computers using up to 200 nodes on Zenith (Dell EMC) and 256 nodes on Stampede2 (TACC). Efficient strong scaling greatly helps to reduce the time to train the model



Diverged Training

While building a model quickly is important, it is critical the make sure that the resulting model is also accurate. Diverged training, where the produced model becomes less accurate (rather than more accurate) with continued training is a common problem not just for large batch training but in general for any NMT system. Monitoring the loss graph would help to understand the convergence of the deep learning model. Setting the learning rate to an optimal value is crucial for the model’s convergence.

Measures can be taken to prevent diverged training. Experiments suggest that having a very high learning rate at the beginning of the training would cause diverged training. But on the other hand, setting the learning rate too low also would make the model to converge slowly. Finding the ideal learning rate for the model is therefore critical.

One solution is to reduce the learning rate (cool down or decay) or increase the learning rate (warm up), or more often a combination of both By allowing the learning rate to increase linearly to the set value for certain number of steps after which it decays based on a chosen function, the resulting model can be more accurate and produced faster. For transformer model, the decay is proportional to the inverse square root of the number of steps.


Figure 3: Learning rate decay used in Transformer model


Based on our experiments we found that for large batch sizes (130k, 402k, 630k, 1M tokens), setting the learning rate to 0.001 – 0.005 would prevent diverged training of the big model.


Figure 4: An example loss profile showing diverged training (gbz=130k, lr=0.01)


Figure 5: An example loss profile showing correct training behavior (gbz=130k, lr=0.001)

Figures 4 and 5 show the loss profiles when trained with a global batch size of 130k. Setting the learning rate to a “high” value (0.01) results in diverged training, but when set to 1e-3 (0.001), the model converges better. This results in good translation quality on the final model. Similar results were observed for all other large batch sizes.



In this blog we highlighted a few common challenges when performing distributed training of the transformer model for neural machine translation (NMT). The solutions developed by Dell EMC in collaboration with Uber, Amazon, Intel, and SURFsara resulted in dramatically improved scaling capabilities and model accuracy. The results are now added part of our research paper accepted at ISC High Performance 2019 conference. The paper has further details about the modifications to Horovod and improvements in terms of memory usage, scaling efficiency, reduced time to train and translation quality. The work has been incorporated into Horovod so that the research community can explore further scaling potential and produce more efficient NMT models.



Due to the almost “unpredictable” nature of stock market, predicting stock price is one of the most challenging problems in financial service industry. In financial literature, a stock price is considered as a stochastic process and modeled with geometric Brownian motion (GBM), which is expressed as a stochastic differential equation (SDE) as follows


Here S(t) is price of the stock at time t ; a is percentage drift; b is percentage volatility and W(t) is the Brownian motion term. The equation has a nice closed form solution


However, for real problems, the percentage volatility b is hardly obtainable, making it extremely difficult to model the dynamics of a stock or predict its trends.

As a compromise, we use neural network models to predict stock price trends. As a data-driven approach, neural networks take stock prices as time series and use historical data to learn the parameters involved in the model and make predictions for the future. Also, neural networks possess a good diversity in model architectures, so they are good candidates for ensemble learning, which usually produces more accurate predictions than any single models.

In this blog, we discuss about building an ensemble prediction system consisting of neural network models to predict the trends for 542 technology companies’ stocks. Also, we will show how to train the prediction system in parallel utilizing the HPC (high performance computing) power of Intel® Xeon cluster.


Ensemble Model for Predicting a Bundle of Stocks

Evolutions of different stocks’ prices are not independent of each other. Instead, they are influencing each other in some way. It is very hard to know exactly how they interact with each other. We can approximately model the dynamics of stock prices with the following system


Here k is the maximum lag that is used as predictors. In this blog, we explore on using neural network models to approximate F_1, F_2, ......, F_n. As is shown in Figure 1, we firstly train three different neural network models (multi-layer perceptron (MLP), convolutional neural network (CNN) and long short-term memory (LSTM)) and then do an ensemble forecasting via majority voting, i.e., we believe that the price of a stock will increase/decrease if at least 2/3 of the prediction models believe so.


Figure 1: Training ensemble model for stock price trends prediction on multiple nodes with Intel® Xeon processors

Low Latency Prediction with Intel® Xeon Scalable Processors

Latency is very critical for stock trading. An investment portfolio on the stock market needs to be frequently adjusted to hedge the risk (to minimize the probability of loss). The hedging strategy needs to be executed at high frequency with very low latency. So, training and predictions must be finished within very short time periods.

We leverage the HPC power of Intel® Xeon processors to build a low latency neural network system for real-time predictions for stock price trends. The system in Figure 1 was tested on the Zenith supercomputer in Dell EMC HPC and AI Innovation Lab, which consists of 422 Dell EMC PowerEdge C6420 servers, each with 2 Intel® Xeon Scalable Gold 6148 processors. Each of these processors has 20 physical cores.

In Figure 1, MLP, CNN and LSTM models can be trained simultaneously with multiple processes (e.g., 40 processes) on 3 Zenith nodes. Training with multiple processes was performed with Uber’s Horovod framework. We trained the models with historical data in the past 200 consecutive trading days. Batch size for each process was 8. We did the test for 20 trading days, so there were 10840 (20x542) trends. The ensemble system can predict about 56% of them correctly.


Figure 2: Training time comparison for 1 process and 40 processes.

Figure 2 compares training time costs for 1 process and 40 processes. As is shown, distributed training over multi-processes can reduce training time by about 10 times. The speed-up is especially significant for LSTM model, because it has more parameters and usually takes longer time to train than other models. With this prediction system, the latency is less than 20 seconds. Stock trading system can take advantage of this low latency to respond and react to changes/fluctuations in the market more quickly, which in the end will help better optimize the investment portfolio so that the risk of loss is lower.


While deep neural networks have had great success in areas like computer vision, natural language processing, heath care, weather forecasting, etc., it is worthwhile to do more exploring research on their applications in financial industry. Since most operations (e.g., stock trading and options trading) in financial industry are time-sensitive and requires low latency, training prediction models in parallel is a necessity when financial companies integrate artificial intelligence (AI) applications into their business. HPC clusters with Intel® Xeon scalable processors will be good infrastructure choices for building such low latency AI systems.

Inference is the process of running a trained neural network to process new inputs and make predictions. Training is usually performed offline in a data center or a server farm. Inference can be performed in a variety of environments depending on the use case. Intel® FPGAs provide a low power, high throughput solution for running inference. In this blog, we look at using the Intel® Programmable Acceleration Card (PAC) with Intel® Arria® 10GX FPGA for running inference on a Convolutional Neural Network (CNN) model trained for identifying thoracic pathologies.


Advantages of using Intel® FPGAs

  1. System Acceleration

Intel® FPGAs accelerate and aid the compute and connectivity required to collect and process the massive quantities of information around us by controlling the data path. In addition to FPGAs being used as compute offload, they can also directly receive data and process it inline without going through the host system. This frees the processor to manage other system events and enables higher real time system performance.

    2. Power Efficiency

Intel® FPGAs have over 8 TB/s of on-die memory bandwidth. Therefore, solutions tend to keep the data on the device tightly coupled with the next computation. This minimizes the need to access external memory and results in a more efficient circuit implementation in the FPGA where data can be paralleled, pipelined, and processed on every clock cycle. These circuits can be run at significantly lower clock frequencies than traditional general-purpose processors and results in very powerful and efficient solutions.

    3. Future Proofing

In addition to system acceleration and power efficiency, Intel® FPGAs help future proof systems. With such a dynamic technology as machine learning, which is evolving and changing constantly, Intel® FPGAs provide flexibility unavailable in fixed devices. As precisions drop from 32-bit to 8-bit and even binary/ternary networks, an FPGA has the flexibility to support those changes instantly. As next generation architectures and methodologies are developed, FPGAs will be there to implement them.


Model and software

The model is a Resnet-50 CNN trained on the NIH chest x-ray dataset. The dataset contains over 100,000 chest x-rays, each labelled with one or more pathologies. The model was trained on 512 Intel® Xeon® Scalable Gold 6148 processors in 11.25 minutes on the Zenith cluster at DellEMC.

The model is trained using Tensorflow 1.6. We use the Intel® OpenVINO™ R3 toolkit to deploy the model on the FPGA. The Intel® OpenVINO™ toolkit is a collection of software tools to facilitate the deployment of deep learning models. This OpenVINO blog post details the procedure to convert a Tensorflow model to a format that can be run on the FPGA.



In this section, we look at the power consumption and throughput numbers on the Dell EMC PowerEdge R740 and R640 servers.

   1. Using the Dell EMC PowerEdge R740 with 2x Intel® Xeon® Scalable Gold 6136 (300W) and 4x Intel® PACs.

Figure 1 and 2 show the power consumption and throughput numbers for running the model on Intel® PACs, and in combination with Intel® Xeon® Scalable Gold 6136. We observe that the addition of a single Intel® PAC adds only 43W to the system power while providing the ability to inference over 100 chest X-rays per second. The additional power and inference performance scales linearly with the addition of more Intel® PACs. At a system level, wee see a 2.3x improvement in throughput and 116% improvement in efficiency (images per sec per Watt) when using 4x Intel® PACs with 2x Intel® Xeon® Scalable Gold 6136.


Figure 1: Inference performance tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.



Figure 2 Performance per watt tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.


    2. Using the Dell EMC PowerEdge R640 with 2x Intel® Xeon® Scalable Gold 5118 (210W) and 2x Intel® PACs

We also used a server with lower idle power. We see a 2.6x improvement in system performance in this case. As before, each Intel® PAC linearly adds performance to the system, adding more than 100 inferences per second for 43W (2.44 images/sec/W).



Figure 3 Inference performance tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.



Figure 4 Performance per watt tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.



Intel® FPGAs coupled with Intel® OpenVINO™ provide a complete solution for deploying deep learning models in production. FPGAs offer low power and flexibility that make them very suitable as an accelerator device for deep learning workloads.

Deploying trained neural network models for inference on different platforms is a challenging task. The inference environment is usually different than the training environment which is typically a data center or a server farm. The inference platform may be power constrained and limited from a software perspective. The model might be trained using one of the many available deep learning frameworks such as Tensorflow, PyTorch, Keras, Caffe, MXNet, etc. Intel® OpenVINO™ provides tools to convert trained models into a framework agnostic representation, including tools to reduce the memory footprint of the model using quantization and graph optimization. It also provides dedicated inference APIs that are optimized for specific hardware platforms, such as Intel® Programmable Acceleration Cards, and Intel® Movidius™ Vision Processing Units.



   The Intel® OpenVINO™ toolkit




  1. Model Optimizer

The Model Optimizer is a cross-platform command-line tool that facilitates the transition between the training and deployment environment, performs static model analysis, and adjusts deep learning models for optimal execution on end-point target devices. It is a Python script which takes as input a trained Tensorflow/Caffe model and produces an Intermediate Representation (IR) which consists of a .xml file containing the model definition and a .bin file containing the model weights.


     2. Inference Engine

The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices. Each supported target device has a plugin which is a DLL/shared library. It also has support for heterogenous execution to distribute workload across devices. It supports implementing custom layers on a CPU while executing the rest of the model on a accelerator device.




  1. Using the Model Optimizer, convert a trained model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and bias values.
  2. Test the model in the Intermediate Representation format using the Inference Engine in the target environment with the validation application or the sample applications.
  3. Integrate the Inference Engine into your application to deploy the model in the target environment.


Using the Model Optimizer to convert a Keras model to IR


The model optimizer doesn’t natively support Keras model files. However, because Keras uses Tensorflow as its backend, a Keras model can be saved as a Tensorflow checkpoint which can be loaded into the model optimizer. A Keras model can be converted to an IR using the following steps

  1. Save the Keras model as a Tensorflow checkpoint. Make sure the learning phase is set to 0. Get the name of the output node.

import tensorflow as tf 
from keras.applications import Resnet50 
from keras import backend as K 
from keras.models import Sequential, Model

K.set_learning_phase(0)   # Set the learning phase to 0

model = ResNet50(weights='imagenet', input_shape=(256, 256, 3))  
config = model.get_config()                                            
weights = model.get_weights()
model = Sequential.from_config(config)

output_node =':')[0]  # We need this in the next step
graph_file =
ckpt_file =
saver = tf.train.Saver(sharded=True)
'', graph_file), ckpt_file)


     2. Run the Tensorflow freeze_graph program to generate a frozen graph from the saved checkpoint.

tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=./resnet50_graph.pb --input_checkpoint=./resnet50.ckpt --output_node_names=Softmax --output_graph=resnet_frozen.pb     


     3. Use the script and the frozen graph to generate the IR. The model weights can be quantized to FP16.

     python --input_model=resnet50_frozen.pb --output_dir=./ --input_shape=[1,224,224,3] --           data_type=FP16            




The C++ library provides utilities to read an IR, select a plugin depending on the target device, and run the model.

  1. Read the Intermediate Representation - Using the InferenceEngine::CNNNetReader class, read an Intermediate Representation file into a CNNNetwork class. This class represents the network in host memory.
  2. Prepare inputs and outputs format - After loading the network, specify input and output precision, and the layout on the network. For these specification, use the CNNNetwork::getInputInfo() and CNNNetwork::getOutputInfo()
  3. Select Plugin - Select the plugin on which to load your network. Create the plugin with the InferenceEngine::PluginDispatcher load helper class. Pass per device loading configurations specific to this device and register extensions to this device.
  4. Compile and Load - Use the plugin interface wrapper class InferenceEngine::InferencePlugin to call the LoadNetwork() API to compile and load the network on the device. Pass in the per-target load configuration for this compilation and load operation.
  5. Set input data - With the network loaded, you have an ExecutableNetwork object. Use this object to create an InferRequest in which you signal the input buffers to use for input and output. Specify a device-allocated memory and copy it into the device memory directly, or tell the device to use your application memory to save a copy.
  6. Execute - With the input and output memory now defined, choose your execution mode:
    • Synchronously - Infer() method. Blocks until inference finishes.
    • Asynchronously - StartAsync() method. Check status with the wait() method (0 timeout), wait, or specify a completion callback.
  7. Get the output - After inference is completed, get the output memory or read the memory you provided earlier. Do this with the InferRequest GetBlob API.


The classification_sample and classification_sample_async programs perform inference using the steps mentioned above. We use these samples in the next section to perform inference on an Intel® FPGA.



Using the Intel® Programmable Acceleration Card with Intel® Arria® 10GX FPGA for inference


The OpenVINO toolkit supports using the PAC as a target device for running low power inference. The steps for setting up the card are detailed here. The pre-processing and post-processing is performed on the host while the execution of the model is performed on the card. The toolkit contains bitstreams for different topologies.

  1. Programming the bitstream

     aocl program <device_id> <open_vino_install_directory>/a10_dcp_bitstreams/2-0-1_RC_FP16_ResNet50-101.aocx                                                                                     

    2. The Hetero plugin can be used with CPU as the fallback device for layers that are not supported by the FPGA. The -pc flag prints            performance details for each layer

     ./classification_sample_async -d HETERO:FPGA,CPU -i <path/to/input/image.png> -m <path/to/ir>/resnet50_frozen.xml                                                                                         




Intel® OpenVINO™ toolkit is a great way to quickly integrate trained models into applications and deploy them in different production environments. The complete documentation for the toolkit can be found at



Time series is a very important type of data in the financial services industry. Interest rates, stock prices, exchange rates, and option prices are good examples for this type of data. Time series forecasting plays a critical role when financial institutions design investment strategies and make decisions. Traditionally, statistical models such as SMA (simple moving average), SES (simple exponential smoothing), and ARIMA (autoregressive integrated moving average) are widely used to perform time series forecasting tasks.


Neural networks are promising alternatives, as they are more robust for such regression problems due to flexibility in model architectures (e.g., there are many hyperparameters that we can tune, such as number of layers, number of neurons, learning rate, etc.). Recently applications of neural network models in the time series forecasting area have been gaining more and more attention from statistical and data science communities.


In this blog, we will firstly discuss about some basic properties that a machine learning model must have to perform financial service tasks. Then we will design our model based on these requirements and show how to train the model in parallel on HPC cluster with Intel® Xeon processors.



Requirements from Financial Institutions


High-accuracy and low-latency are two import properties that financial service institutions expect from a quality time series forecasting model.


High Accuracy  A high level of accuracy in the forecasting model helps companies lower the risk of losing money in investments. Neural networks are believed to be good at capturing the dynamics in time series and hence yield more accurate predictions. There are many hyperparameters in the model so that data scientists and quantitative researchers can tune them to obtain the optimal model. Moreover, data science community believes that ensemble learning tends to improve prediction accuracy significantly. The flexibility of model architecture provides us a good variety of model members for ensemble learning.


Low Latency  Operations in financial services are time-sensitive.  For example, high frequency trading usually requires models to finish training and prediction within very short time periods. For deep neural network models, low latency can be guaranteed by distributed training with Horovod or distributed TensorFlow. Intel® Xeon multi-core processors, coupled with Intel’s MKL optimized TensorFlow, prove to be a good infrastructure option for such distributed training.


With these requirements in mind, we propose an ensemble learning model as in Figure 1, which is a combination of MLP (Multi-Layer Perceptron), CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) models. Because architecture topologies for MLP, CNN and LSTM are quite different, the ensemble model has a good variety in members, which helps reduce risk of overfitting and produces more reliable predictions. The member models are trained at the same time over multiple nodes with Intel® Xeon processors. If more models need to be integrated, we just add more nodes into the system so that the overall training time stays short. With neural network models and HPC power of the Intel® Xeon processors, this system meets the requirements from financial service institutions.


Figure 1: Training high accuracy ensemble model on HPC cluster with Intel® Xeon processors



Fast Training with Intel® Xeon Scalable Processors


Our tests used Dell EMC’s Zenith supercomputer which consists of 422 Dell EMC PowerEdge C6420 nodes, each with 2 Intel® Xeon Scalable Gold 6148 processors. Figure 2 shows an example of time-to-train for training MLP, CNN and LSTM models with different numbers of processes. The data set used is the 10-Year Treasury Inflation-Indexed Security data. For this example, running distributed training with 40 processes is the most efficient, primarily due to the data size in this time series is small and the neural network models we used did not have many layers. With this setting, model training can finish within 10 seconds, much faster than training the models with one processor that has only a few cores, which typically takes more than one minute. Regarding accuracy, the ensemble model can predict this interest rate with MAE (mean absolute error) less than 0.0005. Typical values for this interest rate is around 0.01, so the relative error is less than 5%.


Figure 2: Training time comparison (Each of the models is trained on Intel® Xeon processors within one node)





With both high-accuracy and low-latency being very critical for time series forecasting in financial services, neural network models trained in parallel using Intel® Xeon Scalable processors stand out as very promising options for financial institutions. And as financial institutions need to train more complicated models to forecast many time series with high accuracy at the same time, the need for parallel processing will only grow.

Picture1.jpgExplaining the relationship between machine learning and artificial intelligence is one of the most challenging concepts that I encounter when talking to people new to these topics.  I don’t pretend to have the definitive answer, but, I have developed a story that seems to get enough affirmative head shakes that I want to share it here.


The diagram above has appeared in many introductory books and articles that I’ve seen.  I have reproduced it here to highlight the challenge of talking about “subsets” of abstract concepts – none of which have widely accepted definitions. So, what does this graphic mean or imply?  How is deep learning a subset of artificial intelligence? These are the questions I’m going to try to answer by telling you a story I use for briefings on artificial intelligence during the rest of this article.

Since so many people have read about and studied examples of using deep learning for image classification, that is my starting point.  I am not however going to talk about cats and dogs, so please hang with me for a bit longer.  I’m going to use an example of facial recognition.  My scenario is that there is a secure area in a building that only 4 people (Angela, Chris, Lucy and Marie) are permitted to enter.  We want to use facial recognition to determine if someone attempting to gain access should be allowed in.  You and I can easily look at a picture and say whether it is someone we know.  But how does a deep learning model do that and how could we use the result of the model to create an artificial intelligence application?

I frequently use the picture below to discuss the use of deep neural networks for doing model training for supervised classification.  Now when looking at the network consider that the goal of all machine learning and deep learning is to transform input data into some meaningful output.  For facial recognition, the input data is a representation of the pixel intensity and color or grey scale value from a picture and the output is probability that the picture is either Angela, Chris, Lucy or Marie.  That means we are going to have to train the network using recent photos of these four people.

A highly stylized neural network representation

network.jpgThis picture above is a crude simplification of how a modern convolutional neural network (ConvNet) used for image recognition would be constructed, however, it is useful to highlight many of the important elements of what we mean by transforming raw data into meaningful outputs.  For example, each line or edge drawn between the neurons of each layer represent a weight (parameter) that must be calculated during training.  These weights are the primary mechanism used to transform the input data into something useful.  Because this picture only includes 5 layers with less than 10 nodes per layer it is easy to visualize how fully connected layers can quickly increase the number of weights that must be computed.  The ConvNets in wide spread use today typically have from 16 to 200 or more layers, although not all fully connected for the deeper designs, and can have 10's of millions to 100’s of millions of weights or more.

We need that many weights to “meaningfully” transform the input data since the image is broken down into many small regions of pixels (typically 3x3 or 5x5) before getting ingested by the input layer.  The numerical representation of the pixel values is then transformed by the weights so that the output of the transformation indicates if this region of cells adds to the evidence that this is a picture of Angela or negates the likelihood that this is Angela.  If Angela has black hair and the network does not detect many regions of solid black color, then there not be much evidence that this picture is Angela.

Finally, I want to tie everything discussed so far to an explanation of the output layer.  In the picture above, there are 4 neurons in the output layer and that is why I setup my facial recognition story to have 4 people that I am trying to recognize.  During training I have a set of pictures that have been labeled with the correct name.  One way to look at how I might do that is like this:

Table 1 - Representation of labeled training data


























The goal during training is to come up with a single set of weights that will transform the data from every picture in the training data set into a set of four values (vector) for each picture where the values match as close as possible to the labels assigned as above.  For Picture1 the first value is 1 and the other three are zeros and for Picture2 the set of 4 training values are set to zero for the first 3 elements and the fourth value is 1.  We are telling the model that we are 100% sure (probability = 1) that this is a picture of Angela and certain that it is not Chris, Lucy, or Marie (probability = 0).    The training process tries to find a set of weights that will transform the pixel data for Picture1 in to the vector (1,0,0,0) and Picture2 into the vector (0,0,0,1) and so on for the entire data set.


Of course, no deep learning model training algorithm can do that because of variations in the data so we try to get as close as possible for each input image.  The process of testing a model with known data or processing new unlabeled images is called inferencing.  When we pass in unlabeled data we get back a list of four probabilities that reflect the evidence in the data that the image is one of the four know people, for example we might get something back like (.5, .25, .15, .1).  For most classification algorithms the set of probabilities will add to 1.  What does this result tell us?


Our model says we are most confident that the unlabeled picture is Angela since that is the outcome with the highest probability, but, it also tells us that we can only be 50% sure that it is not one of the other three people.  What does it mean if we get an inference result back that is (..25, .25, .25, .25)?  This result tells us the model can’t do better than a random process like picking a number between 1 and 4.  This picture could be anyone of our known people or it could be a picture of a truck.  The model provides us no information.  How intelligent is that?  This is where the connection with artificial intelligence gets interesting.


What we like to achieve is getting back inference predictions where one value is very close to 1 and all the others are very close to zero.  Then we have high confidence that person requesting access to a restricted area is one of our authorized employees.  That is rarely the case, so we must deal with uncertainty in our applications that use our trained machine learning models.  If the area that we are securing is the executive dining room then perhaps we want to open the door even if we are only 50% sure that the person requesting access is one of our known people.  If the application is securing access to sensitive computer and communication equipment, then perhaps we want to set a threshold of 90% certainty before we unlock the door.  The important point is that machine learning is usually not sufficient alone to build an intelligent application.  Therefore, fear that the machines are going to get smarter than people and therefore be able to make “better” decisions is still a long way off, maybe a very long way…


Phil Hummel




The process of training a deep neural network is akin to finding the minimum of a function in a very high-dimensional space. Deep neural networks are usually trained using stochastic gradient descent (or one of its variants). A small batch (usually 16-512), randomly sampled from the training set, is used to approximate the gradients of the loss function (the optimization objective) with respect to the weights. The computed gradient is essentially an average of the gradients for each data-point in the batch. The natural way to parallelize the training across multiple nodes/workers is to increase the batch size and have each node compute the gradients on a different chunk of the batch. Distributed deep learning differs from traditional HPC workloads where scaling out only affects how the computation is distributed but not the outcome.


Challenges of large-batch training

It has been consistently observed that the use of large batches leads to poor generalization performance, meaning that models trained with large batches perform poorly on test data. One of the primary reason for this is that large batches tend to converge to sharp minima of the training function, which tend to generalize less well. Small batches tend to favor flat minima that result in better generalization [1]. The stochasticity afforded by small batches encourages the weights to escape the basins of attraction of sharp minima. Also, models trained with small batches are shown to converge farther away from the starting point. Large batches tend to be attracted to the minimum closest to the starting point and lack the explorative properties of small batches.

The number of gradient updates per pass of the data is reduced when using large batches. This is sometimes compensated by scaling the learning rate with the batch size. But simply using a higher learning rate can destabilize the training. Another approach is to just train the model longer, but this can lead to overfitting. Thus, there’s much more to distributed training than just scaling out to multiple nodes.


An illustration showing how sharp minima lead to poor generalization. The sharp minimum of the training function corresponds to a maximum of the testing function which hurts the model's performance on test data. [1]

How can we make large batches work?

There has been a lot of interesting research recently in making large-batch training more feasible. The training time for ImageNet has now been reduced from weeks to minutes by using batches as large as 32K without sacrificing accuracy. The following methods are known to alleviate some of the problems described above:


  1. Scaling the learning rate [2]
    The learning rate is multiplied by k, when the batch size is multiplied by k. However, this rule does not hold in the first few epochs of the training since the weights are changing rapidly. This can be alleviated by using a warm-up phase. The idea is to start with a small value of the learning rate and gradually ramp up to the linearly scaled value.

  2. Layer-wise adaptive rate scaling [3]
    A different learning rate is used for each layer. A global learning rate is chosen and it is scaled for each layer by the ratio of the Euclidean norm of the weights to Euclidean norm of the gradients for that layer.

  3. Using regular SGD with momentum rather than Adam
    Adam is known to make convergence faster and more stable. It is usually the default optimizer choice when training deep models. However, Adam seems to settle to less optimal minima, especially when using large batches. Using regular SGD with momentum, although more noisy than Adam, has shown improved generalization.

  4. Topologies also make a difference
    In a previous blog post, my colleague Luke showed how using VGG16 instead of DenseNet121 considerably sped up the training for a model that identified thoracic pathologies from chest x-rays while improving area under ROC in multiple categories. Shallow models are usually easier to train, especially when using large batches.



Large-batch distributed training can significantly reduce training time but it comes with its own challenges. Improving generalization when using large batches is an active area of research, and as new methods are developed, the time to train a model will keep going down.


  1. On large-batch training for deep learning: Generalization gap and sharp minima. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter. 2016. arXiv preprint arXiv:1609.04836.
  2. Accurate, large minibatch SGD: Training imagenet. Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. arXiv preprint arXiv:1706.02677.
  3. Large Batch Training of Convolutional Networks . Yang You, Igor Gitman, Boris Ginsburg. 2017. arXiv preprint arXiv:1708.03888.



The potential of neural networks to transform healthcare is evident. From image classification to dictation and translation, neural networks are achieving or exceeding human capabilities. And they are only getting better at these tasks as the quantity of data increases.


But there’s another way in which neural networks can potentially transform the healthcare industry: Knowledge can be replicated at virtually no cost. Take radiology as an example: To train 100 radiologists, you must teach each individual person the skills necessary to identify diseases in x-ray images of patients’ bodies. To make 100 AI-enabled radiologist assistants, you take the neural network model you trained to read x-ray images and load it into 100 different devices.


The hurdle is training the model. It takes a large amount of cleaned, curated, labeled data to train an image classification model. Once you’ve prepared the training data, it can take days, weeks, or even months to train a neural network. Even once you’ve trained a neural network model, it might not be smart enough to perform the desired task. So, you try again. And again. Eventually, you will train a model that passes the test and can be used out in the world.

neural-network-workflow.pngWorkflow for Developing Neural Network Models


In this post, I’m going to talk about how to reduce the time spent in the Train/Test/Tune cycle by speeding up the training portion with distributed deep learning, using a test case we developed in Dell EMC’s HPC and AI Innovation Lab to classify pathologies in chest x-ray images. Through a combination of distributed deep learning, optimizer selection, and neural network topology selection, we were able to not only speed the process of training models from days to minutes, we were also able to improve the classification accuracy significantly.


Starting Point: Stanford University’s CheXNet

We began by surveying the landscape of AI projects in healthcare, and Andrew Ng’s group at Stanford University provided our starting point. CheXNet was a project to demonstrate a neural network’s ability to accurately classify cases of pneumonia in chest x-ray images.

The dataset that Stanford used was ChestXray14, which was developed and made available by the United States’ National Institutes of Health (NIH). The dataset contains over 120,000 images of frontal chest x-rays, each potentially labeled with one or more of fourteen different thoracic pathologies. The data set is very unbalanced, with more than half of the data set images having no listed pathologies.


Stanford decided to use DenseNet, a neural network topology which had just been announced as the Best Paper at the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), to solve the problem. The DenseNet topology is a deep network of repeating blocks over convolutions linked with residual connections. Blocks end with a batch normalization, followed by some additional convolution and pooling to link the blocks. At the end of the network, a fully connected layer is used to perform the classification.


An Illustration of the DenseNet Topology (source: Kaggle)


Stanford’s team used a DenseNet topology with the layer weights pretrained on ImageNet and replaced the original ImageNet classification layer with a new fully connected layer of 14 neurons, one for each pathology in the ChestXray14 dataset.


Building CheXNet in Keras

It’s sounds like it would be difficult to setup. Thankfully, Keras (provided with TensorFlow) provides a simple, straightforward way of taking standard neural network topologies and bolting-on new classification layers.


from tensorflow import keras
from keras.applications import DenseNet121

orig_net = DenseNet121(include_top=False, weights=’imagenet’, input_shape=(256,256,3))

Importing the base DenseNet Topology using Keras


In this code snippet, we are importing the original DenseNet neural network (DenseNet121) and removing the classification layer with the include_top=False argument. We also automatically import the pretrained ImageNet weights and set the image size to              256x256, with 3 channels (red, green, blue).


With the original network imported, we can begin to construct the classification layer. If you look at the illustration of DenseNet above, you will notice that the classification layer is preceded by a pooling layer. We can add this pooling layer back to the new network with a single Keras function call, and we can call the resulting topology the neural network's filters, or the part of the neural network which extracts all the key features used for classification.


from keras.layers import GlobalAveragePooling2D

filters = GlobalAveragePooling2D()(orig_net.output)

Finalizing the Network Feature Filters with a Pooling Layer


The next task is to define the classification layer. The ChestXray14 dataset has 14 labeled pathologies, so we have one neuron for each label. We also activate each neuron with the sigmoid activation function, and use the output of the feature filter portion of our network as the input to the classifiers.


from keras.layers import Dense

classifiers = Dense(14, activation=’sigmoid’, bias_initializer=’ones’)(filters)

Defining the Classification Layer


The choice of sigmoid as an activation function is due to the multi-label nature of the data set. For problems where only one label ever applies to a given image (e.g., dog, cat, sandwich), a softmax activation would be preferable. In the case of ChestXray14, images can show signs of multiple pathologies, and the model should rightfully identify high probabilities for multiple classifications when appropriate.


Finally, we can put the feature filters and the classifiers together to create a single, trainable model.


from keras.models import Model

chexnet = Model(inputs=orig_net.inputs, outputs=classifiers)

The Final CheXNet Model Configuration


With the final model configuration in place, the model can then be compiled and trained.


Accelerating the Train/Test/Tune Cycle with Distributed Deep Learning

To produce better models sooner, we need to accelerate the Train/Test/Tune cycle. Because testing and tuning are mostly sequential, training is the best place to look for potential optimization.


How exactly do we speed up the training process? In Accelerating Insights with Distributed Deep Learning, Michael Bennett and I discuss the three ways in which deep learning can be accelerated by distributing work and parallelizing the process:

  • Parameter server models such as in Caffe or distributed TensorFlow,
  • Ring-AllReduce approaches such as Uber’s Horovod, and
  • Hybrid approaches for Hadoop/Spark environments such as Intel BigDL.

Which approach you pick depends on your deep learning framework of choice and the compute environment that you will be using. For the tests described here we performed the training in house on the Zenith supercomputer in the Dell EMC HPC & AI Innovation Lab. The ring-allreduce approach enabled by Uber’s Horovod framework made the most sense for taking advantage of a system tuned for HPC workloads, and which takes advantage of Intel Omni-Path (OPA) networking for fast inter-node communication. The ring-allreduce approach would also be appropriate for solutions such as the Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA.


The MPI-RingAllreduce Approach to Distributed Deep Learning

Horovod is an MPI-based framework for performing reduction operations between identical copies of the otherwise sequential training script. Because it is MPI-based, you will need to be sure that an MPI compiler (mpicc) is available in the working environment before installing horovod.


Adding Horovod to a Keras-defined Model

Adding Horovod to any Keras-defined neural network model only requires a few code modifications:

  1. Initializing the MPI environment,
  2. Broadcasting initial random weights or checkpoint weights to all workers,
  3. Wrapping the optimizer function to enable multi-node gradient summation,
  4. Average metrics among workers, and
  5. Limiting checkpoint writing to a single worker.

Horovod also provides helper functions and callbacks for optional capabilities that are useful when performing distributed deep learning, such as learning-rate warmup/decay and metric averaging.


Initializing the MPI Environment

Initializing the MPI environment in Horovod only requires calling the init method:


import horovod.keras as hvd



This will ensure that the MPI_Init function is called, setting up the communications structure and assigning ranks to all workers.


Broadcasting Weights

Broadcasting the neuron weights is done using a callback to the Keras method. In fact, many of Horovod’s features are implemented as callbacks to, so it’s worthwhile to define a callback list object for holding all the callbacks.


callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0) ]


You’ll notice that the BroadcastGlobalVariablesCallback takes a single argument that’s been set to 0. This is the root worker, which will be responsible for reading checkpoint files or generating new initial weights, broadcasting weights at the beginning of the training run, and writing checkpoint files periodically so that work is not lost if a training job fails or terminates.


Wrapping the Optimizer Function

The optimizer function must be wrapped so that it can aggregate error information from all workers before executing. Horovod’s DistributedOptimizer function can wrap any optimizer which inherits Keras’ base Optimizer class, including SGD, Adam, Adadelta, Adagrad, and others.


import keras.optimizers

opt = hvd.DistributedOptimizer(keras.optimizers.Adadelta(lr=1.0))


The distributed optimizer will now use the MPI_Allgather collective to aggregate error information from training batches onto all workers, rather than collecting them only to the root worker. This allows the workers to independently update their models rather than waiting for the root to re-broadcast updated weights before beginning the next training batch.


Averaging Metrics

Between steps error metrics need to be averaged to calculate global loss. Horovod provides another callback function to do this called MetricAverageCallback.


callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),

This will ensure that optimizations are performed on the global metrics, not the metrics local to each worker.


Writing Checkpoints from a Single Worker

When using distributed deep learning, it’s important that only one worker write checkpoint files to ensure that multiple workers writing to the same file does not produce a race condition, which could lead to checkpoint corruption.


Checkpoint writing in Keras is enabled by another callback to However, we only want to call this callback from one worker instead of all workers. By convention, we use worker 0 for this task, but technically we could use any worker for this task. The one good thing about worker 0 is that even if you decide to run your distributed deep learning job with only 1 worker, that worker will be worker 0.

callbacks = [ ... ]

if hvd.rank() == 0:


Result: A Smarter Model, Faster!

Once a neural network can be trained in a distributed fashion across multiple workers, the Train/Test/Tune cycle can be sped up dramatically.


The figure below shows exactly how dramatically. The three tests shown are the training speed of the Keras DenseNet model on a single Zenith node without distributed deep learning (far left), the Keras DenseNet model with distributed deep learning on 32 Zenith nodes (64 MPI processes, 2 MPI processes per node, center), and a Keras VGG16 version using distributed deep learning on 64 Zenith nodes (128 MPI processes, 2 MPI processes per node, far right). By using 32 nodes instead of a single node, distributed deep learning was able to provide a 47x improvement in training speed, taking the training time for 10 epochs on the ChestXray14 data set from 2 days (50 hours) to less than 2 hours!


Performance comparisons of Keras models with distributed deep learning using Horovod

The VGG variant, trained on 128 Zenith nodes, was able to complete the same number of epochs as was required for the single-node DenseNet version to train in less than an hour, although it required more epochs to train. It also, however, was able to converge to a higher-quality solution. This VGG-based model outperformed the baseline, single-node model in 4 of 14 conditions, and was able to achieve nearly 90% accuracy in classifying emphysema.


Accuracy comparison of baseline single-node DenseNet model vs VGG variant with distributed deep learning


In this post we’ve shown you how to accelerate the Train/Test/Tune cycle when developing neural network-based models by speeding up the training phase with distributed deep learning. We walked through the process of transforming a Keras-based model to take advantage of multiple nodes using the Horovod framework, and how these few simple code changes, coupled with some additional compute infrastructure, can reduce the time needed to train a model from days to minutes, allowing more time for the testing and tuning pieces of the cycle. More time for tuning means higher-quality models, which means better outcomes for patients, customers, or whomever will benefit from the deployment of your model.


Lucas A. Wilson, Ph.D. is the Lead Data Scientist in Dell EMC's HPC & AI Engineering group. (Twitter: @lucasawilson)


MNIST with Intel BigDL

Posted by MikeB@Dell Aug 10, 2018



Here at Dell EMC we just announced the general availability of the Dell EMC Ready Solutions for AI - Machine Learning with Hadoop design so I decided to write a somewhat technical article highlighting the power of doing deep learning with the Intel BigDL framework executing on Apache Spark. After a short introduction covering the components of our Hadoop design, I will walk through an example of training an image classification neural network using BigDL.  Finally I show how that example can be extended by training the model to classify a new category of images – emojis.


The combination of Apache Spark and Intel BigDL creates a powerful data science environment by enabling the compute and data intensive operations associated with creating AI to be performed on the same cluster hosting your Hadoop Data Lake and existing analytics workloads. YARN will make sure everything gets their share of the resources.  The Dell EMC Ready Solution also brings Cloudera Data Science Workbench (CDSW) to your Hadoop environment. CDSW enables secure, collaborative access to your Hadoop environment. Data Scientists can choose from a library of published engines that are preconfigured, saving time and frustration dealing with library and framework version control, interoperability etc. Dell EMC includes an engine configured with Intel BigDL and other common Python libraries.

Getting Started

To follow along with the steps in this blog post you can use the docker container included in the BigDL github repository. Installing docker is beyond the scope of this post, however you can find directions to install docker for various operating systems here (link to On my Dell XPS 9370 running Ubuntu 16.04 I use the following commands to get started:

cd ~
git clone
cd BigDL/docker/BigDL
sudo docker build --build-arg BIGDL_VERSION=0.5.0 -t bigdl/bigdl:default .
sudo docker run -it --rm -p 8080:8080 -e NotebookPort=8080 \
-e NotebookToken=”thisistoken” bigdl/bigdl:default

Tip: Notice the space + period that is at the end of our build command. That tells docker to look for our Dockerfile in the current directory.

You should now be able to point a web browser at localhost:8080 and enter the token (“thisistoken” without “” in example). You will be presented with a Jupyter notebook showing several examples of Apache Spark and Intel BigDL operations. For this post we are going to look at the neural_networks/introduction_to_mnist.ipynb notebook. You can launch the notebook by clicking the neural_networks folder and then the introduction_to_mnist.ipynb file.

This example covers the get_mnist function that is used in many of the example notebooks. By going through this we can get a view in to how easy it is to deal with the huge datasets that are necessary for training a neural network. In the get_mnist function we read our dataset and create an RDD of all of the images, sc.parallelize(train_images) loads 60,000 images in to memory ready for distributed, parallel processing to be performed on them. With Hadoop's native Yarn resource scheduler we can decide to do these operations with 1 executor/2 cpu cores or 32 executors with 16 cores each.

Tip: The resource sizing of your Spark executors should be such that executors * cores per executor * n is equal to your batch size.

This example notebook reads the images using a helper function written by the BigDL team for the mnist dataset, mnist.read_data_sets. BigDL provides us the ImageFrame API to read image files. We can read individual files or folders of images. In a clustered environment we create distributed image frames which is an RDD of ImageFeature. ImageFeature represents one individual image, storing the image as well as label and other properties using key/value.

At the end of the introduction_to_mnist notebook we see that samples are created by zipping up the images RDD with the labels RDD. We perform a map of Sample.from_ndarray to end up with an RDD of samples. A sample consists of one or more tensors representing the image and one or more tensors representing the label. This sample of tensors are the structure we need to feed our data in to a neural network.

We can close this introduct_to_mnist notebook now and we will take a look at the cnn.ipynb notebook. The notebook does a good job of explaining what is going on at each step. We use our previously covered mnist helper functions to read the data, define a LeNet model, define our optimizer and train the model. I would encourage you to go through this example once and learn what is going on. After that, go ahead and click Kernel > Restart & Clear Output.

Now that we are back to where we started, let’s look at how we can add a new class to our model. Currently the model is being trained to predict 10 classes of numbers. What if, for example, I wanted the model to be able to recognize emojis also? It is actually really simple, and if you follow along we will go through a quick example of how to train this model to also recognize emoji for a total of 11 classes to categorize.

The Example

The first step is going to be to acquire a dataset of emojis and prepare them. Add a cell after cell [2]. In this cell we will put

git clone
pip install opencv-python


Then we can insert another cell and put:

import glob
from os import listdir
from os.path import isfile, join
from import *
import cv2

path = 'emoji-dataset/emoji_imgs_V5/'
images = [cv2.imread(file) for file in glob.glob("emoji-dataset/emoji_imgs_V5/*.png")]
emoji_train = images[1:1500]
emoji_test = images[1501:]
emoji_train_rdd = sc.parallelize(emoji_train)
emoji_test_rdd = sc.parallelize(emoji_test)

Here we are taking all of our emoji images and dividing them up, then turning those lists of numpy arrays in to RDDs. This isn’t the most efficient way to divide up our dataset but it will work for this example.

Currently these are color emoji images, and our model is defined for mnist which is grayscale. This means our model is set to accept images of shape 28,28,1.  RGB or BGR images will have a shape of 28,28,3. Since we have these images in an RDD though we can use cv2 to resize and grayscale the images. Go ahead and insert another cell to run:

emoji_train_rdd = x: (cv2.resize(x, (28, 28))))
emoji_train_rdd = x: (cv2.cvtColor(x, cv2.COLOR_BGR2GRAY)))
emoji_test_rdd = x: (cv2.resize(x, (28, 28))))
emoji_test_rdd = x: (cv2.cvtColor(x, cv2.COLOR_BGR2GRAY)))
emoji_train_rdd = x: Sample.from_ndarray(x, np.array(11)))
emoji_test_rdd = x: Sample.from_ndarray(x, np.array(11)))
train_data = train_data.union(emoji_train_rdd)
test_data = test_data.union(emoji_test_rdd)

This could easily be reduced to a function, however if you want to see the transformations step by step you can just insert the below code between each line. For example we can see after the resize our image is shape (28, 28, 3) after the resize but before grayscale, and (28, 28, 1) after.

single = emoji_train_rdd.take(1)

Output is 28,28,1.

Then, in our def build_model cell we just want to change the last line so it is
lenet_model = build_model(11) . After this the layers are built to accept 11 classes.


We can see the rest of the results by running through the example. Since our emoji dataset is so small compared to the handwritten number images we will go ahead and create a smaller rdd to predict against to ensure we see our emoji samples after the optimizer is setup and the model is trained. Insert a cell after our first set of predictions and put the following in it:

sm_test_rdd = test_data.take(8)
sm_test_rdd = sc.parallelize(sm_test_rdd)
predictions = trained_model.predict(sm_test_rdd)
imshow(np.column_stack([np.array(s.features[0].to_ndarray()).reshape(28,28) for s in sm_test_rdd.take(11)]),cmap='gray'); plt.axis('off')
print 'Ground Truth labels:'
print ', '.join(str(map_groundtruth_label(s.label.to_ndarray())) for s in sm_test_rdd.take(11))
print 'Predicted labels:'
print ', '.join(str(map_predict_label(s)) for s in predictions.take(11))

We should see that our model has correctly associated the label of “10” to our emoji images and something in the range of 0-9 for the remaining images. Congratulations! We have now trained our lenet model to do more than just recognize examples of 0-9. If we had multiple individual emojis we wanted to recognize instead of just “emoji” we would simply need to provide more specific labels such as “smiling” or “laughing” to the emojis and then define our model with our new number of classes.

The original cnn.ipynb notebook trained a lenet model with 10 classes. If we compare the visualization of the weights in the convolution layers we can see that even adding just a few more images and a new class has changed these a lot.

Original Model

11 Class Model

Thanks everyone, I hope you enjoyed this blog post. You can find more information about the Dell EMC Ready Solutions for AI at this url:



The Hadoop open source initiative is, inarguably, one to the most dramatic software development success stories of the last 15 years.  In every measure of open source software activity that includes the number of contributors, number of code check-ins, number of issues raised and fixed, and the number of new related projects, the commitment of the open source community to the ongoing improvement and expansion of Hadoop is impressive. Dell EMC has been offering solutions that are based on our infrastructure with both Cloudera and Hortonworks Hadoop distributions since 2011.  This year, in collaboration with Intel, we are adding a new product to our analytics portfolio called the Ready Solution for AI: Machine Learning with Hadoop.  As the name implies, we are testing and validating new features for machine learning including a full set of deep learning tools and libraries with our partners Cloudera and Intel.  Traditional machine learning with Apache MLlib on Apache Spark has been widely deployed and discussed for several years.  The concept of achieving deep learning with neural network-based approaches using Apache Spark running on CPU-only clusters is relatively new. We are excited to offer articles such as this to share our findings.



The performance testing that we describe in this article was done in preparation for our launch of Ready Solutions for AI.  The details of scaling BigDL with Apache Spark on on Intel® Xeon® Scalable Clusters is an area that might be new to many data scientists and IT professionals.  In this article, we discuss:


  • Key Spark properties and where to find more information
  • Comparing the capabilities of three different Intel Xeon Scalable processors for scaling impact
  • Comparing the performance of the current Dell EMC PowerEdge 14G server platform with the previous generation
  • Sub NUMA clustering background and performance comparison


We describe our testing environment and show the baseline performance results against which we compared software and hardware configuration changes. We also provide a summary of findings for Ready Solution for AI: Machine Learning with Hadoop.


Our performance testing environment


The following table provides configuration information for worker nodes used in our Cloudera Hadoop CDH cluster for this analysis:


ComponentCDH cluster with Intel Xeon Platinum processorCDH cluster with Intel Xeon Gold 6000 series processorCDH cluster with Intel Xeon Gold 5000 series processor
Server ModelPowerEdge R740xdPowerEdge R740xdPowerEdge R740xd
ProcessorIntel Xeon Platinum 8160 CPU @ 2.10GHzIntel Xeon Gold 6148 CPU @ 2.40GHzIntel Xeon Gold 5120 CPU @ 2.20GHz
Memory768 GB768 GB768 GB
Hard drive(6) 240 GB SSD and (12) 4TB SATA HDD(6) 240 GB SSD and (12) 4TB SATA HDD(2) 240 GB SSD and (4) 1.2TB SAS HDD
Storage controllerPERC H740P Mini (Embedded)PERC H740P Mini (Embedded)Dell HBA330
Network adaptersMellanox 25GbE 2P ConnectX4LXMellanox 25GbE 2P ConnectX4LXQLogic 25GE 2P QL41262 Adapter
Operating SystemRHEL 7.3RHEL 7.3RHEL 7.3
Apache Spark version2.12.12.1
BigDL version0.


Configuring Apache Spark properties


To the get the maximum performance from Spark for running deep learning workloads, there are several properties that must be understood. The following list describes key Apache Spark parameters:


  • Number of executors—Corresponds to the number of nodes on which you want the application to run.  Consider the number of nodes running the Spark service, the number of other Spark jobs that must run concurrently, and the size of the training problem that you are submit.
  • Executor cores—Number of cores to use on each executor. This number can closely match the number of physical cores on the host, assuming that the application does not share the nodes with other applications.
  • Executor memory—Amount of memory to use per executor process. For deep learning applications, the amount of memory allocated for executor memory is dependent on the dataset and the model.  There are several checks and best practices that can help you avoid Spark "out of memory" issues while loading or processing data, however, we do not cover these details in this article.
  • Deployment mode—Deployment mode for Spark driver programs. The deployment mode can be either "client" or "cluster" mode.  To launch the driver program locally use client mode and to deploy the application remotely use cluster mode from one of the nodes inside the cluster. If the Spark application is launched from the Cloudera Data Science Workbench, deployment mode defaults to client mode.


For detailed information about all Spark properties, see Apache Spark Configuration.


Scaling with BigDL


To study scaling of BigDL on Spark, we ran two popular image classification models:

  • Inception V1Inception is an image classification deep neural network introduced in 2014. Inception V1 was one of the first CNN classifiers that introduced the concept of a wider network (multiple convolution operations in the same layer). It remains a popular model for performance characterization and other studies.
  • ResNet-50ResNet is an image classification deep neural network introduced in 2015. ResNet introduced a novel architecture of shortcut connections that skip one or more layers.


ImageNet, the publicly available hierarchical image database, is used as the dataset. In our experiments, we found that ResNet-50 is computationally more intensive compared to Inception V1. We used a learning rate = 0.0898 and a weight decay = 0.001 for both models. Batch size was selected by using best practices for deep learning on Spark.  Details about sizing batches are described later in the article.  Because we were interested in scaling only and not in accuracy, we ensured that our calculated average throughput (images per second) were stable and stopped training after two epochs.  The following table shows the Spark parameters that were used as well as the throughput observed:



4 Nodes

8 Nodes

16 Nodes

Number of executors




Batch size                  









350 GB for Inception

600 GB for ResNet

350 GB for Inception  600 GB for ResNet

350 GB for Inception  600 GB for ResNet

Inception – observed throughput (images/second)




ResNet – observed throughput (images/second)





The following figure shows that deep learning models scale almost linearly when we increase the number of nodes.



Calculating batch size for deep learning on Spark


When training deep learning models using neural networks, the number of training samples is typically large compared to the amount of data that can be processed in one pass. Batch size refers to the number of training examples that are used to train the network during one iteration. The model parameters are updated at the end of each iteration. Lower batch size means that the model parameters are updated more frequently.


For BigDL to process efficiently on Spark, batch size must be a multiple of the number of executors multiplied by the number of executor cores. Increasing the batch size might have an impact on the accuracy of the model. The impact depends on various factors that include the model that is used, the dataset, and the range of the batch size. A model with a large number of batch normalization layers might experience a greater impact on accuracy when changing the batch size. Various hyper-parameter tuning strategies can be applied to improve the accuracy when using large batch sizes in training.


Intel Xeon Scalable processor model comparison

We compared the performance of the image classification training workload on three Intel processor models. The following table provides key details of the processor models.


Intel Xeon Platinum 8160 CPUIntel Xeon Gold 6148 CPUIntel Xeon Gold 5120 CPU
Base Frequency2.10 GHz2.40 GHz2.20 GHz
Number of cores242014
Number of AVX-512 FMA Units221


The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.


The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.



The Intel Xeon Gold 6148 processor slightly outperforms the Intel Xeon Platinum 8160 processor for the Inception V1 workload because it has higher clock frequency. However, for a computationally more intensive workload, the Intel Xeon Platinum 8160 processor outperforms the Xeon Gold 6148 processor.


Both the Intel Xeon Gold 6148 and Intel Xeon Platinum 8160 processors significantly outperform the Intel Xeon Gold 5120 processor. The primary reason for this performance is the number of Intel Advance Vector Extension units (AVX-512 FMA) in these processors.

Intel AVX-512 is a set of new instructions that can accelerate performance for workloads and uses such as deep learning applications. Intel AVX-512 provides ultra-wide 512-bit vector operations capabilities to increase the performance of applications that use identical operations on multiple datasets.


The Intel Xeon Gold 6000 or Intel Xeon Platinum series are recommended for deep learning applications that are implemented in BigDL and Spark.

Comparing deep learning performance across Dell server generations


We compared the performance of the image classification training workload on three Intel processor models, across different server generations. The following table provides key details of the processor models.



Dell EMC PowerEdge R730xd Server with Intel Xeon E5-2683 v3

Dell EMC PowerEdge R730xd with Intel Xeon E5-2698 v4

Dell EMC PowerEdge R740xd with Intel Xeon Platinum 8160 


(2) Xeon E5-2683 v3

(2) Xeon E5-2698 v4

(2) Xeon Platinum 8160


512 GB

512 GB

768 GB

Hard drive

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(6) 240 GB SSD and (12) 4TB SATA HDD

RAID controller

PERC R730 Mini

PERC R730 Mini

PERC H740P Mini (Embedded)

Network adapters

Intel Ethernet 10G 4P X520/I350 rNDC -

Intel Ethernet 10G 4P X520/I350 rNDC -

Mellanox 25GbE 2P ConnectX4LX

Operating system

RHEL 7.3

RHEL 7.3

RHEL 7.3

The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors across Dell EMC PowerEdge server generations. The experiment was performed on four-node clusters.

We observed that the PowerEdge R740xd server with the Intel Xeon Platinum 8160 processor performs 44 percent better than the PowerEdge R730xd server with the Xeon E5-2683 v3 processor and 28 percent better than the PowerEdge R730xd server with the Xeon E5-2698 v4 processor with an Inception V1 workload under the settings noted earlier.




Sub NUMA Clustering Comparison


Sub NUMA clustering (SNC) is a feature available in Intel Xeon Scalable processors. When SNC is enabled, each socket is divided into two NUMA domains, each with half the physical cores and half the memory of the socket. SNC is similar to the Cluster-on-Die option that was available in the Xeon E5-2600 v3 and v4 processors with improvements to remote socket access. At the operating system level, a dual socket server that has SNC enabled displays four NUMA domains.  The following table shows the measured performance impact when enabling SNC.




We found only a 5 percent improvement in performance with SNC enabled, which is within the run-to-run variation in images per second across all the tests we performed.  Sub NUMA clustering improves performance for applications that have high memory locality. Not all deep learning applications and models have high memory locality. We recommend that you either leave Sub NUMA clustering disabled (the default BIOS setting) or disable it.

Ready Solutions for AI


For this Ready Solution, we leverage Cloudera’s new Data Science Workbench, which delivers a self-service experience for data scientists.  Data Science Workbench provides multi-language support that includes Python, R, and Scala directly in a web browser. Data scientists can manage their own analytics pipelines, including built-in scheduling, monitoring, and email alerts in an environment that is tightly coupled with a traditional CDH cluster.

We also incorporate Intel BigDL, a full-featured deep learning framework that leverages Spark running on the CDH cluster.  The combination of Spark and BigDL provides a high performance approach to deep learning training and inference on clusters of Intel Xeon Scalable processors.




It takes a wide range of skills to be successful with high performance data analytics. Practitioners must understand the details of both hardware and software components as well as their interactions. In this article, we document findings from performance tests that cover both the hardware and software components of BigDL with Spark.  Our findings show that:

  • BigDL scales almost linearly for the Inception and ResNet image classification models for up to a 16-node limit, which we tested.
  • Intel Xeon Scalable processors with two AVX-512 (Gold 6000 series and Platinum series) units perform significantly better than processors with one AVX-512 unit (Gold 5000 series and earlier).
  • The PowerEdge server default BIOS setting for Sub NUMA clustering (disabled) is recommended for Deep Learning applications using BigDL.

Authors: Bala Chandrasekaran, Phil Hummel, and Leela Uppuluri