The Hadoop open source initiative is, inarguably, one to the most dramatic software development success stories of the last 15 years.  In every measure of open source software activity that includes the number of contributors, number of code check-ins, number of issues raised and fixed, and the number of new related projects, the commitment of the open source community to the ongoing improvement and expansion of Hadoop is impressive. Dell EMC has been offering solutions that are based on our infrastructure with both Cloudera and Hortonworks Hadoop distributions since 2011.  This year, in collaboration with Intel, we are adding a new product to our analytics portfolio called the Ready Solution for AI: Machine Learning with Hadoop.  As the name implies, we are testing and validating new features for machine learning including a full set of deep learning tools and libraries with our partners Cloudera and Intel.  Traditional machine learning with Apache MLlib on Apache Spark has been widely deployed and discussed for several years.  The concept of achieving deep learning with neural network-based approaches using Apache Spark running on CPU-only clusters is relatively new. We are excited to offer articles such as this to share our findings.



The performance testing that we describe in this article was done in preparation for our launch of Ready Solutions for AI.  The details of scaling BigDL with Apache Spark on on Intel® Xeon® Scalable Clusters is an area that might be new to many data scientists and IT professionals.  In this article, we discuss:


  • Key Spark properties and where to find more information
  • Comparing the capabilities of three different Intel Xeon Scalable processors for scaling impact
  • Comparing the performance of the current Dell EMC PowerEdge 14G server platform with the previous generation
  • Sub NUMA clustering background and performance comparison


We describe our testing environment and show the baseline performance results against which we compared software and hardware configuration changes. We also provide a summary of findings for Ready Solution for AI: Machine Learning with Hadoop.


Our performance testing environment


The following table provides configuration information for worker nodes used in our Cloudera Hadoop CDH cluster for this analysis:


ComponentCDH cluster with Intel Xeon Platinum processorCDH cluster with Intel Xeon Gold 6000 series processorCDH cluster with Intel Xeon Gold 5000 series processor
Server ModelPowerEdge R740xdPowerEdge R740xdPowerEdge R740xd
ProcessorIntel Xeon Platinum 8160 CPU @ 2.10GHzIntel Xeon Gold 6148 CPU @ 2.40GHzIntel Xeon Gold 5120 CPU @ 2.20GHz
Memory768 GB768 GB768 GB
Hard drive(6) 240 GB SSD and (12) 4TB SATA HDD(6) 240 GB SSD and (12) 4TB SATA HDD(2) 240 GB SSD and (4) 1.2TB SAS HDD
Storage controllerPERC H740P Mini (Embedded)PERC H740P Mini (Embedded)Dell HBA330
Network adaptersMellanox 25GbE 2P ConnectX4LXMellanox 25GbE 2P ConnectX4LXQLogic 25GE 2P QL41262 Adapter
Operating SystemRHEL 7.3RHEL 7.3RHEL 7.3
Apache Spark version2.12.12.1
BigDL version0.


Configuring Apache Spark properties


To the get the maximum performance from Spark for running deep learning workloads, there are several properties that must be understood. The following list describes key Apache Spark parameters:


  • Number of executors—Corresponds to the number of nodes on which you want the application to run.  Consider the number of nodes running the Spark service, the number of other Spark jobs that must run concurrently, and the size of the training problem that you are submit.
  • Executor cores—Number of cores to use on each executor. This number can closely match the number of physical cores on the host, assuming that the application does not share the nodes with other applications.
  • Executor memory—Amount of memory to use per executor process. For deep learning applications, the amount of memory allocated for executor memory is dependent on the dataset and the model.  There are several checks and best practices that can help you avoid Spark "out of memory" issues while loading or processing data, however, we do not cover these details in this article.
  • Deployment mode—Deployment mode for Spark driver programs. The deployment mode can be either "client" or "cluster" mode.  To launch the driver program locally use client mode and to deploy the application remotely use cluster mode from one of the nodes inside the cluster. If the Spark application is launched from the Cloudera Data Science Workbench, deployment mode defaults to client mode.


For detailed information about all Spark properties, see Apache Spark Configuration.


Scaling with BigDL


To study scaling of BigDL on Spark, we ran two popular image classification models:

  • Inception V1Inception is an image classification deep neural network introduced in 2014. Inception V1 was one of the first CNN classifiers that introduced the concept of a wider network (multiple convolution operations in the same layer). It remains a popular model for performance characterization and other studies.
  • ResNet-50ResNet is an image classification deep neural network introduced in 2015. ResNet introduced a novel architecture of shortcut connections that skip one or more layers.


ImageNet, the publicly available hierarchical image database, is used as the dataset. In our experiments, we found that ResNet-50 is computationally more intensive compared to Inception V1. We used a learning rate = 0.0898 and a weight decay = 0.001 for both models. Batch size was selected by using best practices for deep learning on Spark.  Details about sizing batches are described later in the article.  Because we were interested in scaling only and not in accuracy, we ensured that our calculated average throughput (images per second) were stable and stopped training after two epochs.  The following table shows the Spark parameters that were used as well as the throughput observed:



4 Nodes

8 Nodes

16 Nodes

Number of executors




Batch size                  









350 GB for Inception

600 GB for ResNet

350 GB for Inception  600 GB for ResNet

350 GB for Inception  600 GB for ResNet

Inception – observed throughput (images/second)




ResNet – observed throughput (images/second)





The following figure shows that deep learning models scale almost linearly when we increase the number of nodes.



Calculating batch size for deep learning on Spark


When training deep learning models using neural networks, the number of training samples is typically large compared to the amount of data that can be processed in one pass. Batch size refers to the number of training examples that are used to train the network during one iteration. The model parameters are updated at the end of each iteration. Lower batch size means that the model parameters are updated more frequently.


For BigDL to process efficiently on Spark, batch size must be a multiple of the number of executors multiplied by the number of executor cores. Increasing the batch size might have an impact on the accuracy of the model. The impact depends on various factors that include the model that is used, the dataset, and the range of the batch size. A model with a large number of batch normalization layers might experience a greater impact on accuracy when changing the batch size. Various hyper-parameter tuning strategies can be applied to improve the accuracy when using large batch sizes in training.


Intel Xeon Scalable processor model comparison

We compared the performance of the image classification training workload on three Intel processor models. The following table provides key details of the processor models.


Intel Xeon Platinum 8160 CPUIntel Xeon Gold 6148 CPUIntel Xeon Gold 5120 CPU
Base Frequency2.10 GHz2.40 GHz2.20 GHz
Number of cores242014
Number of AVX-512 FMA Units221


The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.


The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.



The Intel Xeon Gold 6148 processor slightly outperforms the Intel Xeon Platinum 8160 processor for the Inception V1 workload because it has higher clock frequency. However, for a computationally more intensive workload, the Intel Xeon Platinum 8160 processor outperforms the Xeon Gold 6148 processor.


Both the Intel Xeon Gold 6148 and Intel Xeon Platinum 8160 processors significantly outperform the Intel Xeon Gold 5120 processor. The primary reason for this performance is the number of Intel Advance Vector Extension units (AVX-512 FMA) in these processors.

Intel AVX-512 is a set of new instructions that can accelerate performance for workloads and uses such as deep learning applications. Intel AVX-512 provides ultra-wide 512-bit vector operations capabilities to increase the performance of applications that use identical operations on multiple datasets.


The Intel Xeon Gold 6000 or Intel Xeon Platinum series are recommended for deep learning applications that are implemented in BigDL and Spark.

Comparing deep learning performance across Dell server generations


We compared the performance of the image classification training workload on three Intel processor models, across different server generations. The following table provides key details of the processor models.



Dell EMC PowerEdge R730xd Server with Intel Xeon E5-2683 v3

Dell EMC PowerEdge R730xd with Intel Xeon E5-2698 v4

Dell EMC PowerEdge R740xd with Intel Xeon Platinum 8160 


(2) Xeon E5-2683 v3

(2) Xeon E5-2698 v4

(2) Xeon Platinum 8160


512 GB

512 GB

768 GB

Hard drive

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(6) 240 GB SSD and (12) 4TB SATA HDD

RAID controller

PERC R730 Mini

PERC R730 Mini

PERC H740P Mini (Embedded)

Network adapters

Intel Ethernet 10G 4P X520/I350 rNDC -

Intel Ethernet 10G 4P X520/I350 rNDC -

Mellanox 25GbE 2P ConnectX4LX

Operating system

RHEL 7.3

RHEL 7.3

RHEL 7.3

The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors across Dell EMC PowerEdge server generations. The experiment was performed on four-node clusters.

We observed that the PowerEdge R740xd server with the Intel Xeon Platinum 8160 processor performs 44 percent better than the PowerEdge R730xd server with the Xeon E5-2683 v3 processor and 28 percent better than the PowerEdge R730xd server with the Xeon E5-2698 v4 processor with an Inception V1 workload under the settings noted earlier.




Sub NUMA Clustering Comparison


Sub NUMA clustering (SNC) is a feature available in Intel Xeon Scalable processors. When SNC is enabled, each socket is divided into two NUMA domains, each with half the physical cores and half the memory of the socket. SNC is similar to the Cluster-on-Die option that was available in the Xeon E5-2600 v3 and v4 processors with improvements to remote socket access. At the operating system level, a dual socket server that has SNC enabled displays four NUMA domains.  The following table shows the measured performance impact when enabling SNC.




We found only a 5 percent improvement in performance with SNC enabled, which is within the run-to-run variation in images per second across all the tests we performed.  Sub NUMA clustering improves performance for applications that have high memory locality. Not all deep learning applications and models have high memory locality. We recommend that you either leave Sub NUMA clustering disabled (the default BIOS setting) or disable it.

Ready Solutions for AI


For this Ready Solution, we leverage Cloudera’s new Data Science Workbench, which delivers a self-service experience for data scientists.  Data Science Workbench provides multi-language support that includes Python, R, and Scala directly in a web browser. Data scientists can manage their own analytics pipelines, including built-in scheduling, monitoring, and email alerts in an environment that is tightly coupled with a traditional CDH cluster.

We also incorporate Intel BigDL, a full-featured deep learning framework that leverages Spark running on the CDH cluster.  The combination of Spark and BigDL provides a high performance approach to deep learning training and inference on clusters of Intel Xeon Scalable processors.




It takes a wide range of skills to be successful with high performance data analytics. Practitioners must understand the details of both hardware and software components as well as their interactions. In this article, we document findings from performance tests that cover both the hardware and software components of BigDL with Spark.  Our findings show that:

  • BigDL scales almost linearly for the Inception and ResNet image classification models for up to a 16-node limit, which we tested.
  • Intel Xeon Scalable processors with two AVX-512 (Gold 6000 series and Platinum series) units perform significantly better than processors with one AVX-512 unit (Gold 5000 series and earlier).
  • The PowerEdge server default BIOS setting for Sub NUMA clustering (disabled) is recommended for Deep Learning applications using BigDL.

Authors: Bala Chandrasekaran, Phil Hummel, and Leela Uppuluri