Find Communities by: Category | Product

A data science project lifecycle includes data acquisition, preparation and validation, model development, model validation, productization, and monitoring. For many projects, model development and productization could be iterative processes. Models in production need to be continuously monitored to ensure they are performing as expected.  Each of these steps brings a unique set of challenges and requirements that are not met by traditional software development platforms. When building an AI based application, a data science platform that manages the life cycle of data, models and applications proves immensely valuable. The concept of a “data science workbench” that enables self-service access to datasets, compute instances and version control is now seen as a crucial enablement tool to the productivity and success of data science teams.


To meet this need for our customers, Dell EMC has partnered with Domino Data Lab to offer the Domino Data Science Platform powered by Dell EMC PowerEdge servers. One of the things I first liked about Domino was its straightforward integration with Apache Spark. In the Dell EMC HPC and AI Innovation Lab this allows us to tie our projects with existing Spark clusters such as the Dell EMC Ready Solutions for Data Analytics with Intel Analytics Zoo to enable AI on Spark. Here are some highlights of the features:


Reproducibility Engine: One of the key features of Domino is the Domino Reproducibility Engine. A project within Domino constitutes three components: data, code and the model. When developing and tuning a model, data scientists would run multiple experiments by varying model architecture and hyper parameters. Domino’s Reproducibility Engine captures and stores all these components. This allows for accurate reproduction of the experiments and allows data scientists to revisit the states or conditions in which the model produced optimal results. This also helps future collaborators to iterate through the model history and understand how it was developed, even when the original researcher is no longer available.


Flexibility of compute infrastructure: Domino enables IT administrators to provision compute environments and provide them to data scientists so they can run experiments. The compute infrastructure could be a cluster of CPU nodes, accelerators or an integration into big data frameworks like Cloudera’s distrubutuin including Apache Hadoop (CDH) or Apache Spark. Data scientists can map these compute infrastructures to their projects based on their needs.


Software environments packaged as a container: Domino also enables IT administrators to capture required software packages, libraries and dependencies as a Docker container. Domino comes pre-configured with containers for popular languages, tools, and packages such as R, Python, SAS, Jupyter, RStudio, Tensorflow, and H2O. Custom environments can be built by appending to the Dockerfile and building the container. These version-controlled environments can then be mapped to projects.


Productizing models: Using Domino, a model can be published in several ways. Published models are automatically versioned, secured via access control, and highly available.

      • Models can be published as REST APIs, which can be accessed directly by applications for inference.
      • Data scientists can create self-service web forms using Launchers that lets the end users to easily view and consume analytics and reports.
      • Models can also be published as interactive Flask or Shiny apps and deployed through Domino.

Model usage tracking: Once the model has been published into production, Domino automatically captures model usage and other details. This helps in calculating the ROI of the model and helps to streamline future model development.

 

Refer to this brief for other features of Domino.


Integration into Cloudera


Domino supports running training jobs using Hadoop and Spark worker nodes hosted on existing Cloudera managed clusters.

Slide1.JPG.jpg

Hadoop and Spark integration touches three aspects of the Domino Data Science Platform:

 

  • Projects - Projects are repositories that contain data, code, model and environment settings. Domino tracks the entire project and revisions it. An AI application or model can typically map to a project
  • Environments – Environments consists of software and hardware context in which the projects run. Software context includes a Docker container that incorporates all the libraries and settings for the project. Hardware context includes the nodes where the training job runs.
  • Workspaces – Workspaces are interactive development environments (IDEs) like Jupyter Notebook, JupyterLab and R Studio.


Projects in Domino can be configured to access a Hadoop/Spark cluster environment. An environment is configured with libraries, settings and hardware to access the Hadoop cluster. Once configured, the data scientist can launch a Jupyter notebook session or other workspace and run Spark training jobs. The Spark driver program will run on the Domino executor (as shown in the figure) and spawn workers in the Hadoop cluster for the machine learning training task.

By configuring the Docker container with deep learning libraries from the Analytics Zoo like BigDL, data scientist can request resources on demand from the Spark cluster to build data pipelines and train models.

 

Running on Kubernetes


Kubernetes has become the de-facto tool container orchestration. Domino Data is built to run on an existing Kubernetes cluster. This streamlines deployment, maintenance, scaling, and updates. Kubernetes leverages a farm of nodes - PowerEdge servers that provide the compute resources that power jobs and runs. This enables flexibility, improved resource utilization and version control while developing cloud native AI applications.


Starting with Spark version 2.3, Apache Spark supports Kubernetes natively with spark-submit. Using this method, Spark utilizes Kubernetes scheduler rather than YARN or the built-in. Spark supports both client mode and cluster mode with Kubernetes. The Spark driver runs inside or outside of the Kubernetes cluster to submit jobs to the Kubernetes scheduler, which then creates the workers. When the application completes the executor pods terminate and are cleaned up.


Slide2.JPG.jpg


We have a demo of AI-assisted Radiology using distributed deep learning on Apache Spark and Analytics Zoo using Domino Data Science Platform at the Intel boot in Strata Data Conference (September 2019) in New York. Visit us at the Strata conference to learn more.

 

In a future post we can explore running Spark on Kubernetes in more detail. Follow this space to get best practices around running Apache Spark on Kubernetes.

Introduction

 

The Hadoop open source initiative is, inarguably, one to the most dramatic software development success stories of the last 15 years.  In every measure of open source software activity that includes the number of contributors, number of code check-ins, number of issues raised and fixed, and the number of new related projects, the commitment of the open source community to the ongoing improvement and expansion of Hadoop is impressive. Dell EMC has been offering solutions that are based on our infrastructure with both Cloudera and Hortonworks Hadoop distributions since 2011.  This year, in collaboration with Intel, we are adding a new product to our analytics portfolio called the Ready Solution for AI: Machine Learning with Hadoop.  As the name implies, we are testing and validating new features for machine learning including a full set of deep learning tools and libraries with our partners Cloudera and Intel.  Traditional machine learning with Apache MLlib on Apache Spark has been widely deployed and discussed for several years.  The concept of achieving deep learning with neural network-based approaches using Apache Spark running on CPU-only clusters is relatively new. We are excited to offer articles such as this to share our findings.

Overview

 

The performance testing that we describe in this article was done in preparation for our launch of Ready Solutions for AI.  The details of scaling BigDL with Apache Spark on on Intel® Xeon® Scalable Clusters is an area that might be new to many data scientists and IT professionals.  In this article, we discuss:

 

  • Key Spark properties and where to find more information
  • Comparing the capabilities of three different Intel Xeon Scalable processors for scaling impact
  • Comparing the performance of the current Dell EMC PowerEdge 14G server platform with the previous generation
  • Sub NUMA clustering background and performance comparison

 

We describe our testing environment and show the baseline performance results against which we compared software and hardware configuration changes. We also provide a summary of findings for Ready Solution for AI: Machine Learning with Hadoop.

 

Our performance testing environment

 

The following table provides configuration information for worker nodes used in our Cloudera Hadoop CDH cluster for this analysis:

 

ComponentCDH cluster with Intel Xeon Platinum processorCDH cluster with Intel Xeon Gold 6000 series processorCDH cluster with Intel Xeon Gold 5000 series processor
Server ModelPowerEdge R740xdPowerEdge R740xdPowerEdge R740xd
ProcessorIntel Xeon Platinum 8160 CPU @ 2.10GHzIntel Xeon Gold 6148 CPU @ 2.40GHzIntel Xeon Gold 5120 CPU @ 2.20GHz
Memory768 GB768 GB768 GB
Hard drive(6) 240 GB SSD and (12) 4TB SATA HDD(6) 240 GB SSD and (12) 4TB SATA HDD(2) 240 GB SSD and (4) 1.2TB SAS HDD
Storage controllerPERC H740P Mini (Embedded)PERC H740P Mini (Embedded)Dell HBA330
Network adaptersMellanox 25GbE 2P ConnectX4LXMellanox 25GbE 2P ConnectX4LXQLogic 25GE 2P QL41262 Adapter
Operating SystemRHEL 7.3RHEL 7.3RHEL 7.3
Apache Spark version2.12.12.1
BigDL version0.6.00.6.00.6.0

 

Configuring Apache Spark properties

 

To the get the maximum performance from Spark for running deep learning workloads, there are several properties that must be understood. The following list describes key Apache Spark parameters:

 

  • Number of executors—Corresponds to the number of nodes on which you want the application to run.  Consider the number of nodes running the Spark service, the number of other Spark jobs that must run concurrently, and the size of the training problem that you are submit.
  • Executor cores—Number of cores to use on each executor. This number can closely match the number of physical cores on the host, assuming that the application does not share the nodes with other applications.
  • Executor memory—Amount of memory to use per executor process. For deep learning applications, the amount of memory allocated for executor memory is dependent on the dataset and the model.  There are several checks and best practices that can help you avoid Spark "out of memory" issues while loading or processing data, however, we do not cover these details in this article.
  • Deployment mode—Deployment mode for Spark driver programs. The deployment mode can be either "client" or "cluster" mode.  To launch the driver program locally use client mode and to deploy the application remotely use cluster mode from one of the nodes inside the cluster. If the Spark application is launched from the Cloudera Data Science Workbench, deployment mode defaults to client mode.

 

For detailed information about all Spark properties, see Apache Spark Configuration.

 

Scaling with BigDL

 

To study scaling of BigDL on Spark, we ran two popular image classification models:

  • Inception V1Inception is an image classification deep neural network introduced in 2014. Inception V1 was one of the first CNN classifiers that introduced the concept of a wider network (multiple convolution operations in the same layer). It remains a popular model for performance characterization and other studies.
  • ResNet-50ResNet is an image classification deep neural network introduced in 2015. ResNet introduced a novel architecture of shortcut connections that skip one or more layers.

 

ImageNet, the publicly available hierarchical image database, is used as the dataset. In our experiments, we found that ResNet-50 is computationally more intensive compared to Inception V1. We used a learning rate = 0.0898 and a weight decay = 0.001 for both models. Batch size was selected by using best practices for deep learning on Spark.  Details about sizing batches are described later in the article.  Because we were interested in scaling only and not in accuracy, we ensured that our calculated average throughput (images per second) were stable and stopped training after two epochs.  The following table shows the Spark parameters that were used as well as the throughput observed:

 

 

4 Nodes

8 Nodes

16 Nodes

Number of executors

4

8

16

Batch size                  

736

1472

2944

Executor-cores

46

46

46

Executor-memory

350 GB for Inception

600 GB for ResNet

350 GB for Inception  600 GB for ResNet

350 GB for Inception  600 GB for ResNet

Inception – observed throughput (images/second)

331

625

1244

ResNet – observed throughput (images/second)

95

186

348

 

The following figure shows that deep learning models scale almost linearly when we increase the number of nodes.

1.png

 

Calculating batch size for deep learning on Spark

 

When training deep learning models using neural networks, the number of training samples is typically large compared to the amount of data that can be processed in one pass. Batch size refers to the number of training examples that are used to train the network during one iteration. The model parameters are updated at the end of each iteration. Lower batch size means that the model parameters are updated more frequently.

 

For BigDL to process efficiently on Spark, batch size must be a multiple of the number of executors multiplied by the number of executor cores. Increasing the batch size might have an impact on the accuracy of the model. The impact depends on various factors that include the model that is used, the dataset, and the range of the batch size. A model with a large number of batch normalization layers might experience a greater impact on accuracy when changing the batch size. Various hyper-parameter tuning strategies can be applied to improve the accuracy when using large batch sizes in training.

 

Intel Xeon Scalable processor model comparison

We compared the performance of the image classification training workload on three Intel processor models. The following table provides key details of the processor models.

 

Intel Xeon Platinum 8160 CPUIntel Xeon Gold 6148 CPUIntel Xeon Gold 5120 CPU
Base Frequency2.10 GHz2.40 GHz2.20 GHz
Number of cores242014
Number of AVX-512 FMA Units221

 

The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.

 

The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors. The experiment was performed on four-node clusters.

 

2.png

The Intel Xeon Gold 6148 processor slightly outperforms the Intel Xeon Platinum 8160 processor for the Inception V1 workload because it has higher clock frequency. However, for a computationally more intensive workload, the Intel Xeon Platinum 8160 processor outperforms the Xeon Gold 6148 processor.

 

Both the Intel Xeon Gold 6148 and Intel Xeon Platinum 8160 processors significantly outperform the Intel Xeon Gold 5120 processor. The primary reason for this performance is the number of Intel Advance Vector Extension units (AVX-512 FMA) in these processors.

Intel AVX-512 is a set of new instructions that can accelerate performance for workloads and uses such as deep learning applications. Intel AVX-512 provides ultra-wide 512-bit vector operations capabilities to increase the performance of applications that use identical operations on multiple datasets.

 

The Intel Xeon Gold 6000 or Intel Xeon Platinum series are recommended for deep learning applications that are implemented in BigDL and Spark.

Comparing deep learning performance across Dell server generations

 

We compared the performance of the image classification training workload on three Intel processor models, across different server generations. The following table provides key details of the processor models.

 

 

Dell EMC PowerEdge R730xd Server with Intel Xeon E5-2683 v3

Dell EMC PowerEdge R730xd with Intel Xeon E5-2698 v4

Dell EMC PowerEdge R740xd with Intel Xeon Platinum 8160 

Processor

(2) Xeon E5-2683 v3

(2) Xeon E5-2698 v4

(2) Xeon Platinum 8160

Memory

512 GB

512 GB

768 GB

Hard drive

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(2) 480 GB SSD and (12) 1.2TB SATA HDD

(6) 240 GB SSD and (12) 4TB SATA HDD

RAID controller

PERC R730 Mini

PERC R730 Mini

PERC H740P Mini (Embedded)

Network adapters

Intel Ethernet 10G 4P X520/I350 rNDC -

Intel Ethernet 10G 4P X520/I350 rNDC -

Mellanox 25GbE 2P ConnectX4LX

Operating system

RHEL 7.3

RHEL 7.3

RHEL 7.3

The following figure shows the throughput in images per second of the image classification models for training on the three different Intel processors across Dell EMC PowerEdge server generations. The experiment was performed on four-node clusters.

We observed that the PowerEdge R740xd server with the Intel Xeon Platinum 8160 processor performs 44 percent better than the PowerEdge R730xd server with the Xeon E5-2683 v3 processor and 28 percent better than the PowerEdge R730xd server with the Xeon E5-2698 v4 processor with an Inception V1 workload under the settings noted earlier.

 

3.png

 

Sub NUMA Clustering Comparison

 

Sub NUMA clustering (SNC) is a feature available in Intel Xeon Scalable processors. When SNC is enabled, each socket is divided into two NUMA domains, each with half the physical cores and half the memory of the socket. SNC is similar to the Cluster-on-Die option that was available in the Xeon E5-2600 v3 and v4 processors with improvements to remote socket access. At the operating system level, a dual socket server that has SNC enabled displays four NUMA domains.  The following table shows the measured performance impact when enabling SNC.

 

4.png

 

We found only a 5 percent improvement in performance with SNC enabled, which is within the run-to-run variation in images per second across all the tests we performed.  Sub NUMA clustering improves performance for applications that have high memory locality. Not all deep learning applications and models have high memory locality. We recommend that you either leave Sub NUMA clustering disabled (the default BIOS setting) or disable it.

Ready Solutions for AI

 

For this Ready Solution, we leverage Cloudera’s new Data Science Workbench, which delivers a self-service experience for data scientists.  Data Science Workbench provides multi-language support that includes Python, R, and Scala directly in a web browser. Data scientists can manage their own analytics pipelines, including built-in scheduling, monitoring, and email alerts in an environment that is tightly coupled with a traditional CDH cluster.

We also incorporate Intel BigDL, a full-featured deep learning framework that leverages Spark running on the CDH cluster.  The combination of Spark and BigDL provides a high performance approach to deep learning training and inference on clusters of Intel Xeon Scalable processors.

 

Conclusion

 

It takes a wide range of skills to be successful with high performance data analytics. Practitioners must understand the details of both hardware and software components as well as their interactions. In this article, we document findings from performance tests that cover both the hardware and software components of BigDL with Spark.  Our findings show that:

  • BigDL scales almost linearly for the Inception and ResNet image classification models for up to a 16-node limit, which we tested.
  • Intel Xeon Scalable processors with two AVX-512 (Gold 6000 series and Platinum series) units perform significantly better than processors with one AVX-512 unit (Gold 5000 series and earlier).
  • The PowerEdge server default BIOS setting for Sub NUMA clustering (disabled) is recommended for Deep Learning applications using BigDL.


Authors: Bala Chandrasekaran, Phil Hummel, and Leela Uppuluri