A data science project lifecycle includes data acquisition, preparation and validation, model development, model validation, productization, and monitoring. For many projects, model development and productization could be iterative processes. Models in production need to be continuously monitored to ensure they are performing as expected. Each of these steps brings a unique set of challenges and requirements that are not met by traditional software development platforms. When building an AI based application, a data science platform that manages the life cycle of data, models and applications proves immensely valuable. The concept of a “data science workbench” that enables self-service access to datasets, compute instances and version control is now seen as a crucial enablement tool to the productivity and success of data science teams.
To meet this need for our customers, Dell EMC has partnered with Domino Data Lab to offer the Domino Data Science Platform powered by Dell EMC PowerEdge servers. One of the things I first liked about Domino was its straightforward integration with Apache Spark. In the Dell EMC HPC and AI Innovation Lab this allows us to tie our projects with existing Spark clusters such as the Dell EMC Ready Solutions for Data Analytics with Intel Analytics Zoo to enable AI on Spark. Here are some highlights of the features:
Reproducibility Engine: One of the key features of Domino is the Domino Reproducibility Engine. A project within Domino constitutes three components: data, code and the model. When developing and tuning a model, data scientists would run multiple experiments by varying model architecture and hyper parameters. Domino’s Reproducibility Engine captures and stores all these components. This allows for accurate reproduction of the experiments and allows data scientists to revisit the states or conditions in which the model produced optimal results. This also helps future collaborators to iterate through the model history and understand how it was developed, even when the original researcher is no longer available.
Flexibility of compute infrastructure: Domino enables IT administrators to provision compute environments and provide them to data scientists so they can run experiments. The compute infrastructure could be a cluster of CPU nodes, accelerators or an integration into big data frameworks like Cloudera’s distribution including Apache Hadoop (CDH) or Apache Spark. Data scientists can map these compute infrastructures to their projects based on their needs.
Software environments packaged as a container: Domino also enables IT administrators to capture required software packages, libraries and dependencies as a Docker container. Domino comes pre-configured with containers for popular languages, tools, and packages such as R, Python, SAS, , RStudio, , and H2O. Custom environments can be built by appending to the and building the container. These version-controlled environments can then be mapped to projects.
Productizing models: Using Domino, a model can be published in several ways. Published models are automatically versioned, secured via access control, and highly available.
- Models can be published as REST APIs, which can be accessed directly by applications for inference.
- Data scientists can create self-service web forms using Launchers that lets the end users to easily view and consume analytics and reports.
- Models can also be published as interactive Flask or Shiny apps and deployed through Domino.
Model usage tracking: Once the model has been published into production, Domino automatically captures model usage and other details. This helps in calculating the ROI of the model and helps to streamline future model development.
Refer to this brief for other features of Domino.
Integration into Cloudera
Domino supports running training jobs using Hadoop and Spark worker nodes hosted on existing Cloudera managed clusters.
Hadoop and Spark integration touches three aspects of the Domino Data Science Platform:
- Projects - Projects are repositories that contain data, code, model and environment settings. Domino tracks the entire project and revisions it. An AI application or model can typically map to a project
- Environments – Environments consists of software and hardware context in which the projects run. Software context includes a Docker container that incorporates all the libraries and settings for the project. Hardware context includes the nodes where the training job runs.
- Workspaces – Workspaces are interactive development environments (IDEs) like Notebook, and R Studio.
Projects in Domino can be configured to access a Hadoop/Spark cluster environment. An environment is configured with libraries, settings and hardware to access the Hadoop cluster. Once configured, the data scientist can launch a notebook session or other workspace and run Spark training jobs. The Spark driver program will run on the Domino executor (as shown in the figure) and spawn workers in the Hadoop cluster for the machine learning training task.
By configuring the Docker container with deep learning libraries from the Analytics Zoo like , data scientist can request resources on demand from the Spark cluster to build data pipelines and train models.
Running on Kubernetes
Kubernetes has become the de-facto tool container orchestration. Domino Data is built to run on an existing Kubernetes cluster. This streamlines deployment, maintenance, scaling, and updates. Kubernetes leverages a farm of nodes - PowerEdge servers that provide the compute resources that power jobs and runs. This enables flexibility, improved resource utilization and version control while developing cloud native AI applications.
Starting with Spark version 2.3, Apache Spark supports Kubernetes natively with spark-submit. Using this method, Spark utilizes Kubernetes scheduler rather than YARN or the built-in. Spark supports both client mode and cluster mode with Kubernetes. The Spark driver runs inside or outside of the Kubernetes cluster to submit jobs to the Kubernetes scheduler, which then creates the workers. When the application completes the pods terminate and are cleaned up.
We have a demo of AI-assisted Radiology using distributed deep learning on Apache Spark and Analytics Zoo using Domino Data Science Platform at the Intel boot in Strata Data Conference (September 2019) in New York. Visit us at the Strata conference to learn more.
In a future post we can explore running Spark on Kubernetes in more detail. Follow this space to get best practices around running Apache Spark on Kubernetes.