Find Communities by: Category | Product

Ready Solutions for AI

4 Posts authored by: Phil Hummel

sharing gears.pngIn Part 1 of “Share the GPU Love” we covered the need for improving the utilization of GPU accelerators and how a relatively simple technology like VMware DirectPath I/O together with some sharing processes could be a starting point.  As with most things in technology, some additional technology, and knowledge you can achieve high goals beyond just the basics.  In this article, we are going to introduce another technology for managing GPU-as-a-service – NVIDIA GRID 9.0.


Before we jump to this next technology, let’s review some of the limitations of using DirectPath I/O for virtual machine access to physical PCI functions. The online documentation for VMware DirectPath I/O has a complete list of features that are unavailable for virtual machines configured with DirectPath I/O.  Some of the most important ones are:


vmware vgu.png

  • Fault tolerance
  • High availability
  • Snapshots
  • Hot adding and removing of virtual devices

The technique of “passing through” host hardware to a virtual machine (VM) is simple but doesn’t leverage many of the virtues of true hardware virtualization.  NVIDIA delivers software to virtualize GPUs in the data center for years.  The primary use case has been Virtual Desktop Infrastructure (VDI)  using vGPUs.  The current release - NVIDIA vGPU Software 9 adds the vComputeServer vGPU capability for supporting artificial intelligence, deep learning, and high-performance computing workloads.  The rest of this article will cover using vGPU for machine learning in a VMware ESXi environment.

We want to compare the setup and features of this latest NVIIDA software version, so we worked on adding the vComputeServer to our PowerEdge ESXi that we used for the DirectPath I/O research in our first blog [add blog here].  Our NVIDIA Turing architecture T4 GPUs are on the list of supported devices, so we can check that box and our ESXi version is compatible.  The NVIDIA vGPU software documentation for VMware vSphere has an exhaustive list of requirements and compatibility notes.

You’ll have to put your host into maintenance mode during installation and then reboot after the install of the VIB completes.  When the ESXi host is back online you can use the now familiar nvidia-smi command with no parameters and see a list of all available GPUs that indicates you are ready to proceed.

We configured two of our T4 GPUs for vGPU use and setup the required licenses.  Then we followed the same approach that we used for DirectPath I/O to build out VM templates with everything that is common to all developments and use those to create the developer specific VMs – one with all Python tools and another with R tools.  NVIDIA vGPU software supports only 64-bit guest operating systems. No 32-bit guest operating systems are supported.  You should only use a guest OS release that is supported by both for NVIDIA vGPU software and by VMware.  NVIDIA will not be able to support guest OS releases that are not supported by your virtualization software.

vmware vgpu.JPG.jpg

Now that we have both a DirectPath I/O enabled setup and the NVIDIA vGPU environment let’s compare the user experience.  First, starting with vSphere 6.7 U1 release, vMotion with vGPU and suspend and resume with vGPU are supported on suitable GPUs. Always check the NVIDIA Virutual GPU Software Documentation for all the latest details.  vSphere 6.7 only supports suspend and resume with vGPU. vMotion with vGPU is not supported on release 6.7. [double check this because vMotion is supported I just cant remember what version and update number it is] 

vMotion can be extremely valuable for data scientists doing long running training jobs that you don’t get with DirectPath I/O and suspend/resume of vGPU enabled VMs creates opportunities to increase the return from your GPU investments by enabling scenarios with data science model training running at night and interactive graphics intensive applications running during the day utilizing the same pool of GPUs.  Organizations with workers spread across time zones may also find that suspend/resume of vGPU enabled VMs to be useful.

There is still a lot of work that we want to do in our lab including capturing some informational videos that will highlight some of the concepts we have been talking about in this last two articles.  We are also starting to build out some VMs configured with Docker so we can look at using our vGPUs with NVIDIA GPU Cloud (GCP) deep learning training and inferencing containers.  Our goal is to get more folks setting up a sandbox environment using these articles along with the NVIDIA and VMware links we have provided.  We want to hear about your experience working with vGPUs and VMware.  If you have any questions or comments post them in the feedback section below.


Thanks for reading,

Phil Hummel - @GotDisk

sharing gears.pngAnyone that works with machine learning models trained by optimization methods like stochastic gradient descent (SGD) knows about the power of specialized hardware accelerators for performing a large number of matrix operations that are needed.  Wouldn’t it be great if we all had our own accelerator dense supercomputers?  Unfortunately, the people that manage budgets aren’t approving that plan, so we need to find a workable mix of technology and, yes, the dreaded concept, process to improve our ability to work with hardware accelerators in shared environments.


We have gotten a lot of questions from a customer trying to increase the utilization rates of machines with specialized accelerators.  Good news, there are a lot of big technology companies working on solutions. The rest of the article is going to focus on technology from Dell EMC, NVIDIA, and VMware that is both available today and some that are coming soon.  We also sprinkle in some comments about the process that you can consider.  Please add your thoughts and questions in the comments section below.


We started this latest round of GPU-as-a-service research with a small amount of kit in the Dell EMC Customer Solutions Center in Austin.  We have one Dell EMC PowerEdge R740 with 4 NVIDIA T4 GPUs connected to the system on the PCIe bus.  Our research question is “how can a group of data scientists working on different models with different development tools share these four GPUs?”  We are going to compare two different technology options:

  1. VMware Direct Path I/O

Our server has ESXi installed and is configured as a 1 node cluster in vCenter.  I’m going to skip the configuration of the host BIOS and ESXi and jump straight to creating VMs.  We started off with the Direct Path I/O option.  You should review the article “Using GPUs with Virtual Machines on vSphere – Part 2: VMDirectPath I/O” from VMware before trying this at home.  It has a lot of details that we won’t repeat here.



There are many approaches available for virtual machine image management that can be set up by the VMware administrators but for this project, we are assuming that our data scientists are building and maintaining the images they use.  Our scenario is to show how a group of Python users can have one image and the R users can have another image that both use GPUs when needed.  Both groups are using primarily TensorFlow and Keras.


Before installing an OS we changed the firmware setting to EFI in the VM Boot Options menu per the article above.  We also used the VM options to assign one physical GPU to the VM using Direct Path I/O before proceeding with any software installs.  It is important for there to be a device present during configuration even though the VM may get used later with or without an assigned GPU to facilitate sharing among users and/or teams.


Once the OS was installed and configured with user accounts and updates, we installed the NVIDIA GPU related software and made two clones of that image since both the R and Python environment setups need the same supporting libraries and drivers to use the GPUs when added to the VM through Direct Path I/O.  Having the base image with an OS plus NVIDIA libraries saves a lot of time if you want a new type of developer environment.


With this much of the setup done, we can start testing assigning and removing GPU devices among our two VMs.  We use VM options to add and remove the devices but only while the VM is powered off. For example, we can assign 2 GPUs to each VM, 4 GPUs to one VM and none to the other or any other combination that doesn’t exceed our 4 available devices.  Devices currently assigned to other VMs are not available in the UI for assignment, so it is not physically possible to create conflicts between VMs. We can NVIDIA’s System Management Interface (nvidia-smi) to list the devices available on each VM.


Remember above when we talked about process, here is where we need to revisit that.  The only way a setup like this works is if people release GPUs from VMs when they don’t need them.  Going a level deeper there will probably be a time when one user or group could take advantage of a GPU but would choose to not take one so other potentially more critical work can have it.  This type of resource sharing is not new to research and development.  All useful resources are scarce, and a lot of efficiencies can be gained with the right technology, process, and attitude


Before we talk about installing the developer frameworks and libraries, let’s review the outcome we desire. We have 2 or more groups of developers that could benefit from the use of GPUs at different times in their workflow but not always.  They would like to minimize the number of VM images they need and have and would also like fewer versions of code to maintain even when switching between tasks that may or may not have access to GPUs when running.  We talked above about switching GPUs between machines but what happens on the software side?  Next, we’ll talk about some TensorFlow properties that make this easier.


TensorFlow comes in two main flavors for installation tensorflow and tensorflow-gpu.  The first one should probably be called “tensorflow-cpu” for clarity.  For this work, we are only installing the GPU enabled version since we are going to want our VMs to be able to use GPU for any operations that TF supports for GPU devices. The reason that I don’t also need the CPU version when my VM has not been assigned any GPUs is that many operations available in the GPU enabled version of TF have both a CPU and a GPU implantation. When an operation is run without a specific device assignment, any available GPU device will be given priority in the placement.   When the VM does not have a GPU device available the operation will use the CPU implementation.


There are many examples online for testing if you have a properly configured system with a functioning GPU device.  This simple matrix multiplication sample is a good starting point.  Once that is working you can move on a full-blown model training with a sample data set like the MNIST character recognition model.  Try setting up a sandbox environment using this article and the VMware blog series above. Then get some experience with allocating and deallocating GPUs to VMs and prove that things are working with a small app.  If you have any questions or comments post them in the feedback section below.


Thanks for reading.

Phil Hummel



Picture1.jpgExplaining the relationship between machine learning and artificial intelligence is one of the most challenging concepts that I encounter when talking to people new to these topics.  I don’t pretend to have the definitive answer, but, I have developed a story that seems to get enough affirmative head shakes that I want to share it here.


The diagram above has appeared in many introductory books and articles that I’ve seen.  I have reproduced it here to highlight the challenge of talking about “subsets” of abstract concepts – none of which have widely accepted definitions. So, what does this graphic mean or imply?  How is deep learning a subset of artificial intelligence? These are the questions I’m going to try to answer by telling you a story I use for briefings on artificial intelligence during the rest of this article.

Since so many people have read about and studied examples of using deep learning for image classification, that is my starting point.  I am not however going to talk about cats and dogs, so please hang with me for a bit longer.  I’m going to use an example of facial recognition.  My scenario is that there is a secure area in a building that only 4 people (Angela, Chris, Lucy and Marie) are permitted to enter.  We want to use facial recognition to determine if someone attempting to gain access should be allowed in.  You and I can easily look at a picture and say whether it is someone we know.  But how does a deep learning model do that and how could we use the result of the model to create an artificial intelligence application?

I frequently use the picture below to discuss the use of deep neural networks for doing model training for supervised classification.  Now when looking at the network consider that the goal of all machine learning and deep learning is to transform input data into some meaningful output.  For facial recognition, the input data is a representation of the pixel intensity and color or grey scale value from a picture and the output is probability that the picture is either Angela, Chris, Lucy or Marie.  That means we are going to have to train the network using recent photos of these four people.

A highly stylized neural network representation

network.jpgThis picture above is a crude simplification of how a modern convolutional neural network (ConvNet) used for image recognition would be constructed, however, it is useful to highlight many of the important elements of what we mean by transforming raw data into meaningful outputs.  For example, each line or edge drawn between the neurons of each layer represent a weight (parameter) that must be calculated during training.  These weights are the primary mechanism used to transform the input data into something useful.  Because this picture only includes 5 layers with less than 10 nodes per layer it is easy to visualize how fully connected layers can quickly increase the number of weights that must be computed.  The ConvNets in wide spread use today typically have from 16 to 200 or more layers, although not all fully connected for the deeper designs, and can have 10's of millions to 100’s of millions of weights or more.

We need that many weights to “meaningfully” transform the input data since the image is broken down into many small regions of pixels (typically 3x3 or 5x5) before getting ingested by the input layer.  The numerical representation of the pixel values is then transformed by the weights so that the output of the transformation indicates if this region of cells adds to the evidence that this is a picture of Angela or negates the likelihood that this is Angela.  If Angela has black hair and the network does not detect many regions of solid black color, then there not be much evidence that this picture is Angela.

Finally, I want to tie everything discussed so far to an explanation of the output layer.  In the picture above, there are 4 neurons in the output layer and that is why I setup my facial recognition story to have 4 people that I am trying to recognize.  During training I have a set of pictures that have been labeled with the correct name.  One way to look at how I might do that is like this:

Table 1 - Representation of labeled training data


























The goal during training is to come up with a single set of weights that will transform the data from every picture in the training data set into a set of four values (vector) for each picture where the values match as close as possible to the labels assigned as above.  For Picture1 the first value is 1 and the other three are zeros and for Picture2 the set of 4 training values are set to zero for the first 3 elements and the fourth value is 1.  We are telling the model that we are 100% sure (probability = 1) that this is a picture of Angela and certain that it is not Chris, Lucy, or Marie (probability = 0).    The training process tries to find a set of weights that will transform the pixel data for Picture1 in to the vector (1,0,0,0) and Picture2 into the vector (0,0,0,1) and so on for the entire data set.


Of course, no deep learning model training algorithm can do that because of variations in the data so we try to get as close as possible for each input image.  The process of testing a model with known data or processing new unlabeled images is called inferencing.  When we pass in unlabeled data we get back a list of four probabilities that reflect the evidence in the data that the image is one of the four know people, for example we might get something back like (.5, .25, .15, .1).  For most classification algorithms the set of probabilities will add to 1.  What does this result tell us?


Our model says we are most confident that the unlabeled picture is Angela since that is the outcome with the highest probability, but, it also tells us that we can only be 50% sure that it is not one of the other three people.  What does it mean if we get an inference result back that is (..25, .25, .25, .25)?  This result tells us the model can’t do better than a random process like picking a number between 1 and 4.  This picture could be anyone of our known people or it could be a picture of a truck.  The model provides us no information.  How intelligent is that?  This is where the connection with artificial intelligence gets interesting.


What we like to achieve is getting back inference predictions where one value is very close to 1 and all the others are very close to zero.  Then we have high confidence that person requesting access to a restricted area is one of our authorized employees.  That is rarely the case, so we must deal with uncertainty in our applications that use our trained machine learning models.  If the area that we are securing is the executive dining room then perhaps we want to open the door even if we are only 50% sure that the person requesting access is one of our known people.  If the application is securing access to sensitive computer and communication equipment, then perhaps we want to set a threshold of 90% certainty before we unlock the door.  The important point is that machine learning is usually not sufficient alone to build an intelligent application.  Therefore, fear that the machines are going to get smarter than people and therefore be able to make “better” decisions is still a long way off, maybe a very long way…


Phil Hummel


Picture1.jpgData science requires a mix of computer programming, mathematics/statistics and domain knowledge.  This article focuses on the intersection of the first two requirements.  Comprehensive software packages for classical machine learning, primarily supported by statistical algorithms, have been widely available for decades.  There are many mature offerings available from both the open source software community and commercial software makers.  Modern deep learning is less mature, still experiencing rapid innovation, and so the software landscape is more dynamic.  Data scientists engaged in deep learning must get more involved in programming than typically required for classical machine learning.  The remainder of this article explains the why and how of that effort.

First, let me start with a high-level summary of what deep learning is.  Computer scientists have been studying ways to perform speech recognition, image recognition, natural language processing, including translation, relationship identification, recommendation systems and other forms of data relationship discovery since computers were first invented.  After decades of parallel research in mathematics and software development, researchers discovered a methodology called artificial neural networks (ANNs) that could be used to solve these types of problems and many more using a common set of tools.  The building blocks of ANNs are layers.  Each layer typically accepts structured data (tensors) as inputs, then perform a type of transformation on that data, and finally, sends the transformed data to the next layer until the output layer is processed.  The layers typically used in ANNs can be grouped into categories, for example

  • Input Layers
  • Learnable Layers
  • Activation Layers
  • Pooling Layers
  • Combination Layers
  • Output Layers

The number and definition of the layers, how they are connected, and the data structures used between layers are called the model structure.  In addition to defining the structure, a data scientist must specify how the model is to be executed including the function to be optimized and optimization method.  Given the complexity of the mathematics and the need to efficiently process large data sets, the effort to create a deep learning software program is a significant development effort, even for professional computer scientists.

Deep learning frameworks were developed to make software for deep learning available to the wider community of programmers and data scientists.  Most of today’s popular frameworks are developed through open source software initiatives in each of which attract dozens of active developers.  The rate of innovation in the deep learning framework space is both impressive and somewhat overwhelming.

To further complicate the world of deep learning (yes, that is possible) despite the many similar capabilities of the most popular deep learning frameworks, there are also significant differences that lead to a need for careful evaluation for compatibility once a project is defined.  Based on a sample of the many comparisons of deep learning frameworks that can be found in just the last couple of years, I estimate that there are between 15-20 viable alternatives today.

The Intel® AI Academy recently published a comparison summary focused on frameworks that have versions optimized by Intel and that can effectively run on CPUs optimized for matrix multiplication.  The table below is a sample of the type of analysis and data that was collected.


# of GitHub Stars

# of GitHub Forks










Microsoft Cognitive Toolkit*















neon™ framework



Source Intel AI Academy 2017


The NVIDIA Deep Learning AI website has a summary of deep learning frameworks such as Caffe2, Cognitive toolkit, MXNet, PyTorch, TensorFlow and others that support GPU-accelerated libraries such as cuDNN and NCCL to deliver high-performance multi-GPU accelerated training.  The page also includes links to learning and getting started resources.



  1. Don’t be surprised if the data science team proposes projects using different frameworks.
  2. Get curious if every project requires a different framework.
  3. Plan to age out some frameworks over time and bring in new ones.
  4. Allocate time for new framework investigation.
  5. Look for platforms that support multiple frameworks to reduce silos.
  6. Check online reviews from reputable sources to see how others rate a framework before adopting it for a project.


Thanks for reading,

Phil Hummel