Deploying trained neural network models for inference on different platforms is a challenging task. The inference environment is usually different than the training environment which is typically a data center or a server farm. The inference platform may be power constrained and limited from a software perspective. The model might be trained using one of the many available deep learning frameworks such as Tensorflow, PyTorch, Keras, Caffe, MXNet, etc. Intel® OpenVINO™ provides tools to convert trained models into a framework agnostic representation, including tools to reduce the memory footprint of the model using quantization and graph optimization. It also provides dedicated inference APIs that are optimized for specific hardware platforms, such as Intel® Programmable Acceleration Cards, and Intel® Movidius™ Vision Processing Units.

 

openvino.png

   The Intel® OpenVINO™ toolkit

 

Components

 

  1. Model Optimizer

The Model Optimizer is a cross-platform command-line tool that facilitates the transition between the training and deployment environment, performs static model analysis, and adjusts deep learning models for optimal execution on end-point target devices. It is a Python script which takes as input a trained Tensorflow/Caffe model and produces an Intermediate Representation (IR) which consists of a .xml file containing the model definition and a .bin file containing the model weights.

 

     2. Inference Engine

The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices. Each supported target device has a plugin which is a DLL/shared library. It also has support for heterogenous execution to distribute workload across devices. It supports implementing custom layers on a CPU while executing the rest of the model on a accelerator device.

 

Workflow

 

  1. Using the Model Optimizer, convert a trained model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and bias values.
  2. Test the model in the Intermediate Representation format using the Inference Engine in the target environment with the validation application or the sample applications.
  3. Integrate the Inference Engine into your application to deploy the model in the target environment.

 

Using the Model Optimizer to convert a Keras model to IR

 

The model optimizer doesn’t natively support Keras model files. However, because Keras uses Tensorflow as its backend, a Keras model can be saved as a Tensorflow checkpoint which can be loaded into the model optimizer. A Keras model can be converted to an IR using the following steps

  1. Save the Keras model as a Tensorflow checkpoint. Make sure the learning phase is set to 0. Get the name of the output node.

import tensorflow as tf 
from keras.applications import Resnet50 
from keras import backend as K 
from keras.models import Sequential, Model

K.set_learning_phase(0)   # Set the learning phase to 0

model = ResNet50(weights='imagenet', input_shape=(256, 256, 3))  
config = model.get_config()                                            
weights = model.get_weights()
model = Sequential.from_config(config)

output_node = model.output.name.split(':')[0]  # We need this in the next step
graph_file =
"resnet50_graph.pb"  
ckpt_file =
"resnet50.ckpt"
saver = tf.train.Saver(sharded=True)
tf.train.write_graph(sess.graph_def,
'', graph_file)
saver.save(sess, ckpt_file)

                                                    

     2. Run the Tensorflow freeze_graph program to generate a frozen graph from the saved checkpoint.

tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=./resnet50_graph.pb --input_checkpoint=./resnet50.ckpt --output_node_names=Softmax --output_graph=resnet_frozen.pb     

 

     3. Use the mo.py script and the frozen graph to generate the IR. The model weights can be quantized to FP16.

     python mo.py --input_model=resnet50_frozen.pb --output_dir=./ --input_shape=[1,224,224,3] --           data_type=FP16            

                                                     

Inference

 

The C++ library provides utilities to read an IR, select a plugin depending on the target device, and run the model.

  1. Read the Intermediate Representation - Using the InferenceEngine::CNNNetReader class, read an Intermediate Representation file into a CNNNetwork class. This class represents the network in host memory.
  2. Prepare inputs and outputs format - After loading the network, specify input and output precision, and the layout on the network. For these specification, use the CNNNetwork::getInputInfo() and CNNNetwork::getOutputInfo()
  3. Select Plugin - Select the plugin on which to load your network. Create the plugin with the InferenceEngine::PluginDispatcher load helper class. Pass per device loading configurations specific to this device and register extensions to this device.
  4. Compile and Load - Use the plugin interface wrapper class InferenceEngine::InferencePlugin to call the LoadNetwork() API to compile and load the network on the device. Pass in the per-target load configuration for this compilation and load operation.
  5. Set input data - With the network loaded, you have an ExecutableNetwork object. Use this object to create an InferRequest in which you signal the input buffers to use for input and output. Specify a device-allocated memory and copy it into the device memory directly, or tell the device to use your application memory to save a copy.
  6. Execute - With the input and output memory now defined, choose your execution mode:
    • Synchronously - Infer() method. Blocks until inference finishes.
    • Asynchronously - StartAsync() method. Check status with the wait() method (0 timeout), wait, or specify a completion callback.
  7. Get the output - After inference is completed, get the output memory or read the memory you provided earlier. Do this with the InferRequest GetBlob API.

 

The classification_sample and classification_sample_async programs perform inference using the steps mentioned above. We use these samples in the next section to perform inference on an Intel® FPGA.

 

 

Using the Intel® Programmable Acceleration Card with Intel® Arria® 10GX FPGA for inference

 

The OpenVINO toolkit supports using the PAC as a target device for running low power inference. The steps for setting up the card are detailed here. The pre-processing and post-processing is performed on the host while the execution of the model is performed on the card. The toolkit contains bitstreams for different topologies.

  1. Programming the bitstream

     aocl program <device_id> <open_vino_install_directory>/a10_dcp_bitstreams/2-0-1_RC_FP16_ResNet50-101.aocx                                                                                     

    2. The Hetero plugin can be used with CPU as the fallback device for layers that are not supported by the FPGA. The -pc flag prints            performance details for each layer

     ./classification_sample_async -d HETERO:FPGA,CPU -i <path/to/input/image.png> -m <path/to/ir>/resnet50_frozen.xml                                                                                         

 

Conclusion

 

Intel® OpenVINO™ toolkit is a great way to quickly integrate trained models into applications and deploy them in different production environments. The complete documentation for the toolkit can be found at https://software.intel.com/en-us/openvino-toolkit/documentation/featured.