Find Communities by: Category | Product

Inference is the process of running a trained neural network to process new inputs and make predictions. Training is usually performed offline in a data center or a server farm. Inference can be performed in a variety of environments depending on the use case. Intel® FPGAs provide a low power, high throughput solution for running inference. In this blog, we look at using the Intel® Programmable Acceleration Card (PAC) with Intel® Arria® 10GX FPGA for running inference on a Convolutional Neural Network (CNN) model trained for identifying thoracic pathologies.


Advantages of using Intel® FPGAs

  1. System Acceleration

Intel® FPGAs accelerate and aid the compute and connectivity required to collect and process the massive quantities of information around us by controlling the data path. In addition to FPGAs being used as compute offload, they can also directly receive data and process it inline without going through the host system. This frees the processor to manage other system events and enables higher real time system performance.

    2. Power Efficiency

Intel® FPGAs have over 8 TB/s of on-die memory bandwidth. Therefore, solutions tend to keep the data on the device tightly coupled with the next computation. This minimizes the need to access external memory and results in a more efficient circuit implementation in the FPGA where data can be paralleled, pipelined, and processed on every clock cycle. These circuits can be run at significantly lower clock frequencies than traditional general-purpose processors and results in very powerful and efficient solutions.

    3. Future Proofing

In addition to system acceleration and power efficiency, Intel® FPGAs help future proof systems. With such a dynamic technology as machine learning, which is evolving and changing constantly, Intel® FPGAs provide flexibility unavailable in fixed devices. As precisions drop from 32-bit to 8-bit and even binary/ternary networks, an FPGA has the flexibility to support those changes instantly. As next generation architectures and methodologies are developed, FPGAs will be there to implement them.


Model and software

The model is a Resnet-50 CNN trained on the NIH chest x-ray dataset. The dataset contains over 100,000 chest x-rays, each labelled with one or more pathologies. The model was trained on 512 Intel® Xeon® Scalable Gold 6148 processors in 11.25 minutes on the Zenith cluster at DellEMC.

The model is trained using Tensorflow 1.6. We use the Intel® OpenVINO™ R3 toolkit to deploy the model on the FPGA. The Intel® OpenVINO™ toolkit is a collection of software tools to facilitate the deployment of deep learning models. This OpenVINO blog post details the procedure to convert a Tensorflow model to a format that can be run on the FPGA.



In this section, we look at the power consumption and throughput numbers on the Dell EMC PowerEdge R740 and R640 servers.

   1. Using the Dell EMC PowerEdge R740 with 2x Intel® Xeon® Scalable Gold 6136 (300W) and 4x Intel® PACs.

Figure 1 and 2 show the power consumption and throughput numbers for running the model on Intel® PACs, and in combination with Intel® Xeon® Scalable Gold 6136. We observe that the addition of a single Intel® PAC adds only 43W to the system power while providing the ability to inference over 100 chest X-rays per second. The additional power and inference performance scales linearly with the addition of more Intel® PACs. At a system level, wee see a 2.3x improvement in throughput and 116% improvement in efficiency (images per sec per Watt) when using 4x Intel® PACs with 2x Intel® Xeon® Scalable Gold 6136.


Figure 1: Inference performance tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.



Figure 2 Performance per watt tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.


    2. Using the Dell EMC PowerEdge R640 with 2x Intel® Xeon® Scalable Gold 5118 (210W) and 2x Intel® PACs

We also used a server with lower idle power. We see a 2.6x improvement in system performance in this case. As before, each Intel® PAC linearly adds performance to the system, adding more than 100 inferences per second for 43W (2.44 images/sec/W).



Figure 3 Inference performance tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.



Figure 4 Performance per watt tests using ResNet-50 topology. FP11 precision. Image size is 224x224x3. Power measured via racadm.



Intel® FPGAs coupled with Intel® OpenVINO™ provide a complete solution for deploying deep learning models in production. FPGAs offer low power and flexibility that make them very suitable as an accelerator device for deep learning workloads.

Deploying trained neural network models for inference on different platforms is a challenging task. The inference environment is usually different than the training environment which is typically a data center or a server farm. The inference platform may be power constrained and limited from a software perspective. The model might be trained using one of the many available deep learning frameworks such as Tensorflow, PyTorch, Keras, Caffe, MXNet, etc. Intel® OpenVINO™ provides tools to convert trained models into a framework agnostic representation, including tools to reduce the memory footprint of the model using quantization and graph optimization. It also provides dedicated inference APIs that are optimized for specific hardware platforms, such as Intel® Programmable Acceleration Cards, and Intel® Movidius™ Vision Processing Units.



   The Intel® OpenVINO™ toolkit




  1. Model Optimizer

The Model Optimizer is a cross-platform command-line tool that facilitates the transition between the training and deployment environment, performs static model analysis, and adjusts deep learning models for optimal execution on end-point target devices. It is a Python script which takes as input a trained Tensorflow/Caffe model and produces an Intermediate Representation (IR) which consists of a .xml file containing the model definition and a .bin file containing the model weights.


     2. Inference Engine

The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices. Each supported target device has a plugin which is a DLL/shared library. It also has support for heterogenous execution to distribute workload across devices. It supports implementing custom layers on a CPU while executing the rest of the model on a accelerator device.




  1. Using the Model Optimizer, convert a trained model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and bias values.
  2. Test the model in the Intermediate Representation format using the Inference Engine in the target environment with the validation application or the sample applications.
  3. Integrate the Inference Engine into your application to deploy the model in the target environment.


Using the Model Optimizer to convert a Keras model to IR


The model optimizer doesn’t natively support Keras model files. However, because Keras uses Tensorflow as its backend, a Keras model can be saved as a Tensorflow checkpoint which can be loaded into the model optimizer. A Keras model can be converted to an IR using the following steps

  1. Save the Keras model as a Tensorflow checkpoint. Make sure the learning phase is set to 0. Get the name of the output node.

import tensorflow as tf 
from keras.applications import Resnet50 
from keras import backend as K 
from keras.models import Sequential, Model

K.set_learning_phase(0)   # Set the learning phase to 0

model = ResNet50(weights='imagenet', input_shape=(256, 256, 3))  
config = model.get_config()                                            
weights = model.get_weights()
model = Sequential.from_config(config)

output_node =':')[0]  # We need this in the next step
graph_file =
ckpt_file =
saver = tf.train.Saver(sharded=True)
'', graph_file), ckpt_file)


     2. Run the Tensorflow freeze_graph program to generate a frozen graph from the saved checkpoint.

tensorflow/bazel-bin/tensorflow/python/tools/freeze_graph --input_graph=./resnet50_graph.pb --input_checkpoint=./resnet50.ckpt --output_node_names=Softmax --output_graph=resnet_frozen.pb     


     3. Use the script and the frozen graph to generate the IR. The model weights can be quantized to FP16.

     python --input_model=resnet50_frozen.pb --output_dir=./ --input_shape=[1,224,224,3] --           data_type=FP16            




The C++ library provides utilities to read an IR, select a plugin depending on the target device, and run the model.

  1. Read the Intermediate Representation - Using the InferenceEngine::CNNNetReader class, read an Intermediate Representation file into a CNNNetwork class. This class represents the network in host memory.
  2. Prepare inputs and outputs format - After loading the network, specify input and output precision, and the layout on the network. For these specification, use the CNNNetwork::getInputInfo() and CNNNetwork::getOutputInfo()
  3. Select Plugin - Select the plugin on which to load your network. Create the plugin with the InferenceEngine::PluginDispatcher load helper class. Pass per device loading configurations specific to this device and register extensions to this device.
  4. Compile and Load - Use the plugin interface wrapper class InferenceEngine::InferencePlugin to call the LoadNetwork() API to compile and load the network on the device. Pass in the per-target load configuration for this compilation and load operation.
  5. Set input data - With the network loaded, you have an ExecutableNetwork object. Use this object to create an InferRequest in which you signal the input buffers to use for input and output. Specify a device-allocated memory and copy it into the device memory directly, or tell the device to use your application memory to save a copy.
  6. Execute - With the input and output memory now defined, choose your execution mode:
    • Synchronously - Infer() method. Blocks until inference finishes.
    • Asynchronously - StartAsync() method. Check status with the wait() method (0 timeout), wait, or specify a completion callback.
  7. Get the output - After inference is completed, get the output memory or read the memory you provided earlier. Do this with the InferRequest GetBlob API.


The classification_sample and classification_sample_async programs perform inference using the steps mentioned above. We use these samples in the next section to perform inference on an Intel® FPGA.



Using the Intel® Programmable Acceleration Card with Intel® Arria® 10GX FPGA for inference


The OpenVINO toolkit supports using the PAC as a target device for running low power inference. The steps for setting up the card are detailed here. The pre-processing and post-processing is performed on the host while the execution of the model is performed on the card. The toolkit contains bitstreams for different topologies.

  1. Programming the bitstream

     aocl program <device_id> <open_vino_install_directory>/a10_dcp_bitstreams/2-0-1_RC_FP16_ResNet50-101.aocx                                                                                     

    2. The Hetero plugin can be used with CPU as the fallback device for layers that are not supported by the FPGA. The -pc flag prints            performance details for each layer

     ./classification_sample_async -d HETERO:FPGA,CPU -i <path/to/input/image.png> -m <path/to/ir>/resnet50_frozen.xml                                                                                         




Intel® OpenVINO™ toolkit is a great way to quickly integrate trained models into applications and deploy them in different production environments. The complete documentation for the toolkit can be found at



Time series is a very important type of data in the financial services industry. Interest rates, stock prices, exchange rates, and option prices are good examples for this type of data. Time series forecasting plays a critical role when financial institutions design investment strategies and make decisions. Traditionally, statistical models such as SMA (simple moving average), SES (simple exponential smoothing), and ARIMA (autoregressive integrated moving average) are widely used to perform time series forecasting tasks.


Neural networks are promising alternatives, as they are more robust for such regression problems due to flexibility in model architectures (e.g., there are many hyperparameters that we can tune, such as number of layers, number of neurons, learning rate, etc.). Recently applications of neural network models in the time series forecasting area have been gaining more and more attention from statistical and data science communities.


In this blog, we will firstly discuss about some basic properties that a machine learning model must have to perform financial service tasks. Then we will design our model based on these requirements and show how to train the model in parallel on HPC cluster with Intel® Xeon processors.



Requirements from Financial Institutions


High-accuracy and low-latency are two import properties that financial service institutions expect from a quality time series forecasting model.


High Accuracy  A high level of accuracy in the forecasting model helps companies lower the risk of losing money in investments. Neural networks are believed to be good at capturing the dynamics in time series and hence yield more accurate predictions. There are many hyperparameters in the model so that data scientists and quantitative researchers can tune them to obtain the optimal model. Moreover, data science community believes that ensemble learning tends to improve prediction accuracy significantly. The flexibility of model architecture provides us a good variety of model members for ensemble learning.


Low Latency  Operations in financial services are time-sensitive.  For example, high frequency trading usually requires models to finish training and prediction within very short time periods. For deep neural network models, low latency can be guaranteed by distributed training with Horovod or distributed TensorFlow. Intel® Xeon multi-core processors, coupled with Intel’s MKL optimized TensorFlow, prove to be a good infrastructure option for such distributed training.


With these requirements in mind, we propose an ensemble learning model as in Figure 1, which is a combination of MLP (Multi-Layer Perceptron), CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) models. Because architecture topologies for MLP, CNN and LSTM are quite different, the ensemble model has a good variety in members, which helps reduce risk of overfitting and produces more reliable predictions. The member models are trained at the same time over multiple nodes with Intel® Xeon processors. If more models need to be integrated, we just add more nodes into the system so that the overall training time stays short. With neural network models and HPC power of the Intel® Xeon processors, this system meets the requirements from financial service institutions.


Figure 1: Training high accuracy ensemble model on HPC cluster with Intel® Xeon processors



Fast Training with Intel® Xeon Scalable Processors


Our tests used Dell EMC’s Zenith supercomputer which consists of 422 Dell EMC PowerEdge C6420 nodes, each with 2 Intel® Xeon Scalable Gold 6148 processors. Figure 2 shows an example of time-to-train for training MLP, CNN and LSTM models with different numbers of processes. The data set used is the 10-Year Treasury Inflation-Indexed Security data. For this example, running distributed training with 40 processes is the most efficient, primarily due to the data size in this time series is small and the neural network models we used did not have many layers. With this setting, model training can finish within 10 seconds, much faster than training the models with one processor that has only a few cores, which typically takes more than one minute. Regarding accuracy, the ensemble model can predict this interest rate with MAE (mean absolute error) less than 0.0005. Typical values for this interest rate is around 0.01, so the relative error is less than 5%.


Figure 2: Training time comparison (Each of the models is trained on Intel® Xeon processors within one node)





With both high-accuracy and low-latency being very critical for time series forecasting in financial services, neural network models trained in parallel using Intel® Xeon Scalable processors stand out as very promising options for financial institutions. And as financial institutions need to train more complicated models to forecast many time series with high accuracy at the same time, the need for parallel processing will only grow.