Real-Time Image Registration on Embedded GPU System with OpenCL™

Date: May 6, 2012


In this article we would like to present a novel solution for image registration of live input video stream (720p, 1280×720 pixels) in a small, compact and low-power system.
The application is OpenCL™ GPU based, and was developed recently for a local defense customer, in the field of image processing (computer vision).

By making the right algorithmic adjustments and choosing an appropriate architecture the application runs at real-time performance and targets embedded solutions where other competitive solutions (DSP, FPGA) failed to deliver.
The main purpose of this solution is to serve as a new generation digital signal processor (DSP) for sensor platforms, based on general purpose GPU architecture with heterogeneous computing (CPU + GPU) capabilities.


The Problem

Image registration is known as the process of transforming a set of sequential images (video stream acquired from a sensor) into a similar coordinate system, creating a smoother visual flow.
In real-life, physical conditions or normal movement affect the images a sensor gathers and may cause vibrations. Viewing a continuous frame-set would look shaky or unbalanced as the sensor is not stabilized.

The purpose of image registration is simply to fix that having the output video stream smoother than before.
Direct applications can vary from defense to medical imaging and more.

Registration processes usually consist of the following stages: identification of movement vectors between two relative images, performing alignment and applying further correction/enhancement filters to improve image and stream quality.

For defense needs, registration is widely used in all sensor based components, from ground to aerial systems, with different applications.
To add complexity, computation needs are very high (higher resolutions and frame rates), hardware system size is very limited (with good heat dissipation) and TDP must be maintained for low power consumption.

In the case discussed below, requirements were to process a frame-by-frame input video stream.
Image format was set to a common standard in the defense industry at 720p@120.
Description: High-Definition (HD), 1280×720 progressive single channel images with 120Hz/FPS frame rate. Image channel configuration was a widely used integer 16bit of grayscale color levels.


For solving the algorithmic problem, a method of iterative gradient-descent was chosen until convergence, with affine/perspective transformation of coordinate system.

The software was implemented in C/C++ with OpenCL™ for reasons discussed below, supported by Linux (original operating environment), Microsoft Windows 7 and Android. Real-Time variants are supported as well.

Hardware architecture was chosen to be an AMD APU (full model: AMD Embedded G-Series T56N) in COM-Express form factor.

Featured image of a T56N development module:
AMD G-Series ("eOntario") T56N

Main APU architecture specifications of T56N:

  • CPU Clock Speed: 1.60GHz/1.65GHz
  • CPU Cores: 2
  • CPU L1 Cache: 128KB (64KB per core)
  • CPU L2 Cache: 1MB (512KB per core)
  • RAM: DDR3-1333MHz
  • GPU: AMD Radeon™ HD 6320, 80 Cores
  • Max TDP: 18W
  • Computational Capability: ~100 GFLOPS

This APU selection appeared to deliver necessary performance requirements and graphics capabilities, under project constraints.

The above APU family is called ‘eOntario’ and offers multiple models ranging from a lower TDP (ULV, Ultra-low voltage) of 5.5W to 18W, with varying computational performance. As the prefix ‘e’ implies, this line of heterogeneous processors, “Fusion”, are offered as embedded, with very low TDP and longevity of 7 years for industrial production needs.

For a full list of specifications please check the following Link or visit AMD website.

Development Process

Development process took several months to receive a full, tested, application. Hand optimization stages had to be taken to get highly optimized code for common matrix operations to stand in overall performance goals.

Those hand tuned implementations outperformed existing CPU libraries by a factor of 20 times, when compared to other implementations, such as: AMD BLAS, Intel IPP/MKL and others.
The entire algorithm flow was implemented in OpenCL™ and GPU, leaving only general I/O CPU↔GPU transfers.

Final Notes

OpenCL™ was selected to serve as computational GPU/CPU framework of choice, since it is the only framework available for Intel/AMD platforms. As such, its cross-vendor, cross-platform, support and simplicity of use improved development processes to become very agile and robust.

In order to achieve performance goals, all set of available software features were utilized for parallel asynchronous processing, provided by OpenCL™ and multi-core technologies.

The implementation didn’t consume all available compute resources in the system, leaving resources to be used by other applications for tasks as post-processing and management. About 1.5 CPU cores were free (while GPU processing is active at full speed) for different tasks, especially video acquisition etc.

Advanced graphics display capabilities of the system were used together with OpenGL interoperability to provide a live display of streaming video results on attached screen.

AMD Embedded G-Series T56N APU was found to perform quite well as digital signal processor for a wide variety of applications in the field of image processing. Indeed this architecture can compete well against DSP variants, which are commonly used in defense/medical markets.

The algorithm was found to execute properly under real-time constraints. Numerical results were accurate (under machine limitations) when compared to a standalone CPU implementation using OpenCV and Intel IPP.

At this point it is necessary to mention that no other vendor, except Intel (with recent Ivy-Bridge generation), has relevant competitive hardware offering to withstand low TDP and matching computational performance for general purpose embedded environments.

AMD eOntario and eTrinity APU Comparison

Furthermore, the implementation was tested and verified on the upcoming AMD APU architecture codenamed “eTrinity” having similar TDP of 17W.

The table below provides a brief comparison between AMD eOntario G-Series (T56N) and eTrinity R-Series (R-260H) APUs.
[table id=5 /]

Results: Factor of 2-3 times with “eTrinity” in most algorithm building blocks, compared to “eOntario” T56N.

External Links