Robotics Vision Processing: Object Detection and Tracking

August 14, 2019 Dev Singh, Qualcomm Technologies, Inc.

Robots see, analyze, and make decisions more like humans every day, thanks to advances in converging technologies like artificial intelligence (AI), machine learning (ML), and computer vision (CV). Developing such visual analysis logic involves implementing solutions that can determine the orientation of objects, deal with moving objects, and perform navigation. The foundation for this starts with two important tasks[1]:

  • Preprocessing data collected from the real world via sensors to get it into a more usable state by various subsystems
  • Performing feature detection to extract visual features from the data such as corners, edges, etc.

Once these systems are in place, you can move on to higher-level robotic vision functionality, namely: object detection and classification, and object tracking and navigation. Let’s take a closer look at each.

Detecting Objects and Orientations

Object detection and classification has traditionally been challenging due to variations in viewpoints, different sized images, and dynamic illumination conditions. One solution that can help is to employ a neural network that has been trained to detect and classify objects.

One popular approach is to employ a convolutional neural network (CNN), where small regions of the image are fed into the network in a process known as “sliding windows.” And while this may sound a bit intimidating from a development perspective, thankfully there are a number of AI frameworks that can help you such as Caffe2 and TensorFlow, and the ONNX format. Furthermore, some device manufacturers provide SDKs that can work with output from those frameworks. For example, the Qualcomm® Neural Processing Engine SDK provides a pipeline for converting and optimizing output from these frameworks for use on Qualcomm® Snapdragon™ Mobile Platform, which powers the Qualcomm® Robotics RB3 Platform.

Another task is to determine the orientation of objects, which is important for both object interaction and navigation. The main challenge here is determining the orientation of an object and/or the robot itself in 3D world-space. A popular approach is to apply homography algorithms such as linear least square solver, random sampling and consensus (RANSAC), and least median of squares, to compute points between frames of 2D imagery. And again, some device manufacturers provide SDK support for this level of logic as well. For example, the Qualcomm® Computer Vision SDK provides developers with hardware-accelerated homography and pose evaluation APIs for this purpose.

Once objects have been detected, they can then be assigned metadata such as an ID, bounding box, etc. which can be used during object detection and navigation.

Objects and people can be detected and identified. Image courtesy of Qualcomm Technologies, Inc.

Object Tracking and Navigation

With objects and aspects of the surrounding environment identified, a robot then needs to track them. Since objects can move around, and the robot’s viewport will change as it navigates, developers will need a mechanism to track these elements over time and across frames captured by the camera(s) and other sensors. Since this mechanism must be fast enough to run every frame, numerous algorithms have been devised over the years, which approach the problem in different ways.

For example, Centroid Tracking computes the center point of a bounding box around an identified object across frames, and then computes the distance between the point as it changes under the assumption that the object will only move a certain distance each frame. Another approach is to use a Kalman filter that uses statistics over time to predict the location of an object.

Alternatively, the mean shift algorithm is an approach that basically finds the mean of some aspect of an image (e.g., color histogram) within a sub region of a frame. It then looks for the same description within the next frame by seeking to maximize similarities in features. This allows it to account for changes such as scale, orientation, etc., and to ultimately track where the object is.

Since these techniques only need to track a subset of the original features, they can generally deal with changes such as orientation or occlusion, efficiently and with good success, which makes them effective for robotics vision processing.

But objects aren’t the only things that need to be tracked. The robot itself should be able to successfully navigate its environment, and this is where Simultaneous Localization and Mapping (SLAM) comes in. SLAM seeks to estimate a robot’s location and derive a map of the environment. It can be implemented using a number of algorithms such as Kalman filters. SLAM is often implemented by fusing data from multiple sensors, and when it involves visual data, the process is often referred to as Visual-Inertial Simultaneous Localization and Mapping (VISLAM).


Applying multiple filters from multiple sensors to gather tracking information. Image courtesy of Qualcomm Technologies, Inc.

Of course, SLAM is only as good as what the robot can sense, so developers should choose high-quality cameras and sensors, and find ways to ensure that they’re not blocked from capturing data. From a safety aspect, developers should also devise fail safes in case data cannot be acquired (e.g., the cameras become covered).

To help with this, try to look for SDK support from the device manufacturer here as well. For example, the Qualcomm® Machine Vision SDK, provides algorithms to determine position and orientation including VISLAM using an extended Kalman.

Our next generation of robots with the use of computer vision and machine learning are more sophisticated with their abilities to ‘see’ their surroundings, ‘analyze’ dynamic scenarios or changing conditions, and ‘make decisions.’ This will require developers to become well versed in higher-level robotic vision functionality and tools for object detection and classification, and object tracking and navigation.

Author’s Bio

Dev Singh serves as a director of business development at Qualcomm Technologies, Inc. (QTI), where he is currently the global head of Robotics and Intelligent Machines business segment in QTI’s Industrial IoT business unit.

Qualcomm Snapdragon, Qualcomm Neural Processing SDK, Qualcomm Robotics RB3, Qualcomm Computer Vision SDK and Qualcomm Machine Vision SDK are products of Qualcomm Technologies, Inc. and/or its subsidiaries.


[1] For additional information see this blog.

Previous Article
Advancements in Medical Electronics

The impact of IoT and AI on health care involves medical electronics such as robot-assisted surgeries, diag...

Next Article
Compact Industrial PCs Bolster Compute Power, Graphics and IoT Connectivity

A new series of ultra-compact rugged embedded computers from Portwell is targeting industrial applications ...