These days we seem to take it for granted how powerful and sophisticated computers have become. We can talk to our phones and our Bluetooth speakers and they will respond with context-aware information; in certain cars you can take your hands off the wheel and let yourself be carried down the road by electronics, and we can share messages and pictures with anyone anywhere in the world at the touch of a button.
But one area where our devices are still very much in their infancy is that of computer ‘vision’. While we have ever-better cameras in our pockets, in terms of understanding the world these devices are relatively dumb. While they can see with ever greater clarity, they can’t yet understand what they are seeing.
For example, if you show a three-year-old child an image of a person standing next to an elephant, they would have no issues telling you what they are seeing but for a computer to do the same would be extremely challenging.
However, things are changing. In recent years a field of computing called ‘deep learning’ has greatly enhanced the ability of computers to understand as well as they can see. Rather than relying on traditional image processing techniques, deep learning, and specifically the use of convolutional neural networks, are beginning to make significant inroads into giving computers the ability to make sense of the world.
Convolutional neural networks were first pioneered back in the late 1980s based on based on a series of earlier work in the 1960s on Artificial Neural Networks (ANNs) and Multilayer Perceptrons (MLPs). They were originally designed to work in a similar way to the human brain. Of course much like a human brain, in order to do their job well, they need lots of data on which to be trained.
CNNs became more widely known and used around 2005 with the rise of modern GPUs, as their ability to process repetitive tasks at speed make it practical to use CNNs.
Work in the field on giving computers visual intelligence made a significant leap in 2012 when Alex Krizhevsky used a neural network to win the ImageNet challenge. This is a huge image database of millions of images that was created in 2007 by Professor Kai Li at Princeton University to provide computers with enough training data to help them learn in the same way a child would. The ImageNet challenge is commonly described as the annual Olympics of computer vision and tests how fast a computer can learn to understand what it’s seeing based on a large selection of images; the fewer errors, the greater the score.
At the time, Krizhevsky was able to reduce the error rate down from 26% to 15% – a major improvement and it was all made possible by the use of a convolution neural network. Each year this process is improved further as teams create ever better systems to speed up and improve the ability of devices to understand images.
But how are CNNs being used in the real world today and what impact are they having?
In a famous scene from 2001: A Space Odyssey, astronauts David Bowman and Frank Poole hide in a pod where HAL, the ship’s computer, cannot hear them discussing his odd behaviour. However, HAL is able to read their lips and works out that are going to deactivate him – with infamous results. Well, we now know that he would have used a CNN to decipher what they were saying. There are more down to earth uses for a lip reading computer, such as getting transcriptions from video content where the audio is not available, such as for journalists to obtain off-mic comments from politicians or celebrities.
A group from the University of Oxford has proposed using a CNN exactly for this, while another paper submitted to the IEEE proposes how a CNN could be used to “reduce the negative influence caused by shaking of the subject and face alignment blurring at the feature-extraction level.” It produced a word recognition rate of up to 71.76%, far superior to conventional methods.
However, you can also see the power of CNNs actually running in your hand today. An app called AIPoly, designed to assist blind and partially sighted people leverages an Imagination PowerVR GPU in identifying objects using the smartphone camera and says out loud what they are.
CNN are closely associated with automotive but actively using them to power self-driving cars is still a work in progress. This paper from Cornell University discusses how they can be effectively used to recognise car license plates with CNNs to deliver better results than conventional approaches. Of course, license plate aren’t as unpredictable as moving objects such as pedestrians, but a paper discusses using CNNs to achieve this with improved efficiency over previous methods.
When it comes to those pesky moving objects known as people CNNs are also expected to play a key role as the foremost algorithm type used in ADAS and autonomous vision systems in cars. CNNs are extremely efficient at analysing a scene and breaking it up into its recognisable components until objects, people, cars, trucks, road kerbs and road signs are recognised through a camera-based system. By using vast amounts of training data, the Convolutional Network can ‘learn’ what to look for and extract this from a scene while driving in real time. As an example, through the various layers of the CNN one can detect corners/curves, followed by circles, then road signs and finally, what the road sign means. This is then passed onto a sensor fusion element to take inputs from other sensors i.e. LiDAR or radar to make sense of the bigger picture, and then act upon them either by flashing a warning via the Multi Media Interface or by taking control of braking and or steering.
CNNs can be implemented on the CPU or using GPU compute, which is much more efficient – often by at least a factor of ten – or by hardware acceleration, which will ultimately yield the highest performance at the lowest power and silicon footprint.
By their very nature, CNNs are very good at detecting patterns, making them well suited to assisting in medical situations. As this article in Nature.com discusses they can be very effective in increasing the accuracy of recognising cancers and have been used to pick up “primary breast cancer detection12, glioma grading13 and epithelium and stroma segmentation”. Their efficiency means they can reduce the workload for pathologists and the paper concludes that “’deep learning’ holds great promise to improve the efficacy of prostate cancer diagnosis and breast cancer staging.”
Equally, a paper from Cornell University on using CNNs to aid in breast cancer screening considers the issues that arise when down-sampling the image fidelity of training data and suggests that image resolution must be maintained to ensure the best performance.
If you’re afraid of the thought of computers building themselves, then you might have cause to worry. The semiconductor industry is one that is looking at using deep learning to aid in the design and manufacturing of advanced integrated circuits. CNNs are seen as being very well suited to solving certain manufacturing problems. In a similar vein to identifying cancers, the ability of CNNs to spot patterns will be put to good use in the lithography process, greatly reducing manufacturing defects and helping to increase yields.
CNNs are also being widely used to recognise foods. This paper discusses using a CNN for automatic diet recognition to enable specialists to discover unhealthy food patterns. There are several papers that describe using CNNs in this way; this one refers to it as ‘DeepFood’ for computer aided dietary assessment, improving health and longevity.
Making digital images look great is a skill that many people spend a lot of time perfecting, through careful use of image retouching tools. An experimental process from Adobe and Cornell University called “Deep Photo Style Transfer” is looking to make those people redundant, by applying artificial intelligence. This app can take the style of one photo and automatically apply it to another, with dramatic results.
CNNs are also widely used by sites such as Facebook. Here the company describes how they use one in DeepText, which they describe as “a deep-learning-based text understanding engine that can understand with near-human accuracy the textual content of several thousand posts per second, spanning more than 20 languages.”
Imagination is naturally looking closely at ways of accelerating the use of inference engines; that is the running of CNNs on devices once they have been fully trained on datasets. As we demonstrated last year our PowerVR Rogue GPUs already offer 3x greater efficiency and up to 12x faster performance than running on a CPU, and our new PowerVR Furian architecture will offer even greater performance and power efficiency.
One of our recent blog posts highlights our work in this area and how we are the first to make use of the CNN extension in OpenVX, the open source standard API for computer vision.
We are continuing to work in this sphere and Imagination’s Paul Brasnett recently spoke at the Embedded Vision Summit on the subject of ‘Training CNNs for Efficient Inference‘. In his presentation, he explained Imagination’s approach to improving the efficiency of running CNNs on hardware where power and area constraints are of primary concern, such as on mobile devices and in automotive.
It’s an exciting time for computer vision, and Imagination will be at the heart of it. We look forward to bringing you news regarding upcoming products that will make even greater strides in this area in the coming months.