Abstract — The widespread use of voice to interact with smartphones, tablets and personal assistance systems has jump-started the use of voice as the HMI (Human-Machine Interface) of choice across other technologies. In today’s smart homes, for example, users can ask Alexa to turn on or off lights, lock doors and adjust thermostats. As this technology becomes more commonplace, demand is building for technical solutions that increase the interaction between humans and machines using highly accurate, compact and power-efficient Neural Network-based key phrase detection solutions. Unlike cloud-connected Neural Network (NN) key phrase detection implementations that require network connectivity, edge-based solutions perform all computing at the edge and do not record or transmit data to the cloud.
This paper will describe a NN-based key phrase detection solution designed for the network edge. These Binarized models are available to run on low power UltraPlus™ FPGAs. This paper will discuss how key phrase detection can be used in noisy environments when the NN is trained with a dataset that includes a noisy background such as music or chatting noise. The Neural Network in this case is trained using a public dataset to detect the word “seven.” Key phrase detection can be used in a wide range of applications without the need for a personal assistant device. Possible applications include smart light switches, smart TVs, and AVRs managing devices with commands such as volume up and down.
Using voice commands to control the human-machine interface (HMI) has been a goal of systems designers for a long time. Popular science fiction TV shows and movies dating back to the mid-20th century, such as “Star Trek” and “Star Wars,” gave us a hint of what a voice-enabled world might look like. But developing low cost, power-efficient solutions for real life consumer applications has proved elusive.
Over the last few years, however, the emergence of popular AI applications like Amazon’s Alexa and Apple’s Siri and their ability to convert voice commands into system actions have accelerated the migration to voice-based HMI. These rapid advances have opened the door to a growing array of solutions for the smart home that rely on key phrase detection. Today, users can ask Alexa to order products over the internet, turn on lights, lock doors, set the home thermostat, and even water the grass.
Typically, these voice-enabled HMIs perform the calculations required to recognize a key phrase in the cloud. In many cases designers plug their application into a pre-existing infrastructure like Amazon’s Alexa. However, this development strategy faces several limitations. First and foremost is cost. Solutions that run a key phrase detection algorithm on a server in the cloud must pay for usage by the minute every time they access resources in the cloud. In addition, developers building cloud-based edge solutions must pay an NRE to train their solution to a particular device and then pay a royalty with every solution they ship. And designers that plug their design into a pre-existing infrastructure will see their costs rise as they move to a Wi-Fi model that requires a more powerful processor to acquire the data, analyze it, send it over to the edge device, and listen to the command over Wi-Fi.
Moreover, relying on an internet connection brings additional risks. Using an internet connection to transmit data to the cloud can lead to interruptions in service if the connection goes down. Transmitting data over the internet also poses a potential hacking risk. And from the user’s perspective, internet connectivity opens the door to privacy violations and security concerns. Edge solutions that rely on computational resources located directly on the device avoid these potential problems.
II. NEW APPROACH
This article explores a different approach to bring lower cost key phrase detection to devices located on the network edge. Leveraging advances in the development of highly accurate, compact and low-cost models of Binarized Neural Networks (NNs) with improvements in a new generation of very low power Field Programmable Gate Arrays (FPGAs), designers can now build key phrase detection solutions that perform all computing at the edge and thereby eliminate the connectivity, security and privacy concerns associated with cloud-connected NN key phrase detection implementations.
By performing key phrase detection locally, this design strategy offers significant cost savings compared to cloud-based solutions. It also does not rely on other eco systems to operate. If the internet connection fails in a cloud-based solution, the system fails. A local, edge-based solution doesn’t run this risk. Security and privacy issues are not a threat. And a local solution is easier for the user to setup and run. Finally, using Lattice’s ultra-low-power iCE40 Ultra Plus FPGAs, this approach offers designers significant power savings, an important consideration in battery powered devices. As an example, the solution described in this presentation consumes only 7 mW.
A key step to bringing affordable smart home applications to the edge has been the development of Binarized NN models capable of running on low density, low power FPGAs. The deep learning techniques that use floating-point computation in the cloud are impractical for consumer applications at the edge. Instead, designers must develop computationally-efficient solutions that meet accuracy targets, but also comply with the cost, size and power constraints found in consumer markets. Accordingly, designers operating on the edge must use math that employs as few bits as possible.
One way designers can simplify that computation is to switch from floating-point to fixed-point or even basic integers. By compensating for the quantization of floating-point to fixed-point integers, designers using Binarized NNs can develop solutions that train faster with higher accuracy and raise the performance of fixed-point, low precision integer NNs close to the level of floating-point versions. To build simple edge devices, training must create NN models with 1-bit weights. These models are called Binarized Neural Networks (BNN).
By using 1-bit values instead of larger numbers, BNNs can eliminate the use of multiplication and division. This allows the computation of convolutions using XOR and pop count resulting in significant cost and up to 16x power savings. And with today’s FPGAs, designers have a highly flexible platform that supplies all the memory, logic and DSP resources they need.
III. NN IMPLEMENTATION
The discussion below describes an example of a key phrase detection solution designed for edge applications and implemented in an iCE40 UltraPlus FPGA with a BNN soft core. During normal operation the key phrase detection implementation listens for a sound while consuming less than 1mW. Once the system detects a sound, it activates 1 second of buffering and the BNN is invoked. The BNN operates on the raw input directly not on a conventional spectrogram and MFCC pre-processing. 16K raw samples representing the 1 second of audio go through overlapped 1D convolution layer and become 30 32x32x3 images each representing a 10ms audio sample. The output is then passed into the main BNN for processing.
The BNN is four layers deep with each layer performing the functions shown below:
Binary Convolution is a 1bit multiplication of the input data and the 1bit weights. In this case, multiplication is replaced by XOR function. Batch Normalization and Scale normalizes the activations and helps during the BNN training phase. Rectified Linear Unit (ReLu) sets data below a specific threshold to 0 and higher than the same threshold is set to 1. Pool is performed on each for adjacent pixels of the image and chooses the highest probability meaningful pixel. This function reduces the amount of computation needed in subsequent steps. The Fully connected layer is usually the last layer and it takes every neuron in the previous layer. It also has some weight on the next layer’s neurons. This function is generally computationally expensive, so it is performed as a last operation where there are significantly fewer neurons.
The BNN is trained using a GPU and running standard training tools such as Café and TensorFlow. The training dataset used is a public training set with 65,000 one-second-long utterances of 30 short words by 1K+ people. This phase is known as the training phase. The output of the training tools then is passed through the NN compiler tool from Lattice Semiconductor to format it for usage by the FPGA design. You can think of the weights as a template of the key phrase to be used during the inference on the edge HW. The key phrase selected is “Seven.”
IV. SYSTEM IMPLEMENTATION
To demonstrate the functionality of the system, engineers used the HiMax HM01B0 UPduino shield with an iCE40 UltraPlus FPGA. This is a low-cost Arduino form factor board designed to demonstrate the capability of the FPGA. The board has two I2S microphones connected directly to the FPGA, external flash memory for FPGA design and weights activations storage. It also features LEDs to indicate detection of the key phrase. Users can speak directly into the microphones. Once the key phrase is detected, an LED turns on.
In this application the FPGA design frequency and length of processing can be traded for power consumption. At 27MHz 16K raw samples, equivalent to 1 second of audio processing, can be processed in 25ms while consuming 7.7mW. When the frequency is reduced to 13.5MHz, power consumption drops to 4.2mW and the same 1 second audio sample is processed in 50ms.
Key phrase detection often must operate in noisy environments without adding additional hardware for noise and echo cancelation. This implementation achieved this goal by training the NN with datasets that included noisy backgrounds without the need for localization and beamforming. The trained NN detects the key word like a human does with similar limitations. Datasets with various random levels of crowd noise (Café, conference, etc.) were added with the key phrase. NNs trained with higher noise levels become more robust to noise, but require a louder key phrase.
The BNN can detect up to ten 1-second key phrases making it ideal for HMI via voice. To improve detection accuracy, a time-domain filter is employed to report key phrase detection only if consecutive detection occurs. The design delivers accuracy as high as 99 percent for a single key phrase and up to 90 percent for up to 5 key phrases.
Bringing AI to the edge presents several significant challenges. However, it also offers tremendous opportunity. As this project has demonstrated, building AI into a device using an FPGA implementing a BNN instead of cloud-based resources can dramatically cut HW cost, while accelerating response time. At the same time, keeping processing local improves security and saves valuable bandwidth and server usage costs.
This paper was delivered at Embedded World 2019.
Hussein Osman, Product Marketing Manager
Hussein Osman is a product marketing manager at Lattice Semiconductor where he is responsible for the iCE-40 FPGA product line and developing solutions for consumer market segments such as mobile devices, wearables and consumer IoT devices. Mr. Osman has over 15 years of experience in the semiconductor industry. He has worked as a system engineer in Touch, Cap sense, Fingerprint and USB technologies. Mr. Osman received his bachelor’s degree in Electrical Engineering from California Polytechnic State University in San Luis Obispo.