I’ve written a series of blogs about consumer devices with speech recognition, like Amazon Echo. I mentioned that everyone is getting into the “always listening” game (Alexa, OK Google, Hey Siri, Hi Galaxy, Assistant, Hey Cortana, OK Hound, etc.), and I’ve explained that privacy concerns attempt to be addressed by putting the “always listening” mode on the device, rather than in the cloud.
Let’s now look deeper into the “always listening” approaches and compare some of the different methods and platforms available for embedded triggers.
There are a few basic approaches for running embedded voice wakeup triggers:
First, is running on an embedded DSP, microprocessor, and/or smart microphones. I like to think of this as a “deeply embedded: approach as opposed to running embedded on the operating system (OS). Knowles recently announced a design with a smart mike that provides low-power wake up assistance.
Many leading chip companies have small DSPs that are enabled for “wake up word” detection. These vendors include Audience, Avnera, Cirrus Logic, Conexant, DSPG, Fortemedia, Intel, InvenSense, NXP, Qualcomm, QuickLogic, Realtek, STMicroelectronics, TI, and Yamaha. Many of these companies combine noise suppression or acoustic echo cancellation to make these chips add value beyond speech recognition. Quicklogic recently announced availability of an “always listening” sensor fusion hub, the EOS S3, which lets the sensor listen while consuming very little power.
Next is DSP IP availability. The concept of low-power voice wakeup has gotten so popular amongst processor vendors that the leading DSP/MCU IP cores from ARM, Cadence, CEVA, NXP CoolFlux, Synopsys, and Verisilicon all offer this capability, and some even offer special versions targeting this function.
Running on an embedded OS is another option. Bigger systems like Android, Windows, or Linux can also run voice wake-up triggers. The bigger systems might not be so applicable for battery-operated devices, but they offer the advantage of being able to implement larger and more powerful voice models that can improve accuracy. The DSPs and MCUs might run a 50-kbyte trigger at 1 mA, while bigger systems can cut error rates in half by increasing models to hundreds of megabytes and power consumption to hundreds of milliamps. Apple used this approach in its initial implementation of Siri, thus explaining why the iPhone needed to be plugged in to be “always listening.”
Finally, one can try combinations and multi-level approaches. Some companies are implementing low-power wake-up engines that look to a more powerful system when woken up to confirm its accuracy. This can be done on the device itself or in the cloud. This approach works well for more complex uses of speech technology like speaker verification or identification, where the DSPs are often crippled in performance and a larger system can implement a more state of the art approach. It’s basically getting the accuracy of bigger models and systems, while lowering power consumption by running a less accurate and smaller wakeup system first.
A variant of this approach is accomplished with a low-power speech detection block acting as an always listening front-end, that then wakes up the deeply embedded recognition. Some companies have erred by using traditional speech-detection blocks that work fine for starting a recording of a sentence (like an answering machine), but fail when the job is to recognize a single word, where losing 100 ms can have a huge effect on accuracy. Sensory has developed a very low power hardware sound-detection block that runs on systems like the Knowles mike and Quicklogic sensor hub.
Todd Mozer is the CEO of Sensory. He holds over a dozen patents in speech technology and has been involved in previous startups that reached IPO or were acquired by public companies. Todd holds an MBA from Stanford University, and has technical experience in machine learning, semiconductors, speech recognition, computer vision, and embedded software.