Deconstructing Alexa - Software and sensors of the Amazon Echo and beyond

June 30, 2016

The Amazon Echo is the epitome of an Internet of Things (IoT) device. It combines an embedded applications processor from Texas Instruments, MEMS micr...

 

The Amazon Echo is the epitome of an Internet of Things (IoT) device. It combines an embedded applications processor from Texas Instruments, MEMS microphones from Knowles, Wi-Fi and Bluetooth wireless connectivity, an AWS cloud backend, and support for diverse applications. It’s also multi-function, which increases the platform’s value for consumers (bundled services), as well as Amazon (multi-dimensional insights into customer behavior and trends). The glue that ties all of this together is, of course, software.

The Echo’s signature feature, automatic speech recognition (ASR), is enabled by software algorithms that not only provide the language modeling and natural language understanding capabilities that make the platform unique, but also help offset the rigors of reverberant speech. Reverberant speech is a phenomenon that occurs in indoor environments when an audible signal reflects or bounces off of various surfaces, creating noise in the form of echoes that diminish the direct path signal from speaker to microphone. As you can imagine, this wreaks havoc on speech recognition, but consider the real-world use case of the Amazon Echo wherein reverberant speech is often the only signal available from a speaker communicating with the device.

Jeff Adams, CEO of Cobalt Speech & Language, Inc. and former Senior Manager of the speech and language groups at Amazon, worked on the Echo. He attributes the platform’s success in situations where his wife yells, “‘Alexa, what time is it?’ and hears the answer even though she’s three rooms away, down the hall, and around the corner” to cloud-based deep neural networks (DNNs) capable of performing roughly 1 billion arithmetic operations per second in support of ASR algorithms, beamforming, and noise cancellation techniques. But, while Adams suggests that kind of computing power became possible after cycles of Moore’s law and could at some point be available on processors beyond the data center, those performance requirements don’t leave much hope for accurate ASR today in embedded devices not backed by the power of the cloud.

Sensors, software, and embedded speech rec

Even though acoustic and language processing models such as those used for the Echo can be compressed, the reality is that compression comes with tradeoffs. The more ASR models are compressed the less accurate they become, and typically the size of language libraries shrinks dramatically from the linguistic openness of platforms like the Echo to perhaps a few hundred or a few thousand words. Furthermore, even after compression you’re probably still talking about hundreds of MB for such models, which is a huge burden on even high-end smartphones.

However, innovations in sensor technology are emerging that could help remove some of the overhead associated with massive DNNs, namely the use of multiple, heterogeneous inputs. For instance, Cobalt is partnering with human-to-machine communications (HMC) company VocalZoom, a manufacturer of optical sensors that pair with acoustic microphones to eliminate background noise and improve directional acquisition for speaker isolation.

The optical sensor technology works by converting vibrations from a speaker’s cheek, larynx, and other facial areas into an audio signal, though one devoid of background noise due to the low frequencies at which skin vibrates. This information is then fused with inputs from traditional acoustic microphones to generate noise-free audio signals that can be leveraged in the absence of cloud-based DNNs to reduce the effects of reverberant speech, and even enable applications such as access control and voice authentication. For example, such an implementation could prevent systems like the Echo from waking up when a TV commercial mentions “Alexa” (more on optical sensors can be found in “Delivering more natural, personalized, and secure voice control for today’s connected world”).

Additionally, Adams says that other sensors are starting to be considered in the ASR equation, particularly as his company works towards speech classification engines designed to infer background information about a speaker, such as age, gender, physical and emotional state, and even possibly to aid in early diagnosis of medical conditions like Parkinson’s and Alzheimer’s. Cameras and inputs from medical devices would be obvious complements in these types of applications, which could lead to the next level of sensor data fusion for the Internet of Things.

 

Brandon Lewis, Technology Editor