I used to get frustrated when I went on Netflix and got recommendations for 12-year old girls, because my daughters’ preferences got mixed into mine. Then Netflix had the brilliant idea that a user should “log in” and so the device would know them.
But why do I need to do this? Why can’t my TV just look at me and create a profile for me without me having to log in? Computer Vision is here and it’s not that difficult. Over the next couple of years, we’ll see a proliferation of new uses for embedded vision such as authentication.
In the television example, power consumption isn’t a big issue because it doesn’t need to be “always looking,” just momentarily looking when you turn it on. For products that are always on and always listening for speech, it’s a little trickier. Ultra low power voice triggers can be used to “wake up” devices at rates averaging below 1 mW. However, can this “wakeup” command really be used to identify the speaker?
You might be surprised that it can. Speaker verification can be added to a trigger phrase with only a small effect on memory and power consumption, and you can see this with audio wake-up phrase like “OK Google Now” and “Hey Siri.” However, to keep size and power low so the algorithms can run on tiny DSPs, sacrifices must be made in the quality of speaker verification.
A new approach is being deployed by Sensory to run a “light” speaker verification trigger on the DSP with a state of the art embedded speaker verification running on the main processor. With AC-powered products, high-quality speaker verification can be deployed seamlessly on the main processor. This approach can yield better than 90% acceptance of the right speaker while rejecting 99.999% of the wrong speakers. In a home use application for a shared product, the performance can be nearly perfect.
People will start saying “Hey Jibo” this year and will be very impressed by Jibo’s ability to know who they are. I wouldn’t be surprised if high accuracy speaker verification and vision authentication get deployed in consumer products over the coming years to automatically collect data on usage parameters and preferences for a variety of home products to make them pre-configured to the way users want them.
We’ll also see more applications deployed using vision and voice to protect our transactions. Check out Applock by Sensory for Android phones. It’s free and will give you a sense of today’s state of the art in face and/or voice authentication.
Todd Mozer is the CEO of Sensory. He holds over a dozen patents in speech technology and has been involved in previous startups that reached IPO or were acquired by public companies. Todd holds an MBA from Stanford University, and has technical experience in machine learning, semiconductors, speech recognition, computer vision, and embedded software.