Smart Speakers? Let’s call them Voice Assistants
The launch of Amazon’s Echo device in 2014 popularized the idea of devices where the only means of interaction is through voice and conversation. Now it seems that every month brings a new development in dedicated devices that process voice commands and perform actions. However, what exactly are these devices?
The popular media calls them “smart speakers” or “voice assistants” or “intelligent personal assistants”, but these words aren’t exactly analogous. A smart speaker conjures up a primarily output oriented device that aims to replace keyboard or button interaction with voice commands. Yet, that’s a particularly trivial application for the billions of dollars invested by Amazon, Google, Microsoft, Apple, Alibaba, Tencent, Samsung, Baidu, and others which are seeing this as a critical market to try to dominate. After all, why are all these vendors so aggressively marketing and promoting these devices if all they do is allow you to play Taylor Swift on vocal demand or let you ask about the weather?
Clearly there’s a bigger play here than simply voice-activated speakers. The smart speaker is a way to interact more intelligently with the customer base, get into a larger number of households and businesses, and get people comfortable with using these devices. The real play is something bigger than just a speaker you can control with your voice. The power is not in the speaker, but in the cloud-based technology that powers the device.
Not Smart Speakers. Intelligent Conversational Assistants.
If you ask Amazon, and others, you know that playing music, games, and simply responding to queries is not the end state of their vision for what these conversational gateway devices will be. These devices are low-cost input and output hardware that are a gateway to the much more powerful infrastructure that sits at the major tech companies’ data centers. The device itself is the giveaway to this. You can even build your own full-featured conversational device for just a few dollars. So let’s dispense with the clearly ill-fitting term “smart speaker”. It belies the real power of these devices. Rather than just being passive devices, intelligent conversational assistants can proactively act on your behalf, performing tasks that require interaction with other humans, and perhaps soon, other conversational assistants on the other end. The speaker part, ironically, is the least relevant part to make it happen. It just provides the output. All the power happens prior to that output.
Indeed, where exactly is the converational device? For example, in the Google Duplex demo where Google Assistant interacts with a restaurant to make a reservation, the speaker itself is not even there. It’s all happening behind the scenes from a cloud-based interaction. We don’t see a device because the device is not necessary here. These devices are just gateways to the real activity that’s happening in the cloud-based data centers. In the Google Duplex demo, the conversational agent is acting completely behind the scenes from Google’s data center interacting through voice-over IP (VoIP) telephone lines with a human on the other end.
So why are devices needed at all if they’re just gateways? They’re needed because they provide the user interface to the cloud-based intelligence services. Without a device, the only way to access these services is through a web, desktop, or mobile interface. But this is inefficient. Amazon wasn’t truly the first to bring voice-based assistants to market. Apple had them beat by over three years with Siri, and Google introduced their voice-based assistant in Android just a short while after. What made Amazon stand out though with their Echo devices is that the mobile phone was eliminated entirely. Rather than activating the device through a phone, you can simply speak in the comfort of whatever activity you’re doing and trigger intelligent capabilities. Basically, the value of the device is in its hands-free mode of interaction, but the intelligence of the device is in the back-end infrastructure.
How Intelligent Are These Devices?
In 2018, Cognilytica announced the creation of a Voice Assistant Benchmark to test the intelligence of the devices, which it followed up with another benchmark in 2019. (Disclosure: I’m an analyst with Cognilytica). The purpose of the benchmark isn’t to test the natural language processing (NLP) or natural language generation (NLG) capabilities of the devices, which are now fairly standard across anyone who wants access to high quality natural language capabilities. Nor is the intent of the benchmark to see what sort of skills these devices can perform. We know that better NLP/NLG means the ability to handle a wider range of voices, accents, languages, and speaker characteristics, and more skills mean more single-task capabilities. Those are all “table stakes” as far as we’re concerned. The purpose of the benchmark is to see how truly intelligent these devices are, beyond just voice-activated search-and-retrieval tools.
If the power of the devices is not in the device itself, but in the back-end intelligence that gives these devices real capabilities, then we need to test to see how intelligent that back end really is. Can the conversational agents understand when you’re comparing two things together? Do they understand implicit unspoken things that require common sense or cultural knowledge? For example, a conversational agent scheduling a hair appointment should know that you shouldn’t schedule a haircut a few days after your last haircut, or schedule a root canal dentist appointment right before a dinner party. These are things that humans can do because we have knowledge, intelligence and common sense. Yet as it stands and as we demonstrated in our initial benchmark, neither the Google Home nor Amazon Echo nor Apple Siri devices can answer the question “what’s larger: the sun or the earth?” Would you trust these devices running your life? Not yet. However, we aim to help move things in that direction.
The Implications of an Intelligent Conversational Assistant
In the not-so-distant future, intelligent assistants will be everywhere. We’ll be interacting with them daily in both our personal and business lives. We’ll be chatting with assistants in our homes, and also interacting with other people’s and business’s conversational agents. In a future where everyone will have a personal electronic virtual assistant, we’ll have them do everything from messaging friends when you’re putting together a birthday party, to scheduling all the logistics for that party, to dealing with inbound calls from late attendees who can’t make it. Soon enough, as dependent as we are now on our GPS systems from keeping us from getting lost and our mobile phones for keeping us always connected, we’ll be dependent on these intelligent assistants for keeping our lives in order. This is just an inevitable direction of where things are heading.
However, there’s a downside to the use of intelligent assistants. In an article in Verge, experts bemoan the fact that humans will want to know if they’re talking to a robot or not. Clearly people will be frustrated by early generations of intelligent assistants as they make frustrating mistakes. Yet, there’s an even darker potential outcome. Criminals and mischief makers can use voice assistants to tie up phone lines, cause retail “denial of service” attacks by scheduling fake appointments, cause harm by faking information to people to get them to leave their houses or otherwise tie up resources. In the future, we’ll need a sure-fire way to make sure that we know who the speaker on the phone is, what their intentions are, and how real the requests are. The future (which is really here now) is that we can’t believe anything we see or hear. This makes verifying reality incredibly important in an AI-Enabled Future where intelligent assistants are part of our everyday lives.
We still have a long way to go before our assistants are like the type we see in science fiction movies and television shows. If we want our intelligent conversation assistants to be like the computer in Star Trek: the Next Generation, we need them to become more useful, more intelligent, and more trustworthy. This is why we need intelligent assistants and not just so-called “smart speakers”.