Conversational AI (artificial intelligence) has become a hot topic in tech in recent years, in part because of its ubiquity: if you have a smartphone, you’ve got access to it. Merely saying “Hey Siri” into your iPhone or “Hey Google” into your Android devices gives you access to the automatic speech recognition (ASR) technology embedded in those devices.
There are home-based devices, too, that many people are using every day: Amazon’s Echo, Google’s Nest and Home Assistant, Apple’s HomePod Mini and others have enabled people to control various aspects of their home – entertainment, lighting, thermostats and other smart devices – using simple voice commands.
This technology is useful in part because it’s been around for a couple of decades now, and a lot of research has gone into making it work as well as it does. Companies like Nuance made this technology popular in the early 2000s.
What Is Conversational AI?
The basic idea behind the technology is fairly straightforward. Let’s say I’m leaving on a trip to Chicago on Wednesday and I want to know what the weather will be like. I can say to my iPhone, “Hey Siri, what’s the weather forecast in Chicago for Wednesday?” Siri will take that question, convert it to text, then feed that text into a search engine and knowledge graph, retrieve the results, convert those results from text back to speech, and tell me Chicago’s expected weather for Wednesday.
It uses a combination of ASR, speech-to-text, search, and text-to-speech technologies to answer my question. It feels somewhat conversational to the user, but simple requests like weather forecasts, stock quotes and sports scores (or home-based controls like “Alexa, turn on my bedroom lights.”) don’t require sophisticated technology in order to feel conversational.
The early conversational AI systems were rules-based: which is to say, they were built to respond with reasonably straightforward answers to a relatively narrow (and often predictable) set of questions. Technically it’s a form of conversation, but it often doesn’t feel conversational once we get beyond the most basic questions.
The Challenges of Feeling Conversational Today
Most of us have had the experience of talking to a phone-based customer service system, trying to do something relatively simple like confirming an airline reservation or paying our electricity bill, and becoming frustrated because it felt like we had to phrase our question in a very specific way to get an answer of any kind from the system. We want to talk to a human if we can’t talk to an AI system that feels human to us.
That’s the challenge for conversational AI: when we ask it to do much more than answer basic prescribed questions, and when we expect it to become more conversational, it falls short.
There are several layers to unpack regarding a truly conversational AI system. First, the system should understand the words you are using by employing natural language processing (NLP). Given the many languages people speak, we need NLP systems that can understand the various languages, dialects, accents and idioms present in the world.
The next layer is one of user context. We need systems that are capable of understanding emotional state: is the user happy today? Are they frustrated? Are they angry? The system may need to interact differently with the user based on their emotional state.
Over time, the system will need to interpret a variety of other clues about its users: their personality, their tone of voice, their topic preferences and their points of view. Even certain socio-cultural aspects like level of education, gender, ethnicity or religious background might come into play and help make a conversational AI more relevant and useful. After all, conversation is about a lot more than just language.
As the systems get more complex, they will need to be able to carry on lengthier conversations. If I’m having an hour-long conversation with a virtual assistant, the system needs to understand the full history of that conversation: if in the forty-fifth minute I reference something I said in the third minute, the system will need to be capable of interpreting that.
Eventually, I will be able to expect the system to interpret and respond to today’s conversation accurately when I reference something we discussed last week or last month or last year. These are things that humans do naturally but are very challenging to do with computers.
New Challenges as Conversational AI Gets More Advanced
As these systems become more advanced, they will become capable of responding appropriately not just with information, but with nonverbal cues such as matching tone of voice for the appropriate situation. For example, maybe the user is in an agitated state and is yelling at the AI.
The AI, with its more advanced persona, may find it best to respond in a very calm, polite way. In another situation it may find it appropriate to yell back, or to match their tone in other ways: matching sadness with a sympathetic tone, or matching excitement with enthusiasm. These personas will enhance the user experience and make it feel much more conversational.
Another challenge is in cascading levels of information. For instance, Siri responds well if I ask, “Where is the nearest grocery store?” But what if I have a lot of cascading information in a request, like, “Hey Siri, where’s the best place on the way to my sister’s house to pick up a bottle of wine that pairs well with an Italian pasta dish?”
To interpret this question, the AI first must know who your sister is and where she lives. Then it needs to know the route between where you are and where she lives, and if there are any wine merchants along that route or near it.
Eventually, it will need to know what types of wines to recommend. Even still, there are unanswered questions: when is this trip? What is the starting point? What kind of pasta is she serving? Do you want a red wine or a white? From what region? What’s your budget? What’s the occasion?
All of these questions and others will figure into your decision-making process. So, as you can see, a cascading information model becomes very compute-intensive, because this kind of cascaded probability tree takes large compute resources that must be located somewhere. Individual devices don’t have infinite computing power, so that compute will need to be offloaded to the Cloud.
For this kind of interaction to feel conversational, the compute will need to happen close to the user at the edge of the network, so latency can be kept to the lowest possible levels. If the gap in conversation is more than about 200 milliseconds, it won’t feel natural. You’ll feel like you need to repeat your question.
Keeping latency of this data to manageable, natural-feeling levels requires us to build AI systems with advanced Cloud transport models.
As we mentioned, many systems are still rules-based. This needs to change, and it will, but it will require different approaches than we’ve seen in the past. A rules-based system can be designed almost entirely by an engineer. Conversational systems, in contrast, need sophisticated engineering. But it won’t end there. These systems need to bring sophisticated UI/UX into the design, to make it feel good for the user.
Beyond user experience and interface considerations, there are innovations happening in the processing of natural language queries, which are being referred to as transcoding-free. Transcoding, in simple terms, speeds the process of taking in speech-to-text, processing the query, then outputting text-to-speech.
With transcoding-free Conversational AI, much of that process happens more quickly and fluidly so the conversation can feel more natural. Effectively, instead of speech-to-text-to-speech, this technology will make it more like speech-to-speech. These technologies are in early prototype stages now, but we will see them roll out in the coming years.
But conversations between humans aren’t just about speech and hearing. Future systems will probably also include cameras and visual processing as a way of gathering additional contextual data that today’s audio-only systems don’t have. These systems will be able to interpret facial expressions, body language and other nonverbal cues to come up with better (and more human) responses.
The initial uses for this technology have been mostly customer support and sales. In the future, we’ll see other applications in areas like education; language learning seems to be a particularly interesting application.
Let’s say you want to learn to speak Italian. A virtual Italian tutor could help you, and actually simulate conversations as your skills develop, giving you coaching along the way on pronunciation, grammar, usage and so forth. This would go far beyond simple translation and begin to enter into the realm of using truly conversational AI to learn a language.
We’ll know this technology has reached its pinnacle when, during a conversation, you won’t be able to tell whether you’re talking to a human or to a machine.