VUI

2 Apr 2000

Steve Bogart asked me what I think about an article in which Michael Schrage asserts that computers will never understand what we say with enough accuracy to make spoken natural language interfaces useful.

I agree with the basic premise, that people generally don't communicate clearly enough for immediate action to be taken based on their utterances. Humans are brilliant at understanding other humans, and we get it wrong. I think he's right to sneer at natural language interfaces for certain kinds of tasks, but I think there are two important points he fails to make. One is that there are other tasks for which it may not be a catastrophic failure, and that computers could conceivably someday understand our speech better than people do.

I haven't seen the hype he's referring to, so I don't know whether people are hyping natural speech or just spoken words. Unnatural speech can have great applications though, even with current technology. It's enough to say "Yo car! air conditioner! colder!". I don't think it's any better to say "Gosh, I think it's a bit warm in here. It sure would be nice if the air conditioner were on." In some ways, a dumb speech UI is better than reaching over with a hand that might otherwise be steering the car, and is probably better in some ways (though not all) than natural language.

What I like about unnatural language is that it's not only clear, it's explicit. Not only is it clear to the machine when it's being addressed, it's clear to me when I'm addressing it. As Schrage points out in the article, language isn't designed (or well suited) for command and control. The down side is that we have to learn the command set. With better programming, a system could be more flexible, but there will always be limitations, unless it really understands everything we say, which is incredibly hard.

For this reason, our hands will continue to be the best way to manipulate our environment for most tasks. That's what they're for, and we're good at using them that way. Twisting a knob is always clear, assuming the UI has been designed well enough that we twist the right knob in the right direction. (Sadly this often isn't true.)

There are plenty of applications where a voice UI dumber than a cockroach is better than a knob. The ones that spring to mind fall into three categories: we can't be bothered with a knob, there's no room for a knob, and there would be too many knobs. There are cases in all these classes where we'd be better off with an unnatural voice UI than a physical one.

But the article isn't about voice interfaces in general, it's about natural language interfaces. As with any UI widget, it's more important to know when to use it than how to make it work properly. A finely crafted knob is frustrating if a button is more appropriate. Schrage claims that natural language UI is doomed because it's not appropriate for some tasks, but I think there are other tasks where it will do well.

I agree with him that command and control, especially of physical devices, especially in realtime, is not a good application of a natural language interface, but talking about information is different in a couple important ways. The cost of errors may be much less than crashing your car, the complexity of the system may make other interface options less appealing, and the computer could ask for immediate clarification when there's too much ambiguity.

Information systems are just a vague example of an application I expect to be complex. Interface design is hard, and a powerful system can easily give you too many options for knobs, menus, or keyword speech interfaces. Natural language could easily be the easiest thing for the user. The tradeoff is misunderstanding, but there are cases where some amount of misunderstanding is worth the trouble. If I have to try two or three times to get the right graph in my report, that still might be easier than trying to explain with buttons and drop boxes exactly what relationships I'd like to visualize. As our tools allow more flexibility, the interfaces become more complex, the cost of learning them increases, and the cost of cleaning up after ambiguity shrinks in proportion. Natural language isn't a simple interface, but it's one we already know.

I don't know if we'll ever want to talk to our computers the way we talk to people. We use computers for lots of different tasks, and natural language isn't going to be the optimal UI for all those tasks, though it may be for some. The article does mention multi-modal UI, combining natural language with something else. It's not much of a risk to agree with that approach. Using different methods for different parts of a complex task is good design, though the more you mingle the modes, the harder it is to manage the integration.

Schrage also says "there are still managers and technologists who think that life would be so much better if only their machines understood what they were talking about. That's sad to the point of pathetic." I agree that it's an unreasonable expectation, but I think it's not important to know why, and I don't think it's obvious. The problem is that life is too short to say exactly what you mean. You have to depend on the listener's ability to select the most likely interpretation based on a huge array of shared context and expectations. It's your job as a speaker to know your audience, understand their interpretive biases, and say things that will be interpreted properly. There's no way around that. There's no magic fix. The best we can do is make artificial constructs that are clever listeners and that know your background well enough to make good guesses so you don't have to work so hard to be understood.

On that note, I think computers do have the potential to understand us better than we understand each other. I'm going out on a limb here, because this is still far in the future and we don't understand ourselves enough for anyone to make solid claims about this. This is all speculation on my part, but I have reasons behind my beliefs.

There are tremendous obstacles to making machines that can understand what we say, but if we do manage that, they'll have advantages we don't. It's easier to understand what someone says if you are familiar with the topic, are familiar with their background, are familiar with how they talk, have watched the same TV shows, have read the same books, and have the time and patience to consider all those things. It also helps if you're really listening, and not thinking about what you're going to say, or wondering whether your laundry is done. People can do all those things, and sometimes they do them well, but if we were to construct a machine that could do all those things all the time, it might do a better job than we're used to. All this is still far in the future, but I think we'll get there eventually.