Automatic Speech Recognition

Dateline: 04/06/97

SOME people think AI is all talk and no action.

Well, it ain’t so; but it is true that talk—or, to be more accurate, speech—is a key area of AI. Skim your eyes (or your text-to-speech reader, if you are blind) over the AI Resource List, and the words speech, language, grammar, and linguistics keep popping out at you, whether in the domain of commercial R&D, computer science, or cognitive science.

It’s hardly surprising. After all, speech (together with its complement, hearing) is the ultimate in advanced organic communication technology, unless you believe in mind reading, thought transference, telekinesis and other unproven modes of organic communication. To the individual human being (but not necessarily to our species as a whole in evolutionary survival terms), speech/hearing is more important for advanced communication than writing, reading, or even sight itself.

Dr. Raymond Kurzweil, scholar, philosopher, scientist, inventor, business mogul—in short, your everday genius, blows the whistle on the old saw that "a picture is worth a thousand words" in a chapter he contributed to HAL’s Legacy (a book I discuss in an earlier feature). He writes: "Studies … have shown that groups of people can solve problems with dramatically greater speed if they can communicate verbally rather than being restricted to other methods" [emphasis added]. Kurzweil, in this instance, sounds a bit like a detergent salesman, but knows whereof he speaks. It’s his business to know.

And why did we build computers in the first place? Why, to solve problems, of course! So, if we can teach machines to communicate with us through everyday human speech, we can at least expect to solve problems better and faster than before. (There'll be a great deal more to it than that, but that's a story for another day.)

Giving machines the capability to receive and transmit sounds is easy. Just about every PC sold today has a microphone for receiving sound and loudspeakers for transmitting it. The hard part is getting a machine to recognize and (at some level) understand the meaning in the cacophony of speech. This is where ASR—automatic speech recognition—technology comes in.

How Does It Work?

I'm not going to explain in detail how ASR systems do what they do. (Read HAL's Legacy if you really want to know. It has highly readable chapters by several experts in the field.) The more interesting question (to me, anyway) is: Does it work? We'll get to that in a moment, but first let's look at two key problems ASR developers are having to solve in order to make it work: Context and Cacophony.

Context is more of a cognitive than a technical issue. For an ASR system to work with us, it will first need to "know" as many words as we do. That seems trivial. Just enter a list of words into its memory. But words alone are just data. To understand the information conveyed by words, the ASR needs knowledge of the context within which they are used.

Kurzweil's VoiceMED ASR product has a built-in expert system that has knowledge about how to prepare medical reports. Provided the doctor dictating a report does not go off on tangents, VoiceMED has little difficulty understanding the doctor's dictation and preparing a good report.

Cacophony is my own term to cover the problem ASR systems must resolve of sorting out the relevant sounds from an enormous variety of frequencies, tonalities, pitches, timbres and other physical components of speech (not to mention extraneous background noises that might be present). While this has proved to be far from trivial and has taken decades and many bright minds to resolve, it is essentially a technical issue (as compared with the more slippery cognitive issue of context), and good ASR systems today have the problem pretty well licked.

Holy Grail

According to Kurzweil, the Holy Grail of ASR researchers is a system that combines three attributes: (1) A large vocabulary; (2) Speaker independence; and (3) Continuous speech.

The large vocabulary is an obvious necessity. We use thousands of words in everyday speech, and combine those words to make tens of thousands of phrases (like "automatic speech recognition"). The ASR system—the machine—will need to have the same vocabulary we use. The current constraint on vocabulary size is the context issue discussed above.

Speaker independence (meaning that the machine can understand any individual, not just one individual) is vital in some, but not all, situations. A machine designed to serve more than one individual will need speaker independence, but a machine designed exclusively to serve you or me will not. The current constraint on achieving near-perfect speaker independence is the cacophony issue, but as noted, the boffins have a good grip on this one.

The ability to understand speech delivered in a continuous stream is also an obvious necessity—because we—don’t—talk—like—this, do —we? (It would make Kurzweil’s job so much easier if we did!) Both the context and cacophony issues are constraints on ASR recognition of continuous speech.

The fact is, no ASR system today has all three attributes, but that doesn’t mean we don’t know how to make one that does.

The main holdup is the (relatively paltry) processing power of our machines, which Moore’s Law tells us won’t be a holdup for very long. As of 1991, says Kurzweil, you could choose an ASR system that had any one of the three attributes. Subsequent advances in programming and processing power today enable us to buy ASR systems that have any two of the three attributes. Kurzweil reckons we’ll have systems with all three attributes on our desktops by early next year (1998).

However, that system will be restricted to understanding "business English." I imagine this as being like the computer on board the USS Enterprise (the Star Trek spaceship). You hear Captain Kirk saying such things as "Computer! What are the probabilities of finding a Romulan starship in this sector?" but not: "Hey, HAL; Mr. Spock just left me for another lifeform. Life really sucks sometimes, don’t it?"

For that level of conversational capability in an ASR system, my bet is you’ll have to wait another ten years or so.

Yes, Please! Beam Me One Up!

Sold? OK, here’s a table to point you in the direction of some specific ASR products you can download today. BE SAVVY about this. The table was prepared by Dragon Systems, the company whose ASR product appears (from the table) to be the leader, so it may or may not represent "the whole truth and nothing but the truth." I don’t know. What I do know is that:

(1) ASR technology is not standing still: Any one of the products could leapfrog its competitors any day. All the companies mentioned are highly respected and likely to be around to support and upgrade your system well into the future.

(2) Dragon Systems, Kurzweil Applied Intelligence, and IBM are not the only players in the ASR game. Check out the Resource Area for others, and please, please write to me if you know of companies that should be included.

(3) ASR is profoundly important. It is the next major development in the interface between Human and Machine; one that takes "user-friendly" to a new level of meaning. It will open Pandora's Box for the masses in a way that windows interfaces do not. I think Bill Gates knows this; Microsoft Research is no slouch in this area.

(4) ASR per se is not about deep understanding. It's about the much more limited task of recognizing words in context. But as the context issue and Kurzweil's incorporation of an expert system into an ASR system suggest, another major area of AI research—machine learning—will for certain be teamed with ASR to provide deeper understanding. We'll discuss machine learning in a future article; meantime you can read about three specific machine learning projects described in an earlier article.

The achievements of Dr. Kurzweil and his colleagues in the ASR field are astounding in and of themselves, but they are not finished yet.

Until next week,

 

 

 

 


NEXT WEEK: It's time to wax philosophical again. Recent publication of a book by MIT's Michael Dertouzos has set me to pondering once more on the implications of AI for our future.

Previous Features