Automatic speech recognition comes of age

Dateline: July 27, 1997

GREETINGS again, Lifeform! My apologies for my extended absence -- it to look a week longer than I anticipated to organize my thoughts for this and the next few feature articles, but I hope you will like the results, starting with this article.

Believe it or not, you are not now reading something I have written. You are reading something I have spoken. I am using the latest in automatic speech recognition (ASR) software to dictate my thoughts; the latest being Dragon NaturallySpeaking from Dragon Systems. You may recall that in an earlier article on automatic speech recognition, I quoted Dr. Raymond Kurzweil, founder of another major player in the ASR industry, as saying that the Holy Grail of ASR was a system that would combine the three attributes of large vocabulary, speaker independence, and continuous speech. NaturallySpeaking gives us the large vocabulary -- 30,000 words in memory, plus 200,000 on disk -- and continuous speech, but it will only work well with a single user.

I ordered the program at 11:30 p.m. Eastern standard time on Thursday, from PC Zone in Seattle. It was delivered to my door at 10 a.m. Friday by Airborne Express. It cost $289. Other dealers had quoted as high as $349.

My computer is a Compaq Presario model 4160 with a Pentium 150 MHz processor and (until Friday at noon) 24 MB of RAM. The program needs at least 32 MB of RAM, so off I toddled to Best Buy for two 16 Mb chips costing 89 dollars apiece, making me the inordinately proud owner of 56 MB of RAM.

Installation from compact disc was easy, but setup did report a problem with my sound card, which it said was of low quality. The card is an ESS (Ensoniq Sound Systems, I think it stands for) installed by Compaq, and while it works fine at playing sounds and for Internet telephony (I use TeleVox from Voxware -- works great) the ESS card apparently has low-quality microphone recording circuits. Well, having just spent the better part of 500 dollars I was not about to rush back to Best Buy to spend even more on a Sound Blaster.

Having installed everything, including a headset microphone that comes with the software, I was required to spend about half an hour training the program to recognize my speech patterns by reading from text on screen. Dragon gives you two choices of passage to read from: One is from Arthur C. Clarke's latest book and the other is from a Dave Barry book. Choosing Dave Barry turned out to be a mistake, because it was hard to suppress giggles and chortles at his wit and humor, which tends therefore to mess up the training. But I got through it in the end, and off we went into the real world.

I have now spent about 10 hours using the program, and there is no question that it is getting better all the time. But better is relative, and unless there is significant improvement in the next week this program is headed back to the good people at PC Zone (who upheld their part of the bargain, with fast and courteous service plus the best price available). The sad fact is, it is taking me ten to fifty times longer to get these words down using NaturallySpeaking than it would if I were to flex two fingers at the keyboard. Because the system keeps making mistakes, and it takes time -- way too much time -- to correct them.

I am going to dictate the following two paragraphs without stopping to correct them, so you can see for yourself what I mean. The first paragraph is dictated from my earlier ASR article, and the second I will simply improvise. I will go through them afterwards and highlight the errors:

Cacophony is my own some the, [term to cover] the problem automatic speech recognition systems must resolve of sorting out the relevant sounds from an enormous variety of frequencies, tonalities, pitches, Tom Bruce [timbres] and other physical components of speech (not to mention extraneous background noises that might be present). While this has proved to be far from trivial and has taken decades and many bright minds [to] resolve, it is essentially a technical is you [issue] (as compared with the more slippery cognitive is few [issue] of context), and code [good] automatic speech recognition systems today have the problem pretty well licked.

As you can see, NaturallySpeaking is indeed impressive in what it can achieve, but not [there are] just too many mistakes -- and [the] mistakes themselves are more confusing and hard to find and fix then [than] the simple type owes [typos] one makes from the keyboard.

Dragon Systems was the first to come out with a system this sophisticated, but others are breathing down its neck. IBM will introduce its automatic speech recognition program, ViaVoice, in August. It, too, will require reams of RAM and and a processor even more powerful than my 150 MHz Pentium, and it will probably also require a better quality sound card then I apparently got from Compaq.

Despite my disappointment at the usability of NaturallySpeaking today, I am confident that within one year enhancements to the software and the much more powerful computers we can expect to be buying then will finally put automatic speech recognition on virtually everybody's desktop. As of right now, NaturallySpeaking would be a boon to anyone who cannot type at all, not even with two fingers.

Until next week,

 

 

 

 


NEXT WEEK: More AI history. In a previous article, we looked at some of the pre-history of AI. For the next series of articles, we will examine the more recent history of AI from its inception as a discipline in the early 1950s, courtesy of Daniel Crevier's excellent review and analysis in his book..

Previous Features