Speech Recognition Status Report
Editor: I’ve been singing the praises of Speech Recognition for a couple of years, ever since I used it in place of real time captioning to teach a computer class for people with hearing loss. That particular application was trained to my voice, and I occasionally had to repeat to get the software to “understand” what I was saying. But it was very workable. The newer version that I got a few months ago is noticeably better. And I expect the next version will be significantly better still.
At the other end of the Speech Recognition spectrum from the simple program I use are the ones that interact with thousands of customers without training on their specific voices. Those programs currently only work in situations where the number of possible responses is very limited, but it may not be too long before they are able to work in more general circumstances.
Michael Phillips is the Chief Technical Officer of SpeechWorks, a company that produces these commercial systems. He was recently interviewed by MIT’s “Technology Review” (TR). Here are excerpts from that interview. The complete text is available at http://www.technologyreview.com/articles/focuson0603_speech.asp?p=0
Michael Phillips is Chief Technology Officer for SpeechWorks. He spoke with Technology Review Senior Editor Wade Roush about his company’s interactive voice-response technology, which automates the handling of customer calls at companies like United Airlines and Federal Express. With a father’s pride, Phillips introduced Tom, a jaunty voice with an American accent who is one of SpeechWorks’ synthesized-speech “personas.” (Tom’s colleagues Helen and Karen sound like real women from Britain and Australia, respectively, and personas speaking in many other languages and accents are available.)
Phillips, who co-founded SpeechWorks in 1994 to commercialize language-processing software he had helped to build at MIT’s Laboratory for Computer Science, talked about the company’s plans for making such speech-driven interfaces the dominant way we interact with computers.[snip]
TR: Is the technology really getting that good that fast?
PHILLIPS: The technology is improving rapidly. We’re sort of in the Moore’s Law of speech recognition. We cut error rates in the speech recognizer by 20 or 30 percent every year.
TR: What would you say the rough error rate was when you started in 1994, and what is it now?
PHILLIPS: It depends on the task. And as the speech recognizer gets better and better, we do more and larger and more complex tasks. So a better measure than just what is the accuracy on a fixed task, is what kinds of tasks you can get acceptable accuracy on. When we first started, it was basically a few hundred-word vocabulary. Things like getting a phone number from a user, or even getting a city name were possible, but stretching it. Since then we’ve deployed stock trading and stock quote systems that have 50,000- to 100,000-word vocabularies. Most of the applications we have exploited are not constrained by the quality of the speech recognition so much as by the user interface. We are doing very sophisticated things like entering any street address in the country, entering any name you have, even something like getting an e-mail address from somebody over the telephone.