Speech Recognition Status Report
May 2003
Editor: I've been singing the praises of Speech Recognition for a
couple of years, ever since I used it in place of real time captioning
to teach a computer class for people with hearing loss. That particular
application was trained to my voice, and I occasionally had to repeat to
get the software to "understand" what I was saying. But it was
very workable. The newer version that I got a few months ago is
noticeably better. And I expect the next version will be significantly
better still.
At the other end of the Speech Recognition spectrum from the simple
program I use are the ones that interact with thousands of customers
without training on their specific voices. Those programs currently only
work in situations where the number of possible responses is very
limited, but it may not be too long before they are able to work in more
general circumstances.
Michael Phillips is the Chief Technical Officer of SpeechWorks, a
company that produces these commercial systems. He was recently
interviewed by MIT's "Technology Review" (TR). Here are
excerpts from that interview. The complete text is available at http://www.technologyreview.com/articles/focuson0603_speech.asp?p=0
~~~~~~~~~~~~~~~~~~~
Michael Phillips is Chief Technology Officer for SpeechWorks. He
spoke with Technology Review Senior Editor Wade Roush about his
company's interactive voice-response technology, which automates the
handling of customer calls at companies like United Airlines and Federal
Express. With a father's pride, Phillips introduced Tom, a jaunty voice
with an American accent who is one of SpeechWorks' synthesized-speech
"personas." (Tom's colleagues Helen and Karen sound like real
women from Britain and Australia, respectively, and personas speaking in
many other languages and accents are available.)
Phillips, who co-founded SpeechWorks in 1994 to commercialize
language-processing software he had helped to build at MIT's Laboratory
for Computer Science, talked about the company's plans for making such
speech-driven interfaces the dominant way we interact with computers.
[snip]
TR: Is the technology really getting that good that fast?
PHILLIPS: The technology is improving rapidly. We're sort of in the
Moore's Law of speech recognition. We cut error rates in the speech
recognizer by 20 or 30 percent every year.
TR: What would you say the rough error rate was when you started in
1994, and what is it now?
PHILLIPS: It depends on the task. And as the speech recognizer gets
better and better, we do more and larger and more complex tasks. So a
better measure than just what is the accuracy on a fixed task, is what
kinds of tasks you can get acceptable accuracy on. When we first
started, it was basically a few hundred-word vocabulary. Things like
getting a phone number from a user, or even getting a city name were
possible, but stretching it. Since then we've deployed stock trading and
stock quote systems that have 50,000- to 100,000-word vocabularies. Most
of the applications we have exploited are not constrained by the quality
of the speech recognition so much as by the user interface. We are doing
very sophisticated things like entering any street address in the
country, entering any name you have, even something like getting an
e-mail address from somebody over the telephone.
[snip]