ALDA Speech Recognition Panel - Part 2
Here's Part One!
Not all speech recognition tasks are equal. Some speech recognition
situations are much easier than others. Early speech recognition since
the '70's depended on favorable conditions such as small vocabulary,
discrete speech, known subject material, one speaker who spent weeks
training the system to match voice patterns, no background noise and be
able to attain 90% accuracy. The ideal system would be able to function
under difficult conditions and parameters such as a large vocabulary,
continuous natural speech, random subject materials, many speakers who
have little or no training on the system, much background noise and yet
still attain 99% accuracy.
The technology is still years away from being able to handle a
"hard" speech recognition task. For example, a person cannot
just walk into a noisy party, point a microphone at someone and then
read his or her speech on a screen. That kind of technology is not here
yet, although speech recognition has advanced to the point where it can
be useful in many situations. After a speech recognition system is
properly trained for a specific user, it will perform about as well as a
good typist. This means it could be used by deaf people in some
interpreting situations, such as in a business meeting or in a
classroom. The general state of ASR technology today includes large
vocabulary, continuous speech, random subject material, one speaker
after a few hours of training on the system, limited background noise
and a 95% accuracy.
Accuracy depends on the speaker and the speech recognition system
used. Some people have clearer speech, or at least speech that the
computer finds easier to analyze. A speech recognition system may have
difficulty understanding a person who is under stress or who has a cold.
Speech recognition systems have a very hard time understanding deaf
speech because deaf speech usually does not conform to the usual speech
patterns the computer is expecting. The presenters favor Dragon's
Naturally Speaking, as it seems to have slightly better accuracy than
other models currently on the market.
Dr. Ross Stuckless points out that IBM has worked closely with the
National Institute for the Deaf at Rochester Institute of Technology (NTID/RIT)
using Via Voice and is more familiar with the needs of users who are
deaf and hard of hearing. He also demonstrated his Dragon-Dictate speech
recognition software. Despite logging 30 hours of training on his ASR
system, he doesn't believe its claims of 98% accuracy rate. Instead he
feels that regular practical use has more errors but the ASR technology
is improving. Newer ASR systems help performance with less training time
and larger vocabulary.
In ASR systems, the speaker needs to voice in the punctuation marks
and the system does not distinguish one speaker from another. An example
of a phone conversation on ASR: would you pick up some chinese on your
way home from work tonight I dont have time to make dinner sure if you
don't mind eating late I have a lot of work on my desk and probably wont
be home until after eight thats ok what do you want me to pick up you
know the usual remember as for no fat no salt yuk
At Gallaudet University, Dr. Judy Harkins, Director of the Technology
Assessment Program (TAP) researched on the viability of ASR systems as a
communication aid for people who are deaf and hard of hearing. The
research questions were: How successful and efficient is ASR as a
communication aid compared to CART or lipreading? What can be done to
improve the effectiveness for conversation? What happens over time, with
practice?
There were two single subject studies. Deaf participants had severe
to profound loss resulting from progressive or sudden adult onset and
had good oral and English literacy skills. Hearing participants have had
previous experience communicating with deaf and hard of hearing people.
In order to have some predictable results, TAP used the Map Task,
where two people have maps of the same place but different in detail.
One has a route written on it, the other is blank. One person gives the
other instructions on how to draw the route on the map. In the Map Task,
the words used are pretty much the same and varies little between
different speakers. One may say, "Turn right on Main Street"
and the other may say, "Go to Main Street, turn right."
In the first phase, the experimental condition was face to face only.
Then in the second phase, participants used face to face and CART or
face to face and ASR. Finally, it was CART only or ASR only.
From the pilot test, TAP researchers came up with some preliminary
findings. Practice is crucial to success. It took the same amount of
time to complete the Map Task in face-to-face and ASR alone conditions.
ASR alone used fewer words and there were fewer requests for
clarification. It sometimes help if the speaker hits the key two or
three times periodically to help the reader stay in place and get rid of
sentences that has lots of errors. "Saying" the punctuation in
ASR can be either helpful or distracting, depending on the situation.
The speaker may need to maintain eye contact with the screen to help
performance, although it is less natural than looking at the
conversational partner.
In comparison, real-time steno-captioning or computer assisted
real-time transcription (CART) comes out ahead of speech recognition as
a conversational aid. Both real-time captioning and CART uses a skilled
stenographer who uses a special keyboard to type in phonetic symbols
that translate into a readable transcription of the dialogue. Having
multiple speakers is not a problem with CART unlike in ASR; the program
must unload the first speaker from the memory and then load the next
speaker. Mistranslated words can be corrected quicker on a CART system
than on an ASR system.
Today's ASR systems don't seem to work well over the telephone.
However, Ultratec and Sprint are now conducting joint trials using ASR
in an effort to boost the speed of relay conversations. Instead of using
ASR to replace the relay, the CA repeats the hearing party's message
into a computer that has been trained to recognize the CA's voice. If
the trial is successful, this will cut down on typing keystrokes and
requests for clarification, thereby reducing the error rate on relay
calls.
Despite its drawbacks, ASR has proved itself to be a popular trend.
Computer users with physical disabilities use it all the time. Some
computers are linked to the house's heating and cooling system so a
quadriplegic can adjust the thermostat by voice. Students at NTID/RIT
have requested ASR training so they can voice their homework. Industry
insiders predict that most computer users will soon select ASR over
typing on a keyboard to write letters. In his latest book "The Age
of Spiritual Machines: When Computers Exceed Human Intelligence,"
Ray Kurzweil predicts that palm-sized computers with ASR technology will
be in wide use within ten years.
For further information on ASR, here are some other links:
http://www.speechxp.com/commercial/speech.htm
(This site has many technical links but not all of them.)
http://www.hearingresearch.org
(Lexington's RERC on Hearing Enhancement)
http://tap.gallaudet.edu
(Gallaudet Technology Access Project)
http://www.rit.edu/~Klweie/asr.htm
(First deaf female engineer in study on human dynamics at NTID/RIT)
http://www.wired.com/news/email/member/technology/story/22048.html
(University of Southern California)
http://www.scientificamerican.com/1999/0899issue/0899quicksummary.html
(MIT's Oxygen Project)