Home » Others » ALDA Speech Recognition Panel – Part 2

ALDA Speech Recognition Panel – Part 2

ALDA Speech Recognition Panel – Part 2

Here’s Part One!

Not all speech recognition tasks are equal. Some speech recognition situations are much easier than others. Early speech recognition since the ’70’s depended on favorable conditions such as small vocabulary, discrete speech, known subject material, one speaker who spent weeks training the system to match voice patterns, no background noise and be able to attain 90% accuracy. The ideal system would be able to function under difficult conditions and parameters such as a large vocabulary, continuous natural speech, random subject materials, many speakers who have little or no training on the system, much background noise and yet still attain 99% accuracy.

The technology is still years away from being able to handle a “hard” speech recognition task. For example, a person cannot just walk into a noisy party, point a microphone at someone and then read his or her speech on a screen. That kind of technology is not here yet, although speech recognition has advanced to the point where it can be useful in many situations. After a speech recognition system is properly trained for a specific user, it will perform about as well as a good typist. This means it could be used by deaf people in some interpreting situations, such as in a business meeting or in a classroom. The general state of ASR technology today includes large vocabulary, continuous speech, random subject material, one speaker after a few hours of training on the system, limited background noise and a 95% accuracy.

Accuracy depends on the speaker and the speech recognition system used. Some people have clearer speech, or at least speech that the computer finds easier to analyze. A speech recognition system may have difficulty understanding a person who is under stress or who has a cold. Speech recognition systems have a very hard time understanding deaf speech because deaf speech usually does not conform to the usual speech patterns the computer is expecting. The presenters favor Dragon’s Naturally Speaking, as it seems to have slightly better accuracy than other models currently on the market.

Dr. Ross Stuckless points out that IBM has worked closely with the National Institute for the Deaf at Rochester Institute of Technology (NTID/RIT) using Via Voice and is more familiar with the needs of users who are deaf and hard of hearing. He also demonstrated his Dragon-Dictate speech recognition software. Despite logging 30 hours of training on his ASR system, he doesn’t believe its claims of 98% accuracy rate. Instead he feels that regular practical use has more errors but the ASR technology is improving. Newer ASR systems help performance with less training time and larger vocabulary.

In ASR systems, the speaker needs to voice in the punctuation marks and the system does not distinguish one speaker from another. An example of a phone conversation on ASR: would you pick up some chinese on your way home from work tonight I dont have time to make dinner sure if you don’t mind eating late I have a lot of work on my desk and probably wont be home until after eight thats ok what do you want me to pick up you know the usual remember as for no fat no salt yuk

At Gallaudet University, Dr. Judy Harkins, Director of the Technology Assessment Program (TAP) researched on the viability of ASR systems as a communication aid for people who are deaf and hard of hearing. The research questions were: How successful and efficient is ASR as a communication aid compared to CART or lipreading? What can be done to improve the effectiveness for conversation? What happens over time, with practice?

There were two single subject studies. Deaf participants had severe to profound loss resulting from progressive or sudden adult onset and had good oral and English literacy skills. Hearing participants have had previous experience communicating with deaf and hard of hearing people.

In order to have some predictable results, TAP used the Map Task, where two people have maps of the same place but different in detail. One has a route written on it, the other is blank. One person gives the other instructions on how to draw the route on the map. In the Map Task, the words used are pretty much the same and varies little between different speakers. One may say, “Turn right on Main Street” and the other may say, “Go to Main Street, turn right.”

In the first phase, the experimental condition was face to face only. Then in the second phase, participants used face to face and CART or face to face and ASR. Finally, it was CART only or ASR only.

From the pilot test, TAP researchers came up with some preliminary findings. Practice is crucial to success. It took the same amount of time to complete the Map Task in face-to-face and ASR alone conditions. ASR alone used fewer words and there were fewer requests for clarification. It sometimes help if the speaker hits the key two or three times periodically to help the reader stay in place and get rid of sentences that has lots of errors. “Saying” the punctuation in ASR can be either helpful or distracting, depending on the situation. The speaker may need to maintain eye contact with the screen to help performance, although it is less natural than looking at the conversational partner.

In comparison, real-time steno-captioning or computer assisted real-time transcription (CART) comes out ahead of speech recognition as a conversational aid. Both real-time captioning and CART uses a skilled stenographer who uses a special keyboard to type in phonetic symbols that translate into a readable transcription of the dialogue. Having multiple speakers is not a problem with CART unlike in ASR; the program must unload the first speaker from the memory and then load the next speaker. Mistranslated words can be corrected quicker on a CART system than on an ASR system.

Today’s ASR systems don’t seem to work well over the telephone. However, Ultratec and Sprint are now conducting joint trials using ASR in an effort to boost the speed of relay conversations. Instead of using ASR to replace the relay, the CA repeats the hearing party’s message into a computer that has been trained to recognize the CA’s voice. If the trial is successful, this will cut down on typing keystrokes and requests for clarification, thereby reducing the error rate on relay calls.

Despite its drawbacks, ASR has proved itself to be a popular trend. Computer users with physical disabilities use it all the time. Some computers are linked to the house’s heating and cooling system so a quadriplegic can adjust the thermostat by voice. Students at NTID/RIT have requested ASR training so they can voice their homework. Industry insiders predict that most computer users will soon select ASR over typing on a keyboard to write letters. In his latest book “The Age of Spiritual Machines: When Computers Exceed Human Intelligence,” Ray Kurzweil predicts that palm-sized computers with ASR technology will be in wide use within ten years.

For further information on ASR, here are some other links:

(This site has many technical links but not all of them.)

(Lexington’s RERC on Hearing Enhancement)

(Gallaudet Technology Access Project)

(First deaf female engineer in study on human dynamics at NTID/RIT)

(University of Southern California)

(MIT’s Oxygen Project)