Voice Recognition Captioning
February 2003
Last year I wrote about my experience using voice recognition
technology to teach a computer class to people with hearing loss. It was
a surprisingly positive experience for all concerned, and I mentioned at
the time that I foresaw numerous applications for the technology.
I have just finished captioning an ALDA meeting using the same
technology. What's different from the computer class is that in this
case I was not the speaker. Rather than saying what I wanted to, my job
was to repeat what the speakers said into the voice recognition
software. This, in itself, is quite different from deciding what to say
and just saying it.
An additional difference from speaking my own words is that I had to
ensure that my repetition of the speaker's words weren't distracting to
the speaker or to others in the audience.
The basic tools in either situation are a reasonably fast laptop
computer equipped with voice recognition software and a microphone, an
LCD projector, and a screen. I talk into the microphone, which feeds to
the voice recognition software via the sound card. The voice recognition
software converts the audio to text, which is output to the LCD
projector. The projector puts the text up on the screen.
For the computer class I used a headset with microphone. This was
appropriate equipment for that situation, because those students with
enough hearing to benefit from hearing my voice were able to do so.
For the ALDA meeting, however, I wanted to muffle my voice as much as
possible. To do so I used a stenomask from Talk Technologies (http://www.talk-tech.net/pages/sylencer.html).
The stenomask fits over the mouth and seals against the fact to muffle
the voice. It's not 100% effective, so the audience can still hear
something, but it's far less distracting than normal voice volume
speaking into a conventional microphone.
To train ViaVoice (the voice recognition software I'm using) for the
computer class, I spent about an hour reading the stories that ViaVoice
provides for software training. After that training, ViaVoice performed
with about 95% accuracy, provided I was careful to speak clearly and
distinctly. If I got lazy, the accuracy declined fast.
To use ViaVoice with the stenomask, I had to train a whole new model.
Just as the software must be trained for each person who uses it, it
must also be trained for each new hardware configuration. Changing a
sound card or microphone, or even the background noise, can necessitate
a new voice model.
So I trained ViaVoice with the stenomask for an hour using the
provided stories, after which the accuracy was probably only 75%. I was
disappointed at the performance, but not surprised, because the
stenomask really requires an acclimation period. Ensuring a tight seal
(to reduce escaping sound) requires that the mask be held firmly against
the face. This makes it hard to move the lips in a natural and
consistent manner, which certainly degrades the software performance.
The other problem is the restricted air movement that the mask
causes. One result is that breathing is different, and that takes some
getting used to. More closely related to the accuracy issue is the fact
that the sealed stenomask prohibits normal exhalation as a person
speaks. The pressure builds up and makes vocalization difficult, which
affects how sounds are produced.
I continued training the stenomask voice model for another few hours,
but was unable to significantly improve the accuracy.
Hmmmmm. . . . . what to do?
After awhile I realized what the problem was. When I first started
using the stenomask, I was not at all used to it, and my speech was not
at all natural. With additional training, I became more comfortable with
the equipment, and I was able to speak more naturally. But that speech
was very different from the speech with which the model had originally
been trained! The problem was that the original training was not
representative of my later speech, and no reasonable amount of
additional training could overcome the original corrupted training.
So I started over with the provided training stories, and was able to
get about 90% accuracy after the first hour. I attribute the slightly
degraded performance (compared to using the microphone) to the fact that
I'm still not entirely comfortable with the stenomask, so I don't speak
consistently.
So how did the meeting go?
Very well, actually! The system exceeded my expectations. I was
pretty much able to keep up with the speakers, and the accuracy was
high, as long as I was careful to speak clearly. But as before, the
first hint of lazy speech was brutally punished.
Oh, by the way, the reason I'm doing this is because our ALDA group
just lost the funding that paid for CART services. We're looking for new
funding, of course, but these are difficult times. It may be that I'll
be providing voice recognition captioning for quite some time.
And why am I telling you all this? It's not just because I like to
whine ;-} It's because voice recognition is a very real option for
organizations that can't afford traditional captioning. If your
organization can find a willing volunteer and can borrow an LCD
projector, it's very doable at very reasonable cost.
And the quality? I'd say it was as good as some traditional CART
reporters I've seen. It's nowhere near the quality of the best CART
reporters - yet. But I saved the text and audio files from today's
meetings and I'll use them to continue training the system. Between that
and more practice time for me, I wouldn't be surprised to be rivaling
the best CART reporters in a matter of months.
I'll be happy to do what I can to help anyone who wants to pursue
this. Just email me - larry@hearinglossweb.com