One of the dreams of solving the captioning backlog is to rely on speech recognition. I do have to say that speech recognition is far more effective at time than I would have dreamed, but still my intuition has told me it’s not entirely working. A fascinating article from Robert Fortner on "The Unrecognized Death of Speech Recognition", essentially backs up the intuition with some hard numbers. He notes that accuracy has not improved much since the early 2000s and that in most cases, the rate is not within human tolerance (humans apparently have about a 2% error rate and even that can lead to some pretty ridiculous arguments).
When Speech Recognition Works
Speech recognition can be effective in two situations
- Specific context (airport kiosk, limited menu commands) – even here though it should be noted that it’s pretty darn easy to frustrate the average health insurance voice recognition system so that they give up.
- Specific speaker – Speech recognition is effective when trainied on a single voice, and the training time is shorter than it used to be. For captioning purposes, this means that if a single speaker makes the original audio (e.g. faculty lecture) or someone else repeats what’s on the audio (the captioner), speech recognition is pretty effective.
By the way, in the recent Second Accessibility Summit, Glenda Sims noted that correcting an inaccurate transcript is more difficult than starting from scratch.
What Speech Recognition Is
To understand why speech recognitin isn’t improving, you should consider the task it’s trying to perform. When human ears listens to language, it hears a stream of separate words and sounds and groups those into words and sentences. The reality is that speech is a continuous sound waves with very subtle acoustic transitions for different sounds (see images below, the bottom ones are the spectograms that phoneticians use). Your ears and brain are doing a lot of processing to help you understand that that person just said.
Your brain not only breaks up sound waves, it also accounts for the acoustics of different genders, different regional accents,filtering out different types of background noise and it probably includes some "smart guessing" on what a word is as well (which doesn’t always work). It’s no wonder that replicating the functionailty of the mechanism is taking time.
Ingoring the Linguists
There’s one factor that Robert Fortner points to – speech specialists are not always involved. As one IBM researcher claimed "Every time I fire a linguist my system improves"…but apparently there is an upper limit to this without more information. Maybe it’s time to start rethinking the problem and if the programming team might need some outside experts.