One of the challenges of video captioning is that it does rely human intervention to achieve the most accurate results. That’s because speech recognition is only reliable in certain circumstances, usually when the speaker has set up a profile on a Dragon speech recognition engine (this could include instructors BTW).
To achieve the best transcription in other circumstances though (and human listeners require 96-98% accuracy), you usually need a person to do one of the following:
- Watch and transcribe a video
- Watch a video and correct speech recognition errors (e.g. “Rest in Peas” for “Rest in Peace”)
- Have a videographer watch and repeat the words on the video through her or his trained speech recognition speech system
Note that all of the above assume that someone is spending time re-watching the video. Ugh!
Could an Easy Button be Coming?
What we are all waiting for is the captioning “Easy Button” that will allow use to upload any video file and presto – get back a reasonably accurate transcription regardless of the speaker.
The good news is that Norwegian University of Science and Technology (NTNU) has been working on new speech recognition algorithms. Unlike previous systems, it appears that this one will include a little more old-fashioned phonetic and phonological information and won’t be quite as reliant on statistical models.
It still might not be perfect. As with current systems, you will need high quality recordings so the right amount of phonetic information can be retrieved. I suspect that any speaker outside known linguistic parameters (e.g. a speaker with an undocumented accent) will still be able to throw off the system.
But I am glad that linguistics is being included in the solution.