Visit page
Press "Enter" to skip to content

The pitfalls of automated speech-to-text recognition

By Lindsay Stoker

The attacks on our industry are two-pronged: Electronic Recording (ER) and automated speech-to-text recognition (AST) services. This is not just a captioner issue. Often the vendors of ER are also selling AST services and touting a variety of claims.

The ER industry loves to make ambitious claims. Without naming names, they love to use buzzwords such as “near perfect levels” and “best-in-class platforms,” and all vendors claim to have an accuracy rate in the 95 to 99 percent range. They use standardized audio sources comparable to a game of wiffleball and then purport to compete against the major league baseball players we all are in our profession.

Let’s dive into intellectually honest metrics. ER vendors all love to claim word error rates of around 5 percent. How could they possibly do that, what is the truth behind these numbers, and what are the mitigating circumstances preventing them from reaching those actual goals? The biggest problems with their statistics are the audio sample used, the bias in their data privacy and security, realtime response behavior, and readability.

What are the actual metrics? Descript, an audio/visual company, recently published on the results of their analysis using inputs such as scripted and unscripted broadcasts, telephone conversations, and meetings. Each category of audio had a variety of speakers, including some with accents. Using a professionally transcribed product that serves as their control, researchers added up the number of words the automatic speech recognition engine got wrong and divided by the total number of words. That is how they achieve the word error rate. The best product was only able to achieve an accuracy score of 84 percent.

Let’s dig back into the confounding variables.

Audio samples used: The biggest problem with their statistics is the audio sample used. Each company (except for Google) uses audio recordings from a source called Switchboard. This consists of a large number of recorded phone conversations; it is ubiquitous in the field and in current literature. By using the same source, vendors can make apples-to-apples comparisons with other vendors.

One vendor employee review from states that “Its site boasts of powerful transcription software which reduces operating costs by up to 50 percent, but in reality a bog-standard automatically generated text (the 99 percent accuracy that they quote for this text is a complete joke most of the time unless it is perfect audio with one native American English speaker) is provided which humans then have to edit and review for accuracy for pitifully small amounts of money.“

Bias in their data: Another issue is bias in data. Racial disparity bias exhibits itself when the data used to make algorithms a set of commands that can be simple or complex is from privileged groups.

One national study published by the National Academy of Sciences showed that speech recognition programs are biased against Black speakers. On average, all five programs from leading technology companies showed significant race disparities and were twice as likely to incorrectly transcribe audio from Black speakers as opposed to white speakers.

Another study conducted by Stanford University came to the same conclusion and determined that “for every hundred words, the systems made 19 errors for white speakers compared to 35 errors for Black speakers— nearly twice as many.”

Privacy and security: The flipside of the ability to translate speech into text is the ability to generate natural-sounding speech and deepfake technology.

A Canadian startup, Lyrebird, is developing artificial intelligence capable of mimicking the sound of a person’s voice complete with accents and intonations. After learning how to generate speech, the artificial intelligence can adapt to any voice after hearing 60 seconds of a person’s speech pattern.

Lyrebird is not the only vendor to engage in deepfake technology. A man used Elon Musk’s OpenAI project to create a hyper-realistic chatbot that mimicked his late fiancée. This is from Business Insider: “All he had to do was plug in old messages and give some background information, and suddenly the model could emulate his partner with stunning accuracy.”

Similar technology was also used to produce a documentary about the late television personality Anthony Bourdain. Experts warn that this technology could be used to “impersonate others online” and could be used to manipulate audio files. In situations like depositions or court proceedings, where a complete and accurate record is required by law, having a stenographer who can attest to the accuracy of the record is imperative.

Realtime response behavior: Another concern of AST technology is latency or realtime response behavior, which is the time it takes from an input to be generated to a desired output to be, in this case, available for viewing. It is dependent on varying factors such as network capabilities, network connection, and microphone of the device. For example, when the digital meeting software Zoom receives the spoken input, it must interact with the platform’s server to convert that sound into text and is sent back to the user.

This results in issues such as latency or dropped captions altogether. Numerous users, myself included, have observed and reported issues where the recognition arbitrarily stops for extended periods of time and may or may not later pick up again.

Readability: The final issue with AST captions is the issue of readability. In measuring the total error rate, researchers did not count speaker labels or incorrect punctuation as errors. Furthermore, all words are weighted equally.

The claims by various AST platforms are often biased and unsubstantiated. A human captioner or court reporter is and remains the gold standard.

Lindsay Stoker, RPR, CRC, is a 14-year captioner and court reporter from Huntington Beach, Calif. She also serves on the NCRA STRONG Committee. She can be reached at