The latest on speech recognition

By David Ward

Earlier this year, Watson, the IBM artificial intelligence computer that first gained fame in 2011 by beating past champions in the TV game show Jeopardy, publicly returned to the stage. The category: Speech recognition. At a San Francisco tech conference this past spring, IBM officials announced that Watson is now able to hold conversations in English with a word error rate of 6.9 percent.

Watson’s latest feat of 94.1 percent accuracy is fairly impressive — though it should be noted, it’s still well below the 98.5 percent accuracy rate required by many captioning companies.

But Watson and IBM are not alone. During the past five years, speech recognition has emerged as a hot industry, with progress being made on issues such as accuracy and noise filtering, and not just by established players like Nuance, but also from major tech giants like Apple, Google, Amazon, Microsoft, and the Chinese search/e-commerce firm Baidu.

A recent study, Global and China Speech Recognition Industry Report 2015-2020, projected the global intelligent voice market will grow from $6.21 billion in 2015 to $19.2 billion in 2020. In China alone, voice recognition is expected to be a nearly $3.8 billion market within four years.

The good news for court reporters is that, thus far, few of these new speech recognition breakthroughs seem to be aimed at their livelihood of transcribing spoken testimony into 100 percent accurate and properly formatted legal documents. Instead, most of the focus is on speech recognition as a consumer tool by getting smartphones or car entertainment systems to respond to verbal commands. The highest profile example of this trend is Siri, the Apple app (based on Nuance software) now built into iPhones, iPads, and iPods that lets the owners use their voices to send messages, make calls, set reminders, and more.

“In the past, the growth of speech to text was fairly slow, and the main reason for that was that the improvement in accuracy hadn’t happened,” explains Walt Tetschner, president of Voice Information Associates, one of the leading analyst firms covering the automated speech recognition industry. The company is based in Acton, Mass. “It’s only been the last couple of years that you’ve seen dramatic improvements in accuracy — and I would attribute that not to speech to text, but to the demands of mobile.”

Tetschner, who’s also editor and publisher of the trade magazine, ASRNews, adds that voice recognition is now attracting interest more than from Silicon Valley, saying, “The auto industry is big on speech recognition as a way to control the infotainment system within a car, so they’re putting in a lot of resources.”

Because so much of the current focus is on voice recognition — getting a device to understand what a person is saying — little of this is likely to have an immediate impact on the court reporting industry. But Tetschner does note: “Whether it’s used in an automobile or call center or wherever, the basic speech recognition principles are the same, so when you make gains in one area, it will help the others.”

Outside of the consumer space, the segment where speech recognition has been gaining the most traction is as a personal productivity tool for busy professionals, especially those in the medical/healthcare industry.

“The push in the medical field for electronic medical records makes it a great market to be in,” says Henry Majoue, founder and CEO of Voice Automated based in Lake Forest, Calif. Majoue adds that doctors and other medical professionals use software customized by his company to dictate notes and other directives that are automatically turned into electronic text and included in patient files and other medical records.

Peter Mahoney, Nuance senior vice president as well as general manager of the company’s Dragon Desktop division, adds that the improved accuracy of speech recognition is helping to drive adoption in a host of other industries as well.

“There are plenty of areas where we’re seeing a lot of growth, including the public safety and financial service professionals,” Mahoney says. Those groups now outnumber the people with hearing issues or other disabilities who were previously among of the first adopters of Nuance speech-to-text solutions.

Mahoney says Dragon Desktop is also making inroads in the legal industry. “In litigation preparation, attorneys will often read multiple documents and dictate their notes so they can prepare their arguments,” he says. “We’re also seeing a lot more mobile use, where lawyers dictate notes while outside the office or do case documentation on the go. Lawyers create a lot of texts, and it’s far more productive to be able to use their voices as opposed to having to type.”

Majoue agrees with this assessment: “Our legal business has been aimed at helping lawyers get their documents done, and that can include any type of legal business that needs a lot of documentation, such as worker compensation firms.”

Virtually all of this speech recognition is occurring in fairly controlled environments — and it involves just a single voice. “It’s extremely accurate for the individual,” points out Majoue. “You could get a copy of Dragon software now and be pretty well dictating at greater than 90 percent accuracy within five minutes.”

With accuracy that’s good but not close to perfect, voice recognition tools could soon help universities and other public and private institutions go through vast archives of video and audio records, including lectures, speeches, and seminars, and make them searchable by keyword.

But the next step, moving from voice recognition using a single user in a controlled setting to one where there are multiple voices in a noisy room, continues to be out of reach.

“It’s still extremely difficult currently for speech recognition to move beyond a single user,” Majoue say. “In order for it to work with more than one voice, each individual user would have his or her own computer and microphone focused on just that one voice.”

Those limitations in a multi-voice environment is one reason why speech recognition hasn’t been even more widely adopted by the deaf and hard of hearing, who were one of the earliest supposed beneficiaries of speech recognition.

“Although, in recent years, deaf and hard-of-hearing people have begun to use speech recognition tools in some limited situations, the technology has not been where it needs to be for practical purposes in most settings,” explains Howard Rosenblum, CEO for the National Association of the Deaf (NAD). “While improved accuracy is, by far, the most needed element in speech recognition software, the NAD is in favor of current developments where the speech recognition software is able to identify and indicate who is speaking, particularly in group settings. The NAD also looks forward to improved speed in converting speech to text so that it is as close to real time as possible. There also needs to be software that can be adjusted to give more weight to specific verbiage when the discussion is in a particular area, such as science, computer technology, medicine, law, psychology, and the like.”
With plenty of pent-up demand for speech recognition tools that can work with multiple voices and in rooms with less-than-ideal acoustics, Mahoney suggests the solution may not be that far off.

“When you listen to a recording of a meeting, it sounds in retrospect like a bunch of crazy people because people communicate half-thoughts and they redirect their comments because people are speaking over them,” Mahoney says. “Because of that, the algorithms that are being used for transcripts even in the very sophisticated machine learning tools used these days struggle to come up with context.”

But Mahoney goes on to predict, “Within five years, you’ll see very, very high accuracy, and that’s being driven by a combination of more advanced machine learning techniques to develop sophisticated speech models combined with the access to lots and lots of data that can be used to train these systems to get smarter and smarter.”

Mahoney adds that the development of voice biometric systems that can identify different voices, separate the speech, and label the speaker could also soon be a key component of speech recognition software from Nuance.

As for the acoustic challenges that may come with speech recognition in settings such as a courtroom, Mahoney says: “We’ll certainly use software to clean up the dirty signals in a noisy environment. But on top of that, there are audio processing capabilities that will allow you to deal with far-field microphones.”

Even with these improvements, Mahoney says there will still be many things that court reporters do today that voice recognition software simply won’t be able to match.

“One thing that current voice recognition systems aren’t good enough at yet is really doing the kind of formatting and labeling of data in the way a court reporter would do to come up with a finished document,” Mahoney says. “To produce the raw text with high accuracy that is readable — that can be done. But to produce a transcript that is formatted and 100 percent accurate, you are going to need a court reporter.”

Speech recognition has been on the radar screen of the court reporting community for two decades or more. Given all the recent improvements in speech recognition technology, is now the time for court reporters to start to worry?

Right now, the answer to that seems to be yes — and no.

There are already companies out there in other parts of the world promising speech recognition as part of their courtroom technology, but given the current technology limitations, most of those claims can easily be dismissed as unproven hype.

Still, Mason Farmani, CEO and managing partner of Barkley Court Reporters in Los Angeles, says: “I do think that we should take speech recognition very seriously. With the exponential growth of the technology, the usefulness of this technology in our profession is not that far away. Between IBM’s Watson project, Google’s Jarvis, and Apple’s Siri, there are many speech databases being built, and some like IBM’s are encouraging and facilitating Watson-based systems. I do think we are much closer than we think. The only issue would be adopting legislations to accommodate these changes, which we all know happens very slowly.”

Farmani stops short of suggesting reporting firms begin investing in speech recognition tools in the same way many have invested in other technology trends like videography and cloud-based storage.

“I don’t think the time has come yet — but when it’s here, we should [invest in it], especially with the continuing shortage of court reporters,” Farmani says. “I think there will come a time where these technologies will be used in legal proceedings, but highly educated, highly professional court reporters will continue to be in demand for more complicated cases.”

Todd Olivas of Todd Olivas & Associates of Temecula, Calif., says he has no plans to invest in speech recognition tools any time soon. “But the reason is not because the technology isn’t there yet,” he says. “In fact, I’ll go further and add to Benjamin Franklin’s famous quote about certitude – ‘in this world nothing can be said to be certain, except death and taxes’ – and add technology advancements. So for our purposes, we’d be foolish to think that court reporting is immune to technological advancements. Just ask yourself how many pen writers are still working. Steno writers took their places; right? Ask yourself how many steno writers are transcribing from paper notes. Laptops installed with CAT software took that role. Still, the human court reporter is the constant in all of those scenarios.”

Because of that, Olivas recommends that court reporters not fret over changes in technology like speech recognition, but simply be ready to adapt when they do come.

“We perform our duties using a certain set of tools today in 2016,” Olivas explains. “Yes, the various technologies will change, but the core role of what we do will not. Because our real value is not tied to that. Our real value is being the eyes and ears of the judge. There will always be the need to have a neutral, third-party observer at these proceedings who can administer the oath, facilitate the capture of the spoken word, produce a written document, and certify to its accuracy.”
David Ward is a journalist in Carrboro, N.C. Comments on this article can be sent to jcrfeedback@ncra.org.