Automated Speech Recognition Approaches And Challenges

“The ability to recognize speech as well as humans do is a continuing challenge, since the human speech, especially during the spontaneous conversation, is extremely complex. It’s also difficult to define human performance since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated.” By Julia Hirschberg, Professor, and Chair at the Department of Computer Science at Columbia University

The way technology evolving is significantly different compared to a few decades back, Artificial intelligence trying to bring efficient interaction between machines and human beings through intelligent systems.

In many scenarios, we’re continuing to accelerate the evolution toward this future scenario at a surprisingly fast pace thanks to the continuing development of what is known as automated speech recognition technology.

What Is Automated Speech Recognition?

Automated Speech Recognition is commonly known as ASR, refers to the technology by which a machine or program is engineered to identify spoken words or phrases and convert them to text or any other format that can be read by a machine. Well known examples are youtube closed captions, Siri front-end, Google assistant front-end, Cortana, voicemail transcription, IBM Watson, etc

Voice Artificial intelligence is growing into various aspects of our lives, many are looking for ways to improve sharing experiences like youtube nowadays, it is the best experience even if you are not a native speaker but looking at something captioned makes the work of understanding contents smooth considering people speak one language in various dialects.

The technology behind closed captioning capabilities in real-time is ASR. Using advanced ASR, the audio portion of any video can be transcribed even as the content creator is speaking, making closed captions easy to add anytime and anywhere without interrupting the experience for the content creator.

Why Is ASR Desirable For All Languages?

ASR contributes to the preservation of endangered languages, lots of languages are currently close to extinction or they’ve been given this endangered status, so if you have technologies that are built for such languages it can contribute towards the preservation of such languages.
Developing natural interfaces for both literate and illiterate people makes it easier to interact with systems enabled with ASR technology.
Speech is a primary means of human communication when it comes to comparison struggling with keypad typing for communication, is cumbersome, right? But also humans feel free to speak rather than type even sometimes people like to free their hands while talking but I don’t think is a good idea when it comes to some cases like driving.

What Are The Approaches Used For ASR Development?

There are two common approaches used in developing ASR solutions:-

The traditional approach
The modern or Deep Learning approach

The traditional approach:

This approach involves feature creation from the sound file: this may require a form of filtering and aggregation and performing transformations on the windows, such as Fourier transformations and connectionist temporal classification (CTC). The followed by applying an acoustic model to match the phonemes and the final part involve applying a language model that uses probability distributions to predict words from the phonemes and then the sequences of words from the phonemes.

The advantage of breaking the problem into this form of the pipeline is that you can work on each part independently to improve the system. However, in practice, this is also a disadvantage as overall the process can be brittle and requires specialized researchers.

The Modern or Deep Learning approach:

The goal of this approach is to replace the intermediate steps with one algorithm. The deep learning approach has achieved state-of-the-art results in speech transcription tasks and is replacing the traditional methods used in ASR. It is also simpler because there are fewer steps involved and does not require as much expertise. The implementation of this approach requires a knowledge understanding of deep learning tools such as PyTorch, Tensorflow, DeepSpeech, etc.

How does it work

The computer program converts the speech format into a spectrogram a machine-readable representation of the audio file of the words.
An acoustic model cleans up the audio file by removing any background noises and normalizing the volume, Here is where the algorithm breaks down the cleaned-up (wave file) audio representation into phonemes(the basic building blocks of sounds of languages and words)
The Automated Speech Recognition software uses statistical probability on analyzing the phonemes in sequences to deduce whole words. From the sequences is where the NLP model is applied to the sentences to understand the meaning of the audio and then develop a suitable reply and use TTS conversion to reply.

What Are The Challenges Of Developing ASR?

Regardless of which approach is used but in most cases, ASR solutions are limited by real human interaction, the way we do communication, and the actual data used to train the ASR:-

Style: The way humans communicate is varies depending on the situation conversational speech is different from reading speech, Continuous speech does vary from isolated words. Also, you can find some of the conversations involving language admixtures like Swahili and English most known as SWANGLISH.
Speaker characteristics: In the real world, the speed of communication may be faster than the way machines are trained or even the accent may do vary across various language speakers.
Environment: In this case, let's take an example of background noises or channel conditions
Task specifics: The number of words in vocabulary and language constraints.

How Do Measure The Success Of The ASR Solution?

The most commonly used method to measure the performance of Automatic Speech Recognition is using Word Error Rate most known as WER, which is the percentage of the number of correctly recognized words by the ASR model.

For academic tasks on specific datasets, a WER rate of five percent is possible, but for real-life applications, a WER of 10-20 percent is considered acceptable. This is because the ASR models are trained on historical datasets, which may not reflect modern voice data. Another issue with some models is the inability to handle regional accents, as they may only be trained on voices from the same region. This is a particular problem with some of the cloud ASR services. By altviz ASR paper

Final Thoughts

We are not limited to seeing the usefulness of ASR, on speech to text only since it can be applied to various aspects the transcription can be used in other NLP techniques including Classification of text into different categories depending on the use cases, Named Entity Extraction on text such as people, places and organizations, Natural language understanding in analyzing the meaning of the text.

Point to keep in mind, Artificial Intelligence learns from training datasets, so when some input data are missing the ASR can’t accurately parse their speech. Considering the diversity in developing ASR solutions for a group of users is an important part of building effective solutions.

Also, privacy is another major sticking point for ASR’s widespread adoption, taking into account that the ASR system may be used in offices, homes, vehicles, and even stores or in any setting that offers convenience is contingent on consumers trusting the privacy of their data.

Making machines listen to us is a big deal, for all its complexities, challenges, and technicalities ASR technology is really just about one simple goal of helping machines or computers listen to us. Looks simple or fun but when we stop to think about it we realize just how important this capability truly is.