An Overview of Speaker Normalization Techniques for Speech Recognition

Proceedings of Technological Advances in Science, Medicine and Engineering Conference 2021

An Overview of Speaker Normalization Techniques for Speech Recognition

Anosha Ignatius, Uthayasanker Thayasivam

Abstract

Abstract: Deep Neural Networks based speech embedding techniques have shown significant performance in speech processing applications such as automatic speech recognition and spoken language understanding systems. However, their performance can be greatly compromised due to mismatch between training and testing conditions. This is caused by the variability in the paralinguistic information present in the speech signal such as speaker characteristics and emotional states. Over time many techniques have been experimented to address this problem by disentangling the paralinguistic content from the speech signal when the target application requires only the linguistic content. The most common approach is to provide speaker specific information at the input of the acoustic model to normalize the speaker effects. Speaker specific information is characterized by speaker embeddings which maps speech utterances to fixed dimensional vectors. This approach is challenging since it requires large amounts of labeled training data. Thus, in low resource scenarios, when only a limited amount of transcribed speech data is available, an unsupervised speech representation learning method is adopted. This study presents a detailed analysis of research work related to speaker normalization in speech recognition systems with disentangled speech representations.

Keywords: speech recognition, paralinguistic information, speaker normalization

Last modified: 2021-07-03

Building: TASME Center
Room: Science Hall
Date: July 4, 2021 - 11:05 AM – 11:20 AM

<< Back to Proceedings