Proceedings of Technological Advances in Science, Medicine and Engineering Conference 2021

Explainable Embedding based on Speaker feature Using AAVD
Jenarthanan Rajenthiran, Lakshikka Sithamaparanathan, Saranya Uthayakumar, Uthayasanker Thayasivam
Abstract

Deep neural networks have made significant advances in computer vision and machine learning, enabling disruptive applications in robotics, healthcare, and security. Despite the deep neural networks' increased performance, understanding their inner workings and explaining their output forecasts remains difficult. The interpretability of embedding from an Artificial Intelligence (AI) model is closely related to a number of important issues, including the safety of AI models in critical applications like autonomous driving and medical image diagnosis, as well as the fairness and bias of AI models, which could have social and moral consequences. As a result, the use of complicated models for broader AI applications is severely limited due to a lack of interpretability.

In this work, we have investigated further the idea of interpretability and explainability of audio embedding and propose the Average Activation Value Difference (AAVD) method for explaining embedding in terms of information of a complex speech signal. We extracted features that are mapped with the dimensions of the embedding. Each piece of information/feature of a speech signal, for example, speaker's gender, age, accent, and language particularly includes paralinguistic features (pitch, quality, and loudness) of voice has mapped to a particular dimensional range in an embedding. The importance of an information/feature is calculated using the AAVD approach which has been adopted from the Saliency Maps [1, 2] approach in the image domain. To the best of our knowledge, there are no approaches for the explainability of the voice embeddings. In addition to that, the existing approaches Saliency maps [2] compute the output category's gradient in relation to the input image and by blocking off a portion of an image, occlusion mapping [4] predicts the relevance of pixels in the image are of the image domain [3]. 

The AAVD approach indicates the weight difference between one level to another. Here the level denotes a unique subcategory of each feature. For example, Male and Female are two levels in gender feature. We have defined levels for each feature according to our definition. The following steps are defined in the AAVD approach. First, two levels of data from each speaker feature are selected. That data is fed to a deep learning model and embeddings are generated. The activation value of each level is obtained from the embedding and the average activation value difference between the selected two levels is calculated. This is an important factor of the dimensions. This value shows the dimensions that are more important to this specific speaker feature.  From this identified dimensional range for each speaker feature, we are mapping a speaker feature to a specific dimensional range in an embedding, which explains that the speaker feature is represented through that embedding dimensional range. Similarly, the embedding can be split into several ranges specific to speaker features.


Last modified: 2021-06-30
Building: TASME Center
Room: Science Hall
Date: July 3, 2021 - 09:50 AM – 10:05 AM

<< Back to Proceedings