Overview

Presentations of accepted Interspeech papers and other research work.

Date: Friday 6th September
Time: 14:15 - 16:05
Place: TB180 Joensuu Science Park, F213 Kuopio (Video)

Estimated schedule

14:15 - 14:35: Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration (Ville Vestman, poster)
14:35 - 14:55: Towards Debugging Deep Neural Networks by Generating Speech Utterances (Ville Hautamäki, poster)
14:55 - 15:15: Recognition of Creaky Voice from Emergency Calls (Lauri Tavi, poster)
15:15 - 15:35: Ensemble Models for Spoofing Detection in Automatic Speaker Verification (Bhusan Chettri, poster)
15:35 - 16:05: On security of voice biometrics for large speaker populations (Tomi Kinnunen, lecture)

Abstracts

Unleashing the Unused Potential of I-Vectors Enabled by GPU Acceleration
Speaker embeddings are continuous-value vector representations that allow easy comparison between voices of speakers with simple geometric operations. Among others, i-vector and x-vector have emerged as the mainstream methods for speaker embedding. In this paper, we illustrate the use of modern computation platform to harness the benefit of GPU acceleration for i-vector extraction. In particular, we achieve an acceleration of 3000 times in frame posterior computation compared to real time and 25 times in training the i-vector extractor compared to the CPU baseline from Kaldi toolkit. This significant speed-up allows the exploration of ideas that were hitherto impossible. In particular, we show that it is beneficial to update the universal background model (UBM) and re-compute frame alignments while training the i-vector extractor. Additionally, we are able to study different variations of i-vector extractors more rigorously than before. In this process, we reveal some undocumented details of Kaldi’s i-vector extractor and show that it outperforms the standard formulation by a margin of 1 to 2% when tested with VoxCeleb speaker verification protocol. All of our findings are asserted by ensemble averaging the results from multiple runs with random start.

Towards Debugging Deep Neural Networks by Generating Speech Utterances
Deep neural networks (DNN) are able to successfully process and classify speech utterances. However, understanding the reason behind a classification by DNN is difficult. One such debugging method used with image classification DNNs is activation maximization, which generates example-images that are classified as one of the classes. In this work, we evaluate applicability of this method to speech utterance classifiers as the means to understanding what DNN ”listens to”. We trained a classifier using the speech command corpus and then use activation maximization to pull samples from the trained model. Then we synthesize audio from features using WaveNet vocoder for subjective analysis. We measure the quality of generated samples by objective measurements and crowd-sourced human evaluations. Results show that when combined with the prior of natural speech, activation maximization can be used to generate examples of different classes. Based on these results, activation maximization can be used to start opening up the DNN black-box in speech tasks.

Recognition of Creaky Voice from Emergency Calls
Although creaky voice, or vocal fry, is widely studied phonation mode, open questions still exist in creak’s acoustic characterization and automatic recognition. Many questions are open since creak varies significantly depending on conversational context. In this study, we introduce an exploratory creak recognizer based on convolutional neural network (CNN), which is generated specifically for emergency calls. The study focuses on recognition of creaky voice from authentic emergency calls because creak detection could potentially provide information about the caller’s emotional state or attempt of voice disguise. We generated the CNN recognition system using emergency call recordings and other out-of-domain speech recordings and compared the results with an already existing and widely used creaky voice detection system: using poor quality emergency call recordings as test data, this system achieved F1 of 0.41 whereas our CNN system accomplished an F1 of 0.64. The results show that the CNN system can perform moderately well using a limited amount of training data on challenging testing data and has the potential to achieve higher F scores when more emergency calls are used for model training.

Ensemble Models for Spoofing Detection in Automatic Speaker Verification
Detecting spoofing attempts of automatic speaker verification (ASV) systems is challenging, especially when using only one modeling approach. For robustness, we use both deep neural networks and traditional machine learning models and combine them as ensemble models through logistic regression. They are trained to detect logical access (LA) and physical access (PA) attacks on the dataset released as part of the ASV Spoofing and Countermeasures Challenge 2019. We propose dataset partitions that ensure different attack types are present during training and validation to improve system robustness. Our ensemble model outperforms all our single models and the baselines from the challenge for both attack types. We investigate why some models on the PA dataset strongly outperform others and find that spoofed recordings in the dataset tend to have longer silences at the end than genuine ones. By removing them, the PA task becomes much more challenging, with the tandem detection cost function (t-DCF) of our best single model rising from 0.1672 to 0.5018 and equal error rate (EER) increasing from 5.98% to 19.8% on the development set.