Bibliography | Reich, Kevin: Optimizing neural fake speech detection using post hoc analysis. University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 38 (2021). 76 pages, english.
|
Abstract | With the technological advance in speech synthesis methods, it has become apparent that attackers can abuse this technology to launch fake speech attacks in a number of ways: faking the voice of a supervisor to order an employer to make money transfers, spreading fake news and propaganda or spoofing automatic speaker verification (ASV) systems. Thus, it has become important to detect whether speech is genuine or artificially created. A small scale study contained in this thesis indicates that humans do not solve this problem trivially and therefore will need the help of automatic counter measure (CM) systems. The most successful automatic approaches use neural networks to solve the problem. In our work, we analyzed the decision making process of neural CM systems and used that insight to improve the performance of the best network we observed. Our work was done on the ASVspoof 2019 dataset as it was the only popular fake speech dataset in use when we started our work. First, we showed that using spectrogram images as input is a legitimate way to solve the task of fake speech detection. This allowed us to use image classification models and the post hoc analysis method Score-CAM. Among the multiple image classification models we tested, EfficientNet-B3 achieved the best scores. Our post hoc analysis for the EfficientNet revealed that it uses background noise and features in the lower frequencies to distinguish between real and fake speech samples. We used that insight in two follow-up experiments to improve the models performance by 28.7% and 30.25% respectively. The model from the second follow up experiment is the fifth best non-ensemble model for the ASVspoof 2019 LA dataset up to date. This highlights the importance of understanding what neural networks are actually doing since it can be used to optimize their performance significantly.
|
Full text and other links | Volltext
|
Department(s) | University of Stuttgart, Institute for Natural Language Processing
|
Superviser(s) | Vu, Prof. Ngoc Thang; Tilli, Pascal |
Entry date | November 4, 2021 |
---|