Master Thesis MSTR-2023-84

BibliographyZhou, Zhenliang: Evaluation of transfer learning methods in text-to-speech.
University of Stuttgart, Faculty of Computer Science, Electrical Engineering, and Information Technology, Master Thesis No. 84 (2023).
79 pages, english.
Abstract

Abstract: Transfer learning is widely used as an important machine learning method in natural language processing. To complete a specific task, the developer must use the relevant datasets to complete the training of the specific task model, which will consume a lot of data and computing resources. In this case, transfer learning is proposed. To finish the specific task, the developer can first train a multi-task model with giant datasets, and then only need a small amount of data to train the model on a specific task to achieve migration. In natural language processing, four major transfer learning methods have been proposed: adapter, BitFit, diff pruning, and full finetuning, which use less finetuning data and less training time to achieve comparable results with a single task model. We expect to apply these four transfer learning methods in the text-to-speech domain. We enable pre-training on Fastspeech2 using the multi-speaker dataset to learn the speech information of these speakers. Then a single speaker training dataset is used to finetune the pre-training model to imitate the speaker's speech characteristics. After generating the speech audios by four transfer models, we compare the generated audios with the original speech of the speaker and score these speech signals through non-subjective and subjective evaluation to obtain the methods' performance. We find that BitFit has the best performance in the transfer learning experiment trained with low resources datasets(vctk), while full finetuning encountered the problem of overfitting, which heavily influence the audio duration information. Besides, the audios generated by the diff pruning model are all noise, which represents diff pruning is completely unsuitable for the migration of low resources datasets. In the comparative experiment, we use LJspeech(high resources) dataset for finetuning. The adapter and full finetuning models have the best speech restoration. Although the voice quality of BitFit and diff pruning is inferior to the adapter and full finetuning, the audio quality is not significantly reduced.

Full text and
other links
Volltext
Department(s)University of Stuttgart, Institute for Natural Language Processing
Superviser(s)Vu, Prof. Ngoc Thang; Schweitzer, Dr. Antje; Koch, Julia
Entry dateFebruary 20, 2024
   Publ. Computer Science