Masterarbeit MSTR-2023-84

Bibliograph.
Daten
Zhou, Zhenliang: Evaluation of transfer learning methods in text-to-speech.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Masterarbeit Nr. 84 (2023).
79 Seiten, englisch.
Kurzfassung

Abstract: Transfer learning is widely used as an important machine learning method in natural language processing. To complete a specific task, the developer must use the relevant datasets to complete the training of the specific task model, which will consume a lot of data and computing resources. In this case, transfer learning is proposed. To finish the specific task, the developer can first train a multi-task model with giant datasets, and then only need a small amount of data to train the model on a specific task to achieve migration. In natural language processing, four major transfer learning methods have been proposed: adapter, BitFit, diff pruning, and full finetuning, which use less finetuning data and less training time to achieve comparable results with a single task model. We expect to apply these four transfer learning methods in the text-to-speech domain. We enable pre-training on Fastspeech2 using the multi-speaker dataset to learn the speech information of these speakers. Then a single speaker training dataset is used to finetune the pre-training model to imitate the speaker's speech characteristics. After generating the speech audios by four transfer models, we compare the generated audios with the original speech of the speaker and score these speech signals through non-subjective and subjective evaluation to obtain the methods' performance. We find that BitFit has the best performance in the transfer learning experiment trained with low resources datasets(vctk), while full finetuning encountered the problem of overfitting, which heavily influence the audio duration information. Besides, the audios generated by the diff pruning model are all noise, which represents diff pruning is completely unsuitable for the migration of low resources datasets. In the comparative experiment, we use LJspeech(high resources) dataset for finetuning. The adapter and full finetuning models have the best speech restoration. Although the voice quality of BitFit and diff pruning is inferior to the adapter and full finetuning, the audio quality is not significantly reduced.

Volltext und
andere Links
Volltext
Abteilung(en)Universität Stuttgart, Institut für Maschinelle Sprachverarbeitung
BetreuerVu, Prof. Ngoc Thang; Schweitzer, Dr. Antje; Koch, Julia
Eingabedatum20. Februar 2024
   Publ. Informatik