| Abstract | Recent advancements in text-to-speech (TTS) systems have significantly improved the naturalness and expressivity of synthetic speech. However, a persistent trade-off exists between expressivity and controllability. While deep learning-based models can generate highly expressive speech, they often lack mechanisms for fine-grained control over prosodic features, limiting their applicability in domains requiring precise prosody control, such as voice anonymization, and high-quality dubbing. This thesis explores probabilistic prosody modeling as a means to enhance expressivity while maintaining user control over pitch, energy, and duration. We evaluate three generative approaches: Normalizing Flows (NF), Conditional Flow Matching (CFM), and Rectified Flows (RF), comparing their effectiveness in capturing natural prosodic variation. Unlike previous research, which primarily focused on stochastic duration modeling, this study systematically examines probabilistic methods across all three major prosodic features.
Through a structured multi-stage evaluation, including objective variance analysis and large-scale subjective studies, we demonstrate that probabilistic prosody modeling significantly improves prosodic diversity over deterministic approaches. Among the tested models, RF at moderate sampling temperatures achieves the best balance between naturalness and expressivity. The results confirm that controlled variance is crucial for balancing expressive speech synthesis with listener expectations of naturalness.
This work contributes to the development of more flexible and human-like TTS systems by providing systematic insights into probabilistic prosody modeling.
|