HierTTS: Expressive End-to-End Text-to-Waveform (TTS) using Multi-Scale Hierarchical Variational Auto-encoder



End-to-end text-to-speech models that directly generate waveforms from the text are gaining popularity. However, existing end-to-end models are still not natural enough in their prosodic expressiveness. And previous studies on improving the expressiveness of TTS mainly focus on acoustic models. There is a lack of research on enhancing expressiveness in an end-to-end framework. Therefore, we propose HierTTS, a highly expressive end-to-end text-to-waveform generation model. It deeply couples the hierarchical properties of speech with hierarchical variational autoencoders and models multi-scale latent variables, including frame, phone, subword, word, and sentence levels. The hierarchical encoder encodes the speech signal from fine-grained features into coarse-grained latent variables. In contrast, the hierarchical decoder generates fine-grained features conditioned on the coarse-grained latent variables. We propose a staged KL-weighted annealing strategy to prevent hierarchical posterior collapse. Furthermore, we employ a hierarchical text encoder to extract linguistic information at different levels and act on both the encoder and decoder. Experiments results show that our model is closer to natural speech in prosody expressiveness and has better generative diversity.

Read abstract in Chinese Mandarin

Experiment results

I. Comparison to previous systems

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Text 雇了美丽性感的女子克洛伊去试探丈夫。 对于梅耶尔而言,巴雷特确实“物有所值。 但是目前泽尻究竟在何处静养还不清楚。 柳荫的失误让丹麦连偷两分,将比赛带入加时局。 魏屯乡果断卖掉仅有的一辆桑塔纳轿车。
Ground truth
FastSpeech2 + HiFi-GAN
MultiGST + HiFi-GAN
PortaSpeech + HiFi-GAN

II. Ablation studies

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Text 雇了美丽性感的女子克洛伊去试探丈夫。 对于梅耶尔而言,巴雷特确实“物有所值。 但是目前泽尻究竟在何处静养还不清楚。 柳荫的失误让丹麦连偷两分,将比赛带入加时局。 魏屯乡果断卖掉仅有的一辆桑塔纳轿车。
-Sentent level
-Word level
-Subword level

III. Sample diversity

Text 淋过雨的空气,疲倦了的伤心,我记忆里的童话已经慢慢的融化。
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5