multilingual tts

Audio Samples from "Incorporating fine-grained style transfer for multi-speaker multi-style multi-language text-to-speech"

Paper:

Paper is accpeted.

Abstract:

Recently multilingual TTS systems using only monolingual datasets have obtained significant improvement. However, the quality of cross-language speech synthesis is not comparable to the speaker's own language and often comes with a heavy foreign accent. This paper proposed a multi-speaker multi-style multi-language speech synthesis system (M3), which improves the speech quality by introducing a fine-grained style encoder and overcomes the non-authentic accent problem through cross-speaker style transfer. To avoid leaking timbre information into style encoder, we utilized a speaker conditional variational encoder and conducted adversarial speaker training using the gradient reversal layer. Then, we built a Mixture Density Network (MDN) for mapping text to extracted style vectors for each speaker. At the inference stage, cross-language style transfer could be achieved by assigning any speaker's style type in the target language. Our system uses existing speaker style and genuinely avoids foreign accents. In the MOS-speech-naturalness, the proposed method generally achieves 4.0 and significantly outperform the baseline system.

Our training data includes six speakers, covering three languages. There are three speakers from mandarin (spk1, spk2, spk3), two from English (spk4, spk5), and one from German (spk6). Among them, spk4 speaks with an Indian accent. spk3 and spk5 are male speakers. We extract style from each speaker