Audio Samples from "Incorporating fine-grained style transfer for multi-speaker multi-style multi-language text-to-speech"
Paper:
Paper is accpeted.
Abstract:
Recently multilingual TTS systems using only monolingual datasets have obtained significant improvement. However, the quality of cross-language speech synthesis is not comparable to the speaker's own language and often comes with a heavy foreign accent. This paper proposed a multi-speaker multi-style multi-language speech synthesis system (M3), which improves the speech quality by introducing a fine-grained style encoder and overcomes the non-authentic accent problem through cross-speaker style transfer. To avoid leaking timbre information into style encoder, we utilized a speaker conditional variational encoder and conducted adversarial speaker training using the gradient reversal layer. Then, we built a Mixture Density Network (MDN) for mapping text to extracted style vectors for each speaker. At the inference stage, cross-language style transfer could be achieved by assigning any speaker's style type in the target language. Our system uses existing speaker style and genuinely avoids foreign accents. In the MOS-speech-naturalness, the proposed method generally achieves 4.0 and significantly outperform the baseline system.
Our training data includes six speakers, covering three languages. There are three speakers from mandarin (spk1, spk2, spk3),
two from English (spk4, spk5), and one from German (spk6).
Among them, spk4 speaks with an Indian accent. spk3 and spk5 are male speakers. We extract style from each speaker
Multilingual speech synthesis
Baseline:
Source language\Target language
CN
EN
DE
CN
EN
DE
M3 w/o FSE:
Source language\Target language
CN
EN
DE
CN
EN
DE
M3(Proposed):
Source language\Target language
CN
EN
DE
CN
EN
DE
Style transfer
Covering both within-language and cross-language style transfer:
Target style
style1
style2
style3
style4
style5(Indian accent)
style6
Source speaker\Reference audio
Spk1
Spk2
Spk3
Spk4
Spk5
Spk6
Ablation study for style encoder
Target style
style1
style2
style3(male)
style4
style5(Indian accent)
style6
Reference audio
M3
-adversial speaker training
-speaker condition
Performance of style predictor against the number of training samples