Audio Samples from "Incorporating fine-grained style transfer for multi-speaker multi-style multi-language text-to-speech"

Paper:

Paper is accpeted.

Abstract:

Recently multilingual TTS systems using only monolingual datasets have obtained significant improvement. However, the quality of cross-language speech synthesis is not comparable to the speaker's own language and often comes with a heavy foreign accent. This paper proposed a multi-speaker multi-style multi-language speech synthesis system (M3), which improves the speech quality by introducing a fine-grained style encoder and overcomes the non-authentic accent problem through cross-speaker style transfer. To avoid leaking timbre information into style encoder, we utilized a speaker conditional variational encoder and conducted adversarial speaker training using the gradient reversal layer. Then, we built a Mixture Density Network (MDN) for mapping text to extracted style vectors for each speaker. At the inference stage, cross-language style transfer could be achieved by assigning any speaker's style type in the target language. Our system uses existing speaker style and genuinely avoids foreign accents. In the MOS-speech-naturalness, the proposed method generally achieves 4.0 and significantly outperform the baseline system.

 

Our training data includes six speakers, covering three languages. There are three speakers from mandarin (spk1, spk2, spk3), two from English (spk4, spk5), and one from German (spk6). Among them, spk4 speaks with an Indian accent. spk3 and spk5 are male speakers. We extract style from each speaker

 

Multilingual speech synthesis

Baseline:

Source language\Target language

CN

EN

DE

CN
EN
DE

M3 w/o FSE:

Source language\Target language

CN

EN

DE

CN
EN
DE

M3(Proposed):

Source language\Target language

CN

EN

DE

CN
EN
DE

 

Style transfer

Covering both within-language and cross-language style transfer:

Target style

style1

style2

style3

style4

style5(Indian accent)

style6

Source speaker\Reference audio
Spk1
Spk2
Spk3
Spk4
Spk5
Spk6

 

Ablation study for style encoder

Target style

style1

style2

style3(male)

style4

style5(Indian accent)

style6

Reference audio
M3
-adversial speaker training
-speaker condition

 

Performance of style predictor against the number of training samples

Number of training samples
5

10

50

100

200

2000

Generated audio