Abstract
In recent years, with the application of deep learning in speech synthesis, waveform generation models based on generative adversarial networks achieve high quality comparable to natural speech. In most waveform generators, neural upsampling unit plays an essential role as it is employed to upsample acoustic features to sample level. However, we observe significant aliasing artifacts in the generated speech from non-idea upsampling-based waveform generators. According to the Shannon-Nyquist sampling theorem, non-ideal upsampling filters such as nearest neighbor, bilinear, subpixel or zero-padding upsampling result in aliasing. In this paper, we aim to systematically analyze how aliasing artifacts are produced in the non-idea upsampling-based waveform generators. We explore commonly used neural upsampling unit including transposed convolution, nearest neighbor interpolation and subpixel convolution in HiFi-GAN and VITS and find that the high-frequency spectral details are generated based on low-frequency structure utilizing the nonlinear transformation. But the nonlinear transformation could not perfectly eliminate the low-frequency spectral imprint and finally manifested as spectral artifacts in generated waveforms. To suppress the aliasing artifacts, we first apply a low-pass filter after upsampling layer but find significant performance drops. Experimental results indicate that the aliasing makes the training process converge faster by filling the high-frequency empty. In this regard, we propose to mix high-frequency noise into low-pass filtered features so that it can converge quicker and naturally avoid artifacts. In addition, we design an artifact-detection algorithm based on structural similarity to evaluate the effectiveness of our method.
Experiment results (It is encouraged to focus more on the difference of high frequency and observe the linear scale STFTs in Audition.)
I. Synthesis from ground-truth mel-spectrogram (HiFi-GAN)
- Dataset: LJSpeech
- Input: Ground truth mel-spectrogram
Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
---|---|---|---|---|---|
Ground truth |
Transposed CNN | |||||
---|---|---|---|---|---|
Subpixel CNN | |||||
Nearest neighbor | |||||
Lowpass | |||||
Anti-aliasing (ours) |
II. Synthesis from text (VITS)
- Dataset: LJSpeech
- Input: Text
Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
---|---|---|---|---|---|
Text | during the morning of November twenty-two prior to the motorcade. | Oswald was, however, willing to discuss his contacts with Soviet authorities. He denied having any involvement with Soviet intelligence agencies | During his Presidency, Franklin D. Roosevelt made almost four hundred journeys and traveled more than three hundred fifty thousand miles. | No night clubs or bowling alleys, no places of recreation except the trade union dances. I have had enough. | eleven. If I am alive and taken prisoner, |
Ground truth |
Transposed CNN | |||||
---|---|---|---|---|---|
Subpixel CNN |
Nearest neighbor | |||||
---|---|---|---|---|---|
Lowpass | |||||
Anti-aliasing (ours) |