layout
default

Abstract

The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values that are out of the singer's vocal range or associated with limited training samples. Based on our previous work, this work proposes a melody-unsupervised multi-speaker pre-training method conducted on a multi-singer dataset to enhance the vocal range of the single-speaker, while not degrading the timbre similarity. This pre-training method can be deployed to a large-scale multi-singer dataset, which only contains audio-and-lyrics pairs without phonemic timing information and pitch annotation. Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information. These pre-trained model parameters are delivered into the fine-tuning step as prior knowledge to enhance the single speaker's vocal range. Moreover, this work also contributes to improving the sound quality and rhythm naturalness of the synthesized singing voices. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice, and a bi-directional flow model to improve the sound quality. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.

Fig.1: The structure of the proposed model.

Subjective Evaluation

To demonstrate that our proposed model can significantly improve the naturalness and quality of the synthesized singing voice, some samples are provided for comparison. GT means ground truth. Baseline represents the baseline model we are comparing, and Proposed means the proposed model with pretrain strategy、learnable upsampling layer and bi-directional flow model, which are described in detail in the paper.

Ablation Study

We further conduct an ablation study to validate different contributions in our proposed method. We remove pretrain-strategy, bi-directional flow model, and learnable upsampling layer respectively. The audio samples are present below.

Case Study

To demonstrate the impact of the aforementioned contributions, a case study is conducted to synthesize a testing sample that contains pitch values of limited training data. We compare the ground-truth, the proposed method and the baseline. The pitch is marked with blue lines and the pitch value at the red line is shown on the right. This sample ends with a slightly low pitch that is associated with few training data. It is observed that the proposed method synthesizes this pitch accurately, but the baseline method tends to incorrectly use a higher pitch to replace this one, proving that the proposed pre-training strategy is effective in enhancing the vocal range.

Model	Target Chinese Text	Audio
Proposed	我还在寻找一个依靠	Your browser does not support the audio element.
Baseline	我还在寻找一个依靠	Your browser does not support the audio element.
Ground-truth	我还在寻找一个依靠	Your browser does not support the audio element.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Abstract

Subjective Evaluation

Ablation Study

Case Study

Files

index.md

Latest commit

History

index.md

File metadata and controls

Abstract

Subjective Evaluation

Ablation Study

Case Study