You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values that are out of the singer's vocal range or associated with limited training samples. Based on our previous work, this work proposes a melody-unsupervised multi-speaker pre-training method conducted on a multi-singer dataset to enhance the vocal range of the single-speaker, while not degrading the timbre similarity. This pre-training method can be deployed to a large-scale multi-singer dataset, which only contains audio-and-lyrics pairs without phonemic timing information and pitch annotation. Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information. These pre-trained model parameters are delivered into the fine-tuning step as prior knowledge to enhance the single speaker's vocal range. Moreover, this work also contributes to improving the sound quality and rhythm naturalness of the synthesized singing voices. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice, and a bi-directional flow model to improve the sound quality. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
Fig.1: The structure of the proposed model.
Subjective Evaluation
To demonstrate that our proposed model can significantly improve the naturalness and quality of the synthesized singing voice, some samples are provided for comparison. GT means ground truth. Baseline represents the baseline model we are comparing, and Proposed means the proposed model with pretrain strategy、learnable upsampling layer and bi-directional flow model, which are described in detail in the paper.
| Target Chinese Text | GT | Baseline | Proposed |
| :----:| :----:| :----:| :----:| :----:|
| 你说我不该不该不该在这时候 |
Ablation Study
We further conduct an ablation study to validate different contributions in our proposed method. We remove pretrain-strategy, bi-directional flow model, and learnable upsampling layer respectively. The audio samples are present below.
| Target Chinese Text | GT | Proposed | without pretrain | without bi-flow |
| :----:| :----:| :----:| :----:|:----:|:----:|
| 你说我不该不该不该在这时候 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
| 昂首到达每一个地方这世界的太阳 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
| 青春嫩绿得 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
| 很鲜明 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
| 想知道关于我的事情 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
我还在寻找一个依靠 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |Your browser does not support the audio element.|
不要再沉默徘徊 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
冲破这层层阻碍 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
我才明白外面世界如此精彩 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
时间飞这生命似钟摆 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.|
| Target Chinese Text | GT | Proposed | without learnable upsampling layer |
| :----:| :----:| :----:| :----:|:----:|
| 是不是说没有做完的梦最痛 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
| 把故事听到最后才说再见 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
| 右手左手慢动作重播 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
| 成长的烦恼算什么 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
| 昂首到达每一个地方这世界的太阳 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
青苔入镜檐下 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
小酒窝长睫毛迷人的无可救药 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
我放慢了步调感觉像是喝醉了 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
我永远爱你到老 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
就算曾经我们都轻如尘埃 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |
Case Study
To demonstrate the impact of the aforementioned contributions, a case study is conducted to synthesize a testing sample that contains pitch values of limited training data. We compare the ground-truth, the proposed method and the baseline.
The pitch is marked with blue lines and the pitch value at the red line is shown on the right. This sample ends with a slightly low pitch that is associated with few training data. It is observed that the proposed method synthesizes this pitch accurately, but the baseline method tends to incorrectly use a higher pitch to replace this one, proving that the proposed pre-training strategy is effective in enhancing the vocal range.