Source code of the paper Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution.
- Install python requirements.
pip install requirements.txt
- Please convert all the data files into
.wav
format and put them under the same directory. The following command will train a 48 kHz UDM.
python train.py model.res_channels=64 epochs=50 sr=48000 train_T=0 dataset.size=120000 dataset.segment=32768 dataset.data_dir=/your/vctk/train/set/ loader.batch_size=12 scheduler.patience=1000000
The numbers in the paper can be reproduced with following commands.
-
rate
: the upscaling ratio. -
downsample-type
: the downsampling filter. -
infer-type
: the upscaling method. -
lr
: the$\eta$ value in the paper.
python vctk_dsp_baseline.py /your/vctk/test/set/ --downsample-type sinc --infer-type spline --rate 2
python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --rate 2 -T 50 --infer-type manifold --downsample-type stft --lr 0.67
python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --rate 3 -T 50 --infer-type inpainting --downsample-type sinc
The checkpoint of UDM is used for noise scheduling.
For training NU-Wave, please refer to here. For evaluating NU-Wave+, change infer-type
to nuwave-manifold
and specify the value of lr
.
python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --nuwave-ckpt /XXXX/checkpoints_nuwave_x2/nuwave_x2_01_07_22_epoch\=645_EMA --rate 2 -T 50 --infer-type nuwave --downsample-type stft
The checkpoint of UDM is used for noise scheduling.
For training NU-Wave 2, please refer to here. For evaluating NU-Wave 2+, change infer-type
to nuwave2-manifold
and specify the value of lr
.
python -W ignore vctk_infer.py outputs/XXXX/saved/training_checkpoint_500000.pt outputs/XXXX/.hydra/config.yaml /your/vctk/test/set --nuwave-ckpt /XXXX/nuwave2_08_14_09_epoch\=72_EMA --rate 3 -T 50 --infer-type nuwave2 --downsample-type sinc
We'll release the script for evaluating WSRGlow and NVSR in the future.
When using IIR lowpass filter to downsample audio, it introduces non-linear phase delays, thus breaking the assumption that the frequency mask is real value. An easy solution to compensate for the delays is applying the same filter again during upsampling but in a backward direction of time. We conducted the same 48 kHz experiment in the paper again but with a 8th order Chebyshev Type I lowpass filter.
2x | 3x | |
---|---|---|
NU-Wave | 0.87 | 1.00 |
NU-Wave 2 | 0.73 | 0.87 |
NU-Wave+ | 1.03 | 1.32 |
NU-Wave 2+ | 0.86 | 1.00 |
UDM+ | 0.64 | 0.79 |