This repo is for personal research into existing wavenet architectures for audio denoising. The model by itself has not been modified, however, provisions were made to allow data loading to be easier (without XML files).
The Wave-U-Net applied to speech enhancement [1], an adaptation of the original implementation for music source separation by Stoller et al [2].
The Wave-U-Net is a convolutional neural network applicable to audio source separation tasks, recently introduced by Stoller et al for the separation of music vocals and accompaniment [2]. A 1D convolutional time domain variant of the 2D convolutions within the U-Net [3], this end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account.
Experiments on audio source separation for speech enhancement in [1] show that the proposed method rivals state-of-the-art architectures, improving upon various metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, with respect to the single-channel speech enhancement task on the Voice Bank corpus (VCTK) dataset. Future experimentation will focus on increasing effectiveness and efficiency by further adapting the model size and other parameters, e.g. filter sizes, to the task and expanding to multi-channel audio and multi-source-separation.
The architecture is the same as that employed in [2] with the exception of the number of hidden layers and the validation set size. The number of hidden layers was experimented with and the results suggest the optimum size to be 9 layers.
See diagram below for a summary visual representation of the architecture:
- ffmpeg
- libsndfile
- Python 2.7
- Python 2.7 packages can be installed using
pip install -r requirements.txt
Under our implementation, training took c.36 hours using GeForce GTX 1080 Ti GPU with 11178 MiB, on Linux Ubuntu 16.04, with Python 2.7. In a new virtual environment, required Python 2.7 packages can be installed using pip install -r requirements.txt
. N.B. this presumes the installation of ffmpeg and libsndfile.
Train and test datasets provided by the 28-speaker Voice Bank Corpus (VCTK) [4] (30 speakers in total - 28 intended for training and 2 reserved for testing). The noisy training data were generated by mixing the clean data with various noise datasets, as per the instructions provided in [4, 5, 6].
Download trained weights here: https://www.dropbox.com/s/2ytshnr5iavax2q/728467-2001.data-00000-of-00001?dl=0
Put downloaded trained weights into ./checkpoints
.
Also, as [3] says, Checkpoint v2 saves the checkpoint as 3 files (*.data-00000-of-00001
, *.index
, *.meta
). So when restoring the checkpoint remove the extension *.data-00000-of-00001
and just use the filename *.ckpt
.
In order to build the training set we can use the build_train_set.py. Given the source directoy with the clean and noisy signals mixed with each other, it builds a clearer dataset for each individual person for ease of access.
Arguments of build_train_set.py
:
--clean_source
: Required. Source directory containing clean audio files from the Voice Bank Corpus (VCTK)] data set.--noisy_source
: Required. Source directory containing the contaminated audio files from the Voice Bank Corpus (VCTK)] data set.--out_directory
: Not required, default totrain_set_built
. Destination directory, if it does not exits, it will be created by this script.--sampling_rate
: Not required, default to16000
Hz. Sampling rate for audio files (Hz).
Example of usage:
# python build_train_set.py --clean_trainset --noisy_trainset --sampling_rate --out_directory
python build_train_set.py \
--clean_source datasets/VCTK/clean_trainset_28spk_wav \
--noisy_source datasets/VCTK/noisy_trainset_28spk_wav \
--out_directory datasets/VCTK/progress_testing \
--sampling_rate 32000 # The best parameter for training I think (betegon) it is 16000Hz.
For help explaining what each of these positional arguments do, run
python build_train_set.py -h
The training dataset should then be prepared for being parsed as an XML file (not provided) using the ElementTree XML API in Datasets.getAudioData.
Training can be executed by running the command python Training.py
, modyfing the parameters in Config.py as desired.
NOTE I think you need to specify .xml
file in Training.py
, in line:
dataset_train = Datasets.getAudioData("")
So it could be something like this, if the xml file is called data.xml
and it is placed in the repo root directory:
dataset_train = Datasets.getAudioData("data.xml")
Given the following project tree:
Wavenet-U-Net-For-Speech-Enhancement-1 # root directory
├── checkpoints
│ └── trained_model_from_github
│ ├── model-10000.data-00000-of-00001
│ ├── model-10000.index
│ └── model-10000.meta
.
.
.
You should specify the model path as follows in inference.py
:
produce_source_estimates(model_config, 'checkpoints/trained_model_from_github/model-10000', '512.wav', output_path='.')
python inference.py --noisy_file file_to_denoise.wav
Testing experiments can be performed by running the command python Test_Predictions_VCTK.py
.
Speech source estimates should then be evaluated against the clean speech file they are estimating. This can be done using Evaluate.m, which selects multiple files and performs the composite.m script [7] (available here) upon each one, calculating the PESQ, SSNR, CSIG, CBAK and COVL.
Audio examples of both speech and background noise estimates of the VCTK test set, alongside the noisy test files and clean speech for reference, are available for download in the audio_examples directory.
-
Argparse for
inference.py
. -
After using
inference.py
for one or a couple of times, then it stops working, throwing the error:2020-03-01 22:09:42.646762: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 2097479680
2020-03-01 22:09:42.646886: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
Aborted (core dumped)
I thinks the TODO above is resolved by upgrading tensorflow to 1.8. Check this before marking it as done (just run inference some times haha).
- Argparse for
Training.py
to specify dataset path (by default it should by in directorytrain_data/
) - Upgrade to Tensorflow 2.x.
- Train on 28 speakers. https://datashare.is.ed.ac.uk/handle/10283/2791
- Train on 56 speakers. https://datashare.is.ed.ac.uk/handle/10283/2791
[1] https://github.com/MathewSam/Wave-U-Net-For-Speech-Enhancement-1
[2] https://github.com/jonasyang/Wave-U-Net-For-Speech-Enhancement
[3] https://orvillemcdonald.com/2018/02/13/restoring-tensorflow-models/
[4] haoxiangsnr/A-Convolutional-Recurrent-Neural-Network-for-Real-Time-Speech-Enhancement#10 (comment)
[1] Craig Macartney and Tillman Weyde. Improved Speech Enhancement with the Wave-U-Net. 2018. URL http://arxiv.org/abs/1811.11307
[2] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. 6 2018. URL https://arxiv.org/abs/1806.03185.
[3] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017, pages 745–751, 2017. URL https://ismir2017.smcnus.org/wp-content/uploads/2017/10/171_Paper.pdf.
[4] Cassia Valentini-Botinhao. Noisy speech database for training speech enhancement algorithms and TTS models, 2016 [sound]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017. URL http://dx.doi.org/10.7488/ds/2117.
[5] Santiago Pascual, Antonio Bonafonte, and Joan Serrà. SEGAN: Speech Enhancement Generative Adversarial Network. doi: 10.7488/ds/1356. URL http://dx.doi.org/10.7488/ds/1356.
[6] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. Technical report. URL https://www.research.ed.ac.uk/portal/files/26581510/SSW9_Cassia_1.pdf.
[7] Philipos C Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., Boca Raton, FL, USA, 2nd edition, 2013. ISBN 1466504218, 9781466504219.