Skip to content

Improved speech enhancement with the Wave-U-Net, a deep convolutional neural network architecture for audio source separation, implemented for the task of speech enhancement in the time-domain.

Notifications You must be signed in to change notification settings

betegon/Wave-U-Net-For-Speech-Enhancement-1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Enhancement with the Wave-U-Net

This repo is for personal research into existing wavenet architectures for audio denoising. The model by itself has not been modified, however, provisions were made to allow data loading to be easier (without XML files).

The Wave-U-Net applied to speech enhancement [1], an adaptation of the original implementation for music source separation by Stoller et al [2].

The Wave-U-Net is a convolutional neural network applicable to audio source separation tasks, recently introduced by Stoller et al for the separation of music vocals and accompaniment [2]. A 1D convolutional time domain variant of the 2D convolutions within the U-Net [3], this end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account.

Experiments on audio source separation for speech enhancement in [1] show that the proposed method rivals state-of-the-art architectures, improving upon various metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, with respect to the single-channel speech enhancement task on the Voice Bank corpus (VCTK) dataset. Future experimentation will focus on increasing effectiveness and efficiency by further adapting the model size and other parameters, e.g. filter sizes, to the task and expanding to multi-channel audio and multi-source-separation.

Architecture

The architecture is the same as that employed in [2] with the exception of the number of hidden layers and the validation set size. The number of hidden layers was experimented with and the results suggest the optimum size to be 9 layers.

See diagram below for a summary visual representation of the architecture:

alt text

Initialisation

Requirements

  • ffmpeg
  • libsndfile
  • Python 2.7
  • Python 2.7 packages can be installed using pip install -r requirements.txt

Under our implementation, training took c.36 hours using GeForce GTX 1080 Ti GPU with 11178 MiB, on Linux Ubuntu 16.04, with Python 2.7. In a new virtual environment, required Python 2.7 packages can be installed using pip install -r requirements.txt. N.B. this presumes the installation of ffmpeg and libsndfile.

Data Preparation

Train and test datasets provided by the 28-speaker Voice Bank Corpus (VCTK) [4] (30 speakers in total - 28 intended for training and 2 reserved for testing). The noisy training data were generated by mixing the clean data with various noise datasets, as per the instructions provided in [4, 5, 6].

TRAINED WEIGHTS

Download trained weights here: https://www.dropbox.com/s/2ytshnr5iavax2q/728467-2001.data-00000-of-00001?dl=0

Put downloaded trained weights into ./checkpoints.

Also, as [3] says, Checkpoint v2 saves the checkpoint as 3 files (*.data-00000-of-00001, *.index, *.meta). So when restoring the checkpoint remove the extension *.data-00000-of-00001 and just use the filename *.ckpt.

build_train_set.py

In order to build the training set we can use the build_train_set.py. Given the source directoy with the clean and noisy signals mixed with each other, it builds a clearer dataset for each individual person for ease of access.

Arguments of build_train_set.py:

  • --clean_source: Required. Source directory containing clean audio files from the Voice Bank Corpus (VCTK)] data set.
  • --noisy_source: Required. Source directory containing the contaminated audio files from the Voice Bank Corpus (VCTK)] data set.
  • --out_directory: Not required, default to train_set_built. Destination directory, if it does not exits, it will be created by this script.
  • --sampling_rate: Not required, default to 16000 Hz. Sampling rate for audio files (Hz).

Example of usage:

# python build_train_set.py --clean_trainset --noisy_trainset --sampling_rate --out_directory
python build_train_set.py \
--clean_source datasets/VCTK/clean_trainset_28spk_wav \
--noisy_source datasets/VCTK/noisy_trainset_28spk_wav \
--out_directory datasets/VCTK/progress_testing \
--sampling_rate 32000 # The best parameter for training I think (betegon) it is 16000Hz.

For help explaining what each of these positional arguments do, run

python build_train_set.py -h

The training dataset should then be prepared for being parsed as an XML file (not provided) using the ElementTree XML API in Datasets.getAudioData.


TRAINING

Training can be executed by running the command python Training.py, modyfing the parameters in Config.py as desired.

NOTE I think you need to specify .xml file in Training.py, in line:

dataset_train = Datasets.getAudioData("")

So it could be something like this, if the xml file is called data.xml and it is placed in the repo root directory:

dataset_train = Datasets.getAudioData("data.xml")

Visualize training and validation loss from

INFERENCE

Given the following project tree:

Wavenet-U-Net-For-Speech-Enhancement-1 # root directory
├── checkpoints
│   └── trained_model_from_github
│       ├── model-10000.data-00000-of-00001
│       ├── model-10000.index
│       └── model-10000.meta
.
.
.

You should specify the model path as follows in inference.py:

produce_source_estimates(model_config, 'checkpoints/trained_model_from_github/model-10000', '512.wav', output_path='.')

Inference example

python inference.py --noisy_file file_to_denoise.wav

Testing

Testing experiments can be performed by running the command python Test_Predictions_VCTK.py.

Speech source estimates should then be evaluated against the clean speech file they are estimating. This can be done using Evaluate.m, which selects multiple files and performs the composite.m script [7] (available here) upon each one, calculating the PESQ, SSNR, CSIG, CBAK and COVL.

Audio examples of both speech and background noise estimates of the VCTK test set, alongside the noisy test files and clean speech for reference, are available for download in the audio_examples directory.

TO-DO

  • Argparse for inference.py.

  • After using inference.py for one or a couple of times, then it stops working, throwing the error:

    2020-03-01 22:09:42.646762: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 2097479680

    2020-03-01 22:09:42.646886: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA

    Aborted (core dumped)

I thinks the TODO above is resolved by upgrading tensorflow to 1.8. Check this before marking it as done (just run inference some times haha).

REPOS used / issues or commits commented

[0] https://github.com/jeewenjie/Wave-U-Net-For-Speech-Enhancement/commit/e7e34f6782294e970f375159ded3487c5763e17f#commitcomment-37570464

[1] https://github.com/MathewSam/Wave-U-Net-For-Speech-Enhancement-1

[2] https://github.com/jonasyang/Wave-U-Net-For-Speech-Enhancement

[3] https://orvillemcdonald.com/2018/02/13/restoring-tensorflow-models/

[4] haoxiangsnr/A-Convolutional-Recurrent-Neural-Network-for-Real-Time-Speech-Enhancement#10 (comment)

References

[1] Craig Macartney and Tillman Weyde. Improved Speech Enhancement with the Wave-U-Net. 2018. URL http://arxiv.org/abs/1811.11307

[2] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. 6 2018. URL https://arxiv.org/abs/1806.03185.

[3] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017, pages 745–751, 2017. URL https://ismir2017.smcnus.org/wp-content/uploads/2017/10/171_Paper.pdf.

[4] Cassia Valentini-Botinhao. Noisy speech database for training speech enhancement algorithms and TTS models, 2016 [sound]. University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017. URL http://dx.doi.org/10.7488/ds/2117.

[5] Santiago Pascual, Antonio Bonafonte, and Joan Serrà. SEGAN: Speech Enhancement Generative Adversarial Network. doi: 10.7488/ds/1356. URL http://dx.doi.org/10.7488/ds/1356.

[6] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. Technical report. URL https://www.research.ed.ac.uk/portal/files/26581510/SSW9_Cassia_1.pdf.

[7] Philipos C Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., Boca Raton, FL, USA, 2nd edition, 2013. ISBN 1466504218, 9781466504219.

About

Improved speech enhancement with the Wave-U-Net, a deep convolutional neural network architecture for audio source separation, implemented for the task of speech enhancement in the time-domain.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.2%
  • Jupyter Notebook 2.2%
  • MATLAB 0.6%