This project is a simplified version of SOFA (Singing-Oriented Forced Aligner) and only provides phoneme boundary segmentation. It can be used when SOFA's results are not accurate enough. After segmenting the boundaries, you can manually input the phonemes in vlabeler.
- Use
git clone
to download the repository code. - Install conda or use venv.
- Go to the PyTorch website to install torch.
- (Optional, for faster wav file reading) Install torchaudio from the PyTorch website.
- Install other Python libraries:
pip install -r requirements.txt
-
Follow the steps above to set up the environment. It is recommended to install torchaudio for faster binarization speed.
-
Run
python convert_ds.py --data_zip_path xxx.zip --lang xx
to convert the nnsvs dataset into the diffsinger dataset. Conversion needs to be done separately by language. Supported languages can be found inconvert_ds.py
. -
Place the training data in the
data
folder in the following format:- data - full_label - singer1 - wavs - audio1.wav - audio2.wav - ... - transcriptions.csv - singer2 - wavs - ... - transcriptions.csv
-
Modify
binarize_config.yaml
as needed, then runpython binarize.py
. -
Modify
train_config.yaml
as needed, then runpython train.py
. If you want to resume training, usepython train.py -r
. -
For training visualization, use:
tensorboard --logdir=ckpt/
.
-
Prepare the audio files to be segmented and place them in a folder (default is the
/segments
folder) in the following format:- segments - singer1 - segment1.wav - segment2.wav - ... - singer2 - segment1.wav - ...
-
Inference via Command Line
Run
python infer.py
for inference.Required parameters:
--ckpt
: (mandatory) Path to the model weights.--folder
: Folder containing the data to be aligned (default issegments
).
python infer.py -c checkpoint_path -f segments_path
-
Obtain the Final Annotations A .lab file with the same name will be generated in the folder containing the audio files.