SARVAM_ASR_DATASETS
10 LANGUAGES: Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Tamil, Telugu, Punjabi
UNUSED:
- imasc_malayalam https://huggingface.co/datasets/thennal/IMaSC
- openslr_bengali https://www.openslr.org/53/
- iisc_vaani https://vaani.iisc.ac.in/ The data is huge but only 1% has transcription. Also it is not segregated language wise
- mucs_subtask_2 https://www.openslr.org/104/ Code-switched with English
- nptel https://asr.iitm.ac.in/dataset (scroll down) Except english, dataset in not open-source
PARTIALLY USED/ NEWER VERSION AVAILABLE:
- commonvoice 17.0 has more languages and overall data. Only train.tsv is being used, though validated.tsv contains more data
- spring_inx manifests from kaushal have been used which has less data
- indictts newer version
PREPROCESSING:
- Currently the script should be present in the folder where the dataset is downloaded
- No scripts (maybe not needed) for indictts, shrutilipi