GitHub

SARVAM_ASR_DATASETS

10 LANGUAGES: Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Tamil, Telugu, Punjabi

UNUSED:

imasc_malayalam https://huggingface.co/datasets/thennal/IMaSC
openslr_bengali https://www.openslr.org/53/
iisc_vaani https://vaani.iisc.ac.in/ The data is huge but only 1% has transcription. Also it is not segregated language wise
mucs_subtask_2 https://www.openslr.org/104/ Code-switched with English
nptel https://asr.iitm.ac.in/dataset (scroll down) Except english, dataset in not open-source

PARTIALLY USED/ NEWER VERSION AVAILABLE:

commonvoice 17.0 has more languages and overall data. Only train.tsv is being used, though validated.tsv contains more data
spring_inx manifests from kaushal have been used which has less data
indictts newer version

PREPROCESSING:

Currently the script should be present in the folder where the dataset is downloaded
No scripts (maybe not needed) for indictts, shrutilipi

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
preprocessing		preprocessing
README.md		README.md
all_links.csv		all_links.csv
download_data.sh		download_data.sh

Provide feedback