Dataset collection (Dataset List)

This dataset list includes all the raw datasets we have found up to now. You may also find their Data Type* as well as their status*.

Most datasets are made public, hence downloadable through the URL in the list. You may find download scripts for some of them in audio-dataset/utils/. For those datasets who do not have any link in the list, they are purchased by LAION hence we can not make it public due to license issue. Do please contact us if you want to process them.

For using the excat processed dataset for training your models, please contact LAION.

*Data Type Terminology Explanation

Caption: A natural language sentence describing the content of the audio

Example: A wooden door creaks open and closed multiple times
Class label: Labels that are often manually annotated for classification in curated datasets. Each audio clip can be assigned with one or several class label.

Example: Cat, Dog, Water
Tag: Tags of the audio that are commenly associated with data in website. A audio clip may be associated to several tags

Example: phone recording, city, sound effect
Relative text: Any text about the audio. May be comments on the audio, or other metadata. Can be very long.

Exmaple: An impact sound that I would hear over an action scene, with some cinematic drums for more tension and a high pitched preexplosion sound followed by the impact of the explosion. Please rate only if you like it, haha. Thanks!
Transcription: Transcription of human speech. Only used for Speech Datasets.
Translation: Transcription in an other language of what the speaker uses.

*Status Terminology Explanation

processed: Dataset already converted to webdataset format.
processing: Dataset already downloaded and the processing going on.
meatadata downloaded: We have already scraped the dataset website, wheras the dataset itself is not yet downloaded.
assigned: Someone have begun the work on the dataset.

General Sound Dataset

Name	Description	URL	Data Type	Total Duration	Total Audio Number	Status
AudioSet	The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. To collect all our data we worked with human annotators who verified the presence of sounds they heard within YouTube segments. To nominate segments for annotation, we relied on YouTube metadata and content-based search. The sound events in the dataset consist of a subset of the AudioSet ontology. You can learn more about the dataset construction in our ICASSP 2017 paper. Explore the dataset annotations by sound class below. There are 2,084,320 YouTube videos containing 527 labels	Click here	class labels, video, audio	5420hrs	1951460	processed
AudioSet Strong	Audio events from AudioSet clips with singal class label annotation	Click here	1 class label, video, audio	625.93hrs	1074359	processed (@marianna13#7139)
BBC sound effects	33066 sound effects with text description. Type: mostly environmental sound. Each audio has a natural text description. (need to see check the license)	Click here	1 caption, audio	463.48hrs	15973	processed
AudioCaps	40 000 audio clips of 10 seconds, organized in three splits; a training slipt, a validation slipt, and a testing slipt. Type: environmental sound.	Click here	1 caption, audio	144.94hrs	52904	processed
Audio Caption Hospital & Car Dataset	3700 audio clips from "Hospital" scene and around 3600 audio clips from the "Car" scene. Every audio clip is 10 seconds long and is annotated with five captions. Type: environmental sound.	Click here	5 captions, audio	10.64 + 20.91hrs	3709 + 7336	we don't need that
Clotho dataset	Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Type: environmental sound.	Click here	5 captions, audio	37.0hrs	5929	processed
Audiostock	Royalty Free Music Library. 436864 audio effects(of which 10k available), each with a text description.	Click here	1 caption & tags, audio	46.30hrs	10000	10k sound effects processed(@marianna13#7139)
ESC-50	2000 environmental audio recordings with 50 classes	Click here	1 class label, audio	2.78hrs	2000	processed(@marianna13#7139)
VGG-Sound	VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube	Click here	1 class label, video, audio	560hrs	200,000 +	processed(@marianna13#7139)
FUSS	The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus.	Click here	no class label, audio	61.11hrs	22000
UrbanSound8K	8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes	Click here	1 class label, audio	8.75hrs	8732	processed(@Yuchen Hui#8574)
FSD50K	51,197 audio clips of 200 classes	Click here	class labels, audio	108.3hrs	51197	processed(@Yuchen Hui#8574)
YFCC100M	YFCC100M is a that dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license, including 8081 hours of audio.	Click here	title, tags, audio, video, Flickr identifier, owner name, camera, geo, media source	8081hrs	requested access (@marianna13#7139)
ACAV100M	100M video clips with audio, each 10 sec, with automatic AudioSet, Kinetics400 and Imagenet labels. -> Noisy, but LARGE.	Click here	class labels/tags, audio	31 years	100 million
Free To Use Sounds	10000+ for 23$ :)	Click here	1 caption & tags, audio	175.73hrs	6370
MACS - Multi-Annotator Captioned Soundscapes	This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation).	Click here	multiple captions & tags, audio	10.92hrs	3930	processed(@marianna13#7139 & @krishna#1648 & Yuchen Hui#8574)
Sonniss Game effects	Sound effects	no link	tags & filenames, audio	84.6hrs	5049	processed
WeSoundEffects	Sound effects	no link	tags & filenames, audio	12.00hrs	488	processed
Paramount Motion - Odeon Cinematic Sound Effects	Sound effects	no link	1 tag, audio	19.49hrs	4420	processed
Free Sound	Audio with text description (noisy)	Click here	pertinent text, audio	3003.38hrs	515581	processed(@Chr0my#0173 & @Yuchen Hui#8574)
Sound Ideas	Sound effects library	Click here	1 caption, audio
Boom Library	Sound effects library	Click here	1 caption, audio			assigned(@marianna13#7139)
Epidemic Sound (Sound effect part)	Royalty free music and sound effects	Click here	Class labels, audio	220.41hrs	75645	metadata downloaded(@Chr0my#0173), processed (@Yuchen Hui#8547)
Audio Grounding dataset	The dataset is an augmented audio captioning dataset. Hard to discribe. Please refer to the URL for details.	Click here	1 caption, many tags,audio	12.57hrs	4590
Fine-grained Vocal Imitation Set	This dataset includes 763 crowd-sourced vocal imitations of 108 sound events.	Click here	1 class label, audio	1.55hrs	1468	processed(@marianna13#7139)
Vocal Imitation	The VocalImitationSet is a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound (https://freesound.org/), which were curated based on Google's AudioSet ontology (https://research.google.com/audioset/).	Click here	1 class label, audio	24.06hrs	9100 files	processed(@marianna13#7139)
VocalSketch	Dataset contains thousands of vocal imitations of a large set of diverse sounds.The dataset also contains data on hundreds of people's ability to correctly label these vocal imitations, collected via Amazon's Mechanical Turk	Click here	1 class label, audio	18.86hrs	16645	processed(@marianna13#7139)
VimSketch Dataset	VimSketch Dataset combines two publicly available datasets(VocalSketch + Vocal Imitation, but Vimsketch delete some parts of the previous two datasets),	Click here	class labels, audio	Not important	Not important
OtoMobile Dataset	OtoMobile dataset is a collection of recordings of failing car components, created by the Interactive Audio Lab at Northwestern University. OtoMobile consists of 65 recordings of vehicles with failing components, along with annotations.	Click here (restricted access)	class labels & tags, audio	Unknown	59
DCASE17Task 4	DCASE Task 4 Large-scale weakly supervised sound event detection for smart cars	Click here
Knocking Sound Effects With Emotional Intentions	A dataset of knocking sound effects with emotional intention recorded at a professional foley studio. Five type of emotions to be portrayed in the dataset: anger, fear, happiness, neutral and sadness.	Click here	1 class label & audio		500	processed(@marianna13#7139)
WavText5Ks	WavText5K collection consisting of 4525 audios, 4348 descriptions, 4525 audio titlesand 2058 tags.	Click here	1 label, tags & audio		4525 audio files	processed(@marianna13#7139)

Speech Dataset

Name	Description	URL	Data Type	Status
People's Speech	30k+ hours en-text	Click here	transcription, audio	assigned(@PiEquals4#1909)
Multilingual Spoken Words	6k+ hours 1sec audio clips with words of 50+ languages	Click here	transcription, audio	processing(@PiEquals4#1909)
AISHELL-2	Contains 1000 hours of clean read-speech data from iOS is free for academic usage.	Click here	transcription, audio
Surfing AI Speech Dataset	30k+ - proprietary	Click here	transcription, audio
LibriSpeech	A collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project.	Click here	transcription, audio	processed(@marianna13#7139)
Libri-light	60K hours of unlabelled speech from audiobooks in English and a small labelled dataset (10h, 1h, and 10 min) plus metrics, trainable baseline models, and pretrained models that use these datasets.	Click here	transcription, audio
Europarl-ST	A Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.	Click here	translation, audio	processed(@Antoniooooo#4758)
CoVoST	A large-scale multilingual ST corpus based on Common Voice, to foster ST research with the largest ever open dataset. Its latest version covers translations from English into 15 languages---Arabic, Catalan, Welsh, German, Estonian, Persian, Indonesian, Japanese, Latvian, Mongolian, Slovenian, Swedish, Tamil, Turkish, Chinese---and from 21 languages into English, including the 15 target languages as well as Spanish, French, Italian, Dutch, Portuguese, Russian. It has total 2,880 hours of speech and is diversified with 78K speakers.	Click here	translation & transcription, audio	assigned(@PiEquals4#1909)
GigaSpeech	An evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training.	Click here	transcription, audio	processing(@PiEquals4#1909)
LJSpeech Dataset	This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.	Click here Or download	transcription, audio	processed(@PiEquals4#1909)
Spotify English-Language Podcast Dataset	This dataset consists of 100,000 episodes from different podcast shows on Spotify. The dataset is available for research purposes. We are releasing this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. The dataset contains about 50,000 hours of audio, and over 600 million transcribed words. The episodes span a variety of lengths, topics, styles, and qualities. Only non-commercial research is permitted on this dataset	Click here	transcription, audio	requested access(@marianna13#7139)
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)	The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent.	Click here	transcription, audio	processed(@PiEquals4#1909)
CREMA-D	CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad) and four different emotion levels (Low, Medium, High and Unspecified).	Click here	transcription, audio	processed(@PiEquals4#1909)
EmovV-DB	The emotional Voice Database. This dataset is built for the purpose of emotional speech synthesis. It includes recordings for four speakers- two males and two females. The emotional styles are neutral, sleepiness, anger, disgust and amused.	Click here	transcription, class labels, audio	assigned(@PiEquals4#1909)
CMU_Arctic	The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databses include US English male (bdl) and female (slt) speakers (both experinced voice talent) as well as other accented speakers.	Click here	transcription, tags, audio,...TBD	processed(@marianna13#7139)
IEMOCAP database	The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker database. It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions.	Click here	transcription, video, audio,...TBD	assigned(@marianna13#7139)
YouTube dataset	youtube video/audio + automatically generated subtitle. For details, please ask @marianna13#7139.	No link (please contact @marianna13#7139)	transcription, audio, video	processed(@marianna13#7139)
The Hume Vocal Burst Competition Dataset (H-VB)	labels, audio	Click here	labels, audio	assigned(@Yuchen Hui#8574)

Music Dataset

Name	Description	URL	Text Type	Status
Free Music Archive	We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition. Code, data, and usage examples are available at https://github.com/mdeff/fma.	Click here	tags/class labels, audio	processed(@marianna13#7139)
MusicNet	MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results. URL: https://homes.cs.washington.edu/~thickstn/musicnet.html	Click here	class labels, audio	processed(@IYWO#9072)
MetaMIDI Dataset	We introduce the MetaMIDI Dataset (MMD), a large scale collection of 436,631 MIDI files and metadata. In addition to the MIDI files, we provide artist, title and genre metadata that was collected during the scraping process when available. MIDIs in (MMD) were matched against a collection of 32,000,000 30-second audio clips retrieved from Spotify, resulting in over 10,796,557 audio-MIDI matches. In addition, we linked 600,142 Spotify tracks with 1,094,901 MusicBrainz recordings to produce a set of 168,032 MIDI files that are matched to MusicBrainz database. These links augment many files in the dataset with the extensive metadata available via the Spotify API and the MusicBrainz database. We anticipate that this collection of data will be of great use to MIR researchers addressing a variety of research topics.	Click here	tags, audio
MUSDB18-HQ	MUSDB18 consists of a total of 150 full-track songs of different styles and includes both the stereo mixtures and the original sources, divided between a training subset and a test subset.	Click here	1 class label, audio	processed(@marianna13#7139)
Cambridge-mt Multitrack Dataset	Here’s a list of multitrack projects which can be freely downloaded for mixing practice purposes. All these projects are presented as ZIP archives containing uncompressed WAV files (24-bit or 16-bit resolution and 44.1kHz sample rate).	Click here	1 class label, audio	processed(@marianna13#7139)
Slakh	The Synthesized Lakh (Slakh) Dataset contains 2100 automatically mixed tracks and accompanying MIDI files synthesized using a professional-grade sampling engine.	Click here	1 class label, audio	processed(krishna#1648)
Tunebot	The Tunebot project is an online Query By Humming system. Users sing a song to Tunebot and it returns a ranked list of song candidates available on Apple’s iTunes website. The database that Tunebot compares to sung queries is crowdsourced from users as well. Users contribute new songs to Tunebot by singing them on the Tunebot website. The more songs people contribute, the better Tunebot works. Tunebot is no longer online but the dataset lives on.	Click here	song name(so transcription), audio	processed(@marianna13#7139)
Juno	A music review webset	Click here	perinent text/class lables, audio	meatadata downloaded(@dicknascarsixtynine#3885) & processed(@marianna13#7139)
Pitch Fork	Music review website	Click here	pertinent text (long paragraphs), audio
Genius	Music lyrics website	pertinent text (long paragraphs), audio	assigned(@marianna13#7139)
IDMT-SMT-Audio-Effects	The IDMT-SMT-Audio-Effects database is a large database for automatic detection of audio effects in recordings of electric guitar and bass and related signal processing.	Click here	class label, audio
MIDI50K	Music generated by MIDIFILES using the synthesizer available at https://pypi.org/project/midi2audio/	Temporary not available, will be added soon	MIDI files, audio	Processing(@marianna13#7139)
MIDI130K	Music generated by MIDIFILES using the synthesizer available at https://pypi.org/project/midi2audio/	Temporary not available, will be added soon	MIDI files, audio	Processing(@marianna13#7139)
MillionSongDataset	72222 hours of general music as 30 second clips, one million different songs.	Temporarily not available	tags, artist names, song titles, audio
synth1B1	One million hours of audio: one billion 4-second synthesized sounds. The corpus is multi-modal: Each sound includes its corresponding synthesis parameters. Since it is faster to render synth1B1 in-situ than to download it, torchsynth includes a replicable script for generating synth1B1 within the GPU.	Click here	synthesis parameters, audio
Epidemic Sound (music part)	Royalty free music and sound effects	Click here	class label, tags, audio	assigned(@chr0my#0173)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset collection (Dataset List)

*Data Type Terminology Explanation

*Status Terminology Explanation

General Sound Dataset

Speech Dataset

Music Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset collection (Dataset List)

*Data Type Terminology Explanation

*Status Terminology Explanation

General Sound Dataset

Speech Dataset

Music Dataset