Skip to content

Latest commit

 

History

History
24 lines (20 loc) · 5.83 KB

datasets.md

File metadata and controls

24 lines (20 loc) · 5.83 KB

Curated Datasets for the Slingshot Competition

Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.

There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below. If you would like to use a dataset that you don't see listed here, please submit a PR to add the dataset to this table (i.e. through the registration process).

Name Descriptions Size Format URL
COVID-19 Open Research Dataset An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House 19 GB JSON https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
Chest X-Ray Images (Pneumonia) 5,863 images, 2 categories 2.29 GB JPEG https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
Huge Stock Market Dataset Historical daily prices and volumes of all U.S. stocks and ETFs 772 MB CSV https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Condensed Movies A large-scale video dataset, featuring clips from movies with detailed captions. 250 GB Video https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
USENET (2005-2011) Compressed USENET posts 36 GB Text http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Sloan Digital Sky Survey Three dimensional view of the universe 273 TB Various https://www.sdss.org/
GHTorrent Project a scalable, queriable, offline mirror of data offered through the Github REST API. 18TB MySQL https://ghtorrent.org/
Free Music Archive 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres 879 GB MP3 https://github.com/mdeff/fma
Open Images Dataset 9 million URLs to images that have been annotated with labels spanning over 6000 categories 18 TB PNG https://storage.googleapis.com/openimages/web/index.html
Internet Archive a digital library of Internet sites and other cultural artifacts in digital form 45 PB Various https://archive.org/
Common Crawl An open repository of web crawl data 235 TB WARC https://commoncrawl.org/
Noisy speech database Used for training speech enhancement algorithms and TTS models 14 GB WAV https://datashare.is.ed.ac.uk/handle/10283/2791
NFL play-by-play The data has three tables: teams, players, and plays. 2.54 GB Text https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play
NYC Trip Record Data include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. 267 GB CSV https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
National Cancer Institute Cancer data for analysis 18.46 TB JSON https://portal.gdc.cancer.gov/repository