Curated Datasets for the Slingshot Competition

Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.

There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below. If you would like to use a dataset that you don't see listed here, please submit a PR to add the dataset to this table (i.e. through the registration process).

Name	Descriptions	Size	Format	URL
COVID-19 Open Research Dataset	An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House	19 GB	JSON	https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
Chest X-Ray Images (Pneumonia)	5,863 images, 2 categories	2.29 GB	JPEG	https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
Huge Stock Market Dataset	Historical daily prices and volumes of all U.S. stocks and ETFs	772 MB	CSV	https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Condensed Movies	A large-scale video dataset, featuring clips from movies with detailed captions.	250 GB	Video	https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
USENET (2005-2011)	Compressed USENET posts	36 GB	Text	http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Sloan Digital Sky Survey	Three dimensional view of the universe	273 TB	Various	https://www.sdss.org/
GHTorrent Project	a scalable, queriable, offline mirror of data offered through the Github REST API.	18TB	MySQL	https://ghtorrent.org/
Free Music Archive	106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres	879 GB	MP3	https://github.com/mdeff/fma
Open Images Dataset	9 million URLs to images that have been annotated with labels spanning over 6000 categories	18 TB	PNG	https://storage.googleapis.com/openimages/web/index.html
Internet Archive	a digital library of Internet sites and other cultural artifacts in digital form	45 PB	Various	https://archive.org/
Common Crawl	An open repository of web crawl data	235 TB	WARC	https://commoncrawl.org/
Noisy speech database	Used for training speech enhancement algorithms and TTS models	14 GB	WAV	https://datashare.is.ed.ac.uk/handle/10283/2791
NFL play-by-play	The data has three tables: teams, players, and plays.	2.54 GB	Text	https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play
NYC Trip Record Data	include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.	267 GB	CSV	https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
National Cancer Institute	Cancer data for analysis	18.46 TB	JSON	https://portal.gdc.cancer.gov/repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

Curated Datasets for the Slingshot Competition

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Curated Datasets for the Slingshot Competition