Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.
There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below. If you would like to use a dataset that you don't see listed here, please submit a PR to add the dataset to this table (i.e. through the registration process).
Name | Descriptions | Size | Format | URL |
---|---|---|---|---|
COVID-19 Open Research Dataset | An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House | 19 GB | JSON | https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge |
Chest X-Ray Images (Pneumonia) | 5,863 images, 2 categories | 2.29 GB | JPEG | https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia |
Huge Stock Market Dataset | Historical daily prices and volumes of all U.S. stocks and ETFs | 772 MB | CSV | https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs |
Condensed Movies | A large-scale video dataset, featuring clips from movies with detailed captions. | 250 GB | Video | https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/ |
USENET (2005-2011) | Compressed USENET posts | 36 GB | Text | http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html |
Sloan Digital Sky Survey | Three dimensional view of the universe | 273 TB | Various | https://www.sdss.org/ |
GHTorrent Project | a scalable, queriable, offline mirror of data offered through the Github REST API. | 18TB | MySQL | https://ghtorrent.org/ |
Free Music Archive | 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres | 879 GB | MP3 | https://github.com/mdeff/fma |
Open Images Dataset | 9 million URLs to images that have been annotated with labels spanning over 6000 categories | 18 TB | PNG | https://storage.googleapis.com/openimages/web/index.html |
Internet Archive | a digital library of Internet sites and other cultural artifacts in digital form | 45 PB | Various | https://archive.org/ |
Common Crawl | An open repository of web crawl data | 235 TB | WARC | https://commoncrawl.org/ |
Noisy speech database | Used for training speech enhancement algorithms and TTS models | 14 GB | WAV | https://datashare.is.ed.ac.uk/handle/10283/2791 |
NFL play-by-play | The data has three tables: teams, players, and plays. | 2.54 GB | Text | https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play |
NYC Trip Record Data | include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. | 267 GB | CSV | https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page |
National Cancer Institute | Cancer data for analysis | 18.46 TB | JSON | https://portal.gdc.cancer.gov/repository |