-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files used in notebooks are not all in the Dataverse yet #36
Comments
Regarding data uploads. |
In juliageo.ipynb all data gets downloaded, which is about 100MB gets downloaded. This is almost entirely the geometries from GADM, which it only needs to download once. A Manifest.toml is also uploaded for reproducability. |
Btw, I anticipate a nice of set of file types will end up in the JuliaEO dataverse repo. It likely would be useful to have a few examples out of zip. To demo previewing and maybe lazy access functionalities, for example.
Looping in @pdurbin @atrisovic @felixcremer @rafaqz @Alexander-Barth just in case |
quoting from @pdurbin on different platform, about the auto-unzip functionality for upload : you have to double zip. This is how I do it for my dataset:
|
For supporting notebooks that require data from e.g. , I think this makes life easy on the self-guided user. Are there strong preferences between Supposedly we can add software on the Dataverse or Julia side to extract meta data from Would be good also to grab the Dockerfile as preview of a Docker image like DockerImage.tar.gz I think. |
@gaelforget @visr last week we enabled a Binder button on Harvard Dataverse: It looks like this on the "Global Workshop on Earth Observation with Julia 2023" dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OYBLGK I'm scared to click it, though, because the Docker image is 1.7 GB! 😅 Binder will try to download all the files in the dataset, including that giant one. Binder supports defining your own Dockerfile (an alternative to uploading the image itself): https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html Is this something you'd like to try? For more on Binder from the Dataverse perspective: https://guides.dataverse.org/en/5.12.1/admin/integrations.html#binder |
For
|
For
A couple questions I have :
|
Yes, Please include this citation to the original dataset, if possible: Kikaki K, Kakogeorgiou I, Mikeli P, Raitsos DE, Karantzalos K (2022) MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 17(1): e0262247. https://doi.org/10.1371/journal.pone.0262247
|
Done. Thanks! Dataverse : https://doi.org/10.7910/DVN/OYBLGK |
To enable future reproducibility of the notebooks by as many users as possible, we are envisioning this.
We are using Dataverse for this. It has great support for meta-data to help refer to all relevant data sources that should be accredited.
sessions that used an external data set shared at workshop
We'd like to collect one zipped file per session and push it to the dataverse repo as soon as possible.
This will create a DOI and permanent archive. It will also allow for automatic download and browsing via Dataverse.jl
sessions that only downloaded data automatically
We'd like to document how much gets downloaded, for users who may have limited internet access may want to know.
Could also be a good idea to provide a zipped file to archive at Dataverse as backup if possible.
sessions that used custom Docker images
We'd like to collect these too and upload them to the Dataverse as well if possible. The central one that's meant to support running all notebooks (ideally) has already been posted there.
If you prefer to maintain and archive your own way that's totally fine too. We'd like a DOI though so we can refer to it and rely on it in the future.
Non-text or large files in the repo
Ideally, one may want to avoid putting non-text files in the repo -- these cannot be diff'ed with git in a practical way, and increase the time to download / clone / etc the repo.
For
ipynb
files for example, one can instead provide jupytext and point to a rendered version elsewhere (e.g. the GitHub page for the repo in separate branch).pdf
,png
, etc can all be put elsewhere too in order to keep the repo small. For the whole repo that would likely be needed to stay under, say, 100MB.The text was updated successfully, but these errors were encountered: