Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files used in notebooks are not all in the Dataverse yet #36

Open
gaelforget opened this issue Jan 18, 2023 · 11 comments
Open

Files used in notebooks are not all in the Dataverse yet #36

gaelforget opened this issue Jan 18, 2023 · 11 comments

Comments

@gaelforget
Copy link
Collaborator

gaelforget commented Jan 18, 2023

To enable future reproducibility of the notebooks by as many users as possible, we are envisioning this.

We are using Dataverse for this. It has great support for meta-data to help refer to all relevant data sources that should be accredited.

sessions that used an external data set shared at workshop

We'd like to collect one zipped file per session and push it to the dataverse repo as soon as possible.

This will create a DOI and permanent archive. It will also allow for automatic download and browsing via Dataverse.jl

sessions that only downloaded data automatically

We'd like to document how much gets downloaded, for users who may have limited internet access may want to know.

Could also be a good idea to provide a zipped file to archive at Dataverse as backup if possible.

sessions that used custom Docker images

We'd like to collect these too and upload them to the Dataverse as well if possible. The central one that's meant to support running all notebooks (ideally) has already been posted there.

If you prefer to maintain and archive your own way that's totally fine too. We'd like a DOI though so we can refer to it and rely on it in the future.

Non-text or large files in the repo

Ideally, one may want to avoid putting non-text files in the repo -- these cannot be diff'ed with git in a practical way, and increase the time to download / clone / etc the repo.

For ipynb files for example, one can instead provide jupytext and point to a rendered version elsewhere (e.g. the GitHub page for the repo in separate branch).

pdf , png , etc can all be put elsewhere too in order to keep the repo small. For the whole repo that would likely be needed to stay under, say, 100MB.

@gaelforget
Copy link
Collaborator Author

Regarding data uploads.

See gdcc/Dataverse.jl#13 , IQSS/dataverse#9298

@visr
Copy link
Contributor

visr commented Jan 18, 2023

We'd like to document how much gets downloaded, for users who may have limited internet access may want to know.

In juliageo.ipynb all data gets downloaded, which is about 100MB gets downloaded. This is almost entirely the geometries from GADM, which it only needs to download once. A Manifest.toml is also uploaded for reproducability.

@gaelforget
Copy link
Collaborator Author

Btw, I anticipate a nice of set of file types will end up in the JuliaEO dataverse repo.

It likely would be useful to have a few examples out of zip. To demo previewing and maybe lazy access functionalities, for example.

  • already there : netcdf (arrays), geotiff (rasters), and geojson (polygons)
  • possible / likely : csv, JLD2 (Julia / HDF5), zarr (arrays, cloud optimized), parquet (geospatial, cloud optimized), geotiff & nectdf (more, cl;oud optimized), Dockerfile (text but automated image build would be nice), data cubes, ...

Looping in @pdurbin @atrisovic @felixcremer @rafaqz @Alexander-Barth just in case

@gaelforget
Copy link
Collaborator Author

gaelforget commented Jan 18, 2023

quoting from @pdurbin on different platform, about the auto-unzip functionality for upload :

you have to double zip. This is how I do it for my dataset:

zip -r primary-data.zip primary-data -x '**/.*' -x '**/__MACOSX'
zip -r outer.zip primary-data.zip -x '**/.*' -x '**/__MACOSX'

@gaelforget
Copy link
Collaborator Author

For supporting notebooks that require data from Dataverse, the simplest thing I could think of would be to have one tar.gz or zip file associated with the notebook folder in GitHub.

e.g. , Data_Visualizations_with_Makie.tar.gz or some zip version

I think this makes life easy on the self-guided user.

Are there strong preferences between tar.gz and zip?

Supposedly we can add software on the Dataverse or Julia side to extract meta data from tar.gz and preview in UI / without downloading the file.

Would be good also to grab the Dockerfile as preview of a Docker image like DockerImage.tar.gz I think.

@pdurbin
Copy link

pdurbin commented Feb 1, 2023

sessions that used custom Docker images

We'd like to collect these too and upload them to the Dataverse as well if possible. The central one that's meant to support running all notebooks (ideally) has already been posted there.

@gaelforget @visr last week we enabled a Binder button on Harvard Dataverse:

It looks like this on the "Global Workshop on Earth Observation with Julia 2023" dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OYBLGK

Screen Shot 2023-02-01 at 3 34 02 PM

I'm scared to click it, though, because the Docker image is 1.7 GB! 😅 Binder will try to download all the files in the dataset, including that giant one.

Binder supports defining your own Dockerfile (an alternative to uploading the image itself): https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html

Is this something you'd like to try?

For more on Binder from the Dataverse perspective: https://guides.dataverse.org/en/5.12.1/admin/integrations.html#binder

@gaelforget
Copy link
Collaborator Author

For Land_Cover_Classification_of_Earth_Observation_images the files used, depending on user choice, are :

  • Small : RGB (133M, 3 bands)
  • Large : MS (2.8G, 13 bands)

@gaelforget
Copy link
Collaborator Author

gaelforget commented Mar 9, 2023

For RF_classification_using_marida the files used are :

  • 551M Bands_Indices-S2
  • 57M Train_Test-Datasets

A couple questions I have :

@EmanuelCastanho

@EmanuelCastanho
Copy link
Member

Yes, Train_Test-Datasets is a subset created by me from the original MARIDA dataset.
According to their MIT Licence and Creative Commons Attribution 4.0 International I think it is fine to include this subset in the Dataverse.

Please include this citation to the original dataset, if possible: Kikaki K, Kakogeorgiou I, Mikeli P, Raitsos DE, Karantzalos K (2022) MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 17(1): e0262247. https://doi.org/10.1371/journal.pone.0262247

Bands_Indices-S2 can be included in the Dataverse.

@gaelforget

@gaelforget
Copy link
Collaborator Author

Yes, Train_Test-Datasets is a subset created by me from the original MARIDA dataset. According to their MIT Licence and Creative Commons Attribution 4.0 International I think it is fine to include this subset in the Dataverse.

Please include this citation to the original dataset, if possible: Kikaki K, Kakogeorgiou I, Mikeli P, Raitsos DE, Karantzalos K (2022) MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 17(1): e0262247. https://doi.org/10.1371/journal.pone.0262247

Bands_Indices-S2 can be included in the Dataverse.

@gaelforget

Done. Thanks!

Dataverse : https://doi.org/10.7910/DVN/OYBLGK
Zenodo : https://doi.org/10.5281/zenodo.8113073

@gaelforget
Copy link
Collaborator Author

automated data downloads have now been implemented for some notebooks

#43 #44 #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants