Provenance of data accessed through an API which may require authentication #377

thomaszwagerman · 2024-12-11T13:12:37Z

thomaszwagerman
Dec 11, 2024

Taking note of guidance on downloadable datasets.

I'm wondering if there are any best practice examples on how to represent data provenance when data is accessed through an API endpoint (which may require authentication through some sort of key like a Personal Access Token).

My use case is that I would like to represent a pipeline which programmatically accesses data from the Copernicus Climate Data Store (for example ERA5 single levels), performs some calculation and creates a new data output. This "new" data output could be ingested into another environmental forecasting model (or Digital Twin system), so an ROCrate is potentially a good way to pass on provenance in such an interaction. Ideally a description of this captures the endpoint used, query used (i.e. geographical area, timespan), license, date accessed, DOI, and the authentication without actually attaching credentials.

I'm new to ROCrate, so my interpretation of how it should work might be wrong, but I'm considering the following approach:

I interpret a script that accesses an API as software used to create files, and describe a script and configuration file (which includes information on the query) as a createAction instruments where the object is a detached Dataset that contains information on the endpoint and result the data downloaded.

        {
            "@id": "Download ERA5 data",
            "@type": "CreateAction",
            "agent": {
                "@id": "https://orcid.org/0000-0000-0000-0000"
            },
            "name": "Downloading required ERA5 data",
            "description": "Script downloading the era5 land sea mask (era5_lsm.nc) and mean sea level pressure data (data/ERA5/monthly/era5_mean_sea_level_pressure_*.nc), referencing ENVS to obtain query parameters.",
            "instrument": [
                {
                    "@id": "src/00_download_era5.sh"
                },
                {
                    "@id": "ENVS"
                }
            ],
            "object": {
                "@id":"https://cds.climate.copernicus.eu/api"
            },
            "result": {
                "@id": "data/"
            }
        }

Where `"data/" is described as downloaded:

        {
            "@id": "data/",
            "@type": "Dataset",
            "description": "Folder containing ERA5 land sea mask (era5_lsm.nc) and ERA5/monthly/era5_mean_sea_level_pressure_monthly_*.nc files.",
            "encodingFormat": "application/netcdf",
            "name": "ERA5 Data",
            "type": "FormalParameter",
            "valueRequired": true
        },

and "https://cds.climate.copernicus.eu/api" is described as a detached dataset (yet to crack this one really, but should contain license etc). After this there are further scripts which process "/data".

I guess my main questions are:

Is is good practice to represent API access explicitly like this?
If this step is part of a larger computational workflow, does it need to be represented separately in this way?
Should guidance on API authentication be linked, or a placeholder file be attached? This would not work "out of the box" without some manual user configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provenance of data accessed through an API which may require authentication #377

{{title}}

Replies: 0 comments

Select a reply

Provenance of data accessed through an API which may require authentication #377

thomaszwagerman Dec 11, 2024

Replies: 0 comments

thomaszwagerman
Dec 11, 2024