Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support csw dcat harvest #2800

Merged
merged 38 commits into from
Oct 19, 2023

Conversation

maudetes
Copy link
Contributor

@maudetes maudetes commented Jan 13, 2023

A proposed solution for datagouv/data.gouv.fr#913.

Use DCAT export served by the CSW endpoint, available in GeoNetwork for harvesting.

This CSW backend is actually based on the DCAT backend with a different parse_graph function.

@maudetes maudetes marked this pull request as draft January 13, 2023 10:12
@maudetes maudetes requested a review from quaxsze March 6, 2023 13:16
</csw:GetRecords>'''
headers = {"Content-Type": "application/xml"}

content = requests.post(url, data=body.format(start=1), headers=headers).text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the best between making a POST using our crafted body or expecting a complete url to make a GET query on it?

I would say a POST with our body would make it easier on the user side, but it may prevent the user from adding custom params to the targeted endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSW spec allows GET and/or POST and this is advertised in the GetCapabilities

<ows:Operation name="GetRecords">
<ows:DCP>
<ows:HTTP>
<ows:Get xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/>
<ows:Post xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/>

but most implementations probably support both.

documentStandard is quite specific to GeoNetwork and in this case related to the fact that we want to only request metadata records which are in a standard which provides a mapping to DCAT (currently ISO19139 and ISO19115-3) and avoid failures on catalogues containing records on other standards (eg. Dublin core or ISO19110). No criteria is needed if the target catalogue contains only ISO19139.

You can probably increase page size - we have been using 200 records per page for INSPIRE member states harvesting with no issue.

Can be good to add a sort by clause to avoid potential issue of records appearing in various pages if index changes occur during harvesting

    <ogc:SortBy xmlns:ogc="http://www.opengis.net/ogc">
      <ogc:SortProperty>
        <ogc:PropertyName>identifier</ogc:PropertyName>
       <ogc:SortOrder>ASC</ogc:SortOrder>
      </ogc:SortProperty>
    </ogc:SortBy>
  </csw:Query>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About documentStandard filter, you should probably remove it from your side and ask GeoNetwork users to set up a dedicated portal on their catalogue in order to filter on their side the records they want to be harvested (so they can filter on documentStandard if they have metadata records not supporting mapping to DCAT - which should not happen often).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the inputs!

  • We're keeping with the POST request for an improved user experience. We'll mention it in our harvesting doc (as well as an example of an expected endpoint to provide).
  • I've removed documentStandard, we should probably add a mention about the need to expose a catalog with records that support mapping to DCAT in our documentation also!
  • I've increased the page size to 200 and added the sort clause.

udata/harvest/backends/dcat.py Outdated Show resolved Hide resolved
udata/harvest/backends/dcat.py Show resolved Hide resolved
@maudetes maudetes marked this pull request as ready for review March 17, 2023 16:35
It only adds a toogle between default and GeoDCAT-AP.
We may want to make it entirely configurable later if other makes sense.
However only HarvestFeature and HarvestFilter exists for now.
GeoDCAT-AP schema mapping is not working yet with our harvester
@maudetes maudetes changed the title WIP Support csw dcat harvest Support csw dcat harvest Oct 19, 2023
@maudetes maudetes merged commit 23ea32b into opendatateam:master Oct 19, 2023
1 check passed
@maudetes maudetes deleted the feat/support-csw-dcat-harvest branch October 19, 2023 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants