Support csw dcat harvest #2800

maudetes · 2023-01-13T10:12:15Z

A proposed solution for datagouv/data.gouv.fr#913.

Use DCAT export served by the CSW endpoint, available in GeoNetwork for harvesting.

This CSW backend is actually based on the DCAT backend with a different parse_graph function.

maudetes · 2023-03-03T15:40:36Z

udata/harvest/backends/dcat.py

+                </csw:GetRecords>'''
+        headers = {"Content-Type": "application/xml"}
+
+        content = requests.post(url, data=body.format(start=1), headers=headers).text


What would be the best between making a POST using our crafted body or expecting a complete url to make a GET query on it?

I would say a POST with our body would make it easier on the user side, but it may prevent the user from adding custom params to the targeted endpoint.

The CSW spec allows GET and/or POST and this is advertised in the GetCapabilities

<ows:Operation name="GetRecords"> <ows:DCP> <ows:HTTP> <ows:Get xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/> <ows:Post xlink:href="https://apps.titellus.net/geonetwork/srv/eng/csw"/>

but most implementations probably support both.

documentStandard is quite specific to GeoNetwork and in this case related to the fact that we want to only request metadata records which are in a standard which provides a mapping to DCAT (currently ISO19139 and ISO19115-3) and avoid failures on catalogues containing records on other standards (eg. Dublin core or ISO19110). No criteria is needed if the target catalogue contains only ISO19139.

You can probably increase page size - we have been using 200 records per page for INSPIRE member states harvesting with no issue.

Can be good to add a sort by clause to avoid potential issue of records appearing in various pages if index changes occur during harvesting

<ogc:SortBy xmlns:ogc="http://www.opengis.net/ogc"> <ogc:SortProperty> <ogc:PropertyName>identifier</ogc:PropertyName> <ogc:SortOrder>ASC</ogc:SortOrder> </ogc:SortProperty> </ogc:SortBy> </csw:Query>

About documentStandard filter, you should probably remove it from your side and ask GeoNetwork users to set up a dedicated portal on their catalogue in order to filter on their side the records they want to be harvested (so they can filter on documentStandard if they have metadata records not supporting mapping to DCAT - which should not happen often).

Thank you for the inputs!

We're keeping with the POST request for an improved user experience. We'll mention it in our harvesting doc (as well as an example of an expected endpoint to provide).

I've removed documentStandard, we should probably add a mention about the need to expose a catalog with records that support mapping to DCAT in our documentation also!

I've increased the page size to 200 and added the sort clause.

udata/harvest/backends/dcat.py

It only adds a toogle between default and GeoDCAT-AP. We may want to make it entirely configurable later if other makes sense. However only HarvestFeature and HarvestFilter exists for now.

GeoDCAT-AP schema mapping is not working yet with our harvester

maudetes added 7 commits October 24, 2022 18:24

Test parsing CSW graph

8e28bd7

Merge branch 'master' into feat/support-csw-dcat-harvest

c1d6f50

More work in progress

89079e1

Fix CSW pagination

3633007

Add dedicated CSW backend

9cc93ad

Remove temporary verify=False param in requests

7156205

Add TODO on the node to parse in CSW response

be50028

maudetes marked this pull request as draft January 13, 2023 10:12

maudetes added 4 commits January 13, 2023 11:16

Merge branch 'master' into feat/support-csw-dcat-harvest

0dbfc2d

Find SearchResults in tree

02b5624

Merge branch 'master' into feat/support-csw-dcat-harvest

0f2bf14

Comment out on identifier missing ValueError

f0e736c

maudetes mentioned this pull request Jan 25, 2023

Support consommation DCAT depuis un appel CSW de Geonetwork datagouv/data.gouv.fr#913

Closed

Merge branch 'master' into feat/support-csw-dcat-harvest

7617e66

maudetes mentioned this pull request Feb 6, 2023

Gérer le moissonnage de gros catalogues via DCAT (CSW-DCAT ou XSLT) datagouv/data.gouv.fr#1046

Closed

maudetes added 4 commits February 23, 2023 19:38

Add max items limit in CSW harvest

dcac080

Merge branch 'master' into feat/support-csw-dcat-harvest

7b60d7f

Clean up a bit

ef27113

Add tests on paginated CSW GeoNetwork v4

23ab5c3

maudetes requested a review from quaxsze March 6, 2023 13:16

Merge branch 'master' into feat/support-csw-dcat-harvest

382772a

maudetes commented Mar 7, 2023

View reviewed changes

udata/harvest/backends/dcat.py Show resolved Hide resolved

maudetes added 7 commits March 8, 2023 19:07

Improve query based on review comments

496ee8c

Improve break conditions to deal with edge cases

e573378

Merge branch 'master' into feat/support-csw-dcat-harvest

e324779

Specify CSW namespace instead of guessing

70dd0b3

Merge branch 'master' into feat/support-csw-dcat-harvest

966c2e9

Mock better SearchResults pagination value

74ff3e3

Update changelog

3a7992f

maudetes marked this pull request as ready for review March 17, 2023 16:35

maudetes added 7 commits March 29, 2023 09:40

Merge branch 'master' into feat/support-csw-dcat-harvest

f3c8793

Merge branch 'master' into feat/support-csw-dcat-harvest

b1f2d86

Merge branch 'master' into feat/support-csw-dcat-harvest

7b48b8e

Add support for CswDcatBackend in parse_url command

b409af8

Merge branch 'master' into feat/support-csw-dcat-harvest

26cbbfc

Merge branch 'master' into feat/support-csw-dcat-harvest

8d9539f

Merge branch 'master' into feat/support-csw-dcat-harvest

f2fa89a

quaxsze approved these changes Jul 11, 2023

View reviewed changes

maudetes added 6 commits July 19, 2023 14:31

Merge branch 'master' into feat/support-csw-dcat-harvest

d1117ee

Merge branch 'master' into feat/support-csw-dcat-harvest

dd98a8b

Add a schema feature on CswDcatBackend

f44c56e

It only adds a toogle between default and GeoDCAT-AP. We may want to make it entirely configurable later if other makes sense. However only HarvestFeature and HarvestFilter exists for now.

Merge branch 'master' into feat/support-csw-dcat-harvest

29e3021

Remove configurable schema feature for now

e9fd92b

GeoDCAT-AP schema mapping is not working yet with our harvester

Update changelog

7be2386

maudetes changed the title ~~WIP Support csw dcat harvest~~ Support csw dcat harvest Oct 19, 2023

maudetes mentioned this pull request Oct 19, 2023

Add mention of csw-dcat etalab/doc.data.gouv.fr#132

Merged

Merge branch 'master' into feat/support-csw-dcat-harvest

8ebe100

maudetes merged commit 23ea32b into opendatateam:master Oct 19, 2023
1 check passed

maudetes deleted the feat/support-csw-dcat-harvest branch October 19, 2023 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support csw dcat harvest #2800

Support csw dcat harvest #2800

maudetes commented Jan 13, 2023 •

edited

Loading

maudetes Mar 3, 2023

fxprunayre Mar 8, 2023

fxprunayre Mar 8, 2023

maudetes Mar 17, 2023

Support csw dcat harvest #2800

Support csw dcat harvest #2800

Conversation

maudetes commented Jan 13, 2023 • edited Loading

maudetes Mar 3, 2023

Choose a reason for hiding this comment

fxprunayre Mar 8, 2023

Choose a reason for hiding this comment

fxprunayre Mar 8, 2023

Choose a reason for hiding this comment

maudetes Mar 17, 2023

Choose a reason for hiding this comment

maudetes commented Jan 13, 2023 •

edited

Loading