Integration of different data formats #24

fretchen · 2024-03-22T08:18:01Z

A number of different possible data formats exist and ideally we should find a way to streamline them. One issue that was raised with the xlsx format in #21 was by @goergen95

The provision of the ToR are a burden to technical adapt partners. For partners with elaborated GIS systems there are better formats/procedures for the exchange of geospatial information. I would be against KfW making it mandatory for them to deliver their data in sub-optimal and proprietary data formats.

We have started to look into this in #17 and we have first tools for conversion in #18 . However, it is not yet clarified how we can put all of the ideas together...

fretchen · 2024-03-22T09:35:39Z

Also has to go into #10

fretchen · 2024-06-24T17:46:46Z

We now have the specifications for the template fixed within the technical notes. To make progress on this we could use these notes and translate them into a json schema. The advantages of json schema to the markdown format:

It is directly machine readable.
It can be used as an input for the generation of markdown files.
It as a very broad support across different technological tools
It can be used to directly create web forms.
It plays nicely with python / pydantic, which is under the hood of geonode.
It is designed for complex data formats. Hence it could allow us to provide a technical bridge to more advanced systems as suggested in Redundante und aufwändige Datenstrukturen bei GeoDaten-Erhebungen #10, Open Data Kit Template files for Project Location Data #18 and Maja4 dev patch 1 Sample Geodata ToR #21.
Any technical implementation could then be automatically tested if it is compatible with the jsonschema, lending to much more flexibililty and robustness.

Alternatives

I currently do not see any good alternatives. Possible options would be:

xlsx

Hard to read for machines.
Hard to implement complex data structures.
Not really a very open standard.
Not very broadly used as reference.
more adapt as a technical implementation than as a reference

markdown

not machine readable and hence not able to serve as technical basis for validations.

uml

this is a fairly abstract format which does not allow for validation.
so it falls more into the documentation level and allows for less precise implementations.

direct technical implementations

Keeping them compatible requires a common reference / language. This should be digestable across technical implementations. Hence, jsonschema.

I would propose this as first step to see where we can go with this. Comments @Jo-Schie , @Maja4Dev or @goergen95 before I start simple first attempts in this direction ?

Jo-Schie · 2024-06-25T12:29:48Z

Hi Fred. I think this is an awesome idea. From what you listed above I also think that json or even geojson could be a good approach. I will try to ask the people from the Geonode project also for their opinion. Will report back as soon as I have an answer.

goergen95 · 2024-06-26T06:22:04Z

Hi @fretchen, totally agree with your summary here. Please note, that translating the specification to JSON Schema can only be a first step. To make something useful with that, I think it would require to also provide some tooling for conversion and validation. For a first step, conversion could go e.g. from Excel/CSV -> JSON/GeoJSON which could then be validated. There is also recent fiboa project providing prior-art of designing an extensible and modularized data specification, including geospatial information, based on JSON Schema.

goergen95 · 2024-06-26T06:30:34Z

Also, I don't think the title of this issue applies if we are now targeting JSON Schema. JSON Schema specifies how data should look like, it is not data itself. The specification is not the same as an implementation. And, as I said in other comments, I do not think we want to force third parties into concrete implementations. Instead, we should aim to offer a specification, tooling for some conversions, and, most importantly, validation.

fretchen · 2024-06-26T16:01:52Z

Sounds good to me and I have nothing to add. When I find time to create a json schema, I would open a separate issue / PR such that we can work us through the todo list. The way I see the todo list right now is:

get started with a machine readable reference (i.e. the json schema for the moment)
use the reference for validation of certain file formats (xls or csv)
make this validation part of automatic testing !
use the reference for validation of converers that are always tested as part of PRs...

fretchen · 2024-08-12T10:03:59Z

@goergen95 made the following argument for YAML in #76

as a human, I find JavaScript hard to write and most of the time I make a lot of mistakes with the brackets etc. when I am trying. Since we should aim for everyone to easily participate in the maintenance of the specification, we could think about moving it to YAML and create JSON Schema automatically from there?

I personally support the idea in general. However, there are a few technical points to be respected, which might actually create quite some work:

Automatic conversion from YAML to json schema: I personally have not easily found a tool to directly do this.

Introduction of yet another file format: From what I have seen YAML cannot completely make json schemas obsolete. So we will typically have them in our tool-chain independently of YAML in or not. So we really have to see even the ease of working with YAML outweighs the need of extra conversion tools.

If I am right in my understanding we might consider these points after we have decided to merge #76 .

goergen95 · 2024-08-12T10:54:43Z

In Python, both file formats are internally represented as dictionaries. So you can easily use JSON Schema vocabulary in YAML and validated with jsonschema. That means you could maintain the schema within this repo in YAML, convert incoming data to a dictionary and validate on the fly with jsonschema without ever writing the schema to JSON.
However, one might want to serialize to JSON via GitHub Actions to distribute the schema as JSON which can be achieved via json.dump()

Here is a small example (incoming data does not have to be in YAML, but needs to be converted to a dictionary):

import yaml
import json
import urllib.request as req
from jsonschema.validators import Draft202012Validator

schema_yaml = "schema.yaml"
data_yaml = "data.yaml"
req.urlretrieve("https://raw.githubusercontent.com/mapme-initiative/mapme.pipelines/main/inst/schema.yaml", schema_yaml)
req.urlretrieve("https://raw.githubusercontent.com/mapme-initiative/mapme.pipelines/main/inst/config-example.yaml", data_yaml)

with open(schema_yaml, 'r') as yaml_file:
    schema = yaml.safe_load(yaml_file)

with open(data_yaml, 'r') as yaml_file:
    data = yaml.safe_load(yaml_file)

validator = Draft202012Validator(schema)
validator.validate(data)

with open("schema.json", "w") as json_file:
  json.dump(schema, json_file, indent=4)

fretchen mentioned this issue Mar 22, 2024

Maja4 dev patch 1 Sample Geodata ToR #21

Merged

fretchen mentioned this issue Jul 18, 2024

Create a json schema for the model #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of different data formats #24

Integration of different data formats #24

fretchen commented Mar 22, 2024

fretchen commented Mar 22, 2024

fretchen commented Jun 24, 2024

Jo-Schie commented Jun 25, 2024

goergen95 commented Jun 26, 2024

goergen95 commented Jun 26, 2024

fretchen commented Jun 26, 2024

fretchen commented Aug 12, 2024

goergen95 commented Aug 12, 2024