Skip to content

Latest commit

 

History

History
143 lines (85 loc) · 6.47 KB

README.md

File metadata and controls

143 lines (85 loc) · 6.47 KB

LOD-aggregator (LODA)

Summary

An increasing number of heritage institutes are taking steps to publish their collection information as Linked Open Data (LOD), especially in Schema.org to increase the visibility in Google and other major search engines. To lower the barriers for the contributing data to Europeana we designed a basic pipeline, the LOD-aggregator, that harvests the published Linked Data and converts the Schema.org information to the Europeana Data Model (EDM) to make the ingest in the Europeana harvesting platform possible. This pipeline was build to demonstrate the feasibility for this approach and to prove that production ready data can be provided this way.

See the following specifications for more background information:

This software was developped as part of the Europeana Common Culture project. Main development and testing was done by Europeana R&D and the Dutch Digital Heritage Network (NDE).

Installation

This tool requires Docker-compose to be configured on your system. Please visit Docker-compose documentation for details about the installation of Docker and Docker-compose.

The tools are run using a command shell. Please make sure your system is able to run commands from a shell. The software is testend on Linux Ubuntu but should run on any Linux system or compatible environment.

Use the following command or your favorite Git tool to clone the repo to your local environment

git clone https://github.com/netwerk-digitaal-erfgoed/lod-aggregator.git

Enter the newly created directory lod-aggregator and run all commands from here.

cd lod-aggregator

Set the defaults values for your local installation using a .env file:

cp env.dist .env
# use your favorite editor to set VAR_PROVIDER to the appropriate value in `.env`

The tools expect input files to be in ./data, shape files to be in ./shapes and query files to be in ./queries. Use the environment variables in .env to change these defaults.

After cloning the repository only a build of the crawl service is required. Use the following command:

docker-compose build --no-cache crawl

Important notes:

  • Europeana requires an identification label of the institution that runs the LOD aggregation service. This can be specified in the .env file with the VAR_PROVIDER variable or set with the --provider parameter during runtime.

  • Run the following command each time you start a new session:

    source bin/setpath

    This will add ./bin path to your $PATH so you can run the commands without prefixing them.

No further configuration is needed. See docker-compose.yml and the starter.sh script for more details on how the crawler and JENA tools (sparql and shacl) are being called in more detail.

Please report problems or other feedback through the Github Issues function or send an email to enno.meijers at kb.nl.

General workflow

For a generic harvesting process the following tasks should be performed:

  1. run the crawl service to harvest the data described by a dataset description

  2. run the map service to convert the harvested data from Schema.org to EDM

  3. run the validate service to validate the generated EDM data

  4. run the convert service to prepare the data for ingesting into Europeana

Optional steps:

  • run the crawl and validate service to download and check only the dataset description
  • run the serialize service to transform the RDF from one serialization (N-Triples, RDF/XML, Turtle) into another

Documented test runs

In order to demonstrate the use of these tools with real world LOD data a number of test cases have been documented. See the tests directory for more information.

Running the crawler

Run the crawler using the following command:

crawl.sh --dataset-uri {dataset URI} --output {output filename} [--description_only]

To download only the dataset description found at the URI of the dataset use the option --description-only.

Check the crawler.log logfile in the data dir for the results of the crawl proces. Both progress and error information can be found here.

Running the mapper

Run the mapper tool to convert the downloaded Linked Data into an output format using a SPARQL CONSTRUCT query. The default configuration is based on converting data in schema.org format to EDM. The generic sparql query in schema2edm.rq takes care of this, see the queries dir for more information.

map.sh --data {data file} --output {output file} \
  [ --query {query file} ] \
  [ --format {serialization format} ] \
  [ --provider { provider name } ]

The default query file used is schema2edm.rq. For fixing input data or mapping from other formats to EDM you can provide your own sparql construct query.

The default serialization is RDF/XML as this is the preferred format for delivery to Europeana. This can be overruled with --format <format> option. See the starter.sh script for the available serialization formats.

The provider option sets the edm:provider property, if not specified it is derived from the VAR_PROVIDER variable set in .env.

Running the validator

Run the validator using the following command:

validate.sh --data {data file} [ --shape {shape file} ]

The default shape file is shacl_edm.ttl and checks the data according to the EDM specifications.

The result of the validation is written to errors.txt.

Prepare the data for delivery to Europeana

The Europeana import proces is XML based so the data also needs to comply with certain XML constraints. The following command takes care of this. The result is a zipfile containing seperate records for each resource.

convert.sh --data {input file} --output {output file`.zip`}

Because the output file is a zipfile the extension should be set to .zip.

Converting RDF data into different serializations

For debugging and testing it can be helpful to convert RDF into other serialization formats using this command:

serialize.sh --data {input file} --format {RDF format} --output {output file}

See the starter.sh script for a full list of format options.