-
Notifications
You must be signed in to change notification settings - Fork 13
Advanced resource entity implementation
Elaborated on #184:
Summary
This describes resource provenance using two attributes which relate the ORI resource to the original resource so it can be used of metadata.
- Internally we use the
canonical_id
andcanonical_iri
, which can be serialized and made public into a different form. -
entity
has been replaced bycanonical_id
andcanonical_iri
, see history. - Both
canonical_id
andcanonical_iri
may be specified in the same resource. -
canonical_id
can appear in theused_file
to designate a subsection if it contains multiple nested resources -
canonical_iri
should designate as close as possible what resource was used. In the most simple case this is the URL of the resource that was retrieved. However, if that URL contains multiple resources we can 'guess' what the URL directly to the resource would be.
Description
Different resources can been derived from one entity, i.e. a meeting has multiple nested documents. These documents can be resolvable by their own URL but the original source of the resource is still the same. The same as its parent since this is our actual source (Resources that can be resolved (have their own identifier) should use that URL instead, see the comment below).
If possible, it should be identified with a URL, scheme and query parameters like this. It should represent the suppliers resource as they specify it, it should include a scheme (https:// by default) but no additional parameters. If the supplier does not specify it but we can assume the resource exists, we can construct the more specific URL ourselves. This makes it IRI's, which are most often URL's. This implies that we cannot assume they always resolve.
The canonical creates the bridge between the mapping IRI and the supplier's resource. In SOAP it is not possible to use URL's to identify a specific resource, in that case we do not have more information than the identifier itself so we use canonical_id
, it would be something like '8984124'. The used_file
would be the URL to our cached version of the SOAP response. In a later iteration we can use URL fragments to designate the identifier within the context of the cached version (this proves to be a problem with Google Storage document revision). We use canonical_id
and canonical_iri
fields since we need to serialize them as different attributes.
Some considerations:
- When a subresource has an own URL,
canonical_iri
is used to specify. There is no direct relation betweencanonical_iri
andused_file
, the canonical refers to the specific resource while theused_file
should be the cached version of the resource's parent. - When a subresource doesn't have an own URL,
canonical_id
is used to designate the subresource within the resource. There is a direct relation betweencanonical_id
andused_file
, since the id will always be in the scope of the cached file. - A downloadable document has a
schema:contentUrl
to the resolver, soused_file
shouldn't refer the same cache URL. Instead it should refer to the file where the URL to the document was originally specified. Also,schema:isBasedOn
set by the enricher refers to the document's original download URL. Canonical should refer to the same URL, except for when the following applies: - Some suppliers distinguish between a document resource URL and a document download URL. If this is the case,
canonical_iri
should be the resource URL andschema:isBasedOn
should be the download URL. - Note that for
canonical_iri
the document resource URL is specified here as"self": "api.notubiz.nl/document/780972"
,withoutwith the?format=json&version=1.10.8
.However we cannot add this information, it is up to the user to make the decision about which version and format to use.If possible, we want to give the user as much information how to find the actual resource we used, so it will be including at least theversion
query parameter but it would also be wise to includeformat
as well. Sensitive query parameters like authentication should be left out.