Skip to content

Latest commit

 

History

History
627 lines (499 loc) · 20.6 KB

API.md

File metadata and controls

627 lines (499 loc) · 20.6 KB

Madoc Search Service API

The Madoc saerch service is a IIIF-aware wrapper around PostgreSQL, written as a Django Rest Framework application. The default behaviour is for the application to index incoming data as fulltext, using PostgreSQL's fulltext search capabilities, and to use these in query parsing. Additional behaviour is available via the APIs below.

POSTing data

The service accepts a JSON POSTrequest at iiif/ with headers:

{"Content-Type": "application/json", "Accept": "application/json"}

The payload format is:

{
  "contexts": 
    [
       {
          "id": "urn:madoc:site:2", 
          "type": "Site"
        },
       {
          "id": "https://example.org/collections/foo",
          "type": "Collection"
         }
     ],
  "resource": { },
  "id": "urn:madoc:manifest:foo",
  "thumbnail": "http://madoc.foo/thumbnail/foo/fake.jpg"
}
  • contexts: List of contexts with identifier and type
  • resource: the JSON for the IIIF resource
  • id: the identifier (in Madoc) for the resource
  • thumbnail: URL for the thumbnail (in Madoc) for the resource

Simple GET Query API

The search API is at search/.

The basic query parameters are as follows

  • ?fulltext=foo : search against the preindexed fulltext (term vectors) for foo
  • ?contexts__id=urn:madoc:site:2: filter to just objects with that context (N.B. that's a double underscore)
  • ?madoc_id=urn:madoc:manifest:saa-AA0428a: filter to just hits within that one object (n.b. that's a single underscore)

In principle, we can add filters trivially easy for any other fields in the IIIF resource that make sense to filter on. They need to be added to a list of filterset_fields, but other than that, the code is already in place.

Querying for a facet:

  • ?facet_type=metadata&facet_subtype=material&facet_value=paper&fulltext=Abbey: filter the existing search results (for "Abbey") on metadata, where the metadata field "material" equals "paper"

Get a list of facets

POST to /api/search/facets

{}

Will return a list of facets grouped by type for all contexts.

POST to /api/search/facets

{"contexts":["https://iiif.ub.uni-leipzig.de/static/collections/Drucke17/collection.json"]}

Will return just the facet fields for objects within that list of contexts.

The API accepts an optional list of facet types, e.g.

{
"contexts":["https://iiif.ub.uni-leipzig.de/static/collections/Drucke17/collection.json"],
"facet_types":["metadata","descriptive"]}

The facets will be returned grouped by type, e.g.

{
	"descriptive": ["attribution", "label"],
	"metadata": ["about", "alternate title", "attribution", "author", "author(s)", "call number", "century", "collection", "collection name", "comment", "date", "date added", "date of origin (english)", "date of publication", "dated", "description", "digitization project", "digitization sponsor", "digitized by", "dimensions", "disclaimers", "document type", "doi", "format", "full conditions of use", "full title", "holding institution", "katalogeintrag", "kitodo", "language", "liturgica christiana", "location", "manifest type", "material", "materials", "number of pages", "online since", "owner", "part of", "part reference", "persons", "physical description", "place of origin", "place of origin (english)", "place of publication", "provenance", "publication date", "publisher", "record created", "related", "repository", "rights/license", "series", "shelfmark", "source", "source ppn (swb)", "sponsored by", "summary", "summary (english)", "text language", "title", "title (english)", "topic", "urn", "vd16", "vd17"]
}

If no type is provided, the API will default to just ["metadata"].

Autocomplete against a facet

This will provide a list of values for a specific facet field that can be used to populate an autocomplete and can be constrainted to a particular context.

Example 1:

All of the publishers starting with "g", for objects within a given context.

POST to /api/search/autocomplete

{
  "contexts": [
    "https://iiif.ub.uni-leipzig.de/static/collections/Drucke17/collection.json"
  ],
  "autocomplete_type": "metadata",
  "autocomplete_subtype": "publisher",
  "autocomplete_query": "g"
}

JSON Query API

The accepted fields are as follows:

  • fulltext: Optional This is the text you want to search for in the indexed fulltext.
  • search_language: Optional This is the language you want to use in constructing the query, this determines how the query parser will stem/parse the text that you provide in the 'fulltext' parameter.
  • search_type: Optional This is the type of search to use in constructing the query, e.g. 'websearch', 'plaintext', 'raw' - see https://docs.djangoproject.com/en/3.1/ref/contrib/postgres/search/#searchquery
  • non_latin_fulltext: Optional If True, this signals to the API that non-Latin text (such as Chinese, Korean, Thai, etc) can be queried using fulltext (for example when the Postgres instance has support for zhparser or similar). This should largely be left un-set, as typically, this support will NOT be present.
  • search_multiple_fields: Optional If True, this signals to the API that rather than parsing fulltext queries (using PostgreSQL ts_query) and matching against individual metadata fields, we should search across multiple fields for partial matches. Some support for stemming is lost, but more flexibility in matching is gained.
  • type: Optional Search againsts just textual data with this type, e.g. metadata (to only search metadata)
  • subtype: Optional Search against just textual data with this subtype, e.g. a specific metadata field.
  • date_exact: Optional Search for just objects with a start and end date that exactly matches this
  • date_start: Optional Search for just objects with a start date that is great than or equal to
  • date_end: Optional Search for just objects with an end date less than or equal to
  • integer: Optional Search for an integer (this is an object with value and operator)
  • float: Optional Search for a float (this is an object with value and operator)
  • raw: Optional Provide an object with explicit "raw" query parameters.
  • language_display: Optional only search fields where the display language is, e.g. "english".
  • language_iso639_1: Optional only search fields where the iso639_1 language code is, e.g. "en"
  • language_iso639_2: Optional only search fields where the iso639_2 language code is, e.g. "eng"
  • iiif_identifiers: Optional an array of identifiers (these should be the @ids for the objects), the query will filter to just these objects before it runs any fulltext or facet filters.
  • madoc_identifiers: Optional an array of identifiers (these should be the madoc ids for the objects), the query will filter to just these objects before it runs any fulltext or facet filters.
  • contexts: Optional an array of identifiers (these should be the ids for the relevant site, project, collection, etc), the query will filter to just those objects associated with those any of those contexts before it runs any fulltext or facet filters.
  • facet_fields: Optional an array of strings which represent the fields which will have facet counts provided in the result output. These are assumed to be labels in the resource metadata (and otherwise have no semantics).
  • facet_types: Optional an array of string which represent the type of the indexables the facets will be generated from. Defaults to ["metadata"] but, for example, if you also wanted to facet on fields in the IIIF descriptive properties, you could use ["metadata", "descriptive"]
  • facets: Optional an array of facet queries (see below) which are applied as filters to the query output.
  • number_of_facets: Optional an integer which sets how many facets to return, defaults to 10. If effectively unlimited facets are required, provide an arbitrarily high integer.
  • ordering: Optional an object which specifies how the objects are ordered in the results, if not provided, defaults to rank

Ordering has the following format:

  • type: The indexed text type, e.g. "metadata" or "descriptive", etc
  • subtype: The subtype, e.g. "place of publication", "label", etc.
  • value_for_sort: Optional the field in the Indexables model which provided the value for sorting
  • direction: Optional defaults to descending (choices are ascending or descending)

Example:

        "date_start": "1100",
        "date_end": "1202",
        "ordering": {"type": "descriptive", "subtype": "navDate", "direction": "ascending",
                     "value_for_sort": "indexable_date_range_start"}}

Example:

{
    "date_start": "1100",
    "date_end": "1202",
    "ordering": {"type": "metadata", "subtype": "title", "direction": "descending"}
}

Facet queries have the following format:

  • type: The indexed text type, e.g. "metadata" or "descriptive", etc
  • subtype: The subtype, e.g. "place of publication", "label", etc.
  • value: The value to match, e.g. "Berlin" (N.B. this matches only against the indexables field)
  • field_lookup: Optional The method to use when matching. This defaults to iexact (case insensitive exact match) but the query parser will accept any of the standard field_lookup types. N.B. this applies to all of type, subtype and value. See: https://docs.djangoproject.com/en/3.1/ref/models/querysets/#field-lookups
  • indexable_int: Optional Match against the int field
  • indexable_float: Optional Match against the float field
  • indexable_date_range_start: Optional Date
  • indexable_date_range_end: Optional Date

Recently added, you can also facet against the indexable_int and indexable_float field (do not use value here).

e.g.

{
...
  "facets": [
    {
      "type": "metadata",
      "subtype": "place of publication",
      "value": "Hamburg"
    },
      {
      "type": "metadata",
      "subtype": "weight_in_kg",
      "indexable_int": 50,
      "field_lookup": "gte"
    }
  ]
}


N.B. types and subtypes have no semantics, they are just organising labels applied to incoming content. The ingest code will default to storing all data from the IIIF metadata block with: type = "metadata" and subtype = the label for the field.

Numeric queries (integers or floats) have the following format:

* __value__: the value to match
* __operator__: One of "exact", "lt", "gt", "lte", "gte"

e.g.

```json
{
  "integer": 
    {
       "value": 100,
        "operator": "gte"
    }
 }

Raw queries allow you to pass in standard Django filters as an object/dict. These must target the indexables model.

THe general form is:

{
	"raw": {
		"indexables__$FIELD__$FIELD_LOOKUP": "value"
	}
}

Where $FIELD is the field name in the Indexables model, and $FIELDLOOKUP corresponds to one of the standard field lookup options in: https://docs.djangoproject.com/en/3.1/ref/models/querysets/#field-lookups

For example:

{
	"raw": {
		"indexables__subtype__iexact": "title",
        "indexables__original_content__icontains": "bible"
	}
}

The query is constructed in the following order:

  1. prefilter: filter the list of objects being queried against to just those objects:
    • associated with one or more of the contexts if provided
    • whose madoc identifier is within the list of madoc_identifiers if provided
    • whose IIIF @id is within the list of iiif_identifiers if provided
  2. fulltext query: search (using a PostgreSQL textsearch tsquery) the indexed content associated with all of the IIIF objects (this is currently IIIF descriptive properties and all metadata 'fields' but will include fulltext and capture models).
    • type: only search text marked with this type. Current types are descriptive (content in the IIIF manifest), metadata (content in the IIIF metadata block). Additional types will be added.
    • subtype: field type for the search text, e.g. label to search just the IIIF labels; author to search just metadata where the label is author etc.
    • language_display: only search fields where the display language is, e.g. "english".
    • language_iso639_1: only search fields where the iso639_1 language code is, e.g. "en"
    • language_iso639_2: only search fields where the iso639_2 language code is, e.g. "eng"
  3. facet filter(s): apply the facet filters (which may be a mixture of ANDs and ORs, see examples below) to the result of the fulltext query.

Facet counts are calculated on the output of this query.

Example 1

Fulltext search and a single facet.

GET

/api/search/search?fulltext=G%C3%B6tter&facet_type=metadata&facet_subtype=place%20of%20publication&facet_value=Hamburg&search_language=german

In this example, we are searching for the German plural of God, and faceting on objects where the metadata Place of Publication is Hamburg.

N.B. if we remove the search_langauge=german property, this will find zero hits, because although the test is indexed as German, the search query is not being stemmed/parsed using German fulltext rules.

POST

POST to /api/search/search

{
  "fulltext": "Götter",
  "search_language": "german",
  "contexts": [
    "urn:madoc:site:2"
  ],
  "facets": [
    {
      "type": "metadata",
      "subtype": "place of publication",
      "value": "Hamburg"
    }
  ]
}

In this example, we are also filtering to objects that appear in/are part of urn:madoc:site:2.

Example 2

Fulltext search and multiple facets with specified facet fields.

{
  "fulltext": "Abbey",
  "facet_fields": [
    "text language",
    "place of origin (english)",
    "collection name",
    "publisher",
    "persons",
    "material",
    "location"
  ],
  "contexts": [
    "urn:madoc:site:2"
  ],
  "facets": [
    {
      "type": "metadata",
      "subtype": "collection name",
      "value": "Staatsarchiv Aargau"
    },
    {
      "type": "metadata",
      "subtype": "material",
      "value": "paper"
    },
    {
      "type": "metadata",
      "subtype": "material",
      "value": "parchment"
    },
    {
      "type": "metadata",
      "subtype": "text language",
      "value": "German"
    }
  ]
}

In this instance we are searching for:

  1. Fulltext = "Abbey"

AND

  1. context = urn:madoc:site:2

AND

  1. Collection Name = Staatsarchiv Aargau

AND

  1. Text Language = German

AND

  1. Material = (Paper OR Parchment)

The query parser automatically combines any two queries against the same field in the metadata as an OR, otherwise, all facets are applied as an AND.

Further, we are requesting that the results returned include the facet counts for:

    "text language",
    "place of origin (english)",
    "collection name",
    "publisher",
    "persons",
    "material",
    "location"

If any of those fields don't exist on the query results, the system will return an empty object {} for that "field".

The results returned in this case include a facet key as follows

{
...
  "facets": {
   "metadata": {
     "text language": {
       "German": 10,
       "Latin": 4
     },
     "place of origin (english)": {
       "Königsfelden": 8,
       "Wettingen": 2
     },
     "collection name": {
       "Staatsarchiv Aargau": 10
     },
     "publisher": {},
     "persons": {
       "Author: Innocentius IV, Papa": 1,
       "Author: Rümlang, Eberhard von": 1,
       "Restorer: Gall, Ernst": 1,
       "Scribe: Peter, von Neumagen; Annotator: Tschudi, Aegidius": 1
     },
     "material": {
       "Paper": 9,
       "Parchment": 1
     },
     "location": {
       "Aarau": 10
     }
   }
 }
}

Example 3

Filter by facet, without a fulltext search.

{
  "facet_fields": [
    "text language",
    "place of origin (english)",
    "collection name",
    "publisher",
    "persons",
    "material",
    "location"
  ],
  "contexts": [
    "urn:madoc:site:2"
  ],
  "facets": [
    {
      "type": "metadata",
      "subtype": "collection name",
      "value": "Staatsarchiv Aargau"
    },
    {
      "type": "metadata",
      "subtype": "material",
      "value": "paper"
    },
    {
      "type": "metadata",
      "subtype": "material",
      "value": "parchment"
    },
    {
      "type": "metadata",
      "subtype": "text language",
      "value": "German"
    }
  ]
}

We can run the same query, but just set the facets and request the facet counts, without adding fulltext, e.g. the facet count this time:

{
...
  "facets": {
    "metadata": {
      "text language": {
        "German": 13,
        "Latin": 6
      },
      "place of origin (english)": {
        "Königsfelden": 9,
        "Wettingen": 2,
        "Aarau": 1,
        "Southwestern Germany": 1
      },
      "collection name": {
        "Staatsarchiv Aargau": 13
      },
      "publisher": {},
      "persons": {
        "Author: Innocentius IV, Papa": 1,
        "Author: Rümlang, Eberhard von": 1,
        "Former possessor: Stettler, Anton": 1,
        "Restorer: Gall, Ernst": 1,
        "Scribe: Peter, von Neumagen; Annotator: Tschudi, Aegidius": 1
      },
      "material": {
        "Paper": 10,
        "Parchment": 3
      },
      "location": {
        "Aarau": 13
      }
    }
  }
}

Example 4

{
  "facet_fields": [
    "text language",
    "place of origin (english)",
    "collection name",
    "publisher",
    "persons",
    "material",
    "location"
  ],
  "contexts": [
    "urn:madoc:site:2"
  ],
  "facets": [
    {
      "type": "metadata",
      "subtype": "collection name",
      "value": "Staat",
      "field_lookup": "istartswith"
    },
    {
      "type": "metadata",
      "subtype": "material",
      "value": "paper"
    },
    {
      "type": "metadata",
      "subtype": "material",
      "value": "parchment"
    },
    {
      "type": "metadata",
      "subtype": "text language",
      "value": "German"
    }
  ]
}

As per Example 3, only this time we are using istartswith (case insensitive startswith) for the type = "metadata", subtype = "collection name", "value" = "Staat" facet. (N.B. this will also match, e.g. "collection name (English)" for example).

POSTing Capture models and OCR

POST to /api/search/model

{
    "resource_id": "urn:foo:bar",
    "content_id": "123-abcdhd-24-j8-0000-foo",
    "resource": { ... the capture model or ocr data ...}
}

N.B. The following fields are compulsory:

  • resource_id : the identifier in madoc for the object this relates to
  • resource: there must be something to index

N.B. the resource_id must exist and must be already in the search service.

  • content_id:
    • for capture models (not the intermediate OCR format) the content_id used for each indexable object created will be derived from the model and any content_id passed in in the POST will be ignored.
    • for OCR data, you MAY pass in a content_id and this will be used to identify the indexed content in the system, if not content_id is passed in, the system will generate one based on the resource_id and the content_type.

Pagination

API endpoints support pagination. This can be controlled at query time, using the following query parameters (supported in both GET and POST operations)

  • ?page=2: return the second page of results.
  • ?page_size=10: limit the number of results to 10 items.

Default page_size is controlled by the PAGE_SIZE environment variable, defaulted to 25.

The MAX_PAGE_SIZE environment variable can be used to set a limit on how many results to allow.

By default this is not set, allowing any page_size to be set.

POSTing "raw" Indexable content

POST to /api/search/indexables

{
    "resource_id": "urn:foo:bar",
    "content_id": "https://example.org/foo/bar",
    "original_content": "<html>Whooo!</html>",
    "indexable": "Whooo!",
    "indexable_date_range_start": null,
    "indexable_date_range_end": null,
    "indexable_int": null,
    "indexable_float": null,
    "indexable_json": null,
    "selector": null,
    "type": "custom",
    "subtype": "custom1",
    "language_iso639_2": "eng",
    "language_iso639_1": "en",
    "language_display": "english",
    "language_pg": "english"
}

N.B. The following fields are compulsory:

  • resource_id : the identifier in madoc for the object this relates to
  • indexable: this is the text to be indexed by the fulltext search
  • original_content: this is whatever will get highlighted by the snipper API
  • type: e.g. custom
  • subtype: e.g. custom_subfield

N.B. the resource_id must exist and must be already in the search service.