Skip to content

Commit

Permalink
Add details on manual mapping curation in docs (#308)
Browse files Browse the repository at this point in the history
In response to a request on slack, adding more details to the docs on
how to curation mapping justifications that are manually curated.
  • Loading branch information
matentzn authored Jul 29, 2023
1 parent bf7ea35 commit f6025c3
Showing 1 changed file with 32 additions and 3 deletions.
35 changes: 32 additions & 3 deletions src/docs/mapping-justifications.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ The goal of this document is to provide the user with a few pointers into the ar

1. [lexical matching](#lexical-matching)
1. [semantic similarity threshold-based matching](#semantic-matching)
1. [manual mapping curation](#manual-mapping-curation)
1. [mapping review](#mapping-review)
1. Other justifications
1. background knowledge-based matching
1. composite matching
1. instance-based matching
1. lexical similarity threshold-based matching
1. logical reasoning
1. manual mapping curation
1. mapping chaining-based matching
1. mapping inversion-based matching
1. semantic similarity threshold-based matching
Expand Down Expand Up @@ -68,14 +68,43 @@ The basic idea behind "Semantic similarity threshold-based matching" is that a p

**Semantic vs lexical similarity?**: Semantic similarity is different from lexical similarity intuitively because the context (the graph structure, the background information) is taken into account and provides an (often crude) model of the actual entity, rather than of the word describing it. However, the distinctions can become a bit hazy. Imagine learning a graph embedding on a graph without edges, or a word embedding purely on a single label - there is definitely a grey zone where lexical similarity finishes and semantic similarity begins. In practice though, it should be mostly clear.

## Level 1: Documenting semantic similarity matches
#### Level 1: Documenting semantic similarity matches

The suggested metadata for semantic similarity threshold based matching approach is:
The suggested metadata for semantic similarity threshold-based matching approach is:

- [semantic_similarity_measure](https://mapping-commons.github.io/sssom/semantic_similarity_measure/)
- [semantic_similarity_score](https://mapping-commons.github.io/sssom/semantic_similarity_score/)
- ((authors note: Maybe we need a [value for similarity threshold](https://github.com/mapping-commons/sssom/issues/296)?))

<a id="manual-mapping-curation"></a>

## Manual mapping curation

[semapv:ManualMappingCuration](https://w3id.org/semapv/vocab/ManualMappingCuration) is a process conducted by a (usually human) agent to determine a mapping by virtue of domain expertise. The task usually involves the agent determining, for a given `subject_id`, a suitable `obect_id` in the `object_source`.

#### Level 1: Documenting manual mapping curation

The suggested minimal metadata for manual mapping curation is:

- [author_id](https://mapping-commons.github.io/sssom/author_id/): Documenting, using a unique identifier such as an ORCID, the identity of the author performing the expert curation.
- [comment](https://mapping-commons.github.io/sssom/comment/): When no formal [curation_rule](https://mapping-commons.github.io/sssom/curation_rule/) is provided (see below), it is recommended to provide a short comment with the mapping justification, especially if there is some uncertainty or ambiguity about the mapping decision.

#### Level 2: Documenting the confidence of expert curation

[confidence](https://mapping-commons.github.io/sssom/confidence/) is an incredibly useful metric for downstream users, including ETL engineers and data analysts. In an ideal world, all mappings have some kind of confidence associated with them. `confidence` scores should be read as "the strength of evidence provided in this record/table row (i.e mapping justification) leads us to believe the mapping (e.g. OMOP:44499396 --[skos:broadMatch]--> OMOP:4028717) is correct with 90% confidence.

In manual curation, confidence expresses the domain expertise degree of conviction that the asserted mapping holds true. While manual mapping curation is still considered a gold standard, in practice human agents have (a) varying levels of expertise on the subject domain, (b) varying levels of understanding of the intuitions behind "semantic spaces" and associated concepts and (c) varying levels of metadata associated with a concept to be able to determine a match (definitions, labels, papers, synonyms, etc). Documenting confidence can be very useful both to increase the transparency of data science pipelines that involve entity mappings, and as a means to increase curation speed: rather than trying to achieve 100% confidence for a mapping, which can be extremely time-consuming, it is often better to first "wave through" a mapping with lower confidence to reach coverage, and later revisit low confidence mappings iteratively.

#### Level 3: Documenting curation rules

For manual matches, it is often unclear by what criteria a match was established. Documenting the `curation rule`s can help increase consistency for manual curation, and transparency for downstream users.

For example `OHDSI_CURATION_RULE:19` could correspond to the following rule:

OHDSI_CURATION_RULE:19 = If the subject concept does not have an exact match in the object source vocabulary, we select the nearest broad ("up-hill") concept applicable. Conceptually, if both terms would exist in the same terminology, the subject concept can be defined as a subconcept of the object concept. The determination for both criteria (nearest broad, conceptally subconcept) is performed through medical expert judgement.

Curation rules are often very use case-specific and difficult to standardise. As of August 2023, SSSOM does not provide any standardised curation rules, but encourages the community to define them locally.

<a id="mapping-review"></a>

## Mapping review
Expand Down

0 comments on commit f6025c3

Please sign in to comment.