-
Notifications
You must be signed in to change notification settings - Fork 5
Annotations in NexSON
The original NexSON documentation is at https://github.com/OpenTreeOfLife/phylesystem-api/wiki/NexSON.
This page is a short-term scratchpad/sandbox for fleshing out the details of how annotations that are added to NexSON files during a process of commenting or validation should be structured.
NexSON is a JSON format derived from NexML using BadgerFish conventions. NexML is intended to be used to encode phylogenetic trees, alignments, and associated information. NexSON implements several of the NexML first-class object types including a top-level study object (the nexml
object), which may contain a number of other first-class objects of the types: otus
, otu
, trees
, tree
, node
, and edge
. For more information, see the other NexSON pages and the NexML schema.
As a means to store additional metadata, the NexML schema defines a meta
tag, which may be a child of any first-class NexML object. These meta
tags may contain arbitrarily complex, discretionary data structures, and are intended to contain metadata annotating the first-class element to which they are attached. In NexSON, sets of meta
tags are represented as a JSON array, stored under they meta
key in the first-class object to which the meta tags apply. For example:
{
"tree": {
"meta": [
{
"@property": "the meta element property type",
"childElement": { "$": "inner text of this child element" },
"aDifferentChild": { "$": "foo" }
},
{
"@property": "a different property type",
"arrayValue": [
{"$": "inner text of an 'arrayValue' child of this meta element" },
{"$": "another arrayValue" },
{"$": "and a final one" }
]
}
]
}
}
The content of these meta tags is unconstrained by the NexML schema. For example, phenoscape embeds data structured according to other XML schemas inside meta tags. See some example phenoscape files here.
This document proposes a standardized model to facilitate efficient and straightforward storage and retrieval of human- and machine-generated annotation metadata regarding a NexSON study and its contained objects. The goals of this proposed model are limited to the scope of the Open Tree of Life project. Thus, no attempt is made to generalize a model suitable for all conceivable annotation purposes under the sun. Rather, the concepts are tailored to suit the activities expected to occur as part of the OToL workflow, including (but not necessarily limited to):
- Study curation
- NexSON structural validation
- Data quality assessment
- Metadata persistence
- Cross-purpose communication of OToL tools, including external tools intended to complement and extend OToL tools.
Extensions to this model, or other models, may be required for other purposes outside the defined scope.
Much of the content documented here was originally discussed in the Annotations thread on the [email protected]. Where appropriate, attempts have been made to incorporate concepts from related projects addressing the formalization of annotation data, including Open Annotation, Annotation Ontology, and the W3C PROV Ontology.
Three primary annotation object classes are proposed: annotationEvent
, agent
, and message
. This three-part breakdown corresponds to the PROV data model, with its Activity
, Agent
, and Entity
types corresponding to NexSON annotationEvent
, agent
, and message
objects respectively. The roles of these object types are defined as follows:
An annotationEvent
is a one-time event, during which an agent generates one or more messages related to a study or element(s) within it. Each annotationEvent
should generate one or more message
objects. annotationEvent
objects relate message
objects to associated agent
objects and contain information about the event itself, including the date [more info here].
After some discussion (Jan 13, 2014 software G+ hangout), we decided to put the message
objects inside their containing annotationEvent
object.
An agent is a person or program that creates annotations, possibly acting on behalf of another agent. agent
. objects contain information identifying and describing real-world annotating agents, including names, urls, information about the execution environment (for automated agents), version (for automated agents), etc. For more information, refer to the agent
syntax below.
A message
is a simple data structure provides information about a particular target object or set of objects. Messages are generalized, and contain features to accommodate diverse annotation data. For more information, refer to the message
object syntax below.
Annotation information may conceivably be stored anywhere (e.g. within single-file NexSON documents, or externally accessible via URL). For convenience and simplicity, at this time we propose storing annotations within NexSON documents themselves.
Two top-level NexSON meta
element containers are proposed to store collections of primary annotation objects. These container elements are in fact NexSON meta
elements, whose @property
value may be equal to "ot:annotationEvents" or "ot:agents". Corresponding annotation elements of each respective type should be stored in the appropriate meta
container. (The "ot:messages" container is now deprecated, in favor of storing messages inside annotation events.)
Exactly one meta
element with the property "ot:annotationEvents" and one with "ot:agents" should exist for a given study, as children of the nexml
object itself. These containers should contain all of the annotationEvent
and agent
objects associated with message
objects applied to elements within the study.
Inside the AnnotationEvent
that created them.
Deprecated recommendation The message
objects themselves should be stored in "ot:messages" containers that are attached to the least inclusive NexSON element to which the information in the message
applies. Thus, one "ot:messages" container may exist as a child of each annotated object (see 2A for a list of annotatable object types) in the study. meta
containers of the "ot:messages" type should only be assigned to the following first-class NexSON objects:
N.B. This entire section is deprecated, in favor of storing messages inside annotation events.
-
nexml
(the study itself) tree
node
edge
otu
Determining the best location to attach these "ot:messages" containers may be a rather arbitrary choice in many cases, but placements facilitating ease of interpretation and semantic consistency are encouraged. By convention, the "ot:messages" element attached to top-level nexml
element should contain message
objects that describe information about the study itself, or about one of its associated annotationEvent
or agent
objects; "ot:messages" containers attached to a tree
element should contain message
objects specific to that tree, but message
objects specific to a single node within that tree should be stored in a "ot:messages" container attached to the node itself; etc.
It may be instructive to consider a negative example: it is possible to store every message
object in the "ot:messages" meta
container attached to the nexml
element itself, and simply use the "refersTo" field (see below) to associate message
objects with the NexSON objects to which they pertain. This usage pattern is discouraged since it complicates the association of the message objects to their relevant NexSON elements. With that in mind, it is worth recognizing that there may be rare cases where it is appropriate to store all of the message
objects associated with a given annotationEvent
in the "ot:messages" container attached to the nexml
object. For instance, when every message
object refers to both tree
and otu
objects, or to other annotationEvent
objects, then most inclusive placement of each message
is the nexml
object.
JA: If we put warnings/queries/errors in the meta
that is inside the top-level nexml
object, then the curator application could grab the annotations, and quickly ascertain which parts of the study have problems. This will require that app to hold onto tree, node, edge and otu IDs until it has the data to instantiate objects of those types. But that does not seem too onerous.
CEH: I think we want to avoid the need for the curator app (or any other app for that matter) to download and parse the entire NexSON study and/or entire set of messages in order to find the relevant ones. I would suggest that implementing services (such as OTI) capable of returning the information based on queries (e.g. "return all warnings/queries/errors for anything in study X") would be more scalable than searching the NexSON for them on every load. In this case, the placement of the messages within the file is arbitrary. I would argue that storing them as children of the objects to which they most closely pertain makes more intuitive sense than not doing so, and that it will be easier to parse in many cases (e.g. no need to hold onto node, edge, otu, etc. ids as mentioned above). So this is my recommendation.
In accordance with Badgerfish conventions (for XML compatibility), each container object in the JSON representation will contain an array of objects of the corresponding type under the defined key. Each element of these arrays corresponds to a single tag of the same type name in the XML as the as the array key in the Badgerfished JSON (e.g. annotationEvent
, agent
, or message
)`.
tag | legal value(s) | explanation |
---|---|---|
@property | "ot:annotationEvents" | |
@xsi:type | "nex:ResourceMeta" | |
annotation | list of annotationEvent elements |
See details below |
tag | legal value(s) | explanation |
---|---|---|
@property | "ot:agents" | |
@xsi:type | "nex:ResourceMeta" | |
agent | list of agent elements |
See details below |
tag | legal value(s) | explanation |
---|---|---|
@property | "ot:messages" | |
@xsi:type | "nex:ResourceMeta" | |
message | list of message elements |
See details below. |
tag | legal value(s) | explanation |
---|---|---|
@id | string | unique among the set of IDs used in this file (not necessarily globally unique) |
@description | string | human-readable description of the type of annotation performed (e.g. "NexSON validation" or "treemachine import check") |
@wasAssociatedWithAgentId | string | id of the agent (person or tool; see below) that created the annotationEvent
|
@dateCreated | String in ISO 8601 | date that the annotationEvent occurred |
@passedChecks | boolean | default True. False indicates that the author is a validating service (rather than just a commenting tool), and some aspect of the validation procedure failed in some serious way. The details should be in the messages. |
@preserve | boolean | False by default. True serves as a flag to future invocations of the same tool (software agent), indicating that the message should be retained |
otherProperty | array of otherProperty elements |
Optional. See below for additional information |
message | list of message elements |
See details below. |
An Agent can be a human author or a program. (Is there a standard way of describing a software tool that we should be using here? <-- Yes, we are adapting this from the PROV model.) Here is the basic info we want:
tag | legal value(s) | explanation |
---|---|---|
@id | string | unique among the set of IDs used in this file (not necessarily globally unique) |
@name | string | Name of software that produced the annotation, or authorized user (GitHub username or email) |
@url | string | URL of service or page that describes the tool (blank for a human) |
@description | string | human-readable description of the tool, or full name for a human |
@version | string | version number string of the authoring tool (blank for a human) |
invocation | object |
Only applicable to automated (i.e. software) agents. invocation object that contains relevant info about the execution environment and operating parameters |
otherProperty | array of otherProperty objects |
Optional. See below for more information |
tag | legal value(s) | explanation |
---|---|---|
commandLine | list of strings | (optional) args |
method | string | GET, PUT... for web services |
data | string | data parameters passed to the web-services call |
checksPerformed | list of strings | list of Message Codes (see below) that the service claims to have checked for |
otherProperty | array of otherProperty objects for additional information |
Optional. See below for more information |
tag | legal value(s) | explanation |
---|---|---|
@id | string | unique among the set of IDs used in this file (not necessarily globally unique) |
@wasGeneratedById | string |
Deprecated no longer used because message objects now occur inside the annotation event that generates them. The id of the annotationEvent object with which this message is associated |
wasAttributedToId | string |
Optional. The id of an agent object that this message is attributed to, which may be different from the agent associated with the generating annotationEvent . For example, "wasAttributedToId" could identify a human agent operating a software agent with which the annotationEvent itself may be associated. |
@severity | string | one of the defined Severity values (like logger message levels; see below) |
@code | string | one of the Message Codes (see below) |
@humanMessageType | string | one of the Message Types (see below). Optional if the Message Code indicates that a front end should be able generate a message from the code (see below). |
@humanMessage | string | human-interpretable message (ie. no NexSON IDs). Optional if the Message Code indicates that a front end should be able generate a message from the code (see below). |
dataAnnotation | string | Optional. More precise message for machine consumption |
data | object | fields depend on the Message Code (see below) |
refersTo | path object (see below) | path to the object that the message refers to (see path syntax below) |
other | object | object (key to string, number, or boolean) for additional information |
These objects are used to designate optional properties. They are intended to be used a catch-all for necessary round-trip information that does not belong in any pre-defined property for a given object. This feature is intentionally restrictive to reduce complexity and increase consistency/adherence to the annotation spec.
tag | explanation |
---|---|
name | the name of this property |
value | a value of one of the predefined value types below |
Acceptable values are defined by JSON.
tag | explanation |
---|---|
STRING | a string wrapped in quotes. |
NUMBER | a floating point or integer value. |
BOOLEAN | a boolean value. Acceptable values are either of the strings "true" or "false", without the quotes. |
tag | explanation |
---|---|
ERROR | an error. generally designates a fail condition |
WARNING | a warning. designates a condition that is not encouraged but is not generally a fail condition |
INFO | neither a warning nor an error |
The following message types (borrowed from the Open Annotation and Annotation Ontology projects) are used to define different cases for the human-readable message (if there is one):
tag | explanation |
---|---|
NONE | there is no human-readable message |
NOTE | a general, human-readable note |
COMMENT | this suggests editorial intent |
REPLY | points to another annotation (Note, Comment, or Reply) |
EXPLANATORY_NOTE | by the curator? |
QUESTION | specifically asks for reply or clarification |
ERRATUM | identifies an error (added by curator to historical stuff? or by a reviewer?) |
Because message objects may relate to more than a single NexSON element,
Here we define a lightweight, NexSON-specific method of describing the paths
Avoiding strict use of a JSON version of XPATH will avoid parsing on the string and dealing with funky Ids (which are legal but could make naive parsing hard to implement).
Used in the refersTo
field to indicate the target of the comment. It seems like we can just expand the parts of an absolute path expression (taking advantage of Ids and the fact that NexSONs are not that "deep").
tag | legal value(s) | explanation |
---|---|---|
@idref | string | ID of the object referred to. This ID will also be found in one of the subsequent fields, but duplicating it here makes it easy for a id->object map to quickly interpret this path blob |
@top | "meta" "otus" or "trees" | child of the nexml element |
@otusID | string | only if otus is top |
@otuID | string | only if otus is top. Optional |
@treesID | string | only if trees is top |
@treeID | string | only if trees is top. Optional |
@edgeID | string | only if trees is top and treeId is specified. Optional |
@nodeID | string | only if trees is top and treeId is specified. Optional |
@metaID | string | only if meta is top (useful for replies) |
@annotationID | string | only if meta is top |
@messageID | string | only if meta is top; NOTE that message may be "localized" anywhere in the study! |
@property | string | optional. property of the element referenced by the preceding parts of the path |
@inMeta | bool | The property is in the meta list of the element referenced by the preceding parts of the path |
The pseudocode for processing on of these paths would be something like this (assuming that [] looks of a property or contained Id in an object):
function find_prop_in_meta(meta_list, prop) {
for (element in meta_list) {
if (element.property == path.property) {
return element
}
}
throw InvalidPathException()
}
function resolve(nexml, path) {
if (path.top == "meta") {
el = nexml.meta
} else if (path.top == "otus") {
otus = nexml.otus[path.otusID];
if (defined(path.otuID)) {
el = otus[path.otuID]
} else {
el = otus
}
} else if (path.top == "trees") {
trees = nexml.trees[path.treesID]
if (defined(path.treeID)) {
tree = trees[path.treeID]
if (defined(path.nodeID)) {
el = tree[path.nodeID]
} else if (defined(path.edgeID)) {
el = tree[path.edgeID]
} else {
el = tree
}
} else {
el = trees;
}
} else {
throw InvalidPathException();
}
if (defined(path.inMeta)) {
return find_prop_in_meta(el.meta, path.property)
}
if (defined(path.property)) {
return el[path.property]
} else {
return el
}
}
This is intended to be an extensible, controlled vocabulary of the types of messages that we anticipate seeing/generating. Preferably, many of the codes, along with the data
blob in the message will be rich enough to create a meaningful user interface for the message (without simply forcing the UI to simply display the messageForUser
and hope that user will know how to react to the message).
code name |
data contents |
explanation |
---|---|---|
REFERENCED_ID_NOT_FOUND | {key: string, value: string} | The NexSON attribute with the name key refers to an ID, but the ID is not in the NexSON. We have about 3000 cases of this with @otu in nodes or @source in edge objects not matching. |
TIP_WITHOUT_OTU | {} |
refersTo object is a node that is a tip on the tree, but is not mapped to any OTU object. This is an NexSON error, not failure to map to OTT. We have about 3000 cases |
UNRECOGNIZED_PROPERTY_VALUE | {key: string, value: string} | the meta array associated key value pair in which the key is recognized, but the value is not valid. We have about 51 cases of this in which key is "ot:branchLengthMode" and value is "ot:years" (which is deprecated, I think). |
MISSING_OPTIONAL_KEY | string | the attribute is not found. Used to report lack of "ot:dataDeposit", "ot:focalClade", "ot:inGroupClade", "ot:ottolid", and "ot:studyPublication" fields. So we have about 32 thousand of these |
NO_ROOT_NODE | {} | tree that is refersTo has no node flagged as the root. We have 12 cases |
TIP_WITHOUT_OTT_ID | {} |
refersTo is a node with and otu, but the otu has no OTT ID. (about 31 thousand cases) |
MULTIPLE_TIPS_MAPPED_TO_OTT_ID | {nodes:[list of IDs]} |
refersTo is a tree the nodes listed are tips in the tree that map to the same OTT ID (about 31 thousand nodes) |
MULTIPLE_TREES | {} | trees element is refersTo and it has multiple trees with no indication of which one treemachine should prefer to use |
UNRECOGIZED_TAG | string | value of an ot:tag meta is not understood. This is not unexpected at all (and this sort of message will probably be suppressed), but the validator does emit it currently so we can see what tags are being used. |
UNVALIDATED_ANNOTATION | {key: string, value: string} | a object in the meta list was an unrecognized key. Not surprising (will be suppressed). |
CONFLICTING_PROPERTY_VALUES | list of key-value pairs that conflict | Flags with conflicting meanings, for example the "delete me" and the "choose me" tags |
NO_TREES | {} | file contains no trees that are not flagged for deletion |
NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID | list of lists of IDs. each sublist is a set of nodes that are monophyletic on the tree and for which all the tips have the same OTT ID | This code is more serious than MULTIPLE_TIPS_MAPPED_TO_OTT_ID because it indicates cases in which different arbitrary prunings could lead to different phylogenetic statements |
These indicate serious problems with the NexSON (and we can probably be unfriendly about them in terms of UI, because they'll probably be encountered by developers):
code name |
data contents |
explanation |
---|---|---|
MISSING_MANDATORY_KEY | string - key name |
refersTo object lacks a mandatory attribute. |
UNRECOGNIZED_KEY | string - key name |
refersTo object has an attribute that is not allowed by the NeXML schema |
MISSING_LIST_EXPECTED | ? | element (e.g. edge) that should be a list, was not |
DUPLICATING_SINGLETON_KEY | string | the attribute specified was encountered more than one time, though it should have been found only once (e.g. a doi) |
REPEATED_ID | string | ID found more than once |
MULTIPLE_ROOT_NODES | {} | tree has more than one node marked as root |
MULTIPLE_EDGES_FOR_NODES | {} | node has more than one edge to parent |
CYCLE_DETECTED | {node : id string} | tree has a cycle (including the referenced node) |
DISCONNECTED_GRAPH_DETECTED | {} | tree is not connected graph |
INCORRECT_ROOT_NODE_LABEL | {} | the node labelled as the root has a parent |
code name |
data contents |
explanation |
---|---|---|
OTU_MAPPING_HINTS | object | Object describing 'searchContext' (string) and required 'substitutions' (sub-objects) |
SUPPORTING_FILE_INFO | object | Object describing 'files' (sub-objects) |
Presumably the curator app (see the "curator" subdir of the opentree repo ) will try to render a subset of this information to curators. Specifically the annotations could be warnings, error messages, and queries to the curators.
Some annotations could also be "extra" contributions to the study data, that need not be shown to curators. These could still be useful for users of the git repo of the studies (currently this is the treenexus repo, but that name will probably change soon).
Taken from study 1003:
{
"id": "anno1",
"description": "Open Tree NexSON validation",
"agent": "agentX",
"checksPassed": false
}
{
"id": "agentX",
"description": "validator of NexSON constraints as well as constraints that would allow a study to be imported into the Open Tree of Life's phylogenetic synthesis tools",
"invocation": {
"checksPerformed": [
"MISSING_MANDATORY_KEY",
"MISSING_OPTIONAL_KEY",
"UNRECOGNIZED_KEY",
"MISSING_LIST_EXPECTED",
"DUPLICATING_SINGLETON_KEY",
"REFERENCED_ID_NOT_FOUND",
"REPEATED_ID",
"MULTIPLE_ROOT_NODES",
"NO_ROOT_NODE",
"MULTIPLE_EDGES_FOR_NODES",
"CYCLE_DETECTED",
"DISCONNECTED_GRAPH_DETECTED",
"INCORRECT_ROOT_NODE_LABEL",
"TIP_WITHOUT_OTU",
"TIP_WITHOUT_OTT_ID",
"MULTIPLE_TIPS_MAPPED_TO_OTT_ID",
"NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID",
"INVALID_PROPERTY_VALUE",
"PROPERTY_VALUE_NOT_USEFUL",
"UNRECOGNIZED_PROPERTY_VALUE",
"MULTIPLE_TREES",
"UNRECOGNIZED_TAG",
"CONFLICTING_PROPERTY_VALUES",
"NO_TREES"
],
"commandLine": [
"--validate"
]
},
"name": "normalize_ot_nexson.py",
"url": "https://github.com/OpenTreeOfLife/api.opentreeoflife.org",
"version": "0.0.1a"
}
{
"parentAnnotationId": "anno1",
"code": "NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID",
"comment": "Multiple nodes that do not form the tips of a clade are mapped to the OTT ID \"210453\". The clades are \"node503822\" +++ \"node503824\" +++ \"node503827\" +++ \"node503832\" in \"tree(id=tree1945)\"\n",
"data": {
"nodes": [
[
"node503822"
],
[
"node503824"
],
[
"node503827"
],
[
"node503832"
]
]
},
"preserve": false,
"refersTo": {
"inMeta": false,
"top": "trees",
"treeID": "tree1945",
"treesID": "trees1003"
},
"severity": "WARNING"
}