Skip to content
veitveit edited this page Mar 15, 2016 · 6 revisions

###Summary The EDAM ontology has the potential to be applied to not only bioinformatics tools and services, but anything related to the usage of these resources. Hence, teaching and training material can be annotated with EDAM terms allowing their association with relevant software tools and services. Other examples would be protocols for data storage and data exchange. From experience curating https://bio.tools, whether adding single software tools up to entire collections, the bottleneck is the mapping of software descriptors (tags or free text) to EDAM concepts (terms and synonyms). This can be time-consuming and error-prone, especially for annotation of for large data sets which lack meaningful tags (annotations) performed by people who are not already familiar with EDAM, resulting in wrong or missing annotations.
Given this bottleneck, we need fast and precise methods to assist a curator in identifying the right EDAM terms when annotating softwares and teaching materials. As most of these come with additional information available in textual form (e.g. a publication abstract, or short textual description), mapping of such text to EDAM terms and synonyms provides a shortcut with the potential to become an automatized procedure for annotation of the material.

###Objectives

  • Mapping to EDAM concepts (via terms and synonyms) with highest possible confidence.
  • Software tool using arbitrary text input and providing well-defined output (mapping to EDAM concepts)
  • Improve annotation of tools in https://bio.tools by mapping paper abstracts
  • Integration of the mapping tool into https://bio.tools and other portals to simplify upload of new material

Background

EDAM includes 4 branches of different concepts (OWL classes): topic, operation, data & format. A concept is assigned to a branch by its URI (id), thus:

In the following, text in parenthesis refers to statements in the OWL file. Each concept has a:

  • preferred term (rdfs:label)
  • definition (oboInOwl:hasDefinition)
  • parent concept (rdfs:subClassOf)
  • subset (oboInOwl:inSubset), always one of "topics", "operations", "data", "formats", or "obsolete" (for obsolete classes). NB: >1 other types of subsets (not listed here) may be be defined for a concept !
  • exact synonym (oboInOwl:hasExactSynonym) - standard synonym
  • narrow synonym (oboInOwl:hasNarrowSynonym) - specialism of concept
  • broad synonym (oboInOwl:hasBroadSynonym) - generalisation of concept
  • comments (rdfs:comment) - some comment

###Usage

Text input may be

  • key words
  • short phrases (typically in a text file, one word or phrase / line)
  • free text such as paper abstracts, full texts and tutorials

Mapping to EDAM concepts (including deprecated concepts), specifically

  • preferred labels
  • exact synonyms
  • narror and broad synonyms
  • concept definition (maybe)
  • concept comments (maybe)

Output includes (provisionally) at least the following information:

  • text input : supplier-provided keyword, short phrase or text
  • label_or_synonym : EDAM label or synonym that keyword / phrase was matched to
  • URI : EDAM URI of the matched class
  • obsolete : one of "yes" or "no"
  • match_type : one of "Label", Exact_synonym", "Narrow_synonym" or "Broad_synonym"
  • match_conf : one of "Exact" (for non-case-sensitive exact text matches) or "Inexact" (otherwise)
  • branch: : one of "Topic", "Operation", "Data" or "Format"

###Technical and Methodological Roadmap

Definition of input data

  • Define a document: title, abstract, text, outlines, figures, ..
  • Define what can be mapped against EDAM: abstracts of papers defining a tool and/or database in the registry; tool collections with describing terms/keywords
  • Continuously adapt tool to manage larger text

_Implement specific implementations

Flexible output options will be needed on a case-by-case basis, e.g. Assuming a single line of output, e.g.

keyword_or_phrase | label_or_synonym | URI | obsolete | match_type | match_conf | branch

At least one output line is required for every input keyword/phrase, even in cases where no match was found.

In practice, flexible output options will be needed to suit various applications, e.g.:

  • Best n matches for all EDAM branches (Topic, Operation, Data, Format)
  • Best n matches for specific branches (Topic and Operation only, say)
  • Score for fuzziness of mapping
  • Score for more general mapping quality (e.g. number of hits in large texts)

In terms of "quality" of matches: labels > exact synonym > narrow or broad synonym > definition > comment

and: exact full-text matches > "fuzzy" matches i.e. an exact full-text match to the concept label is best of all.

Thus output definition is needed on a case-by-case basis.

Different mapping techniques

  • Start from the EDAM concept and create from it a REGEX then use the REGEX as a lookup key
  • Start from text and parse it into vector of words then compare each word to the concept, then aggregate results and compare again with compound words
  • Sequence alignment (how to fix the threshold for a positive hit)
  • Prediction based on machine learning
  • Apply search engines to find EDAM terms in text

Methodology

  • Pre-processing:
    • Try without
    • Find treatment of specific characters like hyphens, points, ...
    • Elimination of promiscous words from the documents (i.e. conjunctives, prepositions, ..) and applying information retrieval techniques to weight the words concerning the significance (look for available software and apply tf-idf)
  • Processing: Running the selected technique(s) on the input data to obtain the matches
  • Post-processing: Elimination of the matches with low significance from the output (based on a measure of match confidence and/or a defined score)

Validation and benchmarking

  • Benchmarking on a reference dataset, manually curated mappings between registry sources and EDAM can be used to validate the output quality, start simple by counting validated matches
  • Comparison of several mining techniques to select the best performing one.
  • Parameter optimization
  • Find optimal score to distinguish good matches

More notes

See also:

https://docs.google.com/document/d/1Ms04Fm-n9RJzhNLdeYQCKGHK03fgsSrztglVR-VvQUw/edit

which should move to:

https://github.com/bio-tools/biotoolsConnect/wiki

Clone this wiki locally