The Unified Biomedical Knowledge Graph (UBKG) is a knowledge graph database that represents a set of interrelated concepts from biomedical ontologies and vocabularies. The UBKG combines information from the National Library of Medicine's Unified Medical Language System (UMLS) with assertions from “non-UMLS” ontologies or vocabularies, including:
- Ontologies published in references such as the NCBO Bioportal and the OBO Foundry.
- Custom ontologies derived from data sources such as UNIPROTKB.
- Other custom ontologies, such as those for the HuBMAP platform.
An important goal of the UBKG is to establish connections between ontologies. For example,if information on the relationships between proteins and genes described in one ontology can be connected to information on the relationships between genes and diseases described in another ontology, it may be possible to identify previously unknown relationships between proteins and diseases.
The primary components of the UBKG are:
- a graph database, deployed in neo4j
- a REST API that provides access to the information in the graph database
The UBKG database is populated from the load of a set of CSV files, using [neo4j-admin import] (https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/). The set of CSV import files is the product of two generation frameworks.
The UBKG prohibits direct Cypher access to the neo4j knowledge graph database. The UBKG API is a REST API with endpoints that can be used to return information from the UBKG.
The UBKG API is described in this SmartAPI page.
The source framework is a combination of manual and automated processes that obtain the base set of nodes (entities) and edges (relationships) of the UBKG graph.
The source framework is also known as the UMLS-Graph.
- Information on the concepts in the ontologies and vocabularies that are integrated into the UMLS Metathesaurus can be downloaded using the MetamorphoSys application. MetamorphoSys can be configured to download subsets of the entire UMLS.
- Additional semantic information related to the UMLS can be downloaded manually from the Semantic Network.
The result of the Metathesaurus and Semantic Network downloads is a set of files in Rich Release Format (RRF). The RRF files contain information on source vocabularies or ontologies, codes, terms, and relationships both with other codes in the same vocabularies and with UMLS concepts.
The RRF files are loaded into a data mart. A python script then executes SQL scripts that perform Extraction, Transformation, and Loading of the RRF data into a set of twelve temporary tables. These tables are exported to CSV format in files that become the UMLS CSVs.
The UMLS CSVs can be loaded into neo4j to build a graph version of the UMLS, including concepts and relationships from over 150 vocabularies and ontologies that are integrated into the UMLS, such as SNOMED CT, ICD10, NCI, etc..
The UBKG extends the UMLS graph by integrating additional concepts and relationships from sources outside of the UMLS, including a number of standard biomedical ontologies that are published in NCBO BioPortal, including:
Ontology or Source | Description |
---|---|
PATO | Phenotypic Quality Ontology |
UBERON | Uber Anatomy Ontology |
CL | Cell Ontology |
DOID | Human Disease Ontology |
OBI | Ontology for Biomedical Investigations |
EDAM | EDAM |
HSAPDV | Human Developmental Stages Ontology |
SBO | Systems Biology Ontology |
MI | Molecular Interactions |
CHEBI | Chemical Entities of Biological Interest Ontology |
MP | Mammalian Phenotype Ontology |
ORDO | Orphan Rare Disease Ontology |
UO | Units of Measurement Ontology |
UNIPROTKB | Protein-gene relationships from UniProtKB |
HUSAT | HuBMAP Samples Added Terms |
HUBMAP | the application ontology supporting the infrastructure of the HuBMAP Consortium |
CCF | Human Reference Atlas Common Coordinate Framework Ontology |
MONDO | MONDO Disease Ontology |
EFO | Experimental Factor Ontology |
SENNET | the application ontology supporting the infrastructure of the SenNet Consortium |
The generation framework is a suite of scripts that:
- extract information on assertions (also known as triples, or subject-predicate-object relationships) found in ontologies or derived from other sources
- iteratively add assertion information to the base set of UMLS CSVs to create a set of ontology CSVs.
Once a set of ontology CSVs is ready, they can be imported into a neo4j database to form a new instance of the UBKG.
The generation framework can work with:
- data from ontologies published in Web Ontology Language (OWL) files that conform to the principles of the OBO Foundry
- data from private or custom ontologies that are in the SimpleKnowledge format. (SimpleKnowledge is a lightweight ontology editor based on spreadsheets developed by Pitt UBMI.)
- assertion data that conforms to the UBKG Edge/Node format.
The generation framework obtains assertion data from OWL files with scripts that are based on the Phenotype Knowledge Translator (PheKnowLator) application. PheKnowLator converts information from an OWL file into the OWL-NETS (OWL NEtwork Transformation for Statistical learning) format.