-
Notifications
You must be signed in to change notification settings - Fork 6
Term Occurence
The BioPortal software generates a dictionary of terms from preferred labels and synonyms for all ontologies in the BioPortal application. The software then creates a set of data files containing the number of times dictionary terms occur, both as singlets and in pairs, for each resource in the NCBO Resource Index. This page describes the content of these files.
For more information about how this data could be used in a research setting, refer to "Building the graph of medicine from millions of clinical narratives" by Samuel G. Finlayson, Paea LePendu, & Nigam H. Shah.
There are three data files associated with each resource in the Resource Index.
The sampling rate file is a simple text file that contains:
- resource acronym
- total number of documents in the resource
- sampling rate used to create term singleton and co-frequency counts
The following is an example of the sampling rate file for the ArrayExpress resource:
Acronym: AE
Total documents: 48565
Sampling rate: 1
A sampling rate of 1 means that every document in the resource is visited during calculation of term singleton and co-frequency counts. Sampling rates larger than 1 indicate that counts are calculated from a subset of documents. For example, a sampling rate of 100 means that the system only visits one document out of every 100.
The singleton frequency data file contains the total number of times dictionary terms occur in the documents for a particular resource. In the example below, the term "above knee amputation" occurs once across the set of documents, "abscess" appears 6 times across the set of documents, etc.
1 above knee amputation
1 abrasion
6 abscess
4 absence
3 absence of
4 absent
1 absolute
11 absorb
5 absorbing
File format: tab-delimited; column 1 = frequency count, column 2 = term. Files are compressed using the gzip command. To UNZIP a file, use the gunzip command: gunzip filename.tsv.gz.
The co-frequency data file contains the total number of pair-wise occurrences of dictionary terms in the documents for a particular resource. In the example below, the pair of terms "anaphylaxis" and "activity" occur in 3 documents of the resource, the pair "anaphylaxis" and "allergic" occur in 18 documents, etc.
1 anaphylaxis activities
3 anaphylaxis activity
1 anaphylaxis actual
3 anaphylaxis additional
1 anaphylaxis affect
8 anaphylaxis after
13 anaphylaxis against
18 anaphylaxis allergic
Format: tab-delimited; column 1 = frequency count, columns 2 & 3 = terms. Files are compressed using the gzip command. To UNZIP a file, use the gunzip command: gunzip filename.tsv.gz.
Adverse Event Reporting System Data
AgingGenesDB (via NIF)
Antibody Registry (via NIF)
ArrayExpress
ARRS GoldMiner
AutDB (via NIF)
BioGRID (via NIF)
BioModels
Biositemaps
caArray
caNanoLab
Cell Centered Database (via NIF)
CellImageLibrary (via NIF)
ClinicalTrials.gov
Conserved Domain Database
Coriell Cell Repository (via NIF)
CTD ChemDisease (via NIF)
CTD ChemGene (via NIF)
CTD DiseasePathway (via NIF)
Database of Genotypes and Phenotypes
Drug Related Gene Database (via NIF)
DrugBank
Entrez Gene (via NIF)
GEMMA (via NIF)
Gene Expression Omnibus DataSets
GeneNetwork (via NIF)
Integrated Disease View (via NIF)
Integrated Videos (via NIF)
InterNano Process Database
MICAD
ModelDB (via NIF)
NIH RePORTER (via NIF)
Online Mendelian Inheritance in Man
Pathway Commons
PDSP Ki database (via NIF)
PharmGKB [Disease]
PharmGKB [Drug]
PharmGKB [Gene]
PubChem
PubMed
PubMedHealth Drugs (via NIF)
PubMedHealth Tests (via NIF)
Reactome
ResearchCrossroads
Stanford Microarray Database
ToxinDB (via NIF)
UniProt KB
WikiPathways
In addition to the singleton and co-frequency counts of dictionary terms provided in the data files above, researchers may wish to calculate counts for occurrences of ontology classes. To assist in this calculation, the BioPortal software generates a file that maps dictionary terms to their corresponding classes.
The mapping file contains an alphabetically sorted list of all terms (preferred labels and synonyms) in the BioPortal application. Each row contains a term, the ontology in which the term appears, and the corresponding class ID for the term.
In the following example excerpt from the mapping file, the term CELL appears in the BIRNLEX ontology with class ID http://bioontology.org/projects/ontologies/birnlex#birnlex_12. CELL also appears in the RXNORM ontology with class ID http://purl.bioontology.org/ontology/STY/T025, etc.
CELL BIRNLEX http://bioontology.org/projects/ontologies/birnlex#birnlex_12
CELL RXNORM http://purl.bioontology.org/ontology/STY/T025
CELL SAO http://ccdb.ucsd.edu/SAO/1.2#sao1813327414
CELL SCTSPA http://purl.bioontology.org/ontology/STY/T025
CELL SIO http://semanticscience.org/resource/SIO_010001
CELL SNMI http://purl.bioontology.org/ontology/STY/T025
CELL SNOMEDCT http://purl.bioontology.org/ontology/SNOMEDCT/362837007
Format: tab-delimited; column 1 = term, column 2 = ontology acronym, column 3 = class ID. File is compressed using the gzip command. To UNZIP, use the gunzip command: gunzip labels_to_classes.tsv.