Skip to content

Commit

Permalink
Update data_dict.md
Browse files Browse the repository at this point in the history
  • Loading branch information
benstear authored Nov 13, 2023
1 parent a77ec8b commit 2cc5ac3
Showing 1 changed file with 15 additions and 14 deletions.
29 changes: 15 additions & 14 deletions petagraph/data_dict.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ return * limit 1
## Genotype-Tissue Expression Portal, eQTL data (GTEXEQTL)
**Source**: The GTEx eQTL data we ingested comes from the file `GTEx_Analysis_v8_eQTL.tar` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**.

**Preproccessing**: For this first ingestion of GTEx's eQTL data, we only included eQTLs that were present in every tissue. This reduced the number of eQTLs in the dataset from 71 million to 2.1 million. Furthermore, we did not include any eQTLs that were not mapped to genes with a valid `HGNC` Code. This criteria dropped about 14% of the eQTLs. We then created eQTL nodes and attached them to their respective gene (`HGNC`), tissue (`UBERON`), genomic location `([HSCLO]([Homo Sapiens Chromosomal Ontology (HSCLO)]([https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-ontology-hsclo](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-location-ontology-hsclo))` ),see section below) and p-value (`PVALUEBINS`) nodes. The following list of numbers was used to create the p-value bins: `[0,1e-12,1e-11,1e-10,1e-9,1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,.005,.01,.02,.03,.04,.05,.06]`
**Preproccessing**: For this first ingestion of GTEx's eQTL data, we only included eQTLs that were present in every tissue. This reduced the number of eQTLs in the dataset from 71 million to 2.1 million. Furthermore, we did not include any eQTLs that were not mapped to genes with a valid `HGNC` Code. This criteria dropped about 14% of the eQTLs. We then created eQTL nodes and attached them to their respective gene (`HGNC`), tissue (`UBERON`), genomic location ([HSCLO]([Homo Sapiens Chromosomal Ontology (HSCLO)]([https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-ontology-hsclo](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-location-ontology-hsclo)) ),see section below) and p-value (`PVALUEBINS`) nodes. The following list of numbers was used to create the p-value bins: `[0,1e-12,1e-11,1e-10,1e-9,1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,.005,.01,.02,.03,.04,.05,.06]`

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/gtex_eqtl.png" alt="drawing" width="800"/>

Expand All @@ -84,7 +84,7 @@ return * limit 1

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/gtexcoexp.png" alt="drawing" width="800"/>

**Schema Description**: Two HGNC Concepts are shown along with their Codes and preferred Terms. They're connected by a `coexpressed_with` relationship. There is an `evidence_class` property on the relationship that specifies how many tissues the two genes are highly co-expressed in. The SAB for this dataset `GTEXCOEXP` is located on the `coexpressed_with` and the `inverse_coexpressed_with`
**Schema Description**: Two `HGNC` Concepts are shown along with their Codes and preferred Terms. They're connected by a `coexpressed_with` relationship. There is an `evidence_class` property on the relationship that specifies how many tissues the two genes are highly co-expressed in. The SAB for this dataset `GTEXCOEXP` is located on the `coexpressed_with` and the `inverse_coexpressed_with`

```cypher
// Cypher query to reproduce the schema figure
Expand All @@ -101,7 +101,7 @@ The human to mouse orthology mapping data were also obtained in April 2023 from

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/HGNCHCOP.png" alt="drawing" width="800"/>

**Schema Description**: HGNC Concept (blue), Code (yellow) and Term (brown) from HGNC on the left and its corresponding Mouse gene Concept and Code (SAB = `HCOP`) on the right. The SAB for this mapping dataset is `HGNCHCOP` and is located on the SAB property of the `in_1_to_1_relationship_with` and `inverse_in_1_to_1_relationship_with` relationships.
**Schema Description**: An `HGNC` Concept (blue), Code (yellow) and Term (brown) on the left and its corresponding Mouse gene Concept and Code (SAB = `HCOP`) on the right. The SAB for this mapping dataset is `HGNCHCOP` and is located on the SAB property of the `in_1_to_1_relationship_with` and `inverse_in_1_to_1_relationship_with` relationships.


```cypher
Expand All @@ -121,7 +121,7 @@ These data are generated by the HPO group to use OMIM disease-gene associations

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/HGNCHPO.png" alt="drawing" width="800"/>

**Schema Description**: On the left hand side, an HGNC Concept (blue), Code (yellow) and Term (brown) nodes are connected to an HPO Concept node through an `associated_with` relationship. The SAB for this mapping dataset is HGNCHPO and it is located on the SAB property of the `associated_with` and `inverse_associated_with` relationships. In this example we can see that the ODAD2 gene is associated with Atelectasis.
**Schema Description**: On the left hand side, an `HGNC` Concept (blue), Code (yellow) and Term (brown) nodes are connected to an `HPO` Concept node through an `associated_with` relationship. The SAB for this mapping dataset is `HGNCHPO` and it is located on the SAB property of the `associated_with` and `inverse_associated_with` relationships. In this example we can see that the ODAD2 gene is associated with Atelectasis.

```cypher
// Cypher query to reproduce the schema figure
Expand All @@ -137,7 +137,7 @@ return * limit 1

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/HCOPMP.png" alt="drawing" width="800"/>

**Schema Description**: On the left hand side, an MP Concept (blue), Code (yellow) and Term (brown) nodes are connected to an HCOP Concept node through an `involved_in` relationship. The HCOP Code nodes represent mouse genes. The SAB for this mapping dataset is HCOPMP and it is located on the SAB property of the `involved_in` and `inverse_involved_in` relationships.
**Schema Description**: On the left hand side, an `MP` Concept (blue), Code (yellow) and Term (brown) nodes are connected to an `HCOP` Concept node through an `involved_in` relationship. The `HCOP` Code nodes represent mouse genes. The SAB for this mapping dataset is HCOPMP and it is located on the SAB property of the `involved_in` and `inverse_involved_in` relationships.

```cypher
// Cypher query to reproduce the schema figure
Expand All @@ -153,7 +153,7 @@ return * limit 1

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/HPOMP.png" alt="drawing" width="800"/>

**Schema Description**: A Concept (blue), Code (yellow) and Term (brown) from MP on the left and its corresponding HPO Concept, Code and Term on the right. They are connected through an `is_approximately_equivalent_to` relationship. The SAB for this mappings, HPOMP, can be found on the SAB property on the bidirectional relationships.
**Schema Description**: A Concept (blue), Code (yellow) and Term (brown) from `MP` on the left and its corresponding `HPO` Concept, Code and Term on the right. They are connected through an `is_approximately_equivalent_to` relationship. The SAB for this mappings, `HPOMP`, can be found on the SAB property on the bidirectional relationships.

```cypher
// Cypher query to reproduce the schema figure
Expand All @@ -165,11 +165,11 @@ return * limit 1
## Human-Rat ENSEMBL orthologs (RATHCOP)
**Source**: The source of the human ENSEMBL to rat ENSEMBL orthologs is the HGNC Comparisons of Orthology Predictions tool. Go to https://www.genenames.org/tools/hcop/, scroll to the Bulk Downloads section at bottom of the page, select `Rat` in the first drop down menu and `15 columns` and download the data.

**Preproccessing**: No preprocessing was needed on these mappings, we simply selected the `human_ensembl_gene` and `rat_ensembl_gene` columns.
**Preproccessing**: No preprocessing was needed on these mappings, we simply selected the `human_ensembl_gene` and `rat_ensembl_gene` columns from the dataset.

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/RATHCOP.png" alt="drawing" width="800"/>

**Schema Description**: ...
**Schema Description**: A human ENSEMBL Concept and Code are shown on the left its orthologous rat ENSEMBL Concept and Code are shown on the right. The Concepts are connected by `has_human_ortholog` and `inverse_has_human_ortholog` relationships. The `RATHCOP` SAB is located on the SAB property of both Concept-Concept relationships.

```cypher
// Cypher query to reproduce the schema figure
Expand All @@ -181,9 +181,9 @@ return * limit 1

---
## Homo Sapiens Chromosomal Location Ontology (HSCLO)
**Source**: The Homo Sapiens Chromosomal Location Ontology (HSCLO) was created by Taha Ahooyi Mohseni of the Petagraph team. HSCLO was primarily created to connect 4DN loop coordinates to the rest of the graph through the mapping between HSCLO and GENCODE. HSCLO was later utilized to connect GTEXEQTL locations in the graph as searchable nodes at 1kbp resolution.
**Source**: The Homo Sapiens Chromosomal Location Ontology (HSCLO) was created by Taha Ahooyi Mohseni of the Petagraph team. HSCLO was primarily created to connect 4DN loop coordinates to the rest of the graph through the mapping between HSCLO and GENCODE. HSCLO was later utilized to connect `GTEXEQTL` locations in the graph as searchable nodes at 1kbp resolution.

**Preproccessing**: The dataset relationships as well as nodes use HSCLO as their SAB. HSCLO nodes are defined at 5 resolution levels; chromosomes, 1 Mbp, 100 kbp, 10 kbp and 1kbp with each level connecting to lower levels with `above_(resolution level)_band` (e.g. "above_1Mbp_band", "above 1_kbp_band") and nodes at the same resolution level are connected through `precedes_(resolution level)_band` (e.g. "precedes_10kbp_band"). The dataset contains 3,431,155 nodes and 6,862,195 relationships.
**Preproccessing**: The dataset relationships as well as Code nodes use `HSCLO` as their SAB. `HSCLO` nodes are defined at 5 resolution levels; chromosomes, 1 Mbp, 100 kbp, 10 kbp and 1kbp with each level connecting to lower levels with `above_(resolution level)_band` (e.g. `above_1Mbp_band`, `bove 1_kbp_band`) and nodes at the same resolution level are connected through `precedes_(resolution level)_band` (e.g. `precedes_10kbp_band`). The dataset contains 3,431,155 nodes and 6,862,195 relationships.


<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/HSCLO_2.png" alt="drawing" width="800"/>
Expand Down Expand Up @@ -226,6 +226,7 @@ return * limit 1
<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/LINCS.png" alt="drawing" width="800"/>

**Schema Description**: ...

```cypher
// Cypher query to reproduce the schema figure
match (a:Code {SAB:'CHEBI'})-[r0:CODE]-(b:Concept)-[r1 {SAB:'LINCS'}]-(c:Concept)-[r2:CODE]-(d:Code {SAB:'HGNC'})
Expand All @@ -236,7 +237,7 @@ return * limit 1
## Connectivity Map (CMAP)
**Source**: Signature perturbations of gene expression profiles as induced by chemical (small molecule) were obtained from the Ma’ayan Lab Harmonizome portal at [https://maayanlab.cloud/Harmonizome/dataset/CMAP+Signatures+of+Differentially+Expressed+Genes+for+Small+Molecules](https://maayanlab.cloud/Harmonizome/dataset/CMAP+Signatures+of+Differentially+Expressed+Genes+for+Small+Molecules)

**Preproccessing**: In a similar manner to L1000 data integration discussed above, we obtained the edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset from the Harmonizome database :https://maayanlab.cloud, (Lamb et al. 2006; Rouillard et al. 2016). The data was computed based on an earlier study (Lamb et al. 2006; Rouillard et al. 2016). The dataset added 2,625,336 new relationships (including reverse relationships) connecting the Petagraph `CHEBI` and `HGNC` nodes with types types of negatively_correlated_with_gene”, “positively_correlated_with_gene”, “inverse_negatively_correlated_with_gene and inverse_positively_correlated_with_gene and SAB of CMAP.
**Preproccessing**: In a similar manner to L1000 data integration discussed above, we obtained the edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset from the Harmonizome database :https://maayanlab.cloud, (Lamb et al. 2006; Rouillard et al. 2016). The data was computed based on an earlier study (Lamb et al. 2006; Rouillard et al. 2016). The dataset added 2,625,336 new relationships (including reverse relationships) connecting the Petagraph `CHEBI` and `HGNC` nodes with types types of `negatively_correlated_with_gene`, `positively_correlated_with_gene`, `inverse_negatively_correlated_with_gene` and `inverse_positively_correlated_with_gene` and SAB of `CMAP`.

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/CMAP.png" alt="drawing" width="800"/>

Expand All @@ -252,11 +253,11 @@ return * limit 1
## Molecular Signatures Database (MSIGDB) 
**Sourcee**: MSigDB v7.4 datasets C1, C2, C3, C8 and H were obtained from the MSigDB molecular signature database. MSigDB is a collection of gene set resources, curated or collected from several different sources and can be accessed at [https://www.gsea-msigdb.org/gsea/msigdb/](https://www.gsea-msigdb.org/gsea/msigdb/). Five subsets of MSigDB v7.4 datasets were introduced as entity-gene relationships to the knowledge graph: C1 (positional gene sets), C2 (curated gene sets), C3 (regulatory target gene sets), C8 (cell type signature gene sets) and H (hallmark gene sets).

**Preproccessing**: With these five datasets, we created MSIGDB Concept nodes for 31,516 MSigDB systematic names. The relationships between these Concept nodes and HGNC nodes were established considering the information available in each of the mentioned 5 subsets where the subset information was included in the relationship SABs as MSIGDB. The Term names were also compiled according to the MSigDB generic entity names. Collectively, the five MSIGDB datasets added 2,598,060 Concept-Concept relationships to Petagraph. The relationship types used to map MSigDB Concept nodes to HGNC Concept nodes are: `MSigDB C1`, `MSigDB C2`, `MSigDB C3`, `MSigDB C8`, `MSigDB H`. The MSIGDB Codes SAB property is: `MSigDB_Systematic_Name`.
**Preproccessing**: With these five datasets, we created `MSIGDB` Concept nodes for 31,516 MSigDB systematic names. The relationships between these Concept nodes and HGNC nodes were established considering the information available in each of the mentioned 5 subsets where the subset information was included in the relationship SABs as `MSIGDB`. The Term names were also compiled according to the MSigDB generic entity names. Collectively, the five `MSIGDB` datasets added 2,598,060 Concept-Concept relationships to Petagraph. The relationship types used to map MSigDB Concept nodes to `HGNC` Concept nodes are: `MSigDB C1`, `MSigDB C2`, `MSigDB C3`, `MSigDB C8`, `MSigDB H`.

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/MSIGDB.png" alt="drawing" width="800"/>

**Schema Description**: An `HGNC` Concept, Code and Term node are connected to an `HGNC` Concept, Code and Term node on the right. The `MSIGDB` SAB is located on the SAB property for both sets of relationships
**Schema Description**: An `HGNC` Concept, Code and Term node are connected to an `HGNC` Concept, Code and Term node on the right. The `MSIGDB` SAB is located on the SAB property for both sets of relationships.

```cypher
// Cypher query to reproduce the schema figure
Expand All @@ -268,7 +269,7 @@ return * limit 1
## ClinVar (CLINVAR)
**Source**: Human genetic variant-disease associations were obtained from: [https://www.ncbi.nlm.nih.gov/clinvar/](https://www.ncbi.nlm.nih.gov/clinvar/). Only associations with Pathogenic or Likely Pathogenic consequence scores were included in the graph. We also did not include variants that affect a subset of genes (where there was no one-to-one relationship between a gene and phenotype/disease).

**Preproccessing**: The ClinVar human genetic variants-phenotype submission summary dataset (2023-01-05) was utilized to define relationships between human genes and phenotypes (Landrum et al. 2018). To retrieve the target phenotype/disease we used MEDGEN IDs listed in the ClinVar dataset (also already present in Petagraph). The `CLINVAR` variant-disease mappings gave rise to 214,040 new relationships (with the following characteristics [Type: gene_associated_with_disease_or_phenotype, SAB: CLINVAR] and [type: inverse_gene_associated_with_disease_or_phenotype, SAB: CLINVAR] connecting HGNC and MEDGEN, MONDO, HPO, EFO and MESH Concept nodes.
**Preproccessing**: The ClinVar human genetic variants-phenotype submission summary dataset (2023-01-05) was utilized to define relationships between human genes and phenotypes (Landrum et al. 2018). To retrieve the target phenotype/disease we used MEDGEN IDs listed in the ClinVar dataset (also already present in Petagraph). The `CLINVAR` variant-disease mappings gave rise to 214,040 new relationships (with the following characteristics [Type: `gene_associated_with_disease_or_phenotype`, SAB: `CLINVAR`] and [type: `inverse_gene_associated_with_disease_or_phenotype`, SAB: `CLINVAR`] connecting `HGNC` and `MEDGEN`, `MONDO`, `HPO`, `EFO` and `MESH` Concept nodes.

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/CLINVAR_2.png" alt="drawing" width="800"/>

Expand Down

0 comments on commit 2cc5ac3

Please sign in to comment.