From 2cc5ac31750111cf7f953389f656bcccadb82b75 Mon Sep 17 00:00:00 2001 From: ben stear Date: Mon, 13 Nov 2023 10:11:51 -0500 Subject: [PATCH] Update data_dict.md --- petagraph/data_dict.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/petagraph/data_dict.md b/petagraph/data_dict.md index 5ca98d0..c0b843b 100644 --- a/petagraph/data_dict.md +++ b/petagraph/data_dict.md @@ -60,7 +60,7 @@ return * limit 1 ## Genotype-Tissue Expression Portal, eQTL data (GTEXEQTL) **Source**: The GTEx eQTL data we ingested comes from the file `GTEx_Analysis_v8_eQTL.tar` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**. -**Preproccessing**: For this first ingestion of GTEx's eQTL data, we only included eQTLs that were present in every tissue. This reduced the number of eQTLs in the dataset from 71 million to 2.1 million. Furthermore, we did not include any eQTLs that were not mapped to genes with a valid `HGNC` Code. This criteria dropped about 14% of the eQTLs. We then created eQTL nodes and attached them to their respective gene (`HGNC`), tissue (`UBERON`), genomic location `([HSCLO]([Homo Sapiens Chromosomal Ontology (HSCLO)]([https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-ontology-hsclo](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-location-ontology-hsclo))` ),see section below) and p-value (`PVALUEBINS`) nodes. The following list of numbers was used to create the p-value bins: `[0,1e-12,1e-11,1e-10,1e-9,1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,.005,.01,.02,.03,.04,.05,.06]` +**Preproccessing**: For this first ingestion of GTEx's eQTL data, we only included eQTLs that were present in every tissue. This reduced the number of eQTLs in the dataset from 71 million to 2.1 million. Furthermore, we did not include any eQTLs that were not mapped to genes with a valid `HGNC` Code. This criteria dropped about 14% of the eQTLs. We then created eQTL nodes and attached them to their respective gene (`HGNC`), tissue (`UBERON`), genomic location ([HSCLO]([Homo Sapiens Chromosomal Ontology (HSCLO)]([https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-ontology-hsclo](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-location-ontology-hsclo)) ),see section below) and p-value (`PVALUEBINS`) nodes. The following list of numbers was used to create the p-value bins: `[0,1e-12,1e-11,1e-10,1e-9,1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,.005,.01,.02,.03,.04,.05,.06]` drawing @@ -84,7 +84,7 @@ return * limit 1 drawing -**Schema Description**: Two HGNC Concepts are shown along with their Codes and preferred Terms. They're connected by a `coexpressed_with` relationship. There is an `evidence_class` property on the relationship that specifies how many tissues the two genes are highly co-expressed in. The SAB for this dataset `GTEXCOEXP` is located on the `coexpressed_with` and the `inverse_coexpressed_with` +**Schema Description**: Two `HGNC` Concepts are shown along with their Codes and preferred Terms. They're connected by a `coexpressed_with` relationship. There is an `evidence_class` property on the relationship that specifies how many tissues the two genes are highly co-expressed in. The SAB for this dataset `GTEXCOEXP` is located on the `coexpressed_with` and the `inverse_coexpressed_with` ```cypher // Cypher query to reproduce the schema figure @@ -101,7 +101,7 @@ The human to mouse orthology mapping data were also obtained in April 2023 from drawing -**Schema Description**: HGNC Concept (blue), Code (yellow) and Term (brown) from HGNC on the left and its corresponding Mouse gene Concept and Code (SAB = `HCOP`) on the right. The SAB for this mapping dataset is `HGNCHCOP` and is located on the SAB property of the `in_1_to_1_relationship_with` and `inverse_in_1_to_1_relationship_with` relationships. +**Schema Description**: An `HGNC` Concept (blue), Code (yellow) and Term (brown) on the left and its corresponding Mouse gene Concept and Code (SAB = `HCOP`) on the right. The SAB for this mapping dataset is `HGNCHCOP` and is located on the SAB property of the `in_1_to_1_relationship_with` and `inverse_in_1_to_1_relationship_with` relationships. ```cypher @@ -121,7 +121,7 @@ These data are generated by the HPO group to use OMIM disease-gene associations drawing -**Schema Description**: On the left hand side, an HGNC Concept (blue), Code (yellow) and Term (brown) nodes are connected to an HPO Concept node through an `associated_with` relationship. The SAB for this mapping dataset is HGNCHPO and it is located on the SAB property of the `associated_with` and `inverse_associated_with` relationships. In this example we can see that the ODAD2 gene is associated with Atelectasis. +**Schema Description**: On the left hand side, an `HGNC` Concept (blue), Code (yellow) and Term (brown) nodes are connected to an `HPO` Concept node through an `associated_with` relationship. The SAB for this mapping dataset is `HGNCHPO` and it is located on the SAB property of the `associated_with` and `inverse_associated_with` relationships. In this example we can see that the ODAD2 gene is associated with Atelectasis. ```cypher // Cypher query to reproduce the schema figure @@ -137,7 +137,7 @@ return * limit 1 drawing -**Schema Description**: On the left hand side, an MP Concept (blue), Code (yellow) and Term (brown) nodes are connected to an HCOP Concept node through an `involved_in` relationship. The HCOP Code nodes represent mouse genes. The SAB for this mapping dataset is HCOPMP and it is located on the SAB property of the `involved_in` and `inverse_involved_in` relationships. +**Schema Description**: On the left hand side, an `MP` Concept (blue), Code (yellow) and Term (brown) nodes are connected to an `HCOP` Concept node through an `involved_in` relationship. The `HCOP` Code nodes represent mouse genes. The SAB for this mapping dataset is HCOPMP and it is located on the SAB property of the `involved_in` and `inverse_involved_in` relationships. ```cypher // Cypher query to reproduce the schema figure @@ -153,7 +153,7 @@ return * limit 1 drawing -**Schema Description**: A Concept (blue), Code (yellow) and Term (brown) from MP on the left and its corresponding HPO Concept, Code and Term on the right. They are connected through an `is_approximately_equivalent_to` relationship. The SAB for this mappings, HPOMP, can be found on the SAB property on the bidirectional relationships. +**Schema Description**: A Concept (blue), Code (yellow) and Term (brown) from `MP` on the left and its corresponding `HPO` Concept, Code and Term on the right. They are connected through an `is_approximately_equivalent_to` relationship. The SAB for this mappings, `HPOMP`, can be found on the SAB property on the bidirectional relationships. ```cypher // Cypher query to reproduce the schema figure @@ -165,11 +165,11 @@ return * limit 1 ## Human-Rat ENSEMBL orthologs (RATHCOP) **Source**: The source of the human ENSEMBL to rat ENSEMBL orthologs is the HGNC Comparisons of Orthology Predictions tool. Go to https://www.genenames.org/tools/hcop/, scroll to the Bulk Downloads section at bottom of the page, select `Rat` in the first drop down menu and `15 columns` and download the data. -**Preproccessing**: No preprocessing was needed on these mappings, we simply selected the `human_ensembl_gene` and `rat_ensembl_gene` columns. +**Preproccessing**: No preprocessing was needed on these mappings, we simply selected the `human_ensembl_gene` and `rat_ensembl_gene` columns from the dataset. drawing -**Schema Description**: ... +**Schema Description**: A human ENSEMBL Concept and Code are shown on the left its orthologous rat ENSEMBL Concept and Code are shown on the right. The Concepts are connected by `has_human_ortholog` and `inverse_has_human_ortholog` relationships. The `RATHCOP` SAB is located on the SAB property of both Concept-Concept relationships. ```cypher // Cypher query to reproduce the schema figure @@ -181,9 +181,9 @@ return * limit 1 --- ## Homo Sapiens Chromosomal Location Ontology (HSCLO) -**Source**: The Homo Sapiens Chromosomal Location Ontology (HSCLO) was created by Taha Ahooyi Mohseni of the Petagraph team. HSCLO was primarily created to connect 4DN loop coordinates to the rest of the graph through the mapping between HSCLO and GENCODE. HSCLO was later utilized to connect GTEXEQTL locations in the graph as searchable nodes at 1kbp resolution. +**Source**: The Homo Sapiens Chromosomal Location Ontology (HSCLO) was created by Taha Ahooyi Mohseni of the Petagraph team. HSCLO was primarily created to connect 4DN loop coordinates to the rest of the graph through the mapping between HSCLO and GENCODE. HSCLO was later utilized to connect `GTEXEQTL` locations in the graph as searchable nodes at 1kbp resolution. -**Preproccessing**: The dataset relationships as well as nodes use HSCLO as their SAB. HSCLO nodes are defined at 5 resolution levels; chromosomes, 1 Mbp, 100 kbp, 10 kbp and 1kbp with each level connecting to lower levels with `above_(resolution level)_band` (e.g. "above_1Mbp_band", "above 1_kbp_band") and nodes at the same resolution level are connected through `precedes_(resolution level)_band` (e.g. "precedes_10kbp_band"). The dataset contains 3,431,155 nodes and 6,862,195 relationships. +**Preproccessing**: The dataset relationships as well as Code nodes use `HSCLO` as their SAB. `HSCLO` nodes are defined at 5 resolution levels; chromosomes, 1 Mbp, 100 kbp, 10 kbp and 1kbp with each level connecting to lower levels with `above_(resolution level)_band` (e.g. `above_1Mbp_band`, `bove 1_kbp_band`) and nodes at the same resolution level are connected through `precedes_(resolution level)_band` (e.g. `precedes_10kbp_band`). The dataset contains 3,431,155 nodes and 6,862,195 relationships. drawing @@ -226,6 +226,7 @@ return * limit 1 drawing **Schema Description**: ... + ```cypher // Cypher query to reproduce the schema figure match (a:Code {SAB:'CHEBI'})-[r0:CODE]-(b:Concept)-[r1 {SAB:'LINCS'}]-(c:Concept)-[r2:CODE]-(d:Code {SAB:'HGNC'}) @@ -236,7 +237,7 @@ return * limit 1 ## Connectivity Map (CMAP) **Source**: Signature perturbations of gene expression profiles as induced by chemical (small molecule) were obtained from the Ma’ayan Lab Harmonizome portal at [https://maayanlab.cloud/Harmonizome/dataset/CMAP+Signatures+of+Differentially+Expressed+Genes+for+Small+Molecules](https://maayanlab.cloud/Harmonizome/dataset/CMAP+Signatures+of+Differentially+Expressed+Genes+for+Small+Molecules) -**Preproccessing**: In a similar manner to L1000 data integration discussed above, we obtained the edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset from the Harmonizome database :https://maayanlab.cloud, (Lamb et al. 2006; Rouillard et al. 2016). The data was computed based on an earlier study (Lamb et al. 2006; Rouillard et al. 2016). The dataset added 2,625,336 new relationships (including reverse relationships) connecting the Petagraph `CHEBI` and `HGNC` nodes with types types of “negatively_correlated_with_gene”, “positively_correlated_with_gene”, “inverse_negatively_correlated_with_gene” and “inverse_positively_correlated_with_gene” and SAB of “CMAP”. +**Preproccessing**: In a similar manner to L1000 data integration discussed above, we obtained the edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset from the Harmonizome database :https://maayanlab.cloud, (Lamb et al. 2006; Rouillard et al. 2016). The data was computed based on an earlier study (Lamb et al. 2006; Rouillard et al. 2016). The dataset added 2,625,336 new relationships (including reverse relationships) connecting the Petagraph `CHEBI` and `HGNC` nodes with types types of `negatively_correlated_with_gene`, `positively_correlated_with_gene`, `inverse_negatively_correlated_with_gene` and `inverse_positively_correlated_with_gene` and SAB of `CMAP`. drawing @@ -252,11 +253,11 @@ return * limit 1 ## Molecular Signatures Database (MSIGDB)  **Sourcee**: MSigDB v7.4 datasets C1, C2, C3, C8 and H were obtained from the MSigDB molecular signature database. MSigDB is a collection of gene set resources, curated or collected from several different sources and can be accessed at [https://www.gsea-msigdb.org/gsea/msigdb/](https://www.gsea-msigdb.org/gsea/msigdb/). Five subsets of MSigDB v7.4 datasets were introduced as entity-gene relationships to the knowledge graph: C1 (positional gene sets), C2 (curated gene sets), C3 (regulatory target gene sets), C8 (cell type signature gene sets) and H (hallmark gene sets). -**Preproccessing**: With these five datasets, we created MSIGDB Concept nodes for 31,516 MSigDB systematic names. The relationships between these Concept nodes and HGNC nodes were established considering the information available in each of the mentioned 5 subsets where the subset information was included in the relationship SABs as “MSIGDB”. The Term names were also compiled according to the MSigDB generic entity names. Collectively, the five MSIGDB datasets added 2,598,060 Concept-Concept relationships to Petagraph. The relationship types used to map MSigDB Concept nodes to HGNC Concept nodes are: `MSigDB C1`, `MSigDB C2`, `MSigDB C3`, `MSigDB C8`, `MSigDB H`. The MSIGDB Codes SAB property is: `MSigDB_Systematic_Name`. +**Preproccessing**: With these five datasets, we created `MSIGDB` Concept nodes for 31,516 MSigDB systematic names. The relationships between these Concept nodes and HGNC nodes were established considering the information available in each of the mentioned 5 subsets where the subset information was included in the relationship SABs as `MSIGDB`. The Term names were also compiled according to the MSigDB generic entity names. Collectively, the five `MSIGDB` datasets added 2,598,060 Concept-Concept relationships to Petagraph. The relationship types used to map MSigDB Concept nodes to `HGNC` Concept nodes are: `MSigDB C1`, `MSigDB C2`, `MSigDB C3`, `MSigDB C8`, `MSigDB H`. drawing -**Schema Description**: An `HGNC` Concept, Code and Term node are connected to an `HGNC` Concept, Code and Term node on the right. The `MSIGDB` SAB is located on the SAB property for both sets of relationships +**Schema Description**: An `HGNC` Concept, Code and Term node are connected to an `HGNC` Concept, Code and Term node on the right. The `MSIGDB` SAB is located on the SAB property for both sets of relationships. ```cypher // Cypher query to reproduce the schema figure @@ -268,7 +269,7 @@ return * limit 1 ## ClinVar (CLINVAR) **Source**: Human genetic variant-disease associations were obtained from: [https://www.ncbi.nlm.nih.gov/clinvar/](https://www.ncbi.nlm.nih.gov/clinvar/). Only associations with Pathogenic or Likely Pathogenic consequence scores were included in the graph. We also did not include variants that affect a subset of genes (where there was no one-to-one relationship between a gene and phenotype/disease). -**Preproccessing**: The ClinVar human genetic variants-phenotype submission summary dataset (2023-01-05) was utilized to define relationships between human genes and phenotypes (Landrum et al. 2018). To retrieve the target phenotype/disease we used MEDGEN IDs listed in the ClinVar dataset (also already present in Petagraph). The `CLINVAR` variant-disease mappings gave rise to 214,040 new relationships (with the following characteristics [Type: “gene_associated_with_disease_or_phenotype”, SAB: “CLINVAR”] and [type: inverse_gene_associated_with_disease_or_phenotype, SAB: “CLINVAR”] connecting HGNC and MEDGEN, MONDO, HPO, EFO and MESH Concept nodes. +**Preproccessing**: The ClinVar human genetic variants-phenotype submission summary dataset (2023-01-05) was utilized to define relationships between human genes and phenotypes (Landrum et al. 2018). To retrieve the target phenotype/disease we used MEDGEN IDs listed in the ClinVar dataset (also already present in Petagraph). The `CLINVAR` variant-disease mappings gave rise to 214,040 new relationships (with the following characteristics [Type: `gene_associated_with_disease_or_phenotype`, SAB: `CLINVAR`] and [type: `inverse_gene_associated_with_disease_or_phenotype`, SAB: `CLINVAR`] connecting `HGNC` and `MEDGEN`, `MONDO`, `HPO`, `EFO` and `MESH` Concept nodes. drawing