diff --git a/petagraph/data_dict.md b/petagraph/data_dict.md index 961e2f2..fb99754 100644 --- a/petagraph/data_dict.md +++ b/petagraph/data_dict.md @@ -37,9 +37,9 @@ For clarity, all schema figures in this document follow this node color format: [4D Nucleome Program (4DN)](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#4d-nucleome-program-4dn) ## Genotype-Tissue Expression Portal, Expression data (GTEXEXP) -**Source**: Median transcript per million (TPM) expression levels were ingested from the file `GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**. +**Source**: Median transcript per million (TPM) expression levels were ingested from the file `GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**. This gene expression dataset contains expression profiles 54 tissues and 56,200 transcripts. -**Preproccessing**: No preprocessing was done on the median TPM dataset. We only filtered for median TPM expression levels that corresponded to an ENSEMBL gene id that could be mapped back to an `HGNC` Code. +**Preproccessing**: No preprocessing was done on the median TPM dataset. We only filtered for median TPM expression levels that corresponded to an ENSEMBL gene id that could be mapped back to an `HGNC` Code. The gene expression nodes are connected to their corresponding tissue node, gene node and expression bin node. drawing @@ -56,7 +56,7 @@ return * limit 1 --- ## Genotype-Tissue Expression Portal, eQTL data (GTEXEQTL) -**Source**: The GTEx eQTL data we ingested comes from the file `GTEx_Analysis_v8_eQTL.tar` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**. +**Source**: The GTEx eQTL data we ingested comes from the file `GTEx_Analysis_v8_eQTL.tar` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**. The eQTLs dataset contains over 71 million eQTLs from 49 tissues. **Preproccessing**: For this first ingestion of GTEx's eQTL data, we only included eQTLs that were present in every tissue. This reduced the number of eQTLs in the dataset from 71 million to 2.1 million. Furthermore, we did not include any eQTLs that were not mapped to genes with a valid `HGNC` Code. This criteria dropped about 14% of the eQTLs. We then created eQTL nodes and attached them to their respective gene (`HGNC`), tissue (`UBERON`), genomic location ([HSCLO]([Homo Sapiens Chromosomal Ontology (HSCLO)]([https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-ontology-hsclo](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-location-ontology-hsclo)) ),see section below) and p-value (`PVALUEBINS`) nodes. The following list of numbers was used to create the p-value bins: `[0,1e-12,1e-11,1e-10,1e-9,1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,.005,.01,.02,.03,.04,.05,.06]` @@ -91,7 +91,7 @@ return * limit 1 ``` --- -## Human-Mouse Orthologs (HGNCHCOP) +## Human-Mouse Ortholog mappings (HGNCHCOP) **Source**: Mouse genes were downloaded from HGNC Comparisons of Orthology Predictions (HCOP) [https://www.genenames.org/tools/hcop/](https://www.genenames.org/tools/hcop/) (scroll to the bottom, under Bulk Downloads. Select Human - Mouse ortholog data) The human to mouse orthology mapping data were also obtained in April 2023 from the HGNC HCOP tool. @@ -109,10 +109,9 @@ return * limit 1 ``` --- -## Human gene-phenotype (HGNCHPO) +## Human gene-phenotype mappings (HGNCHPO) **Source**: -We use the Human Phenotype (HPO) Ontology mappings for `genes_to_phenotype.txt` and `phenotype_to_genes.txt`. The HPO annotations can be found here: [https://hpo.jax.org/app/data/annotations](https://hpo.jax.org/app/data/annotations). -These data are generated by the HPO group to use OMIM disease-gene associations to map all HPO phenotypes to genes with those phenotypes associated with diseases. Therefore, a gene can be associated with several phenotypes, and a phenotype can be associated with several genes. +We use the Human Phenotype (HPO) Ontology mappings for `genes_to_phenotype.txt` and `phenotype_to_genes.txt`. The HPO annotations can be found here: [https://hpo.jax.org/app/data/annotations](https://hpo.jax.org/app/data/annotations). These data are generated by the HPO group using OMIM disease-gene associations to map HPO phenotypes to genes. These data contain 4,545 genes mapped to at least one phenotype and 10,896 phenotypes mapped to at least one gene **Preproccessing**: This data did not need any preprocessing.