Skip to content

Commit

Permalink
Update data_dict.md
Browse files Browse the repository at this point in the history
  • Loading branch information
benstear authored Nov 24, 2023
1 parent 309fb7a commit 1cb3bea
Showing 1 changed file with 6 additions and 7 deletions.
13 changes: 6 additions & 7 deletions petagraph/data_dict.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ For clarity, all schema figures in this document follow this node color format:
[4D Nucleome Program (4DN)](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#4d-nucleome-program-4dn)

## Genotype-Tissue Expression Portal, Expression data (GTEXEXP)
**Source**: Median transcript per million (TPM) expression levels were ingested from the file `GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**.
**Source**: Median transcript per million (TPM) expression levels were ingested from the file `GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**. This gene expression dataset contains expression profiles 54 tissues and 56,200 transcripts.

**Preproccessing**: No preprocessing was done on the median TPM dataset. We only filtered for median TPM expression levels that corresponded to an ENSEMBL gene id that could be mapped back to an `HGNC` Code.
**Preproccessing**: No preprocessing was done on the median TPM dataset. We only filtered for median TPM expression levels that corresponded to an ENSEMBL gene id that could be mapped back to an `HGNC` Code. The gene expression nodes are connected to their corresponding tissue node, gene node and expression bin node.

<img src="https://github.com/TaylorResearchLab/Petagraph/blob/main/figures/publication_figures/schema_figures/gtex_exp.png" alt="drawing" width="800"/>

Expand All @@ -56,7 +56,7 @@ return * limit 1

---
## Genotype-Tissue Expression Portal, eQTL data (GTEXEQTL)
**Source**: The GTEx eQTL data we ingested comes from the file `GTEx_Analysis_v8_eQTL.tar` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**.
**Source**: The GTEx eQTL data we ingested comes from the file `GTEx_Analysis_v8_eQTL.tar` located on the GTEx Portal website at **[https://gtexportal.org/home/datasets](https://gtexportal.org/home/datasets)**. The eQTLs dataset contains over 71 million eQTLs from 49 tissues.

**Preproccessing**: For this first ingestion of GTEx's eQTL data, we only included eQTLs that were present in every tissue. This reduced the number of eQTLs in the dataset from 71 million to 2.1 million. Furthermore, we did not include any eQTLs that were not mapped to genes with a valid `HGNC` Code. This criteria dropped about 14% of the eQTLs. We then created eQTL nodes and attached them to their respective gene (`HGNC`), tissue (`UBERON`), genomic location ([HSCLO]([Homo Sapiens Chromosomal Ontology (HSCLO)]([https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-ontology-hsclo](https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md#homo-sapiens-chromosomal-location-ontology-hsclo)) ),see section below) and p-value (`PVALUEBINS`) nodes. The following list of numbers was used to create the p-value bins: `[0,1e-12,1e-11,1e-10,1e-9,1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,.005,.01,.02,.03,.04,.05,.06]`

Expand Down Expand Up @@ -91,7 +91,7 @@ return * limit 1
```

---
## Human-Mouse Orthologs (HGNCHCOP)
## Human-Mouse Ortholog mappings (HGNCHCOP)
**Source**: Mouse genes were downloaded from HGNC Comparisons of Orthology Predictions (HCOP) [https://www.genenames.org/tools/hcop/](https://www.genenames.org/tools/hcop/) (scroll to the bottom, under Bulk Downloads. Select Human - Mouse ortholog data)
The human to mouse orthology mapping data were also obtained in April 2023 from the HGNC HCOP tool.

Expand All @@ -109,10 +109,9 @@ return * limit 1
```

---
## Human gene-phenotype (HGNCHPO)
## Human gene-phenotype mappings (HGNCHPO)
**Source**:
We use the Human Phenotype (HPO) Ontology mappings for `genes_to_phenotype.txt` and `phenotype_to_genes.txt`. The HPO annotations can be found here: [https://hpo.jax.org/app/data/annotations](https://hpo.jax.org/app/data/annotations).
These data are generated by the HPO group to use OMIM disease-gene associations to map all HPO phenotypes to genes with those phenotypes associated with diseases. Therefore, a gene can be associated with several phenotypes, and a phenotype can be associated with several genes.
We use the Human Phenotype (HPO) Ontology mappings for `genes_to_phenotype.txt` and `phenotype_to_genes.txt`. The HPO annotations can be found here: [https://hpo.jax.org/app/data/annotations](https://hpo.jax.org/app/data/annotations). These data are generated by the HPO group using OMIM disease-gene associations to map HPO phenotypes to genes. These data contain 4,545 genes mapped to at least one phenotype and 10,896 phenotypes mapped to at least one gene

**Preproccessing**: This data did not need any preprocessing.

Expand Down

0 comments on commit 1cb3bea

Please sign in to comment.