This repository contains the code required to reproduce the analysis presented in the manuscript "Tissue-specific apparent mtDNA heteroplasmy and its relationship with ageing and mtDNA gene expression" by Wengert et al., which is currently under revision. You can access our preprint here.
Below, you can find a conceptual overview of the project and the analyses conducted for this study:
The scripts in this repository work as part of a larger analysis pipeline, but may require minor adjustments to run on different systems. We also do not provide individual level data (either raw data or processed data as part of our project) in any part of our repository, these would have to be obtained through gaining approved access with GTEx. As such, scripts provided here are not intended to run as is but for demonstration of our analyses.
- We obtain reads mapping to the rCRS mtDNA reference genome NC_012920 (which is part of the human hg38 reference genome) using samtools, using flag -F 3852 to select only the reads that do not fall under the following categories: read unmapped (0x4), mate unmapped (0x8), not primary alignment (0x100), read fails platform/vendor quality checks (0x200), read is PCR or optical duplicate (0x400), supplementary alignment (0x800).
samtools view -b -F 3852 $file.bam -o $file.mtdna.bam chrM
samtools index $file.mtdna.bam
- We perform variant calling using mtDNA-server for each sample individually, then combine the variant calls to form our variant call set, using the rCRS.fasta file as reference.
reffile="rCRS.fasta"
java -jar mutserve-1.3.0.jar analyse-local --input $file.mtdna.bam --reference $reffile --level 0.01 --output $file.mtdna.var.txt
We perform the following steps for apparent heteroplasmy variant filtering and processing, all scripts for these steps are in the scripts/preprocessing
directory
- concatenating and preprocessing of raw variant call files from mtDNA-server, as shown in
1_variant_files_preprocessing.R
- identifying the likely inherited allele using the homoplasmic alleles in each individual at each apparent heteroplasmic position, using variant calls from WGS in Whole Blood, as shown in
2_wgs_alleles.R
- applying quality control filters on apparent heteroplasmy calls, as shown in
3_variant_filters.R
- apply cohort filters to identify common apparent heteroplasmies with adequate variance between individuals for association testing, as shown in
4_cohort_filters.R
- add in cohort annotations, as shown in
5_cohort_annotations.R
- prepare apparent heteorplasmy genotype files for association testing, as shown in
6_prep_het_genotype_file.R
- identify cell types with high median xCell scores in 7 cell types in 35 GTEx tissues for use in celltype interaction analyses, as shown in
7_celltype_proportions.R
We perform the following explorations for the landscape of mtDNA heteroplasmy and mtRNA modifications in 49 tissues in GTEx v8, as shown in the scripts/landscape
directory
- identification of numbers and median variant allele frequencies of mtDNA heteroplasmy and mtRNA modifications across donors in each tissue
- identification of transitions and transversions in mtDNA heteroplasmy
- identification of putative inherited/somatic nature mtDNA heteroplasmy through looking into the number of tissues they occur in as described in An et al Nat Genet 2024
- obtaining associations between putative inherited/somatic nature mtDNA heteroplasmy and median mtDNA copy number (mtDNA-CN) as previously established in Rath et al PNAS 2024
- checking for tissue occurrence of pathogenic mtDNA heteroplasmy at mt.3243A>G as reviewed in Gomes et al Hum Mol Genet 2021, previously associated with various clinical phenotypes ranging from severe MELAS to mild deafness and glucose intolerance
- enrichment testing of occurrence of mtRNA modifications in CNS tissues
1-5 are detailed in landscape_explorations.R
and 6 is detailed in cns_enrichment.R
We obtain realistic relationships between donor age, RNAseq coverage on mtDNA genes, and apparent heteroplasmy using data from GTEx v8, and use these parameters to simulate a realistic null situation where apparent heteroplasmy is related to RNAseq coverage but not donor age. We then use this simulated data to test if a linear model (LM) or a beta-binomial model (BB) has well-calibrated false discovery rates (FDR). Scripts for this are in the scripts/modeltesting
directory
- obtain realistic parameters in
get_realistic_params.R
- simulations in
simulations.ipynb
- testing LM and BB models on the simulated data in
test_simulations.R
From the evaluation of association testing models we determined that the BB model is better calibrated than the LM especially when RNAseq coverage is low and heteroskedasticity of heteorplasmy is high. However BB takes a much longer time than LM, and even more so with permutations that can allow us to derive empirical p values. As such in our paper we detail our recommendation for a two-step approach where we perform LM first then BB only on the significant associations. We use this two-step approach in the paper to examine the effect of donor age on apparent heteroplasmy VAF. Scripts for this are in the scripts/associations
directory
- LM test for donor age associations with VAF in
donorage_lmtest.R
- BB test functions for donor age associations with VAF in
functions_bb_empirical.R
- BB test for donor age associtions with VAF in
donorage_bbtest.R
- BB permutations and derivation of empirical p values are in
donorage_bbpermute.R
anddonorage_bbempirical.R
We also investigate the interaction effect of donor age and cell type xCell scores in 7 cell types in 35 GTEx tissues. This is shown in donorage_celltype_interactions.R
We further use the same two-step approach in the paper to examine the effect of apparent heteroplasmy VAF on mtDNA gene expression, in polyA mtDNA genes (all protein coding genes except for MT-ND6). For all mtDNA polyA gene expressions we use residuals of their log(TPM+1) gene expression levels after correcting for tissue-specific PEER factors, as shown in get_geneexp_residuals.R
. Scripts for this are in the scripts/eqtl
directory
- LM test for VAF eqtls in
eqtl_lmtest.R
- BB test for VAF eqtls in
eqtl_bbtest.R
- BB permutations and derivation of empirical p values are in
eqtl_bbpermute.R
andeqtl_bbempirical.R
We also investigate the interaction effect of cell type xCell scores on VAF eqtls in 7 cell types in 35 GTEx tissues. This is shown in eqtl_celltype_interactions.R
We then perform two analyses to test the relationships between apparent heteroplasmy, mtDNA gene expression and donor age for 10 instances where the apparent heteroplasmy eqtl also has a donor age association. Scripts for both analyses are in scripts/mediation
- Mediation analysis using partial correlations in
mediation_analysis.R
- interaction effects between donor age and mtDNA gene expression on apparent heteroplasmy VAF in
donorage_eqtl_interaction.R
Finally we perform a few analyses to test whether p9 mt-tRNA modifications affect the gene expression levels of the mtDNA genes on their 5' ends, as previously proposed in many studies including Meynier et al Nat Comms 2024, and tested implicitly in Ali et al Comms Bio 2020. Scripts for these analyses are in scripts/transcript
- Replication of results shown in Ali et al Comms Bio 2020 in Whole Blood, then performed in all other tissues in GTEx v8, in
p9_5prime_replication.R
- Same analysis done only using modified mt-tRNA reads per sample, asking if a 5' cut on the same reads are associated with 5' gene expression, in
p9_5prime_cut.R
- Read level data per sample across all tissues at p9 mt-tRNA modifications are obtained using pysam as shown in
p9_count_cutreads.py
and annotated summarised inp9_read_level_data.R
We are grateful for any feedback or questions about the analysis code!
-
Code-Related Questions:
If you have questions or encounter issues with the code, please submit an issue viagithub
. -
Scientific Correspondence:
For scientific correspondence please reach out directly to:- Simon Wengert: [email protected]
- Dr. Na Cai: [email protected]
This project is licensed under the MIT License. However, the authors respectfully request that it be used only for non-commercial purposes, unless prior written consent is obtained.
- Non-commercial use includes academic research, educational projects, personal experimentation, and open collaboration.
- Commercial use includes, but is not limited to:
- Use within a for-profit entity (e.g., private companies, corporate labs).
- Redistribution of the software for monetary gain.
- Embedding the software in a product or service that is sold or licensed.
If you wish to use this software for commercial purposes, please contact the authors to discuss licensing terms.