-
Notifications
You must be signed in to change notification settings - Fork 0
SNP Genotype Imputation Using 1000G hg38 Reference
If you have any questions or issues while running the pipeline, feel free to ocontact Drew Neavin (d.neavin @ garvan.org.au). Similarly, if there are portions of the pipeline that you would like to update, please feel free to fork a branch, make the changes and we can implement them.
We have put together instructions for processing and imputing SNP genotype data using eagle for phasing and minimac4 for imputation with the 1000G hg38 reference.
This table illustrates the input that the pipeline requires to run and whether it is provided or needs to be prepared and provided by the user.
Input | Category | User-provided | Developer-provided |
---|---|---|---|
Genotyped SNPs plink2 pfiles (on hg19) | Study Data | ✔️ | ✖️ (Example dataset provided) |
1000G hg38 imputation reference | Reference | ✖️ | ✔️ |
Singularity image | Software | ✖️ | ✔️ |
Snakemake | Software | ✖️ | ✔️ |
scipy | Software | ✖️ | ✔️ |
We will be using the 1000G hg38 reference to impute the data and have prepared the reference.
The reference data is already downloaded on the Garvan HPC for general use. You can find it here:
/directflow/SCCGGroupShare/projects/data/reference_data/1000G_hg38_imputation_ref/hg38
If for some reason, you are working on another cluster of that data has been move, you can access the data here:
wget https://www.dropbox.com/s/l60a2r3e4vo78mn/eQTLGenImpRef.tar.gz
wget https://www.dropbox.com/s/eci808v0uepqgcz/eQTLGenImpRef.tar.gz.md5
After downloading the reference, it is best to make sure the md5sum of the eQTLGenImpRef.tar.gz
file matches the md5sum in the eQTLGenImpRef.tar.gz.md5
:
md5sum eQTLGenImpRef.tar.gz > downloaded_eQTLGenImpRef.tar.gz.md5
diff -s eQTLGenImpRef.tar.gz.md5 downloaded_eQTLGenImpRef.tar.gz.md5
which should return:
Files eQTLGenImpRef.tar.gz.md5 and downloaded_eQTLGenImpRef.tar.gz.md5 are identical
If you get anything else, the download was probably incomplete and you should try to download the file again. Then, unpack the contents of the file:
tar xvzf eQTLGenImpRef.tar.gz
Some HPCs limit the amount of time that a command can run on a head node, causing it to stop/fail part way through so it is best to untar by using a submission script.
Now you should have the references that are needed to impute the SNP genotype data. You will have the following directory structure:
hg38
├── imputation
├── phasing
│ ├── genetic_map
│ └── phasing_reference
├── ref_genome_QC
└── ref_panel_QC
We have provided a test dataset that can be used to test the pipeline and we have built it in to the singularity image (below). It will be used for the example below and can be used to test the pipeline.
For your own dataset, you will need to make sure you have all the following files in the correct formats. You can check the test dataset for an example.
Your reference SNP genotype data will need to be supplied in the plink2 format which includes 3 files: data.pgen
, data.psam
, data.pvar
Your chromosome encoding in the
data.pvar
file must not use 'chr'. For example, chromosome 1 would be encoded as '1', not 'chr1'. The pipeline will check for this before running and will not run if it finds 'chr' chromsome encoding.
The
data.psam
file needs to have some specific columns since it will be important for:
- Comparing reported sexes with SNP-genotype predicted sexes
- Comparing reported ancestries with 1000 Genomes-projected ancestry predictions
The psam should look like this (and requires these headings):
#FID | IID | PAT | MAT | SEX | Provided_Ancestry |
---|---|---|---|---|---|
113 | 113 | 0 | 0 | 1 | EUR |
349 | 350 | 0 | 0 | 1 | EUR |
352 | 353 | 0 | 0 | 2 | EUR |
39 | 39 | 0 | 0 | 2 | EUR |
40 | 40 | 0 | 0 | 2 | EUR |
41 | 41 | 0 | 0 | 1 | EUR |
42 | 42 | 0 | 0 | 2 | EUR |
... | ... | ... | ... | ... | ... |
Key for column contents:
- #FID: Family ID
- IID: Within-family ID
- PAT: Within-family ID of father ('0' if father isn't in dataset)
- MAT: Within-family ID of mother ('0' if mother isn't in dataset)
- SEX: Sex code ('1' = male, '2' = female, '0' = unknown)
- Provided_Ancestry: reported ancestry ('AFR' = African, 'AMR' = Ad Mixed American, 'EAS' = East Asian, 'EUR' = European, 'SAS' = South Asian). If you don't know, you can fill this with any character string.
You can include as many additional metadata columns as you like in the psam file but the first 6 columns must be those listed above.
You will also need the Singularity image (has all softwares needed for the pipeline installed for consistency across different systems). Most HPCs have Singularity installed so you should be able to run Singularity images without any issue. If you are unsure, try running singularity --help
. If Singularity is not installed on your HPC, reach out to your system administrators and see if they can install it. Otherwise, feel reach out to us and we can find another solution.
- Download the singularity image (~6Gb) with the following command:
wget https://www.dropbox.com/s/mwjpndpclp6njcg/SNP_imputation_1000g_hg38.sif
wget https://www.dropbox.com/s/oiw1y6fi7ouboqp/SNP_imputation_1000g_hg38.sif.md5
Again, please make sure that the image is uncorrupted by comparing the md5sum
md5sum SNP_imputation_1000g_hg38.sif > downloaded_SNP_imputation_1000g_hg38.sif.md5
diff -s SNP_imputation_1000g_hg38.sif.md5 downloaded_SNP_imputation_1000g_hg38.sif.md5
- Next, please set up the pipeline with the following command; you will need to provide an absolute directory path to allow singularity to find the correct directories - this must be a directory somewhere above where you current working directory or the current working directory itself:
singularity run --bind <absolute_directory_path> --app setup SNP_imputation_1000g_hg38.sif .
The pipeline expects certain files pulled from the image to be in the same directory as the singularity image so you will have to rerun the setup steps if you move the image
If you want to test the pipeline on your system you can use the test dataset we have provided in the image.run the following to get the test dataset from the image to your local directory:
singularity run --bind <absolute_directory_path> --app test_dataset SNP_imputation_1000g_hg38.sif .
tar -xzf ImputationTestDataset_plink.tar.gz
You will also need Snakemake and scipy to run the pipeline. We highly recommend you use th conda environment that we have prepared with all the requirements. The conda environment yaml (snakemake.yaml
) should have been copied into your local directory when you set up the singularity image above.
To make a conda environment from this yaml, you can run:
conda env create -f snakemake.yaml -n wg1_snakemake
Then, to activate this environment, run:
conda activate wg1_snakemake
If you would prefer to install snakemake and scipy yourself, you can follow the instructions for installing Snakemake and then install scipy with pip install scipy
Here's an outline of the general steps that will be taken by this pipeline:
-
The first step is to copy the PreImputation.yaml from the pipeline directory to a working directory and edit it based on your system, directories and setup. The contents of this file are use to pass arguments to the pipeline.
-
There are only five parameters that need user-input:
- ref_dir: the directory to the 1000G hg38 reference as indicated in the Reference section
- singularity_image: path to the singularity image
- plink_dir: directory to your reference SNP genotype plink2 files
- bind_paths: the paths to use to bind for the singularity image. Singularity needs this to mount all the files and directories under that path to be able to find and call required files. You can use more than one path, just separate them by a comma.
- output_dir: the directory where you want to write all the files to. Make sure this directory exists before running snakemake.
The other parameters are used for specific rules to indicate memory or threads for each command. Hopefully, you shouldn't need to edit any of these but if you are running out of memory for a specific rule, you can change them in this file.
-
Before running Snakemake, let's define some variables that will make it easier to run and cleaner.
- Let's define a variable,
$IMPUTATION_SNAKEFILE
which is the location of the snakefile which is in the cloned github repository. Change the path based on where the file is on your system:
IMPUTATION_SNAKEFILE=/path/to/WG1-pipeline-QC/Snakefile
- Let's also define a variable for the location of the edited configuration yaml. Change the path based on where the file is on your system:
IMPUTATION_CONFIG=/path/to/edited/PreImputation.yaml
- Finally, let's define a location that we would like to keep the cluster log outputs and make the directory. Change the path based on where you want log files to be written.
LOG=/path/to/cluster/log_dir mkdir -p $LOG
- Let's define a variable,
If you log out of the cluster and log back in to run more jobs, you will have to redefine each of those variables. We recommend keeping the commands in a file that can easily be used to define each variable
Now we have all the pieces that we need to run the pipeline. This Snakemake pipeline is built to run all the SNP genotype imputation pre-processing steps that are necessary. However, there is one step that needs user input on potential discrepancies between recorded sexes and ancestries compared to the SNP-predicted sexes and ancestries. Therefore, we will have to submit the pipeline twice. So let's start get started with the first part.
This example will use the test dataset that was provided with the Singularity image which you can get using the instructions above.
-
First, let's do a "dry run" to identify what jobs will be run (remember to activate you snakemake environment before running:
conda activate wg1_snakemake
):snakemake \ --snakefile $IMPUTATION_SNAKEFILE \ --configfile $IMPUTATION_CONFIG \ --dryrun \ --cores 1 \ --reason
-
The result should show you all the jobs that snakemake will run:
-
-
Next we can check how each of these jobs relates to one another:
snakemake \ --snakefile $IMPUTATION_SNAKEFILE \ --configfile $IMPUTATION_CONFIG \ --dag | \ dot -Tsvg \ > dag1.svg
-
The resulting image will be saved to your current directory:
-
-
Next, let's run those jobs:
You will likely need to change the cluster command dependent on your job submission platform. This example is the job submission command for an SGE cluster. Some other submission examples for SLURM clusters are available here.
One of the rules (
final_pruning
) requires OpenBLAS variables to be set when running from a conda environment. you may want to set these before running or if you receive segmentation faults for that rule. Instructions to do so can be found in the Common Errors and How to Fix Them Section.
nohup \
snakemake \
--snakefile $IMPUTATION_SNAKEFILE \
--configfile $IMPUTATION_CONFIG \
--rerun-incomplete \
--jobs 20 \
--use-singularity \
--restart-times 2 \
--keep-going \
--cluster \
"qsub -S /bin/bash \
-q short.q \
-r yes \
-pe smp {threads} \
-l tmp_requested={resources.disk_per_thread_gb}G \
-l mem_requested={resources.mem_per_thread_gb}G \
-e $LOG \
-o $LOG \
-j y \
-V" \
> $LOG/nohup_`date +%Y-%m-%d.%H:%M:%S`.log &
- These first 10 steps shouldn't take too long to run. For our example dataset with 15 individuals, it should take about 8 minutes.
So far the pipeline ran a few QC steps and checked to see if the user reported information matches the SNP-based predictions for sex and ancestry. You should have the following results
folder structure:
results/
├── check_sex
├── common_snps
├── indiv_missingness
├── pca_projection
└── pca_sex_checks
Now comes the part where we need to get your input. The pca_sex_checks
directory will have files that will contain any discrepancies between the reported and SNP-predicted sex and ancestry.
Let's first look at the sex prediction discrepancy:
-
Our
check_sex_update_remove.tsv
file has one individual with non-matching information - we provided that the sex of this individual was male (1) but the SNP-based prediction was female (2):#FID IID PEDSEX SNPSEX STATUS F UPDATE/REMOVE/KEEP 113 113 2 1 PROBLEM 0.9736 -
You have three options for how to deal with each of the samples that have mismatching sex information:
-
UPDATE
will update the assignment to the SNP-predicted decision -
REMOVE
will remove the individual from downstream analysis -
KEEP
will keep the original manually-reported assignment
-
-
Upon checking our records, we can see that this was a manual entry mistake and we will update to the SNP-predicted sex. So we type
UPDATE
int theUPDATE/REMOVE/KEEP
column of thecheck_sex_update_remove.tsv
file:#FID IID PEDSEX SNPSEX STATUS F UPDATE/REMOVE/KEEP 113 113 2 1 PROBLEM 0.9736 UPDATE
Let's next move on to the discrepancies between the user-provided and SNP-predicted ancestries.
-
The
ancestry_update_remove.tsv
file in thepca_sex_checks
directory has one individual with non-matching information - we provided that the ancestry of the individual was South Asian (SAS) and the SNP-predicted ancestry was European (EUR). There is also an empty field for our decision to be entered:#FID IID PAT MAT SEX Provided_Ancestry PCA_Assignment UPDATE/REMOVE/KEEP 349 350 0 0 1 SAS EUR -
There is also a figure that demonstrates where each of the individuals from the vcf file sit within the 1000G reference PCA space. The left panel has all the individuals from the 1000G reference colored by ancestry, the middle panel has the individuals from our dataset colored by the predicted ancestry and the right panel has the individuals from our dataset colored by whether the provided and predicted ancestries match:
-
The individual who was recorded as SAS but was predicted to be from EUR ancestry is colored in maroon on the right panel. Since the individual is similar to the EUR ancestry cluster from 1000G, we want to update this individual to the SNP-predicted ancestry so we will type
UPDATE
in theUPDATE/REMOVE/KEEP
column:#FID IID PAT MAT SEX Provided_Ancestry PCA_Assignment UPDATE/REMOVE/KEEP 349 350 0 0 1 SAS EUR UPDATE
Now that we have provided the manual information that is required, we can run the rest of the pipeline which will 1) update the variant locations to the hg38 reference, 2) filter for high quality variants and individuals, 3) phase the data, 4) impute the data and 5) do some post-imputation processing of the results.
-
Let's first do another dry run which will start an interactive prompt to see what ancestral populations you want to impute and what minor allele frequency you would like to use for that population.
snakemake \ --snakefile $IMPUTATION_SNAKEFILE \ --configfile $IMPUTATION_CONFIG \ --dryrun \ --cores 1 \ --reason
You should be entered into an interactive prompt. With the test dataset, it looks like this:
You have 14 individuals from EUR ancestry. Would you like to impute for this ancestral population? (yes/no)
Then you need to answer if you want that ancestral population to undergo imputation. In our case, we answer
yes
and it provides another prompt:What minor allele frequency filtering would you like to use for the pre-imoutation processing for the EUR ancestry group. A value of 0.05 removes SNPs with < 5% minor alleles from the analysis. For no filtering use 0. (0-1)
In our case, we enter
0
since we have a small sample size and don't want to overfilter the data too much before imputation. Most groups will have much larger samples sizes so we anticipate 0.03 (3%) to 0.05 (5%) to be the most frequently used thresholds.If you didn't fill out your
check_sex_update_remove.tsv
andancestry_update_remove.tsv
files correctly, you will get an error message and the pipeline won't run until you provide the correct input files. You will receive the following message:The column names of your psam file are not correct. They should be: '#FID', 'IID', 'PAT', 'MAT', 'SEX', 'Provided_Ancestry','genotyping_platform', 'array_available', 'wgs_available','wes_available', 'age', 'age_range', 'Study'. If the names look the same, check that the file is tab separated, without any spaces or other weird characters.
Exiting.
After the manual entries, you should see the following jobs will be run by the pipeline:
Your responses to the manual entry steps are saved to a file to be used downstream. If you made an error in the manual entry or want to change your responses, you can either edit the file directly or delete it and run the dryrun command again to enter new answers.
The file is located at
results/pca_sex_checks/ancestry_mafs.tsv
-
Let's also take a look at how those new jobs fit in with the steps that we already ran:
snakemake \ --snakefile $IMPUTATION_SNAKEFILE \ --configfile $IMPUTATION_CONFIG \ --dag | \ dot -Tsvg \ > dag2.svg
- The resulting image will show jobs that are completed in dashed lines and those that still need to be run in solid lines. This will be saved to your current directory. It's quite a busy image so you can open it here if you want to take a look at it.
-
Next, let's run those new jobs:
- *** Remember that you may need to change the cluster command dependent on your job submission platform. This example is the job submission command for an SGE cluster. ***
nohup \ snakemake \ --snakefile $IMPUTATION_SNAKEFILE \ --configfile $IMPUTATION_CONFIG \ --rerun-incomplete \ --jobs 20 \ --use-singularity \ --restart-times 2 \ --keep-going \ --cluster \ "qsub -S /bin/bash \ -q short.q \ -r yes \ -pe smp {threads} \ -l tmp_requested={resources.disk_per_thread_gb}G \ -l mem_requested={resources.mem_per_thread_gb}G \ -e $LOG \ -o $LOG \ -j y \ -V" \ > $LOG/nohup_`date +%Y-%m-%d.%H:%M:%S`.log &
Some HPCs have java memory options preset which may interfere with some jobs and cause them to fail (specifically the
harmonize_hg38
rule). See the Common Errors and How to Fix Them section for a way to fix this.
After running those jobs, you should be done!
You should have the following results directories:
results/
├── check_sex
├── common_snps
├── crossmapped
├── crossmapped_sorted
├── eagle_prephasing
├── filter_preimpute_vcf
├── harmonize_hg38
├── het
├── het_filter
├── indiv_missingness
├── minimac_imputed
├── pca_projection
├── pca_sex_checks
├── separate_indivs
├── split_by_chr
├── subset_ancestry
├── update_sex_ancestry
├── vcf_all_merged
├── vcf_fixref_hg38
└── vcf_merged_by_ancestries
The most important directories and results are:
results/
├── vcf_all_merged
│ ├── imputed_hg38_R2_0.3_MAF0.05_exons_complete_cases.recode.vcf
│ ├── imputed_hg38_R2_0.3_MAF0.05_exons.recode.vcf
│ ├── imputed_hg38_R2_0.3_MAF0.05.vcf.gz
│ ├── imputed_hg38.vcf.gz
│ └── imputed_hg38.vcf.gz.csi
└── vcf_merged_by_ancestries
├── EUR_imputed_hg38.vcf.gz
└── EUR_imputed_hg38.vcf.gz.csi
- The files in
vcf_merged_by_ancestries
andvcf_all_merged
are the files that will be used (after further processing) for analyses like demultiplexing and QTL detection.
- My
final_pruning
or mysex4imputation
andsubset_ancestry
rules have returned core dumps and segmentation faults. How do I fix this? These rules depend on OpenBLAS and need some variables to be set when running from a conda environment. Try running these options before resubmitting the command:export OMP_NUM_THREADS=1 export USE_SIMPLE_THREADED_LEVEL3=1
- The
harmonize_hg38
step is failing due to lack of memory This happens due to system presets for java which is over-riding the memory that is needed for the job. Check what your$JAVA_OPTS
are set them to a larger Xmx before restarting the pipeline. For example:JAVA_OPTS="-Xms256m -Xmx25g"
If you have any questions or issues while running the pipeline, feel free to open an issue or directly email Drew Neavin (d.neavin @ garvan.org.au).