Skip to content

SNP Genotype Imputation Using 1000G hg38 Reference

drneavin edited this page Jan 16, 2023 · 11 revisions

If you have any questions or issues while running the pipeline, feel free to ocontact Drew Neavin (d.neavin @ garvan.org.au). Similarly, if there are portions of the pipeline that you would like to update, please feel free to fork a branch, make the changes and we can implement them.

Background and Introduction

We have put together instructions for processing and imputing SNP genotype data using eagle for phasing and minimac4 for imputation with the 1000G hg38 reference.

Required Input

This table illustrates the input that the pipeline requires to run and whether it is provided or needs to be prepared and provided by the user.

Input Category User-provided Developer-provided
Genotyped SNPs plink2 pfiles (on hg19) Study Data ✔️ ✖️
(Example dataset provided)
1000G hg38 imputation reference Reference ✖️ ✔️
Singularity image Software ✖️ ✔️
Snakemake Software ✖️ ✔️
scipy Software ✖️ ✔️

Reference

We will be using the 1000G hg38 reference to impute the data and have prepared the reference.

The reference data is already downloaded on the Garvan HPC for general use. You can find it here:

/directflow/SCCGGroupShare/projects/data/reference_data/1000G_hg38_imputation_ref/hg38

If for some reason, you are working on another cluster of that data has been move, you can access the data here:

wget https://www.dropbox.com/s/l60a2r3e4vo78mn/eQTLGenImpRef.tar.gz
wget https://www.dropbox.com/s/eci808v0uepqgcz/eQTLGenImpRef.tar.gz.md5

After downloading the reference, it is best to make sure the md5sum of the eQTLGenImpRef.tar.gz file matches the md5sum in the eQTLGenImpRef.tar.gz.md5:

md5sum eQTLGenImpRef.tar.gz > downloaded_eQTLGenImpRef.tar.gz.md5
diff -s eQTLGenImpRef.tar.gz.md5 downloaded_eQTLGenImpRef.tar.gz.md5

which should return:

Files eQTLGenImpRef.tar.gz.md5 and downloaded_eQTLGenImpRef.tar.gz.md5 are identical

If you get anything else, the download was probably incomplete and you should try to download the file again. Then, unpack the contents of the file:

tar xvzf eQTLGenImpRef.tar.gz

IMPORTANT

Some HPCs limit the amount of time that a command can run on a head node, causing it to stop/fail part way through so it is best to untar by using a submission script.

Now you should have the references that are needed to impute the SNP genotype data. You will have the following directory structure:

hg38
├── imputation
├── phasing
│   ├── genetic_map
│   └── phasing_reference
├── ref_genome_QC
└── ref_panel_QC

Data

We have provided a test dataset that can be used to test the pipeline and we have built it in to the singularity image (below). It will be used for the example below and can be used to test the pipeline.

For your own dataset, you will need to make sure you have all the following files in the correct formats. You can check the test dataset for an example.

Plink2 reference SNP genotype pfiles

Your reference SNP genotype data will need to be supplied in the plink2 format which includes 3 files: data.pgen, data.psam, data.pvar

IMPORTANT

Your chromosome encoding in the data.pvar file must not use 'chr'. For example, chromosome 1 would be encoded as '1', not 'chr1'. The pipeline will check for this before running and will not run if it finds 'chr' chromsome encoding.

IMPORTANT

The data.psam file needs to have some specific columns since it will be important for:

  • Comparing reported sexes with SNP-genotype predicted sexes
  • Comparing reported ancestries with 1000 Genomes-projected ancestry predictions

The psam should look like this (and requires these headings):

#FID IID PAT MAT SEX Provided_Ancestry
113 113 0 0 1 EUR
349 350 0 0 1 EUR
352 353 0 0 2 EUR
39 39 0 0 2 EUR
40 40 0 0 2 EUR
41 41 0 0 1 EUR
42 42 0 0 2 EUR
... ... ... ... ... ...

Key for column contents:

  • #FID: Family ID
  • IID: Within-family ID
  • PAT: Within-family ID of father ('0' if father isn't in dataset)
  • MAT: Within-family ID of mother ('0' if mother isn't in dataset)
  • SEX: Sex code ('1' = male, '2' = female, '0' = unknown)
  • Provided_Ancestry: reported ancestry ('AFR' = African, 'AMR' = Ad Mixed American, 'EAS' = East Asian, 'EUR' = European, 'SAS' = South Asian). If you don't know, you can fill this with any character string.

IMPORTANT

You can include as many additional metadata columns as you like in the psam file but the first 6 columns must be those listed above.

Required Software

Singularity Image

You will also need the Singularity image (has all softwares needed for the pipeline installed for consistency across different systems). Most HPCs have Singularity installed so you should be able to run Singularity images without any issue. If you are unsure, try running singularity --help. If Singularity is not installed on your HPC, reach out to your system administrators and see if they can install it. Otherwise, feel reach out to us and we can find another solution.

  1. Download the singularity image (~6Gb) with the following command:
wget https://www.dropbox.com/s/mwjpndpclp6njcg/SNP_imputation_1000g_hg38.sif
wget https://www.dropbox.com/s/oiw1y6fi7ouboqp/SNP_imputation_1000g_hg38.sif.md5

Again, please make sure that the image is uncorrupted by comparing the md5sum

md5sum SNP_imputation_1000g_hg38.sif > downloaded_SNP_imputation_1000g_hg38.sif.md5
diff -s SNP_imputation_1000g_hg38.sif.md5 downloaded_SNP_imputation_1000g_hg38.sif.md5
  1. Next, please set up the pipeline with the following command; you will need to provide an absolute directory path to allow singularity to find the correct directories - this must be a directory somewhere above where you current working directory or the current working directory itself:
singularity run --bind <absolute_directory_path> --app setup SNP_imputation_1000g_hg38.sif .

IMPORTANT

The pipeline expects certain files pulled from the image to be in the same directory as the singularity image so you will have to rerun the setup steps if you move the image

Optional

If you want to test the pipeline on your system you can use the test dataset we have provided in the image.run the following to get the test dataset from the image to your local directory:

singularity run --bind <absolute_directory_path> --app test_dataset SNP_imputation_1000g_hg38.sif .
tar -xzf ImputationTestDataset_plink.tar.gz

Install Snakemake

You will also need Snakemake and scipy to run the pipeline. We highly recommend you use th conda environment that we have prepared with all the requirements. The conda environment yaml (snakemake.yaml) should have been copied into your local directory when you set up the singularity image above.

To make a conda environment from this yaml, you can run:

conda env create -f snakemake.yaml -n wg1_snakemake

Then, to activate this environment, run:

conda activate wg1_snakemake

If you would prefer to install snakemake and scipy yourself, you can follow the instructions for installing Snakemake and then install scipy with pip install scipy

SNP Genotype Preprocessing

Here's an outline of the general steps that will be taken by this pipeline:

overall_steps-01

Preparing to Run the Pipeline

  1. The first step is to copy the PreImputation.yaml from the pipeline directory to a working directory and edit it based on your system, directories and setup. The contents of this file are use to pass arguments to the pipeline.

  2. There are only five parameters that need user-input:

    • ref_dir: the directory to the 1000G hg38 reference as indicated in the Reference section
    • singularity_image: path to the singularity image
    • plink_dir: directory to your reference SNP genotype plink2 files
    • bind_paths: the paths to use to bind for the singularity image. Singularity needs this to mount all the files and directories under that path to be able to find and call required files. You can use more than one path, just separate them by a comma.
    • output_dir: the directory where you want to write all the files to. Make sure this directory exists before running snakemake.

    The other parameters are used for specific rules to indicate memory or threads for each command. Hopefully, you shouldn't need to edit any of these but if you are running out of memory for a specific rule, you can change them in this file.

  3. Before running Snakemake, let's define some variables that will make it easier to run and cleaner.

    • Let's define a variable, $IMPUTATION_SNAKEFILE which is the location of the snakefile which is in the cloned github repository. Change the path based on where the file is on your system:
    IMPUTATION_SNAKEFILE=/path/to/WG1-pipeline-QC/Snakefile
    • Let's also define a variable for the location of the edited configuration yaml. Change the path based on where the file is on your system:
    IMPUTATION_CONFIG=/path/to/edited/PreImputation.yaml
    • Finally, let's define a location that we would like to keep the cluster log outputs and make the directory. Change the path based on where you want log files to be written.
    LOG=/path/to/cluster/log_dir
    mkdir -p $LOG

IMPORTANT

If you log out of the cluster and log back in to run more jobs, you will have to redefine each of those variables. We recommend keeping the commands in a file that can easily be used to define each variable

Running the Pipeline - Phase 1

Now we have all the pieces that we need to run the pipeline. This Snakemake pipeline is built to run all the SNP genotype imputation pre-processing steps that are necessary. However, there is one step that needs user input on potential discrepancies between recorded sexes and ancestries compared to the SNP-predicted sexes and ancestries. Therefore, we will have to submit the pipeline twice. So let's start get started with the first part.

This example will use the test dataset that was provided with the Singularity image which you can get using the instructions above.

  1. First, let's do a "dry run" to identify what jobs will be run (remember to activate you snakemake environment before running: conda activate wg1_snakemake):

    snakemake \
        --snakefile $IMPUTATION_SNAKEFILE \
        --configfile $IMPUTATION_CONFIG \
        --dryrun \
        --cores 1 \
        --reason
    • The result should show you all the jobs that snakemake will run:

      dryrun1
  2. Next we can check how each of these jobs relates to one another:

    snakemake \
        --snakefile $IMPUTATION_SNAKEFILE \
        --configfile $IMPUTATION_CONFIG \
        --dag | \
        dot -Tsvg \
            > dag1.svg
    • The resulting image will be saved to your current directory:

  3. Next, let's run those jobs:

IMPORTANT

You will likely need to change the cluster command dependent on your job submission platform. This example is the job submission command for an SGE cluster. Some other submission examples for SLURM clusters are available here.

One of the rules (final_pruning) requires OpenBLAS variables to be set when running from a conda environment. you may want to set these before running or if you receive segmentation faults for that rule. Instructions to do so can be found in the Common Errors and How to Fix Them Section.

  nohup \
    snakemake \
      --snakefile $IMPUTATION_SNAKEFILE \
      --configfile $IMPUTATION_CONFIG \
      --rerun-incomplete \
      --jobs 20 \
      --use-singularity \
      --restart-times 2 \
      --keep-going \
      --cluster \
          "qsub -S /bin/bash \
          -q short.q \
          -r yes \
          -pe smp {threads} \
          -l tmp_requested={resources.disk_per_thread_gb}G \
          -l mem_requested={resources.mem_per_thread_gb}G \
          -e $LOG \
          -o $LOG \
          -j y \
          -V" \
    > $LOG/nohup_`date +%Y-%m-%d.%H:%M:%S`.log &
  • These first 10 steps shouldn't take too long to run. For our example dataset with 15 individuals, it should take about 8 minutes.

Running the Pipeline - Manual Selections

So far the pipeline ran a few QC steps and checked to see if the user reported information matches the SNP-based predictions for sex and ancestry. You should have the following results folder structure:

results/
├── check_sex
├── common_snps
├── indiv_missingness
├── pca_projection
└── pca_sex_checks

Now comes the part where we need to get your input. The pca_sex_checks directory will have files that will contain any discrepancies between the reported and SNP-predicted sex and ancestry.

Let's first look at the sex prediction discrepancy:

  • Our check_sex_update_remove.tsv file has one individual with non-matching information - we provided that the sex of this individual was male (1) but the SNP-based prediction was female (2):

    #FID IID PEDSEX SNPSEX STATUS F UPDATE/REMOVE/KEEP
    113 113 2 1 PROBLEM 0.9736
  • You have three options for how to deal with each of the samples that have mismatching sex information:

    • UPDATE will update the assignment to the SNP-predicted decision
    • REMOVE will remove the individual from downstream analysis
    • KEEP will keep the original manually-reported assignment
  • Upon checking our records, we can see that this was a manual entry mistake and we will update to the SNP-predicted sex. So we type UPDATE int the UPDATE/REMOVE/KEEP column of the check_sex_update_remove.tsv file:

    #FID IID PEDSEX SNPSEX STATUS F UPDATE/REMOVE/KEEP
    113 113 2 1 PROBLEM 0.9736 UPDATE

Let's next move on to the discrepancies between the user-provided and SNP-predicted ancestries.

  • The ancestry_update_remove.tsv file in the pca_sex_checks directory has one individual with non-matching information - we provided that the ancestry of the individual was South Asian (SAS) and the SNP-predicted ancestry was European (EUR). There is also an empty field for our decision to be entered:

    #FID IID PAT MAT SEX Provided_Ancestry PCA_Assignment UPDATE/REMOVE/KEEP
    349 350 0 0 1 SAS EUR
  • There is also a figure that demonstrates where each of the individuals from the vcf file sit within the 1000G reference PCA space. The left panel has all the individuals from the 1000G reference colored by ancestry, the middle panel has the individuals from our dataset colored by the predicted ancestry and the right panel has the individuals from our dataset colored by whether the provided and predicted ancestries match:

    Ancestry_PCAs

  • The individual who was recorded as SAS but was predicted to be from EUR ancestry is colored in maroon on the right panel. Since the individual is similar to the EUR ancestry cluster from 1000G, we want to update this individual to the SNP-predicted ancestry so we will type UPDATE in the UPDATE/REMOVE/KEEP column:

    #FID IID PAT MAT SEX Provided_Ancestry PCA_Assignment UPDATE/REMOVE/KEEP
    349 350 0 0 1 SAS EUR UPDATE

Running the Pipeline - Final Run

Now that we have provided the manual information that is required, we can run the rest of the pipeline which will 1) update the variant locations to the hg38 reference, 2) filter for high quality variants and individuals, 3) phase the data, 4) impute the data and 5) do some post-imputation processing of the results.

  1. Let's first do another dry run which will start an interactive prompt to see what ancestral populations you want to impute and what minor allele frequency you would like to use for that population.

    snakemake \
      --snakefile $IMPUTATION_SNAKEFILE \
      --configfile $IMPUTATION_CONFIG \
      --dryrun \
      --cores 1 \
      --reason

    You should be entered into an interactive prompt. With the test dataset, it looks like this:

    You have 14 individuals from EUR ancestry.
    Would you like to impute for this ancestral population? (yes/no)
    

    Then you need to answer if you want that ancestral population to undergo imputation. In our case, we answer yes and it provides another prompt:

    What minor allele frequency filtering would you like to use for the pre-imoutation processing for the EUR ancestry group.
    A value of 0.05 removes SNPs with < 5% minor alleles from the analysis.
    For no filtering use 0.
    (0-1)
    

    In our case, we enter 0 since we have a small sample size and don't want to overfilter the data too much before imputation. Most groups will have much larger samples sizes so we anticipate 0.03 (3%) to 0.05 (5%) to be the most frequently used thresholds.

    IMPORTANT

    If you didn't fill out your check_sex_update_remove.tsv and ancestry_update_remove.tsv files correctly, you will get an error message and the pipeline won't run until you provide the correct input files. You will receive the following message:

    The column names of your psam file are not correct. They should be: '#FID', 'IID', 'PAT', 'MAT', 'SEX', 'Provided_Ancestry','genotyping_platform', 'array_available', 'wgs_available','wes_available', 'age', 'age_range', 'Study'. If the names look the same, check that the file is tab separated, without any spaces or other weird characters.

    Exiting.

    After the manual entries, you should see the following jobs will be run by the pipeline:

    Screen Shot 2021-01-20 at 8 06 46 am

    NOTE

    Your responses to the manual entry steps are saved to a file to be used downstream. If you made an error in the manual entry or want to change your responses, you can either edit the file directly or delete it and run the dryrun command again to enter new answers.

    The file is located at results/pca_sex_checks/ancestry_mafs.tsv

  2. Let's also take a look at how those new jobs fit in with the steps that we already ran:

    snakemake \
        --snakefile $IMPUTATION_SNAKEFILE \
        --configfile $IMPUTATION_CONFIG \
        --dag | \
        dot -Tsvg \
            > dag2.svg
    • The resulting image will show jobs that are completed in dashed lines and those that still need to be run in solid lines. This will be saved to your current directory. It's quite a busy image so you can open it here if you want to take a look at it.
  3. Next, let's run those new jobs:

    • *** Remember that you may need to change the cluster command dependent on your job submission platform. This example is the job submission command for an SGE cluster. ***
    nohup \
      snakemake \
        --snakefile $IMPUTATION_SNAKEFILE \
        --configfile $IMPUTATION_CONFIG \
        --rerun-incomplete \
        --jobs 20 \
        --use-singularity \
        --restart-times 2 \
        --keep-going \
        --cluster \
            "qsub -S /bin/bash \
            -q short.q \
            -r yes \
            -pe smp {threads} \
            -l tmp_requested={resources.disk_per_thread_gb}G \
            -l mem_requested={resources.mem_per_thread_gb}G \
            -e $LOG \
            -o $LOG \
            -j y \
            -V" \
      > $LOG/nohup_`date +%Y-%m-%d.%H:%M:%S`.log &

IMPORTANT

Some HPCs have java memory options preset which may interfere with some jobs and cause them to fail (specifically the harmonize_hg38 rule). See the Common Errors and How to Fix Them section for a way to fix this.

Results!

After running those jobs, you should be done!

You should have the following results directories:

results/
├── check_sex
├── common_snps
├── crossmapped
├── crossmapped_sorted
├── eagle_prephasing
├── filter_preimpute_vcf
├── harmonize_hg38
├── het
├── het_filter
├── indiv_missingness
├── minimac_imputed
├── pca_projection
├── pca_sex_checks
├── separate_indivs
├── split_by_chr
├── subset_ancestry
├── update_sex_ancestry
├── vcf_all_merged
├── vcf_fixref_hg38
└── vcf_merged_by_ancestries

The most important directories and results are:

results/
├── vcf_all_merged
│   ├── imputed_hg38_R2_0.3_MAF0.05_exons_complete_cases.recode.vcf
│   ├── imputed_hg38_R2_0.3_MAF0.05_exons.recode.vcf
│   ├── imputed_hg38_R2_0.3_MAF0.05.vcf.gz
│   ├── imputed_hg38.vcf.gz
│   └── imputed_hg38.vcf.gz.csi
└── vcf_merged_by_ancestries
    ├── EUR_imputed_hg38.vcf.gz
    └── EUR_imputed_hg38.vcf.gz.csi
  • The files in vcf_merged_by_ancestries and vcf_all_merged are the files that will be used (after further processing) for analyses like demultiplexing and QTL detection.

🎉 Congrats! 🎉 You have successfully imputed your SNP genotype data!

Common Errors and How to Fix Them

  1. My final_pruning or my sex4imputation and subset_ancestry rules have returned core dumps and segmentation faults. How do I fix this? These rules depend on OpenBLAS and need some variables to be set when running from a conda environment. Try running these options before resubmitting the command:
    export OMP_NUM_THREADS=1
    export USE_SIMPLE_THREADED_LEVEL3=1
  2. The harmonize_hg38 step is failing due to lack of memory This happens due to system presets for java which is over-riding the memory that is needed for the job. Check what your $JAVA_OPTS are set them to a larger Xmx before restarting the pipeline. For example:
    JAVA_OPTS="-Xms256m -Xmx25g"

If you have any questions or issues while running the pipeline, feel free to open an issue or directly email Drew Neavin (d.neavin @ garvan.org.au).