A population-specific pangenome for Saudi Arabia

Saudi Pangenome Analysis Pipeline

Overview

This repository contains workflows and data for constructing pangenomes from Saudi population genomic data using the minigraph-cactus pipeline. The workflows are implemented using Common Workflow Language (CWL), enabling reproducible and portable execution across different computing environments.

Background

Minigraph-Cactus Pipeline

Minigraph-cactus is a state-of-the-art pipeline that combines the efficiency of minigraph with the accuracy of Progressive Cactus for constructing pangenomes. This hybrid approach offers several advantages:

Scalable construction of genome graphs from multiple assemblies
Preservation of structural variants and complex genomic regions
Efficient handling of large-scale genomic data
Integration of both reference-based and reference-free approaches

The pipeline first uses minigraph to create initial graphs, followed by Progressive Cactus's sophisticated algorithms for multiple genome alignment, resulting in high-quality pangenome representations.

Common Workflow Language (CWL)

CWL is a specification for describing analysis workflows and tools. Its key features include:

Platform independence
Explicit declaration of dependencies
Built-in support for Docker/container technologies
Scalability from single workstations to HPC environments
Clear separation of tools from the workflow logic

Workflows

Installation

Prerequisites

You need to install Docker or Singularity.

# Install required software
pip install cwltool

Repository Setup

git clone https://github.com/bio-ontology-research-group/pangenome

Usage

DeepVariant workflow

cwl-runner variant-calling/main-vg.cwl inputs.yml

Configuration

Create a YAML file with your inputs:

graph:
  class: File
  location: reference/ksa_hg38.gbz
reads1:
  class: File
  location: input/sample_R1.fastq.gz
reads2:
  class: File
  location: input/sample_R2.fastq.gz
hapl:
  class: File
  location: reference/ksa_hg38.hapl
ref_prefix: GRCh38
model_type: WGS
ref:
  class: File
  location: reference/hg38.fa
  secondaryFiles:
    - class: File
      location: reference/hg38.fa.fai
output_cram: aligned_reads.cram
output_vcf: sample.vcf
output_gvcf: sample.gvcf

Output Files

aligned_reads.cram: Alignment against GRCh38
sample.gvcf: Genomic VCF
sample.vcf: Variant calls

Citation

If you use this pipeline in your research, please cite:

License

The data is available under CC0

Contact

Please create GitHub issues if you encounter any problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A population-specific pangenome for Saudi Arabia

Saudi Pangenome Analysis Pipeline

Overview

Background

Minigraph-Cactus Pipeline

Common Workflow Language (CWL)

Workflows

Installation

Prerequisites

Repository Setup

Usage

DeepVariant workflow

Configuration

Output Files

Citation

License

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

A population-specific pangenome for Saudi Arabia

Saudi Pangenome Analysis Pipeline

Overview

Background

Minigraph-Cactus Pipeline

Common Workflow Language (CWL)

Workflows

Installation

Prerequisites

Repository Setup

Usage

DeepVariant workflow

Configuration

Output Files

Citation

License

Contact