Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new ChIP-Seq WF that handles replicates and controls #581

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
version: 1.2
workflows:
- name: main
subclass: Galaxy
publish: true
primaryDescriptorPath: /chipseq-pe-with-replicates-controls.ga
testParameterFiles:
- /chipseq-pe-with-replicates-controls-tests.yml
authors:
- name: Wolfgang Maier
orcid: 0000-0002-9464-6640
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
version: '0.1'
registries:
- url: https://workflowhub.eu
project: iwc
workflow: chipseq-pe-with-replicates-controls/main
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Changelog

## [0.1] 2024-10-22

Initial release
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Quality control, mapping and peaks identification for ChIP-Seq replicates with controls

This workflow is for analyzing batches of ChIP-Seq samples with controls and replicates from paired-end sequenced reads to called peaks.

It uses:
- fastp for sequenced reads pre-processing,
- bowtie2 for mapping
- MACS2 for peak calling
- deeptools for cross-sample correlation and averaging

The workflow provides quality control at the level of sequenced reads, mapping results and called peaks and visualizes correlation between samples.

## Input datasets

- Sequencing data: this must be provided as a single list collection of paired fastq datasets of all samples.
- Sample sheet: this is expected to be a 4-column tabular dataset that describes samples, their association with each other and with conditions and replicates.

The first column of the file must list all samples with their names matching the element names in the Sequencing data collection. Samples can be listed in any order.
The second column is used to specify the specific experimental condition that each sample represents. There is no formal restriction on this column, but values should be kept short for readable reports.
The third column is used to specify the replicate that each sample belongs to. There is no formal restriction to replicate identifiers, but they should be kept as short as possible. At least two replicates are required per condition, but different conditions can have different numbers of replicates.
The fourth column must provide the name of the sample that serves as the control for the sample described on each line. Different samples can be associated with the same control sample.
Control samples must also be listed on their own lines just like regular samples, but must use . or - as the value of the fourth column. The value of the third column (replicate ID) is ignored for control sample lines so may also be set to . or -.

Here's an example sample sheet:

SRR5680995 input - -
SRR5680996 H3K4me3 rep1 SRR5680995
SRR5680997 H3K27me3 rep1 SRR5680995
SRR5681007 H3K27me3 rep2 SRR5681005
SRR5681006 H3K4me3 rep2 SRR5681005
SRR5680998 CTCF rep1 SRR5680995
SRR5681008 CTCF rep2 SRR5681005
SRR5681005 input - -

This declares an experimental design with three conditions - H3K4me3, H3K27me3 and CTCF - with two replicates per condition and one input control per replicate. The control sample SRR5680995 is declared as the shared control for all samples from replicate rep1, SRR5681005 as the control for all samples from replicate rep2.

## Input parameters

- Reference genome: set this to the reference genome of your organism of interest; used at the read mapping step
- Sequencing adapter - forward (optional)
- Sequencing aadapter - reverse (optional)
- Effective genome size: this is used by MACS2 and may be entered manually (indications are provided for heavily used genomes).
- Average size of sequenced fragments: used for deeptools-base QC

## Outputs:

- MultiQC analysis reports:
- Sample fingerprints:
- Between-samples correlation plot:
- Clustered heatmap of peaks across samples:
- Peak regions called by MACS2:
- Positions of summits of MACS2-called peaks:
- Peaks per replicate:
- Peaks averaged across replicates:

Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
- doc: Test outline for ChIPseq-PE-with-replicates-controls workflow
job:
Sample sheet:
class: File
path: test-data/test_sample_sheet.tsv
filetype: tabular
Sequencing data:
class: Collection
collection_type: list:paired
elements:
- class: Collection
type: paired
identifier: SRR5204807
elements:
- identifier: forward
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204807_Spt5-ChIP_IP1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
filetype: fastqsanger.gz
- identifier: reverse
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204807_Spt5-ChIP_IP1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
filetype: fastqsanger.gz
- class: Collection
type: paired
identifier: SRR5204808
elements:
- identifier: forward
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204808_Spt5-ChIP_IP2_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
filetype: fastqsanger.gz
- identifier: reverse
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204808_Spt5-ChIP_IP2_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
filetype: fastqsanger.gz
- class: Collection
type: paired
identifier: SRR5204809
elements:
- identifier: forward
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
filetype: fastqsanger.gz
- identifier: reverse
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
filetype: fastqsanger.gz
- class: Collection
type: paired
identifier: SRR5204810
elements:
- identifier: forward
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
filetype: fastqsanger.gz
- identifier: reverse
class: File
location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
filetype: fastqsanger.gz
Reference genome: "sacCer3"
Sequencing adapter - forward: null
Sequencing adapter - reverse: null
Effective genome size: 12000000
Average size of sequenced fragments: 200
outputs:
multiqc_stats:
asserts:
has_n_lines:
n: 5
has_text_matching:
expression: "SRR5204807_Spt5_rep1\t163.0\t0.0\t0.0\t844.+"
macs2_report:
element_tests:
rep1:
elements:
SRR5204807_Spt5_rep1:
asserts:
- that: "has_text"
text: "# name = SRR5204807_Spt5_rep1"
- that: "has_text"
text: "# fragment size is determined as 163 bps"
- that: "has_text"
text: "# fragments after filtering in treatment: 86394"
mapping_stats:
element_tests:
SRR5204807_Spt5_rep1:
asserts:
- that: "has_text"
text: "3067 (3.14%) aligned concordantly 0 times"
- that: "has_text"
text: "80795 (82.60%) aligned concordantly exactly 1 time"
Loading