git clone https://github.com/tobiasrausch/ATACseq.git
cd ATACseq
make all
If one of the above commands fail your operating system probably lacks some build essentials. These are usually pre-installed but if you lack them you need to install these. For instance, for Ubuntu this would require:
apt-get install build-essential g++ git wget unzip
To annotate motifs and estimate TSS enrichments some simple scripts are included in this repository to download these databases.
cd bed/ && Rscript promoter.R && cd ..
cd motif/ && ./downloadMotifs.sh && cd ..
./src/atac.sh <hg38|hg19|mm10> <read1.fq.gz> <read2.fq.gz> <genome.fa> <output prefix>
The pipeline produces at various steps JSON QC files (*.json.gz
). You can upload and interactively browse these files at https://gear-genomics.embl.de/alfred/. In addition, the pipeline produces a succinct QC file for each sample. If you have multiple output folders (one for each ATAC-Seq sample) you can simply concatenate the QC metrics of each sample.
head -n 1 ./*/*.key.metrics | grep "TssEnrichment" | uniq > summary.tsv
cat ./*/*.key.metrics | grep -v "TssEnrichment" >> summary.tsv
To plot the distribution for all QC parameters.
Rscript R/metrics.R summary.tsv
The ATAC-Seq pipeline produces various output files.
- Bowtie BAM alignment files filtered for duplicates and mitochondrial reads.
- Quality control output files from alfred, samtools, FastQC and cutadapt adapter filter metrics.
- Macs peak calling files and IDR filtered peak lists.
- Succinct browser tracks in bedGraph format and IGV's tdf format.
- Footprint track of nucleosome positions and/or transcription factor bound DNA.
- Homer motif finding results.
Merge peaks across samples and create a raw count matrix.
ls ./Sample1/Sample1.peaks ./Sample2/Sample2.peaks ./SampleN/SampleN.peaks > peaks.lst
ls ./Sample1/Sample1.bam ./Sample2/Sample2.bam ./SampleN/SampleN.bam > bams.lst
./src/count.sh hg19 peaks.lst bams.lst <output prefix>
To call differential peaks on a count matrix for TSS peaks, called counts.tss.gz, using DESeq2 we first need to create a file with sample level information (sample.info). For instance, if you have 2 replicates per condition:
echo -e "name\tcondition" > sample.info
zcat counts.tss.gz | head -n 1 | cut -f 5- | tr '\t' '\n' | sed 's/.final$//' | awk '{print $0"\t"int((NR-1)/2);}' >> sample.info
Rscript R/dpeaks.R counts.tss.gz sample.info
Peaks can of course be intersected with enhancer or conserved element tracks, i.e.:
cd tracks/ && downloadTracks.sh
bedtools intersect -a ./Sample2/Sample2.peaks -b tracks/conserved.bed
There is a basic Rscript available for plotting peak densities.
Rscript R/karyoplot.R input.peaks
Tobias Rausch, Markus Hsi-Yang Fritz, Jan O Korbel, Vladimir Benes.
Alfred: Interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long- and short-read sequencing.
Bioinformatics. 2018 Dec 6.
B Erarslan, JB Kunz, T Rausch, P Richter-Pechanska et al.
Chromatin accessibility landscape of pediatric T‐lymphoblastic leukemia and human T‐cell precursors
EMBO Mol Med (2020)
This ATAC-Seq pipeline is distributed under the BSD 3-Clause license.