Manual

scATAC-pro Manual

Contents

[scATAC-pro Manual](#scatac-pro Manual)
- How to set up user configuration file
  - global setting
  - mapping
  - call_peak
  - call_cell
  - clustering
  - split_bam
  - runDA
  - runGO
  - footprint
  - runCicero
  - integrate
- More details about inputs and outputs for all analysis modules

How to set up user configuration file

The user configuration file configure_user.txt is a required input file (specified by flag -c) for running a module, by which the parameteres/options (if exist) are assigned. Below shows how to specify needed parameters/options in the configure file module by module. Note that some modules are not mentioned because there is no parameter/option to specify. Using default setting for vast majority modules is fine, but need to change the genome name, mapping index path and genome annotation files, which are varied for differnt data sets

global setting

Parameter	Value	Note
OUTPUT_PREFIX	pbmc10k	The name used as prefix for outputs, usually sample/dataset name
IsSingleEnd	FALSE	Set it to TURE if the reads are single-ended
BLACKLIST	annotation/hg38_blacklist.bed	Genomic regions as black list used to remove artificial peaks/bins
PROMOTERS	annotation/hg38_promoter.bed	File for promoters to calculate the QC
ENHANCERS	annotation/hg38_enhancer.bed	File for enhancers to calculate the QC
TSS	annotation/hg38_tss.bed	File for transcript start sites to calculate the QC and annotate peaks/bins
GENOME_NAME	hg38	Used for TF motif enrichemnt and footprinting analysis
plotEPS	TRUE	Plot figures in .eps format or not when generating summary report

trimming

Parameter	Value	Note
TRIM_METHOD	trim_galore	Adapter trimming method, three options: trim_galore/Trimmomatic/none
ADAPTER_SEQ	NA	Set it to the path of the adapter .fa file if TRIM_METHOD is set to Trimmomatic, otherwise ignore it

mapping

Parameter	Value	Note
MAPPING_METHOD	bwa	Read alignment method, three options: bwa/bowtie/bowtie2
BWA_OPTS	-t 16	Additional options for bwa, ignore it if MAPPING_METHOD is not set to bwa
BWA_INDEX	PATH_TO_INDEX	Index file for bwa of the used genome (the path of the .fa file of the genome), ignore it if MAPPING_METHOD is not set to bwa
BOWTIE_OPTS	--quiet -p 16	Additional options for bowtie, ignore it if MAPPING_METHOD is not set to bowtie
BOWTIE_INDEX	PATH_TO_INDEX/GENOME_PREFIX	Index file for bowtie of the used genome (the directory of the .ebwt file of the genome), ignore it if MAPPING_METHOD is not set to bowtie
BOWTIE2_OPTS	--quiet -p 16	Additional options for bowtie2, ignore it if MAPPING_METHOD is not set to bowtie2
BOWTIE2_INDEX	PATH_TO_INDEX	Index file for bowtie2 of the used genome (the directory of the .bt2 file of the genome), ignore it if MAPPING_METHOD is not set to bowtie2
MAPQ	30	Filter out reads with MAPQ less than 30 for downstream modules
CELL_MAPQ_QC	TRUE	Report mapping qc for cell barcodes (need to run module get_bam4Cells)

call_peak

Parameter	Value	Note
PEAK_CALLER	MACS2	Peak calling method, four options: MACS2/BIN/COMBINED/GEM
MACS_OPTS	-q 0.01 -g hs --nomodel --extsize 200 --shift -100	Additional options to call macs2; no need to specify -t -n -f
BIN_RESL	5000	Bin resolution in base pair if PEAK_CALLER is set to BIN or COMBINED
CHROM_SIZE_FILE	annotations/chrom_hg38.sizes	The file of the chromosome size

call_cell

Parameter	Value	Note
CELL_CALLER	FILTER	Cell calling method, three options: FILTER/EmptyDrop/cellranger/
EmptyDrop_FDR	0.001	Fdr cutoff for EmptyDrop algorithm, ignore it if CELL_CALLER is not specified as EmptyDrop
FILTER_BC_CUTOFF	--min_uniq_frags 5000 --max_uniq_frags 50000 --min_frac_peak 0.5 --min_frac_tss 0.0 --min_frac_promoter --min_frac_enhancer --max_frac_mito 0.1	The QC (per barcode) cutoffs used for define cells if CELL_CALLER is set to FILTER: the minimum # of unique fragments, the maximum # of unique fragments, the minimum fractions of fragments in peaks, in TSSs, in promoters, in enhancers, and the maximum fraction of fragments in mitochodrial genome , ignore it otherwise

clustering

Parameter	Value	Note
norm_by	tf-idf	Normalization method, three options: tf-idf/log/NA
Top_Variable_Features	10000	Number/fraction of variable features used for seurat. If set to 0-1, meaning the fraction of total # of features
REDUCTION	pca	Dimension reduction method: pca/lda; UMAP and TSNE will be automatically calculated correspondly
nREDUCTION	30	The reduced dimension, an integer
CLUSTERING_METHOD	seurat	Clustering method, one of these options: seurat/cisTopic/kmeans/LSI/SCRAT/scABC/chromVAR
K_CLUSTERS	An integer or NULL	The number of expected cell clusters, will set resolution parameter for Louvain algorithm as 0.2 if K_CLUSTERS is specified as NULL
prepCello	TRUE	Generate object for VisCello (for visualization)

split_bam

Parameter	Value	Note
SPLIT_BAM2CLUSTER	TRUE	Extract bam files for each cell clusters or not; this module is neccessary if you want to do footprinting analysis

runDA

Parameter	Value	Note
RUN_DA	TRUE	Run differential accessibility analysis or not
group1	0:1	Either the name(s) of one or multiple cell clusters, separated by colon, or 'one'. If specified as 'one', will perform all one-vs-rest comparisons
group2	rest	Either the name(s) of one or multiple cell clusters, separated by colon, or 'rest'
test_use	wilcox	Statistical testing method used to do differential accessibility analysis, negbinom/LR/wilcox/t/DESeq2

runGO

Parameter	Value	Note
RUN_GO	TRUE	Run GO analysis or not after running DA
GO_TYPE	BP	Type of GO terms, one of three options: BP/CC/kegg

footprint

Parameter	Value	Note
DO_FOOTPRINT	FALSE	Perform TF footprinting analysis or not
group1_fp	0	Either the name of a cell cluster or 'one'. If specified as 'one', will conduct all one-vs-rest comparisons
group2_fp	rest	Either the name of a cell cluster or 'rest'

runCicero

Parameter	Value	Note
RUN_Cicero	TRUE	Predicting cis chromatin interactions or not
Cicero_Plot_Region	chr5:140610000-140640000	Plot cis chromatin interactions within Cicero_Plot_Region on the summary report

integrate

Parameter	Value	Note
Integrate_By	seurat	Integration method, one of seurat/pool/harmony
prepCello4Integration	TRUE	Prepare VisCello object for integrated object or not

More details about inputs and outputs for all analysis modules

Note this is a long table. You can slide right to read it

Module	Input	Output
demplx_fastq	Fastq files for both reads and index, separated by comma like: PE1_fastq,PE2_fastq,index1_fastq,inde2_fastq,index3_fastq.... Multiple index files are supportted and fastq file can be in compressed format (e.g. .gz file)	Demultiplexed fastq1 and fastq2 files with index information embedded in the read name as: @index3_index2_index1:original_read_name, saved in output/demplxed_fastq/
trimming	Demultiplexed fastq1 and fastq2 files.	Trimmed demultiplexed fastq1 and fastq2 files, saved in output/trimmed_fastq/. This module can be skipped if TRIM_METHOD is set to 'none' when running module process
mapping	The demultiplexed and trimmed paired-end fastq files, separated by comma: pe1.fastq,pe2.fastq	Position sorted bam file, and position sorted MAPQ30 bam file, saved in output/mapping_result/ and plain text files of mapping QC metrics and fragments.txt file saved in output/summary/
call_peak	The position sorted MAPQ30 bam file outputted from the mapping module. Note that the annotation of blacklist regions and CHROM_SIZE_FILE are used to filter out potential artificial peaks. It's not neccessary to use bam file in scATAC-pro format to call peaks, because the peaks are called based on aggregated bam file.	The peaks/features file in plain text format, saved as output/peaks/PEAK_CALLER/OUTPUT_PREFIX_features_BlacklistRemoved.bed.
get_mtx	The fragments.txt file outputted from mapping module, and peaks/features file outputed from call_peak module, separated by a comma	The raw peak-by-cell sparse matrix along with corresponding barcodes and features files, saved in output/raw_matrix/PEAK_CALLER/ as matrix.mtx, barcodes.txt and features.txt.
qc_per_barcode	Fragment.txt file and peaks/features file, separated by comma. This module can only be performed after running module mapping and module call_peak	QC metrics for each barcode, saved as output/summary/qc_per_barcode.PEAK_CALLER.summary in plain text format
aggr_signal	Position sorted MAPQ30 bam file outputted from module mapping	Aggregated data in .bw and .bedgraph files, which can be uploaded and visualized to genome browser, saved in output/signal/. A Tss-by-window count matrix (in .mtx.gz format, +/- 1000 bp of each TSS) is also created in output/signal, which can be used to plot the TSS enrichment profile when generating the summary report
call_cell	The raw peak-by-cell sparse matrix file outputted from the get_mtx module. This module can be only performed after running module get_mtx and module qc_per_barcode	The filtered peak-by-cell sparse matrix, the corresponding barcodes and features files, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/ as matrix.mtx, barcodes.txt and features.txt, respectively
get_bam4Cells	Bam file for aggregated data outputted from module mapping, and a barcodes.txt file outputted from module call_cell, separated by comma	Bam file and mapping QC (optional) for cell barcodes saved in output/mapping_result/cell_barcodes.MAPQ30 and output/summary/cell_barcodes.MappingStats, respectively
clustering	The filtered peak-by-cell sparse matrix outputted from the call_peak module	A seurat object with metadata 'active_clusters' for the cell clustering labels, saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/seurat_obj.rds. The cell barcodes by cluster table, saved in output/downstream_analysis/cell_cluster_table.txt file, the UMAP plot colored by clustering label, and a VisCello input object as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/VisCello_Obj if parameter prepCello is specified as TRUE. If CLUSTER_METHOD is set to 'chromVAR', a chromVAR object is saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/chromVAR_obj.rds as well
split_bam	The cell barcodes by cluster table file (output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cell_cluster_table.txt), outputted from clustering module	Bam (saved in output/downstream/PEAK_CALLER/CELL_CALLER/data_by_cluster), .bw, and .bedgraph files (saved in output/signal/) for aggregated signal for cells in each cluster
runDA	Either two groups named as '0:1,2' in which group1 consists of cluster 0 and 1, and group2 consists of cluster2 or specified as '0, rest' or 'one,rest' .	The differential accessibility features with statistical significancy information, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/differential_accessible_features_group1_vs_group2.txt.
motif_analysis	The filtered peak-by-cell sparse matrix file outputted from the call_cell module.	A chromVar object, a table and a heatmap for differentially enriched TFs in each clusters, saved in output/filtered_matrix/PEAK_CALLER/CELL_CALLER/.
footprint	Either two groups named as '1,2' or '1,rest', 'one,rest'	A table and a heatmap of differential bound TFs for each group, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/
runCicero	seurat_obj.rds file outputted from clustering module, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/seurat_obj.rds	Gene activity object in .rds format, saved as output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cicero_gene_activity.rds, and predicted cis-chromatin interactions in plain text format, saved in output/downstream_analysis/PEAK_CALLER/CELL_CALLER/cicero_inteactions.txt
integrate	Peak files called from different data sets, separated by comma	A seurat object for integrated data and a UMAP plot colored by clustering labels. If Integrate_By = VFACS, variable features across cell clusters are selected, followed by redo dimension reduction and clustering for pooled data. If Integrate_By = seurat, an 'integrated' assay is created in the seurat object and the cell clustering by louvain algorithm is performed on the PCs of the 'integrated' assay. If Integrate_By is set to 'harmony', the clustering is performed on the reduced dimension 'harmony'. If Integreate_By is set to 'pool', the data is simply pooled and regressed out the confound factors of sequence depth per cell and the dataset ID
report	Path to the directory of summary QC files: output/summary as default	A summary report in html format, saved in output/summary/, along with .eps figures for each panels saved in output/summary/Figures/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly