-
Notifications
You must be signed in to change notification settings - Fork 0
/
qc.qmd
160 lines (117 loc) · 6.97 KB
/
qc.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
title: "Quality control and trimming"
---
## Introduction
The pipeline starts with a quality control step. This step contains four main rules:
+ Quality control of the raw `FastQ` files using the tool `FastQC`
+ Removing potential remaining adapters using `Cutadapt`
+ Quality control of the newly trimmed `FastQ` files using `FastQC` again
+ Combining several `FastQC` outputs into one file using `MultiQC`
## Quality control before trimming
In the snakemake pipeline, the name of the FastQ files and their location is retrieved from a file called samples.tsv and that the user prepares before running the pipeline on real data. For more information on the samples.tsv and the config.yaml files that need to be created before running the pipeline on any real data, check the [official aMeta GitHub](https://github.com/NBISweden/aMeta).
In our case we are going to run bash scripts that we will submit directly to the slurm system and we have implemented a for loop going through the fastq files available in the data folder, so no need to create a samples.tsv file for this workshop.
Here is the simplified snakemake rule (without the parameters explained in the "Understand snakemake" section):
```
rule FastQC_BeforeTrimming:
output:
html="results/FASTQC_BEFORE_TRIMMING/{sample}_fastqc.html",
zip="results/FASTQC_BEFORE_TRIMMING/{sample}_fastqc.zip",
input:
fastq=lambda wildcards: samples.loc[wildcards.sample].fastq,
shell:
"fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_BEFORE_TRIMMING &> {log}"
```
Here is a simplified version of this code:
```bash
for sample in $(ls data/*.fastq.gz); do
sample_name=$(basename $sample .fastq.gz
fastqc $sample --threads 4 --nogroup --outdir results/FASTQC_BEFORE_TRIMMING &> logs/FASTQC_BEFORE_TRIMMING/$sample_name.log;
done
```
In summary, this rule loops through the sample files and generate statistics using FastQC for each of them in the form of an `html` and `zip` file. The `html` file contains the final report and you can open it to see information about the quality of sequencing and the presence of adapters. Alternatively, you can navigate the [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) website to see how an html output file looks like and how to interpret it.
WARNING: If you would be running the shell command above directly on the terminal, you would need to export the PATH to the necessary modules first and to create the necessary output and log folders. You can see how the full command would look like by opening the bash script file FastQC_BeforeTrimming.sh using less `less FastQC_BeforeTrimming.sh`.
To run the different rules on TRUBA, we have prepared bash scripts. Before running it, verify that you have done the steps from the `course setup` section of this website and that you are located in your own newly created `workshop` folder. If this is the case, type this command:
```bash
sbatch FastQC_BeforeTrimming.sh --account=your_username
```
It is important to specify your account name anytime you run a bash script. You will get a better priority in the queue than if every user uses the same user name, and you will be able to check the progress status of your job using `squeue`.
Please, remember to check the status of your job using this command. Once your job disappears from this command output, it means that it has terminated.
```bash
squeue -u your_username
```
Once your job has finished, please check that the output files were created like this:
```bash
ls -ltrh results/FASTQC_BEFORE_TRIMMING/
```
## Adapter trimming step
After we run a first quality control, the pipeline will run `Cutadapt` to remove the adapters and filter out the processed reads that have become too small. This step, was added because if users don't remove correctly and completely the adapters before running the pipeline, it will increase the amount of false positives. If you have already removed the adapters, no need to worry about this step, because it won't change much your FastQ file.
You can specify which adapters were used in the `config/config.yaml` file. If no adapter is specified, the default used are Illumina adapters.
Here is the snakemake rule:
```
rule Cutadapt_Adapter_Trimming:
output:
fastq="results/CUTADAPT_ADAPTER_TRIMMING/{sample}.trimmed.fastq.gz",
input:
fastq=lambda wildcards: samples.loc[wildcards.sample].fastq,
params:
adapters=" ".join([f"-a {x}" for x in ADAPTERS]),
shell:
"cutadapt {params.adapters} --minimum-length 30 -o {output.fastq} {input.fastq} &> {log}"
```
Here is a shell command for this rule:
```bash
for sample in $(ls data/*.fastq.gz); do
sample_name=$(basename $sample .fastq.gz)
cutadapt -a AGATCGGAAGAG --minimum-length 30 -o results/CUTADAPT_ADAPTER_TRIMMING/${sample_name}.trimmed.fastq.gz ${sample} &> logs/CUTADAPT_ADAPTER_TRIMMING/${sample_name}.log
done
```
# In summary, it loops through the samples and uses Cutadapt to remove the adapters. From now on, we will work using the trimmed FastQ files that it has created.
Please run the script provided for this part like this:
```bash
sbatch Cutadapt_Adapter_Trimming.sh --account=your_username
```
After your job has finished, please check the output files:
```bash
ls -ltrh results/CUTADAPT_ADAPTER_TRIMMING/
```
## Quality control after trimming
Now that we have removed potentially remaining adapters, we will run a new quality control analysis. Normally, there should be no more adapters visible in the resulting html file:
```
rule FastQC_AfterTrimming:
output:
html="results/FASTQC_AFTER_TRIMMING/{sample}.trimmed_fastqc.html",
zip="results/FASTQC_AFTER_TRIMMING/{sample}.trimmed_fastqc.zip",
input:
fastq="results/CUTADAPT_ADAPTER_TRIMMING/{sample}.trimmed.fastq.gz",
shell:
"fastqc {input.fastq} --threads {threads} --nogroup --outdir results/FASTQC_AFTER_TRIMMING &> {log}"
```
Here is a shell command for this rule:
```bash
for sample in $(ls results/CUTADAPT_ADAPTER_TRIMMING/*.fastq.gz); do
sample_name=$(basename ${sample} .fastq.gz)
fastqc ${sample} --threads 4 --nogroup --outdir results/FASTQC_AFTER_TRIMMING &> logs/FASTQC_AFTER_TRIMMING/${sample_name}.log;
done
```
Please run the `sbatch` command for this part like this:
```bash
sbatch FastQC_AfterTrimming.sh --account=your_username
```
Please check the outputs using this command:
```bash
ls -ltrh results/FASTQC_AFTER_TRIMMING/
```
## Combine quality control outputs
At last, the pipeline combines all the FastQC reports into one file using the tool MultiQC. **This part will not be executed in this tutorial.**
```
rule MultiQC:
output:
html="results/MULTIQC/multiqc_report.html",
input:
unpack(multiqc_input),
params:
config=os.path.join(WORKFLOW_DIR, "envs", "multiqc_config.yaml"),
shell:
'echo {input} | tr " " "\n" > {output.html}.fof;'
"multiqc -c {params.config} -l {output.html}.fof --verbose --force --outdir results/MULTIQC &> {log}"
```