Updated FastQC text

hbctraining · May 23, 2024 · 40c08b9 · 40c08b9
1 parent 8de262a
commit 40c08b9
Showing 1 changed file with 95 additions and 4 deletions.
diff --git a/lessons/02_fastqc.md b/lessons/02_fastqc.md
@@ -215,18 +215,109 @@ You will see messages printed in the message window in the top window pane, givi
 
 You will see two panels in the interface. On the left hand side you will see your the files in your laptop and on the right hand side you have your home directory on O2. Both panels have a directory tree at the top and a detailed listing of the selected directory's contents underneath. In the right hand panel, navigate to where the HTML files are located on O2 `~/variant_calling/results/fastqc/`. Then decide where you would like to copy those files to on your computer and move to that directory on the left hand panel.
 
-Once you have found the html output for `**GET FILE NAME**` **copy it over** by double clicking it or drag it over to right hand side panel. Once you have the HTML file copied over to your laptop, you can leave the Filezilla interface. You can then locate the HTML file on your computer and open it up in a browser. 
+Once you have found the html output for `syn3_normal_1_fastqc.html` **copy it over** by double clicking it or drag it over to right hand side panel. Once you have the HTML file copied over to your laptop, you can leave the Filezilla interface. You can then locate the HTML file on your computer and open it up in a browser. 
 
 ## Interpreting the HTML report
 
-Now we can take a look at the metrics and assess the quality of our sequencing data!
+Now we can take a look at the metrics and assess the quality of our sequencing data! `FastQC` provides a green checkmark if it thinks a plot looks good, a yellow exclamation mark if it thinks a plot has some concerns and a red X if it believes that the data has failed a test. It is exceedingly uncommon to have green checkmarks for everything and even data with a few red X's can still be good data. You should not consider FastQC's scoring very strongly, but rather interpret the data yourself and make your own judgement. This is for two reasons:
 
-[NEEDS WORK] 
-* go through QC metrics and display a few of the important plots and what to expect here
+1) `FastQC` and the associated metrics are used as a first QC step for virtually all NGS analysis, but how RNA-seq, ChIP-seq, WGS sequencing look in these plots is going to vary widely. A "failure" in one or a handful of metrics could simply be the result of the type of experiement you are running.
+
+2) Similarly to the previous point, your experiment could have some peculiarities to it. While this doesn't apply as much to WGS and WES, you could imagine if you somehow biased your subset of reads sequenced that this could have biases in the QC of the reads. This is oftentimes more applicable to other types of NGS data analysis, but can also be true for WGS and WES as well. For example, the GC content of protein coding sequences is also generally higher than the GC content of the genome at large, so WES is introducing a GC bias that you might not see in WGS data.
+
+In general, when looking at at your data within `FastQC`, always keep your experimental design and dataset in consideration and don't read too much into the assessments that `FastQC` provides. 
+
+### Sequence Quality
+
+As we continue down the report, we can skip a few figures until we get to the sequence quality figure. A few things we should know about this figure:
+
+  1) X-axis is position in the read and the y-axis is PHRED score
+
+  2) Typically, the shape of these figures have a steep incline in the first few bases before plateauing and finally tapering off a bit. The shape should be mostly smooth. If we saw large, abrupt drops in quality this could be reason to contact your sequencing facility.
+
+  3) The right read (or R2) often has low-quality than the left read (or R1) and this difference in quality if just an artifact of pair-end Illumina sequencing.
+
+<p align="center">
+<img src="../img/Base_quality_scores.png" width="800">
+</p>
+
+The shape that we see is very typical of a good sequencing run. Imporantly, their aren't any sudden drops in read quality in these samples.
+
+### Average Sequence Quality
+
+The next plot is a distribution highlighting average sequence quality for a read. As opposed to the previous plot, the PHRED score is now on the x-axis and number of reads in on the y-axis.
+
+<p align="center">
+<img src="../img/Per_sequence_quality_scores.png" width="800">
+</p>
+
+We can see that our average quality scores peak well-above 28 and they appear to be mostly unimodal. If the average PHRED score peak was lower or perhaps we saw a bimodal distribution for PHRED scores then we might have some concerns.
+
+### Per Base Sequence Content
+
+The next plot is showing the sequence content across the reads. The x-axis is the position in the read and the y-axis is the percent of each base. The red line is percent Thymine, the blue line is percent cytosine, green is percent Adenine and yellow is percent guanine, Ideally, you should see pretty flat lines free from spikes, but the beginning (~10 bases) can often be a bit bumpy due to primer bias. We can see this primer bias in our samples and the effect appears quite small. If you know the expected GC content of your sample, this could also be a place that you could check that your smaple is in the range of what you would be expecting.
+
+<p align="center">
+<img src="../img/Overall_sequence_content.png" width="800">
+</p>
+
+As you look across out sample the lines have a bit of primer bias on the front and flatten out fairly quickly. 
+
+### GC Content Distribution
+
+Similar to the previous plots on sequence content, we are mostly looking to make sure that there is a reasonably normally-shaped distribution around what the expected GC content is for a reference genome/exome. Strong skews, multi-modal shapes or aburpt spikes could indicate errors in sequencing or contamination. 
+
+<p align="center">
+<img src="../img/GC_content.png" width="800">
+</p>
+
+In the above figure, we see the shape that we would expect to see. It is mostly smooth, normally-centered around a GC-percentage reasonable for the human exome. We don't see any abrupt peaks and the curve looks mostly unimodal.
+
+### Per base N content
+
+When a seqeuncer is unable to make a base call at a position, it assigns the base call of N. As a result, we would hope our sample would have very few N calls. The x-axis is the position in the read and the y-axis is the percent of reads with an N in that position. We are hoping to see a mostly flat line as close to 0 as we can get. Many N calls or abrupt spikes with an abundance of N calls would be concerning.
+
+<p align="center">
+<img src="../img/N_content.png" width="800">
+</p>
+
+In our data, we see that it is mostly a flat line close to 0, so we don't have any concerns.
+
+### Duplication Levels
+
+This next plot is going to help us visualize the amount of duplicate sequence we see in the reads. The x-axis is the number of times a sequence is duplicated and the y-axis is the percentage of reads that are duplicated. This figure would ideally be strongly left-ward shifted with a tail that quickly tapers down. This would indicate that much of the sequence in the reads in not duplicated and is present in single copy.   
+
+<p align="center">
+<img src="../img/Duplication_levels.png" width="800">
+</p>
+
+This figure appears to be about what one would hope to see as most of the reads don't show high levels of duplication.
+
+### Overrepresented Sequences
+
+This table will display any overrepresented sequences and potential sources. It is not uncommon to get adaptor sequences in this table. In general, as long as their are only a handful or fewer overrepresented sequences with all of them being less than ~1%, then your sample should be fine. 
+
+<p align="center">
+<img src="../img/Overrepresented_sequences.png" width="800">
+</p>
+
+These samples don't show any overrepresented sequences, which is great.
 
 FastQC has a really well documented [manual page](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) with [detailed explanations](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/) about every plot in the report. 
 
+### Adapter Content
+
+One large source of overrepresented sequences can be the adapters used in library construction. On the x-axis we will plot the position in the read and on the y-axis it is th percent of adapter contamination for various adapter sets in that position, with each color line being a different potential adapter set. Since no adapters came up in our previous overrepresented sequences evaluation, we would not expect to see any sign of them in this plot. 
+
+<p align="center">
+<img src="../img/Adapter_content.png" width="800">
+</p>
+
+We don't see any signs of adapters in our data. 
+
+### Overal conclusions
 
+It looks like our data looks good and there weren't any concerning issues that we need to address with the sequencing facility. 
 
 ***