diff --git a/docs/usage/DEanalysis/de_rstudio.md b/docs/usage/DEanalysis/de_rstudio.md index 78be7c10b..89a258c3a 100644 --- a/docs/usage/DEanalysis/de_rstudio.md +++ b/docs/usage/DEanalysis/de_rstudio.md @@ -2,7 +2,6 @@ In this section of the tutorial, we will guide you through the practical steps necessary to set up the RStudio environment, load the required libraries and data, and execute the DESeq2 analysis. By the end of this section, you will have a fully functional DESeq2 analysis pipeline set up in RStudio, ready to uncover the differentially expressed genes in your dataset. - ## Launching the RStudio environment Once the nf-core/rnaseq pipeline is terminated, the resulting data are stored in the folder `results_star_salmon`. Now, we can analyse the results by running DESeq2 on RStudio. First of all, we need to launch it: @@ -28,18 +27,17 @@ This command will keep the gitpod session active for exactly 2 hours, providing Now come back to our **RStudio session**. - ## Differential Expression Analysis As in all analysis, firstly we need to create a new project: -1) Go to the **File** menu and select **New Project**; +1. Go to the **File** menu and select **New Project**; -2) Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**; +2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**; ![R_project](./img/project_R.png) -3) The new project will be automatically opened in RStudio +3. The new project will be automatically opened in RStudio We can check whether we are in the correct working directory with `getwd()` The path `/workspace/gitpod/training/DE_analysis/` should appear on your console. To store our results in an organized way, we will create a folder named **de_results** using the **New Folder** button in the bottom right panel. We will save all our resulting tables and plots in this folder. Next, go to the **File menu**, select **New File** and then **R Script** to create a script editor in which we will save all commands required for the analysis. In the editor type: @@ -90,7 +88,8 @@ load("/workspace/gitpod/training/results_star_salmon/star_salmon/deseq2_qc/deseq In this tutorial we will analyse the `dds` object generated by running the alignment with **STAR** and the quantification with **Salmon**. Alternatively, a user could choose to analyse the the `dds` object generated by running only **Salmon** for both lightweight alignment and quantification. The file is stored in `/workspace/gitpod/training/results_star_salmon/salmon/deseq2_qc/deseq2.dds.RData`. -In DESEq2, the `dds` object is a central data structure that contains the following components: +In DESEq2, the `dds` object is a central data structure that contains the following components: + - `countData`: a matrix of raw count data, where each row represents a gene and each column represents a sample; - `colData`: a data frame containing information about the samples, such as the experimental design, treatment and other relevant metadata; - `design`: a formula specifying the experimental design utilised to estimate the dispersion and the log2foldchange. @@ -107,7 +106,7 @@ colData(dds) # to check the sample info design(dds) # to check the design formula ``` -The `colData` and the `design` are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the `dds`. We will rename the column of the `colData`, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the `colData` of the `dds` object with our newly created metadata. +The `colData` and the `design` are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the `dds`. We will rename the column of the `colData`, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the `colData` of the `dds` object with our newly created metadata. ```r #### Creation of metadata starting from the dds colData #### @@ -148,7 +147,7 @@ Now that everything is setted up, we can proceed to generate a new DESeq2 object dds_new <- DESeqDataSet(dds, design = ~ condition) -# dds inspection +# dds inspection head(counts(dds_new)) # to check the raw counts @@ -168,15 +167,15 @@ Before running the different steps of the analysis, a good practice consists in # Select a minimal number of samples = 3 -smallestGroupSize <- 3 +smallestGroupSize <- 3 # Select genes with a sum counts of at least 10 in 3 samples -keep <- rowSums(counts(dds_new) >= 10) >= smallestGroupSize +keep <- rowSums(counts(dds_new) >= 10) >= smallestGroupSize # Keep only the genes that pass the threshold -dds_filtered <- dds_new[keep,] +dds_filtered <- dds_new[keep,] ``` Now, it is time to run the differential expression analysis with the `DESeq()` function: @@ -189,7 +188,8 @@ dds_final <- DESeq(dds_filtered) The `DESeq()` function is a high-level wrapper that simplifies the process of differential expression analysis by combining multiple steps into a single function call. -This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that **DESeq2** calls include: +This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that **DESeq2** calls include: + - **estimateSizeFactors**: to normalise the count data; - **estimateDispersion**: to estimate the dispersion; - **nbinomWaldTest**: to perform differential expression test. @@ -217,11 +217,12 @@ rld <- rlog(dds_final, blind = TRUE) ``` The `rlog` and the `vst` transformations have an argument, **blind** that can be set to: + - **TRUE** (default): useful for QC analysis because it re-estimates the dispersion, allowing for comparison of samples in an unbiased manner with respect to experimental conditions; - **FALSE**: the function utilizes the already estimated dispersion, generally applied when differences in counts are expected to be due to the experimental design. Next, we perform Principal Component Analysis (PCA) to explore the data. DESeq2 provides a built-in function, `plotPCA()`, which uses [ggplot2](https://ggplot2.tidyverse.org) for visualisation, taking the `rld` (or the `vst`) object as input. -Since the **treatment** is the principal condition of interest in our metadata, we will use the `condition` information from our metadata to plot the PCA: +Since the **treatment** is the principal condition of interest in our metadata, we will use the `condition` information from our metadata to plot the PCA: ```r #### Plot PCA #### @@ -241,7 +242,7 @@ sampleDists <- dist(t(assay(rld))) # Calculate pairwise distances between sampl # Convert distances to a matrix -sampleDistMatrix <- as.matrix(sampleDists) +sampleDistMatrix <- as.matrix(sampleDists) # Set the row and column names of the distance matrix @@ -255,11 +256,11 @@ colors <- colorRampPalette(rev(brewer.pal(9, "Greens")))(255) # function from RC # Create the heatmap -pheatmap(sampleDistMatrix, - clustering_distance_rows = sampleDists, - clustering_distance_cols = sampleDists, - col = colors, - fontsize_col = 8, +pheatmap(sampleDistMatrix, + clustering_distance_rows = sampleDists, + clustering_distance_cols = sampleDists, + col = colors, + fontsize_col = 8, fontsize_row = 8) ``` @@ -286,10 +287,10 @@ normalised_counts$gene <- rownames(counts(dds_final)) # Relocate the gene column to the first position -normalised_counts <- normalised_counts %>% +normalised_counts <- normalised_counts %>% relocate(gene, .before = control_rep1) -# Save normalised counts +# Save normalised counts write.csv(normalised_counts, file = "de_results/normalised_counts.csv") ``` @@ -303,7 +304,7 @@ The `results()` function in DESeq2 is used to extract the results of the DE anal - **pvalue**: the p-value from the Wald test indicates the probability of observing the measured difference in gene expression (log2 fold change) by chance, assuming no true difference exists (null hypothesis). A low p-value suggests that the observed expression change between samples is unlikely due to random chance, so we can reject the null hypothesis; - **padj**: the adjusted p-value, which takes into account multiple testing corrections, using the Benjamini-Hochberg method to control the false discovery rate; -By default, the `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast. A contrast is a specific comparison between two or more levels of a factor, such as the comparison between the treatment and control groups. The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level. +By default, the `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast. A contrast is a specific comparison between two or more levels of a factor, such as the comparison between the treatment and control groups. The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level. Notice that in this tutorial the contrast is already correctly specified. ```r @@ -313,15 +314,15 @@ res <- results(dds_final) # Visualise the results -head(res) +head(res) -# Summarise the results showing the number of tested genes (genes with non-zero total read count), the genes up- and down-regulated at the selected threshold (alpha) and the number of genes excluded by the multiple testing due to a low mean count +# Summarise the results showing the number of tested genes (genes with non-zero total read count), the genes up- and down-regulated at the selected threshold (alpha) and the number of genes excluded by the multiple testing due to a low mean count summary(res) # DESeq2 function to extract the name of the contrast -resultsNames(dds_final) +resultsNames(dds_final) # res <- results(dds, contrast = c("design_formula", "condition_of_interest", "reference_condition")) # Command to set the contrast, if necessary @@ -336,37 +337,37 @@ res_viz$gene <- rownames(res) # Convert the results to a tibble for easier manipulation and relocate the gene column to the first position -res_viz <- as_tibble(res_viz) %>% +res_viz <- as_tibble(res_viz) %>% relocate(gene, .before = baseMean) -# Save the results table +# Save the results table write.csv(res_viz, file = "de_results/de_result_table.csv") ``` -In the *Experimental Design* section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change greater than 1 or lower than -1 as differentially expressed. +In the _Experimental Design_ section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change greater than 1 or lower than -1 as differentially expressed. ```r #### Extract significant DE genes from the results #### # Filter the results to include only significantly differentially expressed genes with an adjusted p-value (padj) less than 0.05 and a log2foldchange greater than 1 or less than -1 -resSig <- subset(res_viz, padj < 0.05 & abs(log2FoldChange) > 1) +resSig <- subset(res_viz, padj < 0.05 & abs(log2FoldChange) > 1) # Convert the results to a tibble for easier manipulation and relocate the 'Gene' column to be the first column -resSig <- as_tibble(resSig) %>% - relocate(gene, .before = baseMean) +resSig <- as_tibble(resSig) %>% + relocate(gene, .before = baseMean) # Order the significant genes by their adjusted p-value (padj) in ascending order -resSig <- resSig[order(resSig$padj),] +resSig <- resSig[order(resSig$padj),] # Display the final results table of significant genes -resSig +resSig -# Save the significant DE genes +# Save the significant DE genes write.csv(resSig, file = "de_results/sig_de_genes.csv") ``` @@ -380,7 +381,7 @@ Now that we have obtained the results of the differential expression analysis, i plotMA(res, ylim = c(-2,2)) ``` - + - **counts plot**: it plots the normalised counts for a single gene across the different conditions in your experiment. It’s particularly useful for visualising the expression levels of specific genes of interest and comparing them across sample groups. ```r @@ -399,14 +400,14 @@ Notiche that this plot is based on the normalised counts. significant_genes <- resSig[, 1] -# Extract normalised counts for significant genes from the normalised counts matrix and convert the "gene" column to row names +# Extract normalised counts for significant genes from the normalised counts matrix and convert the "gene" column to row names -significant_counts <- inner_join(normalised_counts, significant_genes, by = "gene") %>% +significant_counts <- inner_join(normalised_counts, significant_genes, by = "gene") %>% column_to_rownames("gene") # Create the heatmap using pheatmap -pheatmap(significant_counts, +pheatmap(significant_counts, cluster_rows = TRUE, fontsize = 8, scale = "row", @@ -422,49 +423,48 @@ pheatmap(significant_counts, # Convert the results to a tibble and add a column indicating differential expression status -res_tb <- as_tibble(res) %>% +res_tb <- as_tibble(res) %>% mutate(diffexpressed = case_when( - log2FoldChange > 1 & padj < 0.05 ~ 'upregulated', + log2FoldChange > 1 & padj < 0.05 ~ 'upregulated', log2FoldChange < -1 & padj < 0.05 ~ 'downregulated', TRUE ~ 'not_de')) # Add a new column with gene names -res_tb$gene <- rownames(res) +res_tb$gene <- rownames(res) # Relocate the 'gene' column to be the first column -res_tb <- res_tb %>% +res_tb <- res_tb %>% relocate(gene, .before = baseMean) # Order the table by adjusted p-value (padj) and add a new column for gene labels -res_tb <- res_tb %>% arrange(padj) %>% +res_tb <- res_tb %>% arrange(padj) %>% mutate(genelabels = "") - + # Label the top 5 most significant genes res_tb$genelabels[1:5] <- res_tb$gene[1:5] # Create a volcano plot using ggplot2 -ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj), col = diffexpressed)) + - geom_point(size = 0.6) + +ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj), col = diffexpressed)) + + geom_point(size = 0.6) + geom_text_repel(aes(label = genelabels), size = 2.5, max.overlaps = Inf) + - ggtitle("DE genes treatment versus control") + + ggtitle("DE genes treatment versus control") + geom_vline(xintercept = c(-1, 1), col = "black", linetype = 'dashed', linewidth = 0.2) + geom_hline(yintercept = -log10(0.05), col = "black", linetype = 'dashed', linewidth = 0.2) + theme(plot.title = element_text(size = rel(1.25), hjust = 0.5), axis.title = element_text(size = rel(1))) + - scale_color_manual(values = c("upregulated" = "red", - "downregulated" = "blue", + scale_color_manual(values = c("upregulated" = "red", + "downregulated" = "blue", "not_de" = "grey")) + labs(color = 'DE genes') + xlim(-3,5) ``` - ## Functional analysis The output of the differential expression analysis is a list of significant DE genes. To uncover the underlying biological mechanisms, various downstream analyses can be performed, such as functional enrichment analysis (identify overrepresented biological processes, molecular functions, cellular components or pathways) and network analysis (group genes based on similar expression patterns to identify novel interactions). To facilitate the interpretation of the resulting list of DE genes, a range of freely available web- and R-based tools can be employed. @@ -530,7 +530,7 @@ go_enrich <- enrichGO( # Create a bar plot of the top enriched GO terms barplot <- barplot( - go_enrich, + go_enrich, title = "Enrichment analysis barplot", font.size = 8 ) @@ -546,4 +546,4 @@ dotplot <- dotplot( # Combine the bar plot and dot plot into a single plot grid plot_grid(barplot, dotplot, nrow = 2) -``` \ No newline at end of file +``` diff --git a/docs/usage/DEanalysis/index.md b/docs/usage/DEanalysis/index.md index 8bab2769b..6c68014c0 100644 --- a/docs/usage/DEanalysis/index.md +++ b/docs/usage/DEanalysis/index.md @@ -20,7 +20,6 @@ By the end of this workshop, you will be able to: Let's get started! - ## Running with Gitpod In order to run this using GitPod, please make sure: @@ -30,21 +29,17 @@ In order to run this using GitPod, please make sure: Now you're all set and can use the following button to launch the service: - [![Open in GitPod](https://img.shields.io/badge/Gitpod-%20Open%20in%20Gitpod-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/lescai-teaching/rnaseq-tutorial) - - ## Additional documentation - You can find detailed documentation on **Nextflow** [here](https://www.nextflow.io/docs/latest/) - You can find additional training on [these pages](https://training.nextflow.io) - ## Credits & Copyright This training material has been written and completed during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024, by Lorenzo Sola, Francesco Lescai and Mariangela Santorsola, with special thanks to Victoria Cepeda for her contributions to the tutorial's revision. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities. The Docker image and Gitpod environment used in this repository have been created by [Seqera](https://seqera.io) but have been made open-source ([CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)) for the community. -All examples and descriptions are licensed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). \ No newline at end of file +All examples and descriptions are licensed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). diff --git a/docs/usage/DEanalysis/interpretation.md b/docs/usage/DEanalysis/interpretation.md index 7ed680ec2..fb0f38a25 100644 --- a/docs/usage/DEanalysis/interpretation.md +++ b/docs/usage/DEanalysis/interpretation.md @@ -6,10 +6,9 @@ Once DE genes have been identified, the next crucial step is to interpret the re The results illustrated in this section might show slight variations compared to your runs due to randomness in the STAR algorithm. This randomness arises from using variable seed values and parallel processing, leading to minor differences in results between runs on the same data. These small discrepancies are not biologically significant and may affect counts and subsequent plots (such as PCA and count plots). However, the overall patterns and main findings should remain consistent. While exact reproducibility is ideal, minor variations are acceptable in practice, as long as they do not impact the main conclusions of the study. - ## Quality control plots -The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: sample, condition and replica. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age and developmental stage. +The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: sample, condition and replica. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age and developmental stage. ![PCA](./img/PCA.png) @@ -19,15 +18,14 @@ Next, we will examine the hierarchical clustering plot to explore the relationsh ![Cluster](./img/hierarchical_clustering.png) -Remember that to create this plot, we utilized the `dist()` function, so in the legend on the right, a value of 0 corresponds to high correlation, while a value of 5 corresponds to very low correlation. Similar to PCA, we can see that samples tend to cluster together according to `condition`, indeed we can observe a high degree of correlation between the three control samples and between the three treated samples. +Remember that to create this plot, we utilized the `dist()` function, so in the legend on the right, a value of 0 corresponds to high correlation, while a value of 5 corresponds to very low correlation. Similar to PCA, we can see that samples tend to cluster together according to `condition`, indeed we can observe a high degree of correlation between the three control samples and between the three treated samples. Overall, the integration of these plots suggests that we are working with high-quality data and we can confidently proceed to the differential expression analysis. - ## Differential expression results -From this point, we will examine plots that are generated after the differential expression analysis. These plots are not quality control (QC) plots, but rather plots that help us to interpret the results. -After running the `results()` function, a good way to start to have an idea about the results is to look at the MA plot. +From this point, we will examine plots that are generated after the differential expression analysis. These plots are not quality control (QC) plots, but rather plots that help us to interpret the results. +After running the `results()` function, a good way to start to have an idea about the results is to look at the MA plot. ![RNAseq](./img/MA_plot.png) @@ -54,7 +52,7 @@ After the identification of DE genes, it's informative to visualise the expressi ![Counts](./img/plotCounts.png) -In our example, post-treatment, we observe a significant increase in the expression of the *ENSG00000142192* gene, highlighting its responsiveness to the experimental conditions. +In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions. Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression. @@ -62,16 +60,14 @@ Finally, we can create a heatmap using the normalised expression counts of DE ge By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualization allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics. - ## Over Representation Analysis (ORA) -Finally, we can attempt to assign biological significance to our differentially expressed genes through **Enrichment Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes. +Finally, we can attempt to assign biological significance to our differentially expressed genes through **Enrichment Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes. ![Enrichment](./img/enrichment_plot.png) The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study. - ## Conclusions -In this tutorial, we have walked through the steps of performing differential expression analysis using DESeq2, from preparing the data to interpreting the results. We have seen how to identify differentially expressed genes, visualise the results and perform enrichment analysis to gain insights into the biological significance of the results. By following this tutorial, you should now be able to perform differential expression analysis using DESeq2 and interpret the results in the context of your own research question. \ No newline at end of file +In this tutorial, we have walked through the steps of performing differential expression analysis using DESeq2, from preparing the data to interpreting the results. We have seen how to identify differentially expressed genes, visualise the results and perform enrichment analysis to gain insights into the biological significance of the results. By following this tutorial, you should now be able to perform differential expression analysis using DESeq2 and interpret the results in the context of your own research question. diff --git a/docs/usage/DEanalysis/rnaseq.md b/docs/usage/DEanalysis/rnaseq.md index 05ffcd439..a801520a1 100644 --- a/docs/usage/DEanalysis/rnaseq.md +++ b/docs/usage/DEanalysis/rnaseq.md @@ -2,7 +2,6 @@ In order to carry out a RNA-Seq analysis we will use the nf-core pipeline [rnaseq](https://nf-co.re/rnaseq/3.14.0). - ## Overview The pipeline is organised following the diffent blocks: pre-processing, alignment (or lightweight alignment) and quantification, post-processing and final QC. @@ -10,63 +9,58 @@ The pipeline is organised following the diffent blocks: pre-processing, alignmen ![Metromap](./img/nf-core-rnaseq_metro_map_grey.png) In each process, the users can choose among a range of different options. Importantly, the users can decide to follow one of the two different routes in the alignment and quantification step: -- alignment and quantification (stage 2); -- pseudoalignment and quantification (stage 3). +- alignment and quantification (stage 2); +- pseudoalignment and quantification (stage 3). ## Experimental Design -The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements in quality of the data. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression. +The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements in quality of the data. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression. !!! note This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple time. -To obtain optimal results, it is crucial to balance the number of biological replicates and sequencing depth. While deeper sequencing improves the detection of lowly expressed genes, it reaches a plateau, beyond which no additional benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps establish a suitable log2 fold change (log2FC) threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results and increasing the reliability of the findings. - +To obtain optimal results, it is crucial to balance the number of biological replicates and sequencing depth. While deeper sequencing improves the detection of lowly expressed genes, it reaches a plateau, beyond which no additional benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps establish a suitable log2 fold change (log2FC) threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results and increasing the reliability of the findings. ## Library design RNA-seq library design involves critical decisions, including the choice between paired-end and single-end sequencing. Paired-end sequencing provides valuable information on structural variations and transcript isoforms, improving mapping accuracy, especially for longer transcripts and repetitive regions. In contrast, single-end sequencing, where only one end of the fragment is sequenced, can be more cost-effective while still providing high-quality data for gene expression analysis. The decision between paired-end and single-end sequencing ultimately depends on the research question and experimental goals. Paired-end sequencing is preferred for novel transcript identification or isoform characterization, while single-end sequencing is sufficient for gene expression quantification. The type of RNA (e.g., mRNA or total RNA), read length, budget and computational resources can impact the choice. - ## Reference genome nf-core pipelines make use of the Illumina iGenomes collection as [reference genomes](https://nf-co.re/docs/usage/reference_genomes). Before starting the analysis, the users might want to check whether the genome they need is part of this collection. -They also might want to consider downloading the reference locally, when running on premises: this would be useful for multiple runs and to speed up the analysis. In this case the parameter `--igenomes_base` might be used to pass the root directory of the downloaded references. +They also might want to consider downloading the reference locally, when running on premises: this would be useful for multiple runs and to speed up the analysis. In this case the parameter `--igenomes_base` might be used to pass the root directory of the downloaded references. One might also need to use custom files: in this case the users might either provide specific parameters at command line (`--fasta` option followed by the genome of choiche), or create a config file adding a new section to the `genome` object. See [here](https://nf-co.re/docs/usage/reference_genomes#custom-genomes) for more details. We will follow this specific approach in this tutorial, since the data we will be using have been simulated on chromosome 21 of the Human GRCh38 reference, and we have prepared genome fasta and genome index containing only this chromosome locally. The two files are `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa` and `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa.fai`, respectively. - ## Reference annoation The reference annotation plays a crucial role in the RNA-seq analysis. Without a high-quality reference annotation, RNA-seq analysis would result in inaccurate or incomplete results. The reference annotation provides a precise guide for aligning sequencing reads to specific genomic regions, allowing to identify genes, transcripts and regulatory elements. This is particularly important for identifying novel transcripts and alternative splicing events. nf-core pipelines make use of the Illumina iGenomes collection also as [reference annotation](https://nf-co.re/docs/usage/reference_genomes). -The reference annotations are vastly out of date with respect to current annotations and miss certain features. So, the general recommendation is to download a newest annotation version compatible with the genome. A user can utilize the `--gtf` or the `--gff` options to specify the annottation files of choiche, or create a config file adding a new section to the `genome` object. +The reference annotations are vastly out of date with respect to current annotations and miss certain features. So, the general recommendation is to download a newest annotation version compatible with the genome. A user can utilize the `--gtf` or the `--gff` options to specify the annottation files of choiche, or create a config file adding a new section to the `genome` object. Similarly to the approach utilised for the genome, in this tutorial we will follow this approach. The annotation files include only the annotated transcripts on chromosome 21 of the Human GRCh38 reference genome and we have already prepared these files locally. The two files are `/workspace/gitpod/training/data/refs/gencode_v29_chr21.gff` and `/workspace/gitpod/training/data/refs/gencode_v29_transcripts_chr21.fa`, respectively. - ## Input files The input data should be provided in a CSV file, according to a format that is largely common for nf-core pipelines. The format is described in the [rnaseq usage page](https://nf-co.re/rnaseq/3.14.0/docs/usage). In the tutorial, the input file is `/workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv`. - ## Running nf-core/rnaseq In the following sections we will: + - prepare our references; - set our computational resources in order to be able to run the pipeline on a gitpod VM; - edit the optional settings; - run the pipeline. - ## Reference and annotation files Following the considerations above, we will first of all edit the `nextflow.config` file in our working directory to add a new genome. @@ -88,7 +82,6 @@ genomes { To speed up the analysis we will include the `star_index` and the `salmon_index` in the config. These files have already been created locally. - ## Computing resources Based on the choices we made when starting up the gitpod environment, we recommend to use the following additional parameters. @@ -102,7 +95,6 @@ params { } ``` - ## Launching the pipeline Now we are ready to launch the pipeline, and we can use the following command line: @@ -127,4 +119,4 @@ nextflow run nf-core/rnaseq -r 3.12.0 \ ``` Notice that we will run the pipeline with STAR as aligner and Salmon in alignment-based mode to quantify gene expression. We will also run the pipeline with Salmon in mapping-based mode to perform a lightweight alignment and quantification. -The `skip` parameters were inserted to reduce the running time. \ No newline at end of file +The `skip` parameters were inserted to reduce the running time. diff --git a/docs/usage/DEanalysis/theory.md b/docs/usage/DEanalysis/theory.md index 2a00d18b2..b829ec3c4 100644 --- a/docs/usage/DEanalysis/theory.md +++ b/docs/usage/DEanalysis/theory.md @@ -14,7 +14,7 @@ In the scheme, we can identify three key phases in the workflow: data pre-proces Depending on the user's needs, the workflow can include additional downstream analyses such as functional enrichment analysis, co-expression analysis, and integration with other omics data. -## Pre-processing +## Pre-processing The pre-processing of sequencing reads from RNA-seq data is a critical step to ensure the quality and accuracy of downstream analysis. The raw reads obtained from the sequencer are stored as [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) files, which contain both the sequence data and quality scores. The initial processing step involves evaluating the quality of the reads and identifying potential issues such as adapter contamination, sequencing errors, or low-quality bases. The presence of adapters (short DNA sequences ligated to the ends of DNA fragments during library preparation) is detected through comparison with known adapter sequences or by using algorithms that identify adapter-like sequences, and removed in a process known as **read trimming**. Next, reads containing contaminants (genomic DNA and/or rRNA) and those with low-quality bases are filtered out. Finally, the quality of the filtered reads is checked again to ensure their suitability for downstream processing. @@ -25,6 +25,7 @@ In the RNA-seq workflow, the alignment step involves mapping sequencing reads to Errors, gaps, regions of poor sequence quality, insertions/deletions (INDELs), as well as duplicated and repetitive regions in the reference sequence, make this step challenging. Addressing these issues by choosing a high-quality reference and an appropriate aligner is essential for accurate results. A crucial component in the alignment step is the [annotation](https://www.ncbi.nlm.nih.gov/genbank/genomes_gff) file, which can be in General Feature Format (GFF) or Gene Transfer Format (GTF). These files contain key information about the location and structure of genes and transcripts, playing a crucial role in accurate mapping and gene expression quantification. Additionally, RNA-seq data often includes reads that span exon-exon junctions, and the annotation files provide information about splice junctions, allowing for the detection of different isoforms. The alignment and quantification steps can follow two different approaches depending on user preferences: + - traditional alignment and quantification; - lightweight alignment and quantification. @@ -48,8 +49,7 @@ Finally, it is often beneficial to filter out genes that are unlikely to show di !!! note - The `vst` or `rlog` transformations are applied to the normalised counts stored in the `dds` object, which is generated by running either the `DESeq()` or `estimateSizeFactors()` function. Since the estimation of size factors is an early step in the `DESeq()` function, the transformations are generally applied immediately afterwards. - + The `vst` or `rlog` transformations are applied to the normalised counts stored in the `dds` object, which is generated by running either the `DESeq()` or `estimateSizeFactors()` function. Since the estimation of size factors is an early step in the `DESeq()` function, the transformations are generally applied immediately afterwards. ### Design Formula @@ -71,20 +71,16 @@ design = ~ sex + developmental_stage + condition In R, the tilde (`~`) is used in formula syntax to specify relationships between variables in statistical models. Here, it indicates that gene counts (dependent variable) will be modelled as a function of the specified variables (predictors). - The results will not be affected by the order of variables but the common practice is to specify the main source of variation in the last position of the design formula. - ### Differential Expression Analysis RNA-seq data typically contain a large number of genes with low expression counts, indicating that many genes are expressed at very low levels across samples. At the same time, RNA-seq data exhibit a skewed distribution with a long right tail due to the absence of an upper limit for gene expression levels. This means that while most genes have low to moderate expression levels, a small number are expressed at high levels. Accurate statistical modelling must therefore account for this distribution to avoid misleading conclusions. - ![RNAseq](./img/count_distribution.png) The core of the differential expression analysis is the `DESeq()` function, a wrapper that streamlines several key steps into a single command. The different functions are listed below: - ![RNAseq](./img/DESeq_function.png) !!! note @@ -93,7 +89,7 @@ The core of the differential expression analysis is the `DESeq()` function, a wr The different steps are explained in detail below: -1) **Normalisation**: since DESeq2 compares counts between sample groups for the same gene, it does not need to adjust for gene length. However, it is essential to account for variations in sequencing depth and RNA composition among samples. To normalise the data, DESeq2 utilises size factors, which correct for these variations in library sizes and RNA composition. +1. **Normalisation**: since DESeq2 compares counts between sample groups for the same gene, it does not need to adjust for gene length. However, it is essential to account for variations in sequencing depth and RNA composition among samples. To normalise the data, DESeq2 utilises size factors, which correct for these variations in library sizes and RNA composition. The size factors are calculated using the **median ratio** method: @@ -106,34 +102,32 @@ The size factors are calculated using the **median ratio** method: While normalised counts are useful for downstream visualisation of results, they should not be used as input for DESeq2. Instead, DESeq2 requires count data in the form of a matrix of integer values. -2) **Estimate dispersion and gene-wise dispersion**: the dispersion is a measure of how much the variance deviates from the mean. The estimation of dispersion is essential to model the variance of the count data. Importantly, RNA-seq data are characterised by **overdispersion**, where the variance in gene expression levels often exceeds the mean (variance > mean). - +2. **Estimate dispersion and gene-wise dispersion**: the dispersion is a measure of how much the variance deviates from the mean. The estimation of dispersion is essential to model the variance of the count data. Importantly, RNA-seq data are characterised by **overdispersion**, where the variance in gene expression levels often exceeds the mean (variance > mean). ![RNAseq](./img/overdispersion.png) -DESeq2 addresses this issue by employing the **negative binomial distribution**, which generalises the Poisson distribution by introducing an additional dispersion parameter. This parameter quantifies the extra variability present in RNA-seq data, providing a more realistic representation than the Poisson distribution, which assumes equal mean and variance. DESeq2 starts by estimating the **common dispersion**, a single estimate of dispersion applicable to all genes in the dataset. This estimate provides a baseline for variability across all genes in the dataset. Next, DESeq2 estimates **gene-wise dispersion**, a separate estimate of dispersion for each individual gene, taking into account that different genes may exhibit varying levels of expression variability due to biological differences. +DESeq2 addresses this issue by employing the **negative binomial distribution**, which generalises the Poisson distribution by introducing an additional dispersion parameter. This parameter quantifies the extra variability present in RNA-seq data, providing a more realistic representation than the Poisson distribution, which assumes equal mean and variance. DESeq2 starts by estimating the **common dispersion**, a single estimate of dispersion applicable to all genes in the dataset. This estimate provides a baseline for variability across all genes in the dataset. Next, DESeq2 estimates **gene-wise dispersion**, a separate estimate of dispersion for each individual gene, taking into account that different genes may exhibit varying levels of expression variability due to biological differences. The dispersion parameter (α) is related to the mean (μ) and variance of the data, as described by the equation: `Var = μ + α ⋅ μ²` A key feature of DESeq2's dispersion estimates is their negative correlation with the mean and positive correlation with variance. Genes with low expression have higher dispersion values, while genes with high expression tend to have lower dispersion. Additionally, genes sharing similar mean expression levels can exhibit different dispersion estimates based on their variance. To improve the accuracy of dispersion estimates, DESeq2 assumes that genes with similar expression profiles share similar dispersion patterns and leverages this information to refine the estimates. -3) **Mean-dispersion relationship**: this process, known as dispersion fitting, models the relationship between the mean expression level of a gene and its dispersion. In this process, DESeq2 identifies a trend in the dispersion estimates across genes. The fitted curve, typically a smooth curve, describes how dispersion changes as a function of the mean expression level. +3. **Mean-dispersion relationship**: this process, known as dispersion fitting, models the relationship between the mean expression level of a gene and its dispersion. In this process, DESeq2 identifies a trend in the dispersion estimates across genes. The fitted curve, typically a smooth curve, describes how dispersion changes as a function of the mean expression level. -4) **Final dispersion estimates**: DESeq2 refines the gene-wise dispersion by shrinking it towards the fitted curve. The "shrinkage" helps control for overfitting, particularly in genes with low counts or few replicates, and makes the dispersion estimates more reliable. However, genes with exceptionally high dispersion values are not shrunk because they likely deviate from the modelling assumptions, exhibiting elevated variability due to biological or technical factors. Shrinking these values could lead to false positives. +4. **Final dispersion estimates**: DESeq2 refines the gene-wise dispersion by shrinking it towards the fitted curve. The "shrinkage" helps control for overfitting, particularly in genes with low counts or few replicates, and makes the dispersion estimates more reliable. However, genes with exceptionally high dispersion values are not shrunk because they likely deviate from the modelling assumptions, exhibiting elevated variability due to biological or technical factors. Shrinking these values could lead to false positives. -5) **Fitting model and testing**: the initial step in hypothesis testing involves defining a **null hypothesis** for each gene. The null hypothesis states that there is no difference in expression between the tested sample groups (LFC == 0). Next, DESeq applies a statistical test to assess whether the null hypothesis is true. -DESeq2 fits a **generalised linear model (GLM)** to the normalised counts using the calculated size factors and final dispersion estimates. A GLM is a flexible extension of linear regression that models the relationship between a response variable (normalised counts) and predictors (e.g., condition). -By using a **negative binomial distribution**, DESeq2's GLM can handle the additional variability in gene expression counts that often occurs in RNA-seq data.Once the model is fit, coefficients are estimated for each sample group along with their standard errors. These coefficients represent the estimated log2 fold changes between groups and serve as the basis for hypothesis testing, using either a **Wald test** or a **Likelihood Ratio Test (LRT)**, depending on the experimental design: +5. **Fitting model and testing**: the initial step in hypothesis testing involves defining a **null hypothesis** for each gene. The null hypothesis states that there is no difference in expression between the tested sample groups (LFC == 0). Next, DESeq applies a statistical test to assess whether the null hypothesis is true. + DESeq2 fits a **generalised linear model (GLM)** to the normalised counts using the calculated size factors and final dispersion estimates. A GLM is a flexible extension of linear regression that models the relationship between a response variable (normalised counts) and predictors (e.g., condition). + By using a **negative binomial distribution**, DESeq2's GLM can handle the additional variability in gene expression counts that often occurs in RNA-seq data.Once the model is fit, coefficients are estimated for each sample group along with their standard errors. These coefficients represent the estimated log2 fold changes between groups and serve as the basis for hypothesis testing, using either a **Wald test** or a **Likelihood Ratio Test (LRT)**, depending on the experimental design: - **Wald Test**: The Wald test is ideal for simpler experimental designs, such as comparing two conditions (e.g., treated vs untreated). It tests whether the estimated effect size (log2 fold change) of a predictor variable (like treatment) is significantly different from zero. This test provides direct estimates of fold changes with associated p-values, making it computationally efficient for straightforward comparisons. - - **Likelihood Ratio Test (LRT)**: The LRT is more suitable for complex experimental designs involving multiple variables (e.g., time points, treatments and batches). It compares the fit of two nested models: one with the factor of interest (e.g., treatment) and one without it to assess whether including that factor significantly improves model fit. This approach allows DESeq2 to account for confounding variables and to isolate the effect of specific variables on gene expression, offering flexibility for multi-factor analyses. -When performing multiple tests, such as in the case of RNA-seq data where thousands of genes are tested, the risk of false positives increases. To account for this, DESeq2 employs multiple test correction methods (the Benjamini-Hochberg procedure is the default) to adjust the p-values and control the false discovery rate (FDR). +When performing multiple tests, such as in the case of RNA-seq data where thousands of genes are tested, the risk of false positives increases. To account for this, DESeq2 employs multiple test correction methods (the Benjamini-Hochberg procedure is the default) to adjust the p-values and control the false discovery rate (FDR). !!! note - The FDR is the expected proportion of false positives among the identified significant results. For example, by setting the FDR cutoff to < 0.05, 5% of genes identified as differentially expressed are expected to be false positives. For instance, if 400 genes are identified as differentially expressed with an FDR cutoff of 0.05, you would expect 20 of them to be false positives. + The FDR is the expected proportion of false positives among the identified significant results. For example, by setting the FDR cutoff to < 0.05, 5% of genes identified as differentially expressed are expected to be false positives. For instance, if 400 genes are identified as differentially expressed with an FDR cutoff of 0.05, you would expect 20 of them to be false positives. After identifying DE genes using DESeq2, it is essential to interpret the biological significance of these genes through functional analysis. This involves examining the roles of the differentially expressed genes in various biological processes, molecular functions and pathways, providing insights into the underlying mechanisms driving the observed changes in gene expression. This interpretation can help in discovering pathways involved in disease or identifying potential therapeutic targets. Different tools are available to carry out these functional analyses, such as [Gene Ontology](https://geneontology.org), [Reactome](https://reactome.org/), [KEGG](https://www.genome.jp/kegg), [clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html), [g:Profiler](https://biit.cs.ut.ee/gprofiler) and [WikiPathways](https://www.wikipathways.org).