vanheeringen-lab · siebrenf · Sep 15, 2023 · Sep 13, 2023 · Sep 13, 2023 · Sep 14, 2023
@@ -9,6 +9,12 @@ All changed fall under either one of these types: `Added`, `Changed`, `Deprecate
 
 ## [Unreleased]
 
+### Changed
+
+- DESeq2 now uses more samples to estimate count dispersions
+  - all samples with a label in the condition column are used
+  - (this feature was previously dependent on a batch effect correction in the contrast design)
+
 ## [1.1.0] - 2023-09-13
 
 ### Added

@@ -1,24 +1,26 @@
 ## Differential gene/peak analysis
-Seq2science outputs counts matrices for each assembly in any bulk-sequencing workflow, which can optionally can be used for differential gene/peak analysis with [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8).
-To do so, add one or more contrast design(s) in the `config.yaml` file, with matching identifiers in the `samples.tsv` file.
+In a differential gene analysis, two groups of samples are compared in order to determine which genes are significantly over- and underexpressed in the target group, compared to the reference group.
+The technique is made for RNA-seq, but can be applied to ATAC-/ChIP-seq peaks as well.
 
-After completing the workflow, rerunning Seq2science with new contrasts will only perform the new analyses.
+Seq2science outputs counts matrices for each assembly in any bulk-sequencing workflow, which can can be used for differential gene/peak analysis with [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8).
+To perform a differential analysis, add one or more contrast design(s) in the `config.yaml` file, with matching identifiers in the `samples.tsv` file.
+This page details how the contrast designs are created.
+Examples of design contrasts are given [below](./DESeq2.html#contrast-designs), and each `config.yaml` contains a commented-out example contrast design at the bottom.
+
+After completing a workflow, rerunning Seq2science with new contrasts will only perform the new analyses.
 Alternatively, you can call the DESeq2 script from the command line.
 To get started with the command line script, try out `deseq2science --help`.
 
-This section details how the contrast designs are created.
-Examples are given [below](./DESeq2.html#contrast-designs), and each `config.yaml` contains a commented-out example contrast design at the bottom.
-
 ##### single-cell DESeq2
 If your counts table contains all cells, and your samples.tsv contains a group identifier for each cell, you can perform DESeq2 on this data using `deseq2science` by passing the `--single-cell` flag. 
 Please note that `deseq2science` can accept a gzipped tsv file as well.
 
-Also note that the `--single-cell` flag should *not* be used with pseudo-bulk.
+Also note that the `--single-cell` flag should *not* be used with pseudo-bulk counts.
 
 ##### Overview of the DESeq2 method
-DESeq2 automatically performs library bias correction when loading your data, and batch correction is performed if it is included in the contrast design.
-After calculating differentially expressed genes/peaks, a multiple testing procedure is applied, which is either the Benjamini-Hochberg procedure (the default) or Independent Hypothesis Weighing.
-The False Discovery Rate cutoff is set by alpha, which is 0.1 by default.
+DESeq2 accepts raw count data, automatically performing normalization, and batch correction if included in the contrast design.
+After calculating differentially expressed genes/peaks, a multiple testing procedure is applied: either the Benjamini-Hochberg procedure (the default) or Independent Hypothesis Weighing.
+The False Discovery Rate is set by the variable `alpha`, which is 0.1 by default.
 Finally, count values are log transformed and shrunk (by default using the apeglm method).
 These defaults can be changed in the [config.yaml](./schemas.html#deseq2), under the `deseq2` variables using the `multiple_testing_procedure`, `alpha_value` and `shrinkage_estimator` options respectively.
 
@@ -27,17 +29,21 @@ For more information, check out the steps in this [vignette](https://www.biocond
 ##### Contrast designs
 The following section will guide you through the process of creating a DESeq2 contrast using only the samples.tsv and the config.yaml.
 
-Here are some definitions: a **contrast** design tells us which samples to compare. It contains three or four parts: a **condition** (a column name in the samples.tsv), and two **groups** (labels in this column).
-The first group will be the **target**, and the second group will be the **reference**.
-Additionally, a contrast can optionally contain a **batch** effect, which you want to correct for.
+Here are some definitions: a **contrast** design tells us which groups of samples to compare. 
+A contrast contains three or four parts: 
 
+  - an optional **batch effect correction** to apply (a column in the samples.tsv)
+  - a **condition** (a column in the samples.tsv)
+  - two **groups** (labels in the condition column)
+
+For each contrast design DESeq2 will determine which genes/peaks are differentially expressed in the first group (the **target** group), compared to the second group (**reference** group).
 To determine differentially expressed genes/accessible peaks, DESeq2 requires at least two samples per group.
 A design contrast therefore requires at least 2x2 samples.
 
 ###### Contrast in the samples.tsv
-In the `samples.tsv`, add a column for every property you wish to test in your samples.
-Next, add labels to the samples involved in the test. You can leave labels empty, or add labels and not use them.
-For example:
+In the `samples.tsv`, add a condition column for every comparison you wish to make.
+Next, add group labels to the samples involved in the test.
+Here are three examples:
 
 1. a column named 'conditions' with values 'wildtype' and 'knockout'.
 2. a column named 'stages' with values 1, 2 and 3.
@@ -50,65 +56,83 @@ For example:
 | sample_3 | hg38     | wildtype   | 2      | treatmentA |
 | sample_4 | hg38     | knockout   | 2      | treatmentA |
 | sample_5 | hg38     |            | 3      | treatmentB |
-| sample_6 | hg38     | unused     | 3      | treatmentB |
+| sample_6 | hg38     |            | 3      | treatmentB |
 
 ###### Contrast in the config.yaml
 Next, in the `config.yaml` add one or more contrasts:
 
 In order to compare two groups, the contrast condition is the column name, followed by the target group and finally the reference group.
-For example, to compare stage 1 to stage 3 from the examples above, the contrast would be `stages_3_1`.
-
-To compare all groups against one reference, the contrast condition is the column name, followed by target group "all" and finally the reference group.
-For example, to compare all treatments to the control from the examples above, the contrast would be `treatments_all_control`
-(our gremlins will carefully unpack all of your input and pass only unique designs to DESeq2).
+For example: to determine which genes/peaks are differentially expressed/active at stage 3 compared to stage 1, the contrast would be `stages_3_1`.
 
 ```
-# contrasts for examples 1-3 in the config.yaml
 contrasts:
-  - 'conditions_knockout_wildtype'
-  - 'stages_3_1'
-  - 'treatments_all_control'
+  - stages_3_1
+  - conditions_knockout_wildtype
+  - treatments_all_control        # the 'all' keyword is explained below
 ```
 
-###### Group order
-As a rule of thumb: the reference group should be specified **last**.
+The contrast `stages_3_1` will compare the expression of the target group (`3`, containing sample_5 and sample_6) against the reference group (`1`, containing sample_1 and sample_2).
+
+The contrast `conditions_knockout_wildtype` will compare the expression of the target group (`knockout`, containing sample_2 and sample_4) against the reference group (`wildtype`, containing sample_1 and sample_3).
+
+###### Count dispersions
+For each design contrast, you can specify which samples should be used to estimate the count dispersions by giving them a label.
+This can be used to exclude failed samples from the analysis.
+In general, more samples improves the dispersion estimate.
+
+In contrast `stages_3_1` all samples will be used to estimate the count dispersions, including sample_3 and sample_4.
+
+In contrast `conditions_knockout_wildtype` sample_5 and sample_6 will **not** be used to estimate the count dispersions.
+
+###### The 'all' keyword
+To compare all groups in a column against the same reference group, the contrast condition is the column name, 
+followed by the `all` keyword and finally the reference group.
+For example: to compare all treatments to the control from the examples above, the contrast would be `treatments_all_control`
+(our gremlins will carefully unpack all of your input and pass only unique designs to DESeq2).
+
+###### Condition group order
+As a rule of thumb: the reference group should be specified last.
 
 The order of the groups in a design contrast determines the direction of the expression log fold change.
 In the example `stages_3_1`, a gene upregulated at stage 3 has a *positive* expression fold change.
 In contrast `stages_1_3`, a gene upregulated at stage 3 has a *negative* expression fold change.
 Other values are unaffected.
 
-###### Batches
-If your data has batch effects, DESeq2 can try to account for these. 
-Note that this can only be done if the batch effects are spread over different conditions.
+###### Batch effect correction
+If your data has batch effects, DESeq2 can try to account for these.
 
 Similar to conditions, add a column to samples.tsv and note the batch of each sample:
 
-| sample   | sequencing_month | conditions |
-|----------|------------------|------------|
-| sample_1 | jan              | control    |
-| sample_2 | feb              | control    |
-| sample_3 | jan              | treatment  |
-| sample_4 | feb              | treatment  |
+| sample   | researcher | treatments |
+|----------|------------|------------|
+| sample_1 | jane       | control    |
+| sample_2 | jane       | control    |
+| sample_3 | jane       | treatmentA |
+| sample_4 | jane       | treatmentA |
+| sample_5 | bob        | treatmentB |
+| sample_6 | jane       | treatmentB |
+
+In this example, `jane` was sick one day and asked `bob` to perform one of her experiments. 
+This may have caused a batch effect.
+To correct for a batch, add the batch column, and a plus sign, in front of the regular contrast:
 
-In this example, the `sequencing_month` may have caused a batch effect.
-Since the (potential) batch effect is spread over the test conditions (control vs treatment) it can be taken into account during DE analysis.
-To correct for a batch, add the column and a plus sign in from of the regular contrast.
 ```
 contrasts:
-  - 'conditions_treatment_control'
-  - 'sequencing_month + conditions_treatment_control'
+  - treatments_treatmentB_control                 # no batch correction
+  - researcher + treatments_treatmentB_control    # batch correction
 ```
 
+Note: if `bob` had performed both `treatmentB` experiments, it would be impossible to tell if the output is caused by `treatmentB` or by `bob`,
+but because the batch effect is divided over the test conditions it can be taken into account during DE analysis.
+
 ##### Output
 For each contrast design, the list of *all* genes/peaks is saved to file, with analysis results for expressed genes. Briefly, these include:
 - The column `padj` contains the adjusted p-values after multiple testing. **(These should be used to identify DE genes/peaks)**.
-- The column `log2FoldChange` contains the fold change of each gene between the two conditions. (The reference group is the one _last mentioned_ in the contrast design, so use `condition_treatment_control`. If you use `condition_control_treatment` the fold change is _inverted_.)
-- Several other columns were kept for sake of completion, such as column `pvalue`, which contains p-values not adjusted for multiple testing.
-
-In addition, MA and PCA plots are generated for each contrast design.
+- The column `log2FoldChange` contains the fold change of each gene between the two conditions.
+- Several other columns were kept for the sake of completion, such as the uncorrected `pvalue`.
 
-If the design includes a batch effect, several PCA plots are generated to visualize the effect of the batch correction.
+In addition, MA and PCA plots are generated for each contrast design (stored in the qc folder).
 
-DESeq2 models the batch effect in their package, but downstream methods may not.
-For this reason, seq2science will produce a batch-corrected counts matrix (and a batch corrected TPM matrix if quantified with Salmon).
+If a contrast design includes a batch effect, several PCA plots are generated to visualize the effect of the batch correction.
+DESeq2 corrects the batch effect internally, but downstream methods may not.
+Therefore, Seq2science will produce batch-corrected count and TPM matrices as well.
@@ -34,29 +34,28 @@ samples <- parse_samples(samples_file, assembly, replicates)
 coldata <- samples
 coldata[,"condition"] <- coldata[condition]
 coldata[,"batch"]     <- if (!is.na(batch)) { coldata[batch] } else { NA }
-coldata <- coldata[c("condition", "batch")]
+coldata               <- coldata[c("condition", "batch")]
 
 # log involved samples
 cat('samples per group:\n')
-cat('  - target    (', groups[1],'):', rownames(coldata[coldata$condition %in% groups[1],]),'\n')
-cat('  - reference (', groups[2],'):', rownames(coldata[coldata$condition %in% groups[2],]),'\n\n')
+cat('  - target    (', condition, groups[1],'): ', rownames(coldata[coldata$condition %in% groups[1],]),'\n')
+cat('  - reference (', condition, groups[2],'): ', rownames(coldata[coldata$condition %in% groups[2],]),'\n\n')
 
 # determine if we need to run batch correction on the whole assembly
 output_batch_corr_counts <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_counts.tsv"), output, fixed=TRUE)
 output_batch_corr_pca    <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_pca.png"), output, fixed=TRUE)
 output_batch_corr_tpm    <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_tpm.tsv"), output, fixed=TRUE)
-# not required if: no batch, too much data, or already done
-no_batch_correction_required <- is.na(batch) | single_cell | (file.exists(output_batch_corr_counts) & file.exists(output_batch_corr_pca) & (!salmon | file.exists(output_batch_corr_tpm)))
 
-# filter out unused conditions & order data for DESeq
-if (no_batch_correction_required) {
-  coldata <- coldata[coldata$condition %in% c(groups[1], groups[2]),]
-} else {
-  cat('\nbatch correction dataset selected\n\n')
-  # for batch corrected counts we want all samples marked in the batch column
-  coldata <- coldata[!is.na(coldata$batch),]
-  if (any(is.na(coldata$condition))) {
-    cat('Error: all samples marked for batch correction (in column "', batch,'") must have a condition (in column "', condition,'").')
+# filter unused samples & order data for DESeq
+coldata <- coldata[!is.na(coldata$condition),]
+cat('Using all (', length(rownames(coldata[!is.na(coldata$condition), ])) ,') ', sep = "")
+cat('samples labelled in the condition column ("', condition, '") ', sep = "")
+cat('to calculate the dispersions:', rownames(coldata[!is.na(coldata$condition), ]), '\n\n')
+if (!is.na(batch)) {
+  if (any(is.na(coldata$batch))) {
+    cat('Error: all samples labelled in the condition column ("', condition, '") ', sep = "")
+    cat('need a label in the batch column ("', batch,'").\n', sep = "")
+    cat('Unlabelled samples:', rownames(coldata[is.na(coldata$batch), ]), '\n\n')
     quit(save = "no" , status = 1)
   }
 }