From 5b80961776903872eea7fe7c5c081a12821c37fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Siebren=20Fr=C3=B6lich?= <48289046+siebrenf@users.noreply.github.com> Date: Wed, 13 Sep 2023 16:29:58 +0200 Subject: [PATCH 1/6] Update DESeq2.md --- docs/content/DESeq2.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/content/DESeq2.md b/docs/content/DESeq2.md index 4941b7666..b52a7b396 100644 --- a/docs/content/DESeq2.md +++ b/docs/content/DESeq2.md @@ -6,14 +6,14 @@ After completing the workflow, rerunning Seq2science with new contrasts will onl Alternatively, you can call the DESeq2 script from the command line. To get started with the command line script, try out `deseq2science --help`. -This section details how the contrast designs are created. +This page details how the contrast designs are created. Examples are given [below](./DESeq2.html#contrast-designs), and each `config.yaml` contains a commented-out example contrast design at the bottom. ##### single-cell DESeq2 If your counts table contains all cells, and your samples.tsv contains a group identifier for each cell, you can perform DESeq2 on this data using `deseq2science` by passing the `--single-cell` flag. Please note that `deseq2science` can accept a gzipped tsv file as well. -Also note that the `--single-cell` flag should *not* be used with pseudo-bulk. +Also note that the `--single-cell` flag should *not* be used with pseudo-bulk counts. ##### Overview of the DESeq2 method DESeq2 automatically performs library bias correction when loading your data, and batch correction is performed if it is included in the contrast design. From 4b584c3c00d62b9df2a2a44cd6ec45cecd36b0a2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Siebren=20Fr=C3=B6lich?= <48289046+siebrenf@users.noreply.github.com> Date: Wed, 13 Sep 2023 17:17:46 +0200 Subject: [PATCH 2/6] restructure --- docs/content/DESeq2.md | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/docs/content/DESeq2.md b/docs/content/DESeq2.md index b52a7b396..357d092a6 100644 --- a/docs/content/DESeq2.md +++ b/docs/content/DESeq2.md @@ -16,9 +16,9 @@ Please note that `deseq2science` can accept a gzipped tsv file as well. Also note that the `--single-cell` flag should *not* be used with pseudo-bulk counts. ##### Overview of the DESeq2 method -DESeq2 automatically performs library bias correction when loading your data, and batch correction is performed if it is included in the contrast design. -After calculating differentially expressed genes/peaks, a multiple testing procedure is applied, which is either the Benjamini-Hochberg procedure (the default) or Independent Hypothesis Weighing. -The False Discovery Rate cutoff is set by alpha, which is 0.1 by default. +DESeq2 accepts raw count data, automatically performing library bias correction and batch correction (if included in the contrast design). +After calculating differentially expressed genes/peaks, a multiple testing procedure is applied: either the Benjamini-Hochberg procedure (the default) or Independent Hypothesis Weighing. +The False Discovery Rate is set by the variable `alpha`, which is 0.1 by default. Finally, count values are log transformed and shrunk (by default using the apeglm method). These defaults can be changed in the [config.yaml](./schemas.html#deseq2), under the `deseq2` variables using the `multiple_testing_procedure`, `alpha_value` and `shrinkage_estimator` options respectively. @@ -27,16 +27,19 @@ For more information, check out the steps in this [vignette](https://www.biocond ##### Contrast designs The following section will guide you through the process of creating a DESeq2 contrast using only the samples.tsv and the config.yaml. -Here are some definitions: a **contrast** design tells us which samples to compare. It contains three or four parts: a **condition** (a column name in the samples.tsv), and two **groups** (labels in this column). -The first group will be the **target**, and the second group will be the **reference**. -Additionally, a contrast can optionally contain a **batch** effect, which you want to correct for. +Here are some definitions: a **contrast** design tells us which samples to compare, and how. +A contrast contains three or four parts: + - an optional **batch** effect to correct for (a column name in the samples.tsv) + - a **condition** (a column name in the samples.tsv) + - two **groups** (labels in the condition column) +For each contrast, DESeq2 will determine which genes/peaks are differentially expressed in the first group (the **target** group), compared to the second group (**reference** group). To determine differentially expressed genes/accessible peaks, DESeq2 requires at least two samples per group. A design contrast therefore requires at least 2x2 samples. ###### Contrast in the samples.tsv -In the `samples.tsv`, add a column for every property you wish to test in your samples. -Next, add labels to the samples involved in the test. You can leave labels empty, or add labels and not use them. +In the `samples.tsv`, add a condition column for every comparisson you wish to make. +Next, add group labels to the samples involved in the test. You can leave labels empty, or add labels and not use them. For example: 1. a column named 'conditions' with values 'wildtype' and 'knockout'. @@ -56,7 +59,7 @@ For example: Next, in the `config.yaml` add one or more contrasts: In order to compare two groups, the contrast condition is the column name, followed by the target group and finally the reference group. -For example, to compare stage 1 to stage 3 from the examples above, the contrast would be `stages_3_1`. +For example, to determine which genes/peaks are differentially expressed/active at stage 3 compared to stage 1, the contrast would be `stages_3_1`. To compare all groups against one reference, the contrast condition is the column name, followed by target group "all" and finally the reference group. For example, to compare all treatments to the control from the examples above, the contrast would be `treatments_all_control` @@ -103,7 +106,7 @@ contrasts: ##### Output For each contrast design, the list of *all* genes/peaks is saved to file, with analysis results for expressed genes. Briefly, these include: - The column `padj` contains the adjusted p-values after multiple testing. **(These should be used to identify DE genes/peaks)**. -- The column `log2FoldChange` contains the fold change of each gene between the two conditions. (The reference group is the one _last mentioned_ in the contrast design, so use `condition_treatment_control`. If you use `condition_control_treatment` the fold change is _inverted_.) +- The column `log2FoldChange` contains the fold change of each gene between the two conditions. - Several other columns were kept for sake of completion, such as column `pvalue`, which contains p-values not adjusted for multiple testing. In addition, MA and PCA plots are generated for each contrast design. From 7214f34b9ebfc54c5382bed4bf4f879a585e1d28 Mon Sep 17 00:00:00 2001 From: siebrenf Date: Thu, 14 Sep 2023 13:58:28 +0200 Subject: [PATCH 3/6] better count dispersion estimates --- CHANGELOG.md | 6 ++++++ seq2science/scripts/deseq2/deseq2.R | 27 +++++++++++++-------------- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f4a1b457a..72812d05f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,12 @@ All changed fall under either one of these types: `Added`, `Changed`, `Deprecate ## [Unreleased] +### Changed + +- DESeq2 filtering now uses more samples in the count matrix to estimate the dispersions + - all samples with a label in the condition column are used + - (this feature was previously dependent on a batch effect correction in the contrast design) + ## [1.1.0] - 2023-09-13 ### Added diff --git a/seq2science/scripts/deseq2/deseq2.R b/seq2science/scripts/deseq2/deseq2.R index ae33be4d6..219a26f79 100644 --- a/seq2science/scripts/deseq2/deseq2.R +++ b/seq2science/scripts/deseq2/deseq2.R @@ -34,29 +34,28 @@ samples <- parse_samples(samples_file, assembly, replicates) coldata <- samples coldata[,"condition"] <- coldata[condition] coldata[,"batch"] <- if (!is.na(batch)) { coldata[batch] } else { NA } -coldata <- coldata[c("condition", "batch")] +coldata <- coldata[c("condition", "batch")] # log involved samples cat('samples per group:\n') -cat(' - target (', groups[1],'):', rownames(coldata[coldata$condition %in% groups[1],]),'\n') -cat(' - reference (', groups[2],'):', rownames(coldata[coldata$condition %in% groups[2],]),'\n\n') +cat(' - target (', condition, groups[1],'): ', rownames(coldata[coldata$condition %in% groups[1],]),'\n') +cat(' - reference (', condition, groups[2],'): ', rownames(coldata[coldata$condition %in% groups[2],]),'\n\n') # determine if we need to run batch correction on the whole assembly output_batch_corr_counts <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_counts.tsv"), output, fixed=TRUE) output_batch_corr_pca <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_pca.png"), output, fixed=TRUE) output_batch_corr_tpm <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_tpm.tsv"), output, fixed=TRUE) -# not required if: no batch, too much data, or already done -no_batch_correction_required <- is.na(batch) | single_cell | (file.exists(output_batch_corr_counts) & file.exists(output_batch_corr_pca) & (!salmon | file.exists(output_batch_corr_tpm))) -# filter out unused conditions & order data for DESeq -if (no_batch_correction_required) { - coldata <- coldata[coldata$condition %in% c(groups[1], groups[2]),] -} else { - cat('\nbatch correction dataset selected\n\n') - # for batch corrected counts we want all samples marked in the batch column - coldata <- coldata[!is.na(coldata$batch),] - if (any(is.na(coldata$condition))) { - cat('Error: all samples marked for batch correction (in column "', batch,'") must have a condition (in column "', condition,'").') +# filter unused samples & order data for DESeq +coldata <- coldata[!is.na(coldata$condition),] +cat('Using all (', length(rownames(coldata[!is.na(coldata$condition), ])) ,') ', sep = "") +cat('samples labelled in the condition column ("', condition, '") ', sep = "") +cat('to calculate the dispersions:', rownames(coldata[!is.na(coldata$condition), ]), '\n\n') +if (!is.na(batch)) { + if (any(is.na(coldata$batch))) { + cat('Error: all samples labelled in the condition column ("', condition, '") ', sep = "") + cat('need a label in the batch column ("', batch,'").\n', sep = "") + cat('Unlabelled samples:', rownames(coldata[is.na(coldata$batch), ]), '\n\n') quit(save = "no" , status = 1) } } From 1a3b2cf47c1caf91110e00869c97cd0fd62802b6 Mon Sep 17 00:00:00 2001 From: siebrenf Date: Thu, 14 Sep 2023 14:00:44 +0200 Subject: [PATCH 4/6] fix changelog message --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 72812d05f..f7dd43f51 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,7 @@ All changed fall under either one of these types: `Added`, `Changed`, `Deprecate ### Changed -- DESeq2 filtering now uses more samples in the count matrix to estimate the dispersions +- DESeq2 now uses more samples to estimate count dispersions - all samples with a label in the condition column are used - (this feature was previously dependent on a batch effect correction in the contrast design) From 8038cbaecb685e8df503feda33aadf27e672aaac Mon Sep 17 00:00:00 2001 From: siebrenf Date: Thu, 14 Sep 2023 15:26:54 +0200 Subject: [PATCH 5/6] explain dispersion estimates + simpler language elsewhere --- docs/content/DESeq2.md | 113 ++++++++++++++++++++++++----------------- 1 file changed, 67 insertions(+), 46 deletions(-) diff --git a/docs/content/DESeq2.md b/docs/content/DESeq2.md index 357d092a6..72f947c8d 100644 --- a/docs/content/DESeq2.md +++ b/docs/content/DESeq2.md @@ -1,14 +1,16 @@ ## Differential gene/peak analysis -Seq2science outputs counts matrices for each assembly in any bulk-sequencing workflow, which can optionally can be used for differential gene/peak analysis with [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8). -To do so, add one or more contrast design(s) in the `config.yaml` file, with matching identifiers in the `samples.tsv` file. +In a differential gene analysis, two groups of samples are compared in order to determine which genes are significantly over- and underexpressed in the target group, compared to the reference group. +The technique is made for RNA-seq, but can be applied to ATAC-/ChIP-seq peaks as well. -After completing the workflow, rerunning Seq2science with new contrasts will only perform the new analyses. +Seq2science outputs counts matrices for each assembly in any bulk-sequencing workflow, which can can be used for differential gene/peak analysis with [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8). +To perform a differential analysis, add one or more contrast design(s) in the `config.yaml` file, with matching identifiers in the `samples.tsv` file. +This page details how the contrast designs are created. +Examples of design contrasts are given [below](./DESeq2.html#contrast-designs), and each `config.yaml` contains a commented-out example contrast design at the bottom. + +After completing a workflow, rerunning Seq2science with new contrasts will only perform the new analyses. Alternatively, you can call the DESeq2 script from the command line. To get started with the command line script, try out `deseq2science --help`. -This page details how the contrast designs are created. -Examples are given [below](./DESeq2.html#contrast-designs), and each `config.yaml` contains a commented-out example contrast design at the bottom. - ##### single-cell DESeq2 If your counts table contains all cells, and your samples.tsv contains a group identifier for each cell, you can perform DESeq2 on this data using `deseq2science` by passing the `--single-cell` flag. Please note that `deseq2science` can accept a gzipped tsv file as well. @@ -16,7 +18,7 @@ Please note that `deseq2science` can accept a gzipped tsv file as well. Also note that the `--single-cell` flag should *not* be used with pseudo-bulk counts. ##### Overview of the DESeq2 method -DESeq2 accepts raw count data, automatically performing library bias correction and batch correction (if included in the contrast design). +DESeq2 accepts raw count data, automatically performing normalization, and batch correction if included in the contrast design. After calculating differentially expressed genes/peaks, a multiple testing procedure is applied: either the Benjamini-Hochberg procedure (the default) or Independent Hypothesis Weighing. The False Discovery Rate is set by the variable `alpha`, which is 0.1 by default. Finally, count values are log transformed and shrunk (by default using the apeglm method). @@ -27,20 +29,21 @@ For more information, check out the steps in this [vignette](https://www.biocond ##### Contrast designs The following section will guide you through the process of creating a DESeq2 contrast using only the samples.tsv and the config.yaml. -Here are some definitions: a **contrast** design tells us which samples to compare, and how. +Here are some definitions: a **contrast** design tells us which groups of samples to compare. A contrast contains three or four parts: - - an optional **batch** effect to correct for (a column name in the samples.tsv) - - a **condition** (a column name in the samples.tsv) + + - an optional **batch effect correction** to apply (a column in the samples.tsv) + - a **condition** (a column in the samples.tsv) - two **groups** (labels in the condition column) -For each contrast, DESeq2 will determine which genes/peaks are differentially expressed in the first group (the **target** group), compared to the second group (**reference** group). +For each contrast design DESeq2 will determine which genes/peaks are differentially expressed in the first group (the **target** group), compared to the second group (**reference** group). To determine differentially expressed genes/accessible peaks, DESeq2 requires at least two samples per group. A design contrast therefore requires at least 2x2 samples. ###### Contrast in the samples.tsv -In the `samples.tsv`, add a condition column for every comparisson you wish to make. -Next, add group labels to the samples involved in the test. You can leave labels empty, or add labels and not use them. -For example: +In the `samples.tsv`, add a condition column for every comparison you wish to make. +Next, add group labels to the samples involved in the test. +Here are three examples: 1. a column named 'conditions' with values 'wildtype' and 'knockout'. 2. a column named 'stages' with values 1, 2 and 3. @@ -53,65 +56,83 @@ For example: | sample_3 | hg38 | wildtype | 2 | treatmentA | | sample_4 | hg38 | knockout | 2 | treatmentA | | sample_5 | hg38 | | 3 | treatmentB | -| sample_6 | hg38 | unused | 3 | treatmentB | +| sample_6 | hg38 | | 3 | treatmentB | ###### Contrast in the config.yaml Next, in the `config.yaml` add one or more contrasts: In order to compare two groups, the contrast condition is the column name, followed by the target group and finally the reference group. -For example, to determine which genes/peaks are differentially expressed/active at stage 3 compared to stage 1, the contrast would be `stages_3_1`. - -To compare all groups against one reference, the contrast condition is the column name, followed by target group "all" and finally the reference group. -For example, to compare all treatments to the control from the examples above, the contrast would be `treatments_all_control` -(our gremlins will carefully unpack all of your input and pass only unique designs to DESeq2). +For example: to determine which genes/peaks are differentially expressed/active at stage 3 compared to stage 1, the contrast would be `stages_3_1`. ``` -# contrasts for examples 1-3 in the config.yaml contrasts: - - 'conditions_knockout_wildtype' - - 'stages_3_1' - - 'treatments_all_control' + - stages_3_1 + - conditions_knockout_wildtype + - treatments_all_control # the 'all' keyword is explained below ``` -###### Group order -As a rule of thumb: the reference group should be specified **last**. +The contrast `stages_3_1` will compare the expression of the target group (`3`, containing sample_5 and sample_6) against the reference group (`1`, containing sample_1 and sample_2). + +The contrast `conditions_knockout_wildtype` will compare the expression of the target group (`knockout`, containing sample_2 and sample_4) against the reference group (`wildtype`, containing sample_1 and sample_3). + +###### Count dispersions +For each design contrast, you can specify which samples should be used to estimate the count dispersions by giving them a label. +This can be used to exclude failed samples from the analysis. +In general, more samples improves the dispersion estimate. + +In contrast `stages_3_1` all samples will be used to estimate the count dispersions, including sample_3 and sample_4. + +In contrast `conditions_knockout_wildtype` sample_5 and sample_6 will **not** be used to estimate the count dispersions. + +###### The 'all' keyword +To compare all groups in a column against the same reference group, the contrast condition is the column name, +followed by the `all` keyword and finally the reference group. +For example: to compare all treatments to the control from the examples above, the contrast would be `treatments_all_control` +(our gremlins will carefully unpack all of your input and pass only unique designs to DESeq2). + +###### Condition group order +As a rule of thumb: the reference group should be specified last. The order of the groups in a design contrast determines the direction of the expression log fold change. In the example `stages_3_1`, a gene upregulated at stage 3 has a *positive* expression fold change. In contrast `stages_1_3`, a gene upregulated at stage 3 has a *negative* expression fold change. Other values are unaffected. -###### Batches -If your data has batch effects, DESeq2 can try to account for these. -Note that this can only be done if the batch effects are spread over different conditions. +###### Batch effect correction +If your data has batch effects, DESeq2 can try to account for these. Similar to conditions, add a column to samples.tsv and note the batch of each sample: -| sample | sequencing_month | conditions | -|----------|------------------|------------| -| sample_1 | jan | control | -| sample_2 | feb | control | -| sample_3 | jan | treatment | -| sample_4 | feb | treatment | +| sample | researcher | treatments | +|----------|------------|------------| +| sample_1 | jane | control | +| sample_2 | jane | control | +| sample_3 | jane | treatmentA | +| sample_4 | jane | treatmentA | +| sample_5 | bob | treatmentB | +| sample_6 | jane | treatmentB | + +In this example, `jane` was sick one day and asked `bob` to perform one of her experiments. +This may have caused a batch effect. +To correct for a batch, add the batch column, and a plus sign, in front of the regular contrast: -In this example, the `sequencing_month` may have caused a batch effect. -Since the (potential) batch effect is spread over the test conditions (control vs treatment) it can be taken into account during DE analysis. -To correct for a batch, add the column and a plus sign in from of the regular contrast. ``` contrasts: - - 'conditions_treatment_control' - - 'sequencing_month + conditions_treatment_control' + - treatments_treatmentB_control # no batch correction + - researcher + treatments_treatmentB_control # batch correction ``` +Note: if `bob` had performed both `treatmentB` experiments, it would be impossible to tell if the output is caused by `treatmentB` or by `bob`, +but because the batch effect is divided over the test conditions it can be taken into account during DE analysis. + ##### Output For each contrast design, the list of *all* genes/peaks is saved to file, with analysis results for expressed genes. Briefly, these include: - The column `padj` contains the adjusted p-values after multiple testing. **(These should be used to identify DE genes/peaks)**. - The column `log2FoldChange` contains the fold change of each gene between the two conditions. -- Several other columns were kept for sake of completion, such as column `pvalue`, which contains p-values not adjusted for multiple testing. - -In addition, MA and PCA plots are generated for each contrast design. +- Several other columns were kept for the sake of completion, such as the uncorrected `pvalue`. -If the design includes a batch effect, several PCA plots are generated to visualize the effect of the batch correction. +In addition, MA and PCA plots are generated for each contrast design (stored in the qc folder). -DESeq2 models the batch effect in their package, but downstream methods may not. -For this reason, seq2science will produce a batch-corrected counts matrix (and a batch corrected TPM matrix if quantified with Salmon). +If a contrast design includes a batch effect, several PCA plots are generated to visualize the effect of the batch correction. +DESeq2 corrects the batch effect internally, but downstream methods may not. +Therefore, Seq2science will produce batch-corrected count and TPM matrices as well. From 7f934c8b269e36d23561f80bd612d9b9314e15d6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Siebren=20Fr=C3=B6lich?= <48289046+siebrenf@users.noreply.github.com> Date: Fri, 15 Sep 2023 10:30:49 +0200 Subject: [PATCH 6/6] feedback Maarten --- docs/content/DESeq2.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/content/DESeq2.md b/docs/content/DESeq2.md index 72f947c8d..11a148ba3 100644 --- a/docs/content/DESeq2.md +++ b/docs/content/DESeq2.md @@ -76,13 +76,13 @@ The contrast `stages_3_1` will compare the expression of the target group (`3`, The contrast `conditions_knockout_wildtype` will compare the expression of the target group (`knockout`, containing sample_2 and sample_4) against the reference group (`wildtype`, containing sample_1 and sample_3). ###### Count dispersions -For each design contrast, you can specify which samples should be used to estimate the count dispersions by giving them a label. +For each design contrast, you can specify which samples should be used to estimate the count dispersions by giving them a label in the contrast condition column. This can be used to exclude failed samples from the analysis. -In general, more samples improves the dispersion estimate. +In general, more samples improves the statistical power of the experiment. -In contrast `stages_3_1` all samples will be used to estimate the count dispersions, including sample_3 and sample_4. +In contrast `stages_3_1` all samples will be used to estimate the count dispersions, including sample_3 and sample_4 (because they have a label in the `stages` column). -In contrast `conditions_knockout_wildtype` sample_5 and sample_6 will **not** be used to estimate the count dispersions. +In contrast `conditions_knockout_wildtype` sample_5 and sample_6 will **not** be used to estimate the count dispersions (because they do not have a label in the `conditions` column). ###### The 'all' keyword To compare all groups in a column against the same reference group, the contrast condition is the column name,