Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DESeq2 feedback #1011

Merged
merged 7 commits into from
Sep 15, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ All changed fall under either one of these types: `Added`, `Changed`, `Deprecate

## [Unreleased]

### Changed

- DESeq2 now uses more samples to estimate count dispersions
- all samples with a label in the condition column are used
- (this feature was previously dependent on a batch effect correction in the contrast design)

## [1.1.0] - 2023-09-13

### Added
Expand Down
122 changes: 73 additions & 49 deletions docs/content/DESeq2.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,26 @@
## Differential gene/peak analysis
Seq2science outputs counts matrices for each assembly in any bulk-sequencing workflow, which can optionally can be used for differential gene/peak analysis with [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8).
To do so, add one or more contrast design(s) in the `config.yaml` file, with matching identifiers in the `samples.tsv` file.
In a differential gene analysis, two groups of samples are compared in order to determine which genes are significantly over- and underexpressed in the target group, compared to the reference group.
The technique is made for RNA-seq, but can be applied to ATAC-/ChIP-seq peaks as well.

After completing the workflow, rerunning Seq2science with new contrasts will only perform the new analyses.
Seq2science outputs counts matrices for each assembly in any bulk-sequencing workflow, which can can be used for differential gene/peak analysis with [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8).
To perform a differential analysis, add one or more contrast design(s) in the `config.yaml` file, with matching identifiers in the `samples.tsv` file.
This page details how the contrast designs are created.
Examples of design contrasts are given [below](./DESeq2.html#contrast-designs), and each `config.yaml` contains a commented-out example contrast design at the bottom.

After completing a workflow, rerunning Seq2science with new contrasts will only perform the new analyses.
Alternatively, you can call the DESeq2 script from the command line.
To get started with the command line script, try out `deseq2science --help`.

This section details how the contrast designs are created.
Examples are given [below](./DESeq2.html#contrast-designs), and each `config.yaml` contains a commented-out example contrast design at the bottom.

##### single-cell DESeq2
If your counts table contains all cells, and your samples.tsv contains a group identifier for each cell, you can perform DESeq2 on this data using `deseq2science` by passing the `--single-cell` flag.
Please note that `deseq2science` can accept a gzipped tsv file as well.

Also note that the `--single-cell` flag should *not* be used with pseudo-bulk.
Also note that the `--single-cell` flag should *not* be used with pseudo-bulk counts.

##### Overview of the DESeq2 method
DESeq2 automatically performs library bias correction when loading your data, and batch correction is performed if it is included in the contrast design.
After calculating differentially expressed genes/peaks, a multiple testing procedure is applied, which is either the Benjamini-Hochberg procedure (the default) or Independent Hypothesis Weighing.
The False Discovery Rate cutoff is set by alpha, which is 0.1 by default.
DESeq2 accepts raw count data, automatically performing normalization, and batch correction if included in the contrast design.
After calculating differentially expressed genes/peaks, a multiple testing procedure is applied: either the Benjamini-Hochberg procedure (the default) or Independent Hypothesis Weighing.
The False Discovery Rate is set by the variable `alpha`, which is 0.1 by default.
Finally, count values are log transformed and shrunk (by default using the apeglm method).
These defaults can be changed in the [config.yaml](./schemas.html#deseq2), under the `deseq2` variables using the `multiple_testing_procedure`, `alpha_value` and `shrinkage_estimator` options respectively.

Expand All @@ -27,17 +29,21 @@ For more information, check out the steps in this [vignette](https://www.biocond
##### Contrast designs
The following section will guide you through the process of creating a DESeq2 contrast using only the samples.tsv and the config.yaml.

Here are some definitions: a **contrast** design tells us which samples to compare. It contains three or four parts: a **condition** (a column name in the samples.tsv), and two **groups** (labels in this column).
The first group will be the **target**, and the second group will be the **reference**.
Additionally, a contrast can optionally contain a **batch** effect, which you want to correct for.
Here are some definitions: a **contrast** design tells us which groups of samples to compare.
A contrast contains three or four parts:

- an optional **batch effect correction** to apply (a column in the samples.tsv)
- a **condition** (a column in the samples.tsv)
- two **groups** (labels in the condition column)

For each contrast design DESeq2 will determine which genes/peaks are differentially expressed in the first group (the **target** group), compared to the second group (**reference** group).
To determine differentially expressed genes/accessible peaks, DESeq2 requires at least two samples per group.
A design contrast therefore requires at least 2x2 samples.

###### Contrast in the samples.tsv
In the `samples.tsv`, add a column for every property you wish to test in your samples.
Next, add labels to the samples involved in the test. You can leave labels empty, or add labels and not use them.
For example:
In the `samples.tsv`, add a condition column for every comparison you wish to make.
Next, add group labels to the samples involved in the test.
Here are three examples:

1. a column named 'conditions' with values 'wildtype' and 'knockout'.
2. a column named 'stages' with values 1, 2 and 3.
Expand All @@ -50,65 +56,83 @@ For example:
| sample_3 | hg38 | wildtype | 2 | treatmentA |
| sample_4 | hg38 | knockout | 2 | treatmentA |
| sample_5 | hg38 | | 3 | treatmentB |
| sample_6 | hg38 | unused | 3 | treatmentB |
| sample_6 | hg38 | | 3 | treatmentB |

###### Contrast in the config.yaml
Next, in the `config.yaml` add one or more contrasts:

In order to compare two groups, the contrast condition is the column name, followed by the target group and finally the reference group.
For example, to compare stage 1 to stage 3 from the examples above, the contrast would be `stages_3_1`.

To compare all groups against one reference, the contrast condition is the column name, followed by target group "all" and finally the reference group.
For example, to compare all treatments to the control from the examples above, the contrast would be `treatments_all_control`
(our gremlins will carefully unpack all of your input and pass only unique designs to DESeq2).
For example: to determine which genes/peaks are differentially expressed/active at stage 3 compared to stage 1, the contrast would be `stages_3_1`.

```
# contrasts for examples 1-3 in the config.yaml
contrasts:
- 'conditions_knockout_wildtype'
- 'stages_3_1'
- 'treatments_all_control'
- stages_3_1
- conditions_knockout_wildtype
- treatments_all_control # the 'all' keyword is explained below
```

###### Group order
As a rule of thumb: the reference group should be specified **last**.
The contrast `stages_3_1` will compare the expression of the target group (`3`, containing sample_5 and sample_6) against the reference group (`1`, containing sample_1 and sample_2).

The contrast `conditions_knockout_wildtype` will compare the expression of the target group (`knockout`, containing sample_2 and sample_4) against the reference group (`wildtype`, containing sample_1 and sample_3).

###### Count dispersions
For each design contrast, you can specify which samples should be used to estimate the count dispersions by giving them a label.
This can be used to exclude failed samples from the analysis.
In general, more samples improves the dispersion estimate.

In contrast `stages_3_1` all samples will be used to estimate the count dispersions, including sample_3 and sample_4.

In contrast `conditions_knockout_wildtype` sample_5 and sample_6 will **not** be used to estimate the count dispersions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample 5 and 6 are not taken along because they are empty right? Perhaps explain that clearer.


###### The 'all' keyword
To compare all groups in a column against the same reference group, the contrast condition is the column name,
followed by the `all` keyword and finally the reference group.
For example: to compare all treatments to the control from the examples above, the contrast would be `treatments_all_control`
(our gremlins will carefully unpack all of your input and pass only unique designs to DESeq2).

###### Condition group order
As a rule of thumb: the reference group should be specified last.

The order of the groups in a design contrast determines the direction of the expression log fold change.
In the example `stages_3_1`, a gene upregulated at stage 3 has a *positive* expression fold change.
In contrast `stages_1_3`, a gene upregulated at stage 3 has a *negative* expression fold change.
Other values are unaffected.

###### Batches
If your data has batch effects, DESeq2 can try to account for these.
Note that this can only be done if the batch effects are spread over different conditions.
###### Batch effect correction
If your data has batch effects, DESeq2 can try to account for these.

Similar to conditions, add a column to samples.tsv and note the batch of each sample:

| sample | sequencing_month | conditions |
|----------|------------------|------------|
| sample_1 | jan | control |
| sample_2 | feb | control |
| sample_3 | jan | treatment |
| sample_4 | feb | treatment |
| sample | researcher | treatments |
|----------|------------|------------|
| sample_1 | jane | control |
| sample_2 | jane | control |
| sample_3 | jane | treatmentA |
| sample_4 | jane | treatmentA |
| sample_5 | bob | treatmentB |
| sample_6 | jane | treatmentB |

In this example, `jane` was sick one day and asked `bob` to perform one of her experiments.
This may have caused a batch effect.
To correct for a batch, add the batch column, and a plus sign, in front of the regular contrast:

In this example, the `sequencing_month` may have caused a batch effect.
Since the (potential) batch effect is spread over the test conditions (control vs treatment) it can be taken into account during DE analysis.
To correct for a batch, add the column and a plus sign in from of the regular contrast.
```
contrasts:
- 'conditions_treatment_control'
- 'sequencing_month + conditions_treatment_control'
- treatments_treatmentB_control # no batch correction
- researcher + treatments_treatmentB_control # batch correction
```

Note: if `bob` had performed both `treatmentB` experiments, it would be impossible to tell if the output is caused by `treatmentB` or by `bob`,
but because the batch effect is divided over the test conditions it can be taken into account during DE analysis.

##### Output
For each contrast design, the list of *all* genes/peaks is saved to file, with analysis results for expressed genes. Briefly, these include:
- The column `padj` contains the adjusted p-values after multiple testing. **(These should be used to identify DE genes/peaks)**.
- The column `log2FoldChange` contains the fold change of each gene between the two conditions. (The reference group is the one _last mentioned_ in the contrast design, so use `condition_treatment_control`. If you use `condition_control_treatment` the fold change is _inverted_.)
- Several other columns were kept for sake of completion, such as column `pvalue`, which contains p-values not adjusted for multiple testing.

In addition, MA and PCA plots are generated for each contrast design.
- The column `log2FoldChange` contains the fold change of each gene between the two conditions.
- Several other columns were kept for the sake of completion, such as the uncorrected `pvalue`.

If the design includes a batch effect, several PCA plots are generated to visualize the effect of the batch correction.
In addition, MA and PCA plots are generated for each contrast design (stored in the qc folder).

DESeq2 models the batch effect in their package, but downstream methods may not.
For this reason, seq2science will produce a batch-corrected counts matrix (and a batch corrected TPM matrix if quantified with Salmon).
If a contrast design includes a batch effect, several PCA plots are generated to visualize the effect of the batch correction.
DESeq2 corrects the batch effect internally, but downstream methods may not.
Therefore, Seq2science will produce batch-corrected count and TPM matrices as well.
27 changes: 13 additions & 14 deletions seq2science/scripts/deseq2/deseq2.R
Original file line number Diff line number Diff line change
Expand Up @@ -34,29 +34,28 @@ samples <- parse_samples(samples_file, assembly, replicates)
coldata <- samples
coldata[,"condition"] <- coldata[condition]
coldata[,"batch"] <- if (!is.na(batch)) { coldata[batch] } else { NA }
coldata <- coldata[c("condition", "batch")]
coldata <- coldata[c("condition", "batch")]

# log involved samples
cat('samples per group:\n')
cat(' - target (', groups[1],'):', rownames(coldata[coldata$condition %in% groups[1],]),'\n')
cat(' - reference (', groups[2],'):', rownames(coldata[coldata$condition %in% groups[2],]),'\n\n')
cat(' - target (', condition, groups[1],'): ', rownames(coldata[coldata$condition %in% groups[1],]),'\n')
cat(' - reference (', condition, groups[2],'): ', rownames(coldata[coldata$condition %in% groups[2],]),'\n\n')

# determine if we need to run batch correction on the whole assembly
output_batch_corr_counts <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_counts.tsv"), output, fixed=TRUE)
output_batch_corr_pca <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_pca.png"), output, fixed=TRUE)
output_batch_corr_tpm <- sub(paste0(contrast, ".diffexp.tsv"), paste0(batch, ".batch_corr_tpm.tsv"), output, fixed=TRUE)
# not required if: no batch, too much data, or already done
no_batch_correction_required <- is.na(batch) | single_cell | (file.exists(output_batch_corr_counts) & file.exists(output_batch_corr_pca) & (!salmon | file.exists(output_batch_corr_tpm)))

# filter out unused conditions & order data for DESeq
if (no_batch_correction_required) {
coldata <- coldata[coldata$condition %in% c(groups[1], groups[2]),]
} else {
cat('\nbatch correction dataset selected\n\n')
# for batch corrected counts we want all samples marked in the batch column
coldata <- coldata[!is.na(coldata$batch),]
if (any(is.na(coldata$condition))) {
cat('Error: all samples marked for batch correction (in column "', batch,'") must have a condition (in column "', condition,'").')
# filter unused samples & order data for DESeq
coldata <- coldata[!is.na(coldata$condition),]
cat('Using all (', length(rownames(coldata[!is.na(coldata$condition), ])) ,') ', sep = "")
cat('samples labelled in the condition column ("', condition, '") ', sep = "")
cat('to calculate the dispersions:', rownames(coldata[!is.na(coldata$condition), ]), '\n\n')
if (!is.na(batch)) {
if (any(is.na(coldata$batch))) {
cat('Error: all samples labelled in the condition column ("', condition, '") ', sep = "")
cat('need a label in the batch column ("', batch,'").\n', sep = "")
cat('Unlabelled samples:', rownames(coldata[is.na(coldata$batch), ]), '\n\n')
quit(save = "no" , status = 1)
}
}
Expand Down