Salmon --keepDuplicates by default #1259

adamrtalbot · 2024-03-12T10:43:32Z

Description of feature

By default, Salmon will drop transcripts with identical sequence, as described here: COMBINE-lab/salmon#214 (comment)

This behaviour should reduce unnecessary duplicate count values in the results matrices, but may be unexpected behaviour for a new user. If they supply some transcripts and one is missing, but is not clear why or which one.

Instead, we could add the flag --keepDuplicates for the Salmon index by default (discussed here). This retains all transcripts which is more predictable behaviour. We could add an additional flag (--salmonDropDuplicates) to disable this behaviour. Alternatively, we could make it opt-in, with a flag --salmonKeepDuplicates or a more generic version (--salmonIndexExtraArgs) to allow users to enable this feature when they require it.

@rob-p apologies for the unsolicited message could you help us understand the downside of keeping the duplicate transcripts? What do we lose when we enable --keepDuplicates?

Side note, this is a breaking behaviour so will need to be communicated to users.

The text was updated successfully, but these errors were encountered:

rob-p · 2024-03-12T14:08:37Z

Hi @adamrtalbot,

Thanks for pinging me here. I'm tagging @mikelove here too in case he has specific thoughts / input.

Basically, when we first added the feature to deduplicate reference sequences prior to index, we had a twitter poll, and the "deduplicate by default" category won. That's not a super scientific way to answer the question, but the reason we had the poll was because we could both think of good arguments pro and con. The biggest con argument is precisely what you state; to have references in your input set simply not appear in the output can be surprising to users, especially if you're not explicitly aware of the deduplication (which we document, but don't constantly shout at users about). On the pro side, many users were actually legitimately surprised that duplicates exist and, also, precisely how many such duplicates there are in common annotations. Most of those we observed should clearly be considered artifactual (e.g. annotations matching those on a reference chromosome, but appearing on a patch contig in a region with no variation).

Further, from the perspective of quantification, sequence-level duplicates are a priori indistinguishable — they are the same transcript with a different label. Therefore, their assigned abundance should always be considered equal. This was the impetus behind writing them out in the duplicate_clusters.tsv file — if you want to assign/project abundance to these transcripts after the fact, you can simply collect each duplicate cluster, and equally divide the abundance estimated for the retained transcript. Granted, this becomes a bit more difficult if you are using an uncertainty-aware differential analysis tool like swish, which makes use of inferential replicates. However, in that case, I think one could simply divide each inferential sample by the number of duplicates and impute identical inferential replicates for each of these duplicates.

In short, the problem, IMO, is the existence of sequence-level duplicates themselves. They are basically meaningless from an inferential perspective. However, I don't think that either keeping them or discarding them as a default decision is much better than the other. Ultimately, the important thing is that the user knows about them and can decide intelligently and in a problem-specific way if they are important and, if so, how to deal with them.

--Rob

adamrtalbot · 2024-03-14T15:55:19Z

Thanks Rob, that's really helpful. I think we should allow users to set it based on how they want, ideally with the default remaining backwards compatible.

We're not exactly sure how best to implement this, but definitely one for the next major release.

adamrtalbot added the enhancement label Mar 12, 2024

adamrtalbot mentioned this issue Mar 12, 2024

nf-test quantify pseudoalignment #1246

Merged

13 tasks

drpatelh added this to the 3.15.0 milestone May 13, 2024

drpatelh modified the milestones: 3.15.0, 3.16.0 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salmon --keepDuplicates by default #1259

Salmon --keepDuplicates by default #1259

adamrtalbot commented Mar 12, 2024 •

edited by drpatelh

Loading

rob-p commented Mar 12, 2024

adamrtalbot commented Mar 14, 2024

Salmon --keepDuplicates by default #1259

Salmon --keepDuplicates by default #1259

Comments

adamrtalbot commented Mar 12, 2024 • edited by drpatelh Loading

Description of feature

rob-p commented Mar 12, 2024

adamrtalbot commented Mar 14, 2024

adamrtalbot commented Mar 12, 2024 •

edited by drpatelh

Loading