-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Salmon --keepDuplicates by default #1259
Comments
Hi @adamrtalbot, Thanks for pinging me here. I'm tagging @mikelove here too in case he has specific thoughts / input. Basically, when we first added the feature to deduplicate reference sequences prior to index, we had a twitter poll, and the "deduplicate by default" category won. That's not a super scientific way to answer the question, but the reason we had the poll was because we could both think of good arguments pro and con. The biggest con argument is precisely what you state; to have references in your input set simply not appear in the output can be surprising to users, especially if you're not explicitly aware of the deduplication (which we document, but don't constantly shout at users about). On the pro side, many users were actually legitimately surprised that duplicates exist and, also, precisely how many such duplicates there are in common annotations. Most of those we observed should clearly be considered artifactual (e.g. annotations matching those on a reference chromosome, but appearing on a patch contig in a region with no variation). Further, from the perspective of quantification, sequence-level duplicates are a priori indistinguishable — they are the same transcript with a different label. Therefore, their assigned abundance should always be considered equal. This was the impetus behind writing them out in the In short, the problem, IMO, is the existence of sequence-level duplicates themselves. They are basically meaningless from an inferential perspective. However, I don't think that either keeping them or discarding them as a default decision is much better than the other. Ultimately, the important thing is that the user knows about them and can decide intelligently and in a problem-specific way if they are important and, if so, how to deal with them. --Rob |
Thanks Rob, that's really helpful. I think we should allow users to set it based on how they want, ideally with the default remaining backwards compatible. We're not exactly sure how best to implement this, but definitely one for the next major release. |
Description of feature
By default, Salmon will drop transcripts with identical sequence, as described here: COMBINE-lab/salmon#214 (comment)
This behaviour should reduce unnecessary duplicate count values in the results matrices, but may be unexpected behaviour for a new user. If they supply some transcripts and one is missing, but is not clear why or which one.
Instead, we could add the flag
--keepDuplicates
for the Salmon index by default (discussed here). This retains all transcripts which is more predictable behaviour. We could add an additional flag (--salmonDropDuplicates
) to disable this behaviour. Alternatively, we could make it opt-in, with a flag--salmonKeepDuplicates
or a more generic version (--salmonIndexExtraArgs
) to allow users to enable this feature when they require it.@rob-p apologies for the unsolicited message could you help us understand the downside of keeping the duplicate transcripts? What do we lose when we enable
--keepDuplicates
?Side note, this is a breaking behaviour so will need to be communicated to users.
The text was updated successfully, but these errors were encountered: