-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imperfect read alignment, changing parameters does not help #357
Comments
Excellent bug report, I can reproduce your results. I will discuss with @marcelm. Thanks for letting us know! |
The soft clipping penalty is not actually a penalty, but an "end bonus". It is added to the score for each end that reaches the end. This is also why you get the score 2260 when you set (I know this is not documented correctly, and we need to fix this. – And maybe the better fix would be to make it actually be a soft-clipping penalty.) I believe the issue may be that the alignment library we are using (SSW) does not directly support soft clipping penalties. We just let it find an alignment that is allowed to start and end anywhere within the read, and then afterwards try to extend it towards the end of the reads and adjust the score accordingly. When we run SSW, we get SSW does not find |
Shouldn’t lowering (mentioned in anther discussion: I haven’t thought it through properly yet, but is it sound that the end bonus/penalty is part in determining the best location for PE-reads?) |
Reflecting some direct emails, there seem to be some examples where the alignments have slightly odd indel behaviour, where I made a recipe for reproducing some cases at https://gist.github.com/drtconway/962697804acd1d89c753b1d9a0378fc6 |
I believe to have found the problem. First, a correction to what I wrote:
Unlike BWA, our gap costs are computed as gap_open + (length - 1) * gap_extend, so the single insertion costs 12, not 12 + 1, making the total score 286. My argument would still hold (both alignments are equally good, so SSW can arbitrarily decide which one to pick), but the actual issue is a different one. The actual problem is that SSW is never called due to an optimization in strobealign that skips computing a gapped alignment if the ungapped alignment looks "good enough". This is a part of the relevant code: Lines 226 to 229 in 7fe2934
To reduce runtime, strobealign first computes an ungapped alignment when extending a seed. At this stage, soft clipping is not allowed and no CIGAR string is computed, we only count the number of mismatches. If that is below 5% of the length of the query, we skip computing a full gapped alignment with SSW. Here, the ungapped alignment is 143= 1X 1= 2X 1= 1X 1= with four mismatches. Since 4/150=0.03, we skip SSW. When SSW has been skipped, the ungapped alignment is re-done to get the CIGAR, this time taking soft-clipping into account. That is why the final output is 143=7S. I was able to get the 140=1I9= alignment by 1) disabling the optimization and 2) reducing gap open penalty to 11 ( |
Perfect that you could pin this down. Easy to fix, but requires a bit of though how to do this without introducing too much runtime. Perhaps hamming only over the span of the NAMs, but it may lead to a lot of extra ksw calls. We can discuss this offline unless you already have something in mind. |
(I wrote the above yesterday, I notice only now I haven’t actually posted it.) It’s clear that just disabling the optimization is undesirable. On a test dataset (drosophila-50), runtime increases by more than 50%. I think it makes sense to first ensure that the initial ungapped alignment is also done with soft clipping. The only reason for the current, inconsistent behavior is that I forgot this when I added soft clipping. With that in place, we would get 143=7S as the first ungapped alignment in the example. If we count the soft-clipped bases as mismatches, that would give us 7/150 = 4.7%. That is still below the currently allowed threshold of 5%, so we would still skip SSW, but it’s a bit closer. Lowering the threshold to 4% (for example) is IMO not the way to go as it doesn’t help for longer reads. I wonder whether we should then just run ksw on the ends that the ungapped alignment soft-clips. ksw can do extension alignment (unlike SSW) and because the soft-clipped ends are typically short, it should be quite fast. |
Sounds good. So we need to start around _k_nt in w.r.t. the softclipp. So for the above example we would do extension alignment with the k+7 rightmost nt in the read to detect the indel. |
Btw, what does strobealign currently do when the region with the NAM has the same length on the query and the reference and Hamming distance of the NAM region is high? Do we fully realign such cases with SSW? If so, an optimization would be to run ksw on the ends only. I remember we have discussed similar scenarios when we tried out partitioning the alignments and use WFA2 but I don't remember the conclusions. Clarification: I meant when Hamming distance is high -- possibly because regions outside the NAM region do not fit (e.g. indels). Then it might be inefficient to realign the whole read. One approach would be to try hamming of the NAM hit only, then extension of the ends. |
It is possible that gapped alignment gives a better score than soft-clipping based on ungapped alignment. Closes #357
It is possible that gapped alignment gives a better score than soft-clipping based on ungapped alignment. Closes #357
Mapping a read
to the reference
using Strobealign
v0.11.0
produces alignment143=7S
with score 296.Mapping the same read using BWA MEM finds a better alignment
140=1I9=
with score 143 (x2 = 286).Setting high penalty for soft clipping
-L 1000
produces a bad alignment143=1X1=2X1=1X1=
with score 2260, for some reason. Decreasing gap opening (-O
) penalty and increasing match bonus/mismatch penalty (-A/B
) did not help.I tried changing seeding arguments
-m/k/c/u/l
, but with no luck for now.Maybe part of the strobemer aligns in place of the 1 bp insert and this breaks the rest of the alignment, but it is strange that alignment cannot be fixed by modifying seeding arguments.
Also, extremely high score with high soft clipping penalty without the actual soft clipping seems like a bug.
The text was updated successfully, but these errors were encountered: