-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ksw2 to align soft-clipped ends from ungapped alignment #382
base: main
Are you sure you want to change the base?
Conversation
Excellent! Since it was a while ago I did a benchmark, I decided to run f4a8dd2 (named
accuracy_plot.pdf I have also started a benchmark on biological data with SNP/indel calling, which may be more relevant. Will report back here. |
Results for the variant analysis below.
I have a hard time forming a consensus, but overall I think it looks good. I guess you verified it also had intended behaviour in the bug reported. I approve a merge. |
I had to type this off from the screenshot, so I may have made a mistake, but are you sure this PR is better than 0.12.0?
|
Perhaps you are right. I may have made this conclusion based on:
Those are very vague motivations. I guess the best way is to check that the alignment score is always better with the new version. Did you perform such a check on the simulated data? I wanted to look at differing alignments for the analysis I did above, but I didn’t find a quick way to do a For the analysis I did, the BAM files for the four datasets are here For the two simulated files, the true alignments are at |
To answer a couple of questions:
Current strobealign reports
Good, let’s hope that the slowdown on SIM150 is just an outlier. Or otherwise we’ll just have to accept the tradeoff and get a speedup some other way.
Yes, here’s the full table: Comparing Score-based accuracy7b3b260 Merge pull request #379 from ksahlin/optimize-kslu2
Average difference se: +0.0068 I had convinced myself that it is ok that the score gets worse in some cases, but now I’m not so sure anymore. I’ll have to look into this.
I don’t have so much time left before going on vacation. I will not be able to look at your files, but may have time for some other checks. |
I see. Not consistently left/right aligning indels could possibly explain the lower precision. Nice! I think we even saw it in a previous evaluation using WFA2, IIRC.
I had expected that we would never end up with lower score. But I realize I don't fully understand the consequences of the new approach. If we did a 'full' (semi-global) realignment before, compared to only extending the softclipped ends in this branch, maybe the sum of the scores of the aligned 'pieces' is lower than the semi-global score? Related: When you extract the softclipped ends to extend, I see you only take the softclipped part:
perhaps there is a benefit of taking an overlapping part of the NAM, such as |
Otherwise, alignments between wildcards get a score of 0, which is inconsistent with SSW alignments.
Apparently, the KSW_EZ_GENERIC_SC flag is required
It is possible that gapped alignment gives a better score than soft-clipping based on ungapped alignment. Closes #357
This should fix some unexpected alignments as reported in #357.
When extending a seed, we currently compute an ungapped alignment allowing softclipping and report this unchanged if the hamming distance (counting each soft-clipped base as a mismatch) is low enough.
This is problematic because there might exist a gapped alignment of the soft-clipped ends that results in an overall higher score.
This PR changes the behavior so that after computing the ungapped alignment, ksw2 is used to align each end that has been soft clipped.
Aligner::ksw_extend
function were taken from the abandonedwfa2
branch.I changed two reads in the phiX test dataset so that they would be soft-clipped with the old code but now result in a non-soft-clipped alignment with a single deletion, similar to the problematic read in #357.
Accuracy changes minimally (+0.006); mapping rate is unaffected. Runtime is hard to measure, maybe it’s 1-2% slower.
The addition of ksw_extend also enables addressing #377, but this is left for a separate PR.
Closes #357
CC @tprodanov