-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment: Asymmetric randstrobe hashes #405
base: main
Are you sure you want to change the base?
Conversation
Update: The below comment is wrong because I misinterpreted the rightmost column. This is a great analysis, because I think it sets rough expectations on how much we could hope multi-contexct seeds can improve when we switch to asymmetrical seeds. I would expect asymmetrical seeds to be performing as well as in the rightmost column (as it uses a related type of rescuing mechanism), but this is not what we saw in our initial implementation of asymmetrical multi-context seeds (CC @Itolstoganov). We should keep this reference benchmark in mind if/when we revert to asymmetrical multi-context seeds. |
I’ve been working on this today. I looked at the set of reads whose alignment becomes incorrect when switching from symmetric to asymmetric randstrobes (using CHM13-500-se). I found that about one third of them are aligned to chrY (and 17% on chr1, 14% on chr9). I believe you mentioned something about PAR/chrX/chrY, so you may have expected this. Also, NAM rescue was done for ~70% of these reads, while the percentage is ~8% for the full dataset. NAM rescue is done when no NAMs have been found or when the So this points to repetitiveness and repetitiveness filtering being the main issue. I’ll investigate further in that direction. |
I have a more detailed picture of why this happens and a potential fix(!) Why it happens: One of the directions (say fw) has many repetitive seeds (for the sake of example, say a nonrepetitive fraction of 0.5 for fw direction). Let's further say that the reverse direction has no repetitive seeds (nonrepetitive_fraction of 1.0). Assuming the sides have the same number of seeds, this leads to a total Consequently, the read will most likely have the highest-scoring NAMs in the reverse direction. If the forward direction was the correct location, we would get a mismapped read. We could probably overcome this by keeping two separate nonrepetitive_fractions ( (*) to be able to fairly compare the NAM scores between FW and RC as they are different in find_nams and find_nams_rescue. |
I also notice that I misinterpreted the last column in the table (i.e., 7ca428d → 1960bef). I previously thought the numbers in this column are the percent point accuracy difference to the baseline (d420610) when both 'features' are implemented, but I now see it's the difference between the two other columns.
Then, this comment is wrong (from post #405 (comment)). (I looked back at the table to check if my potential insight and the "fix" I wrote above relate to the table you posted. But now I realize they are not related and cannot be compared.) |
Two comments on my proposed fix:
Perhaps we can think further on a solution for 2 if we observe that it causes a problem. |
I had a deeper look into this using simulated reads only from CHM13 ChrX+Y. It is quite easy finding many examples of reads that turn bad when changing from symmetric to asymmetric randstrobe hashes. Many reads seem to suffer from what I predicted above, namely that NAMs from correct region is short (and gets low score) because many surrounding hits are repetitive (masked). In symmetric mode it is fine because all repetitive regions are masked, regardless of direction. An example read below where symmetric randstrobe hash top-list (the top NAM creates a perfect match (
In asymmetric mode, some inversions don't have the repetitive (masked) seeds, and are thus able to create longer and higher scoring NAMs, which makes the true region fall lower in the list and don't make the top M (20) candidates . The NAM score top-list for the same read with asymmetric randstrobe hashes:
First I thought we could revise the NAM score or implement separate counters for I don't have a solution yet, some more thinking is required. Another line of thought is to revisit why we wanted asymmetric seeds in the first place(?) - was it only to remove the false reverse hits? If so, maybe this can be dealt with in a different way, using symmetric seeds. My implementation of joining nearby NAMs (#415) seems to work regardless of false reverse hits or not. Read used in above example pasted below.
|
I’ve written about this by e-mail, but thought I’d show the code as well and a table that shows a comparison of accuracies.
This PR includes two commits.
The first commit 7ca428d changes the formula for computing randstrobe hashes from
syncmer1_hash + syncmer2_hash
to2 * syncmer1_hash - syncmer2_hash
. Factor 2 is used to avoid a hash of zero when the two syncmer hashes are identical. This is the case at the end of the query when the syncmer is paired with itself due to there not being any downstream syncmers.The second commit 1960bef introduces a "randstrobe rescue" step where (if no hits can be found), syncmers are paired up with all downstream syncmers between
w_min
andw_max
to produce a set of hashes to look up in the index.Here are the measured differences in accuracy, with some notable rows highlighted.