Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how is calculate valid_coverage ? Different results between total read or subsample reads #292

Open
DelphIONe opened this issue Oct 29, 2024 · 3 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@DelphIONe
Copy link

I have mapped reads on my genome. I used modkit pileup and dmr pair using the IVT sample. If I sub-sample these reads and re-run modkit pileup and then dmr I get strange valid_coverage. For example, for one position, the valid_coverage on my sub-sample of reads is higher than the same position in the result obtained by modkit dmr with the total reads (with the same IVT reads). How is this possible? How is valid_coverage calculated?

Thanks a lot for your help,

@ArtRand
Copy link
Contributor

ArtRand commented Oct 31, 2024

Hello @DelphIONe,

The valid_coverage is the number of base moficiation calls (modified of any class or unmodified) that pass the filter threshold probability. For more details, the documentation has some worked examples. You will likely see different values of valid coverage when you subset your data since you have a different number of reads and the dynamic threshold estimation will derive a different value. You can run modkit sample-probs (documentation here) to find a threshold value for a given set of reads then use it for subsequent experiments.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Oct 31, 2024
@DelphIONe
Copy link
Author

It's not clear for me sorry.
If I have 0% modification for one position with 100 reads, and I mix them with another 100 reads with 50% modification for the same position, shouldn't I get 25% modification for this position with a total of 200 reads ?
What do you mean by "dynamic threshold estimation" ? Is it on quality of reads ? quality of mapping ?

@ArtRand
Copy link
Contributor

ArtRand commented Jan 3, 2025

Hello @DelphIONe,

If I have 0% modification for one position with 100 reads, and I mix them with another 100 reads with 50% modification for the same position, shouldn't I get 25% modification for this position with a total of 200 reads ?

Yes this is correct. My point in the previous comment is that if you estimate the pass threshold from the first set (0% modification) you may see a different value than if you estimate the threshold from the second set.

What do you mean by "dynamic threshold estimation" ? Is it on quality of reads ? quality of mapping ?

With the default settings, Modkit will estimate a pass threshold for which the 10% lowest confidence base modification calls are discarded. This is what I meant my "dynamic threshold estimation", that the threshold value is estimated from the input data. There are more details here. It does not use the quality of the reads or the mapping.

To your original question, when you subsample the reads, it is possible that you could have higher valid_coverage for a given position than when you use the whole read set. I imagine it could happen if the estimated pass threshold is lower when calculated from the subset of reads. Take an extreme example, if the estimated pass threshold for the subset is 0.5 and the estimated threshold when using the whole set is 0.95. It's possible that with a value of 0.5, most of the calls will be counted towards the valid_coverage whereas with the higher pass threshold value, fewer will. In the subset case, however, you should see that the valid_coverate + count_fail is equal to the total coverage for reads with the reference base at that position and less than the sum when you use the whole read set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants