how is calculate valid_coverage ? Different results between total read or subsample reads #292

DelphIONe · 2024-10-29T18:10:33Z

I have mapped reads on my genome. I used modkit pileup and dmr pair using the IVT sample. If I sub-sample these reads and re-run modkit pileup and then dmr I get strange valid_coverage. For example, for one position, the valid_coverage on my sub-sample of reads is higher than the same position in the result obtained by modkit dmr with the total reads (with the same IVT reads). How is this possible? How is valid_coverage calculated?

Thanks a lot for your help,

ArtRand · 2024-10-31T00:22:55Z

Hello @DelphIONe,

The valid_coverage is the number of base moficiation calls (modified of any class or unmodified) that pass the filter threshold probability. For more details, the documentation has some worked examples. You will likely see different values of valid coverage when you subset your data since you have a different number of reads and the dynamic threshold estimation will derive a different value. You can run modkit sample-probs (documentation here) to find a threshold value for a given set of reads then use it for subsequent experiments.

DelphIONe · 2025-01-03T13:59:45Z

It's not clear for me sorry.
If I have 0% modification for one position with 100 reads, and I mix them with another 100 reads with 50% modification for the same position, shouldn't I get 25% modification for this position with a total of 200 reads ?
What do you mean by "dynamic threshold estimation" ? Is it on quality of reads ? quality of mapping ?

ArtRand · 2025-01-03T18:10:26Z

Hello @DelphIONe,

If I have 0% modification for one position with 100 reads, and I mix them with another 100 reads with 50% modification for the same position, shouldn't I get 25% modification for this position with a total of 200 reads ?

Yes this is correct. My point in the previous comment is that if you estimate the pass threshold from the first set (0% modification) you may see a different value than if you estimate the threshold from the second set.

What do you mean by "dynamic threshold estimation" ? Is it on quality of reads ? quality of mapping ?

With the default settings, Modkit will estimate a pass threshold for which the 10% lowest confidence base modification calls are discarded. This is what I meant my "dynamic threshold estimation", that the threshold value is estimated from the input data. There are more details here. It does not use the quality of the reads or the mapping.

To your original question, when you subsample the reads, it is possible that you could have higher valid_coverage for a given position than when you use the whole read set. I imagine it could happen if the estimated pass threshold is lower when calculated from the subset of reads. Take an extreme example, if the estimated pass threshold for the subset is 0.5 and the estimated threshold when using the whole set is 0.95. It's possible that with a value of 0.5, most of the calls will be counted towards the valid_coverage whereas with the higher pass threshold value, fewer will. In the subset case, however, you should see that the valid_coverate + count_fail is equal to the total coverage for reads with the reference base at that position and less than the sum when you use the whole read set.

ArtRand added the question Looking for clarification on inputs and/or outputs label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how is calculate valid_coverage ? Different results between total read or subsample reads #292

how is calculate valid_coverage ? Different results between total read or subsample reads #292

DelphIONe commented Oct 29, 2024

ArtRand commented Oct 31, 2024

DelphIONe commented Jan 3, 2025

ArtRand commented Jan 3, 2025

how is calculate valid_coverage ? Different results between total read or subsample reads #292

how is calculate valid_coverage ? Different results between total read or subsample reads #292

Comments

DelphIONe commented Oct 29, 2024

ArtRand commented Oct 31, 2024

DelphIONe commented Jan 3, 2025

ArtRand commented Jan 3, 2025