-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Purpose of max kmer frequency? #142
Comments
@peterk87: The max frequency threshold could be useful when analyzing metagenomics datasets. For example, Salmonella schemes k-mers are generally present at low frequency (~10-100 fold genome coverage) in the metagenomics samples that we have tested so far. However, we have found that a handful of the S. Enteritidis scheme k-mers were present at 3,000-6,000 fold coverage in some metagenomics samples, and we suspect that these match E. coli genome sequences. By excluding these high coverage k-mers, we get a more accurate average genome coverage estimate (represented by the avg_kmer_coverage value in the BioHansel results) for the pathogen targeted by the scheme. I would recommend increasing the default value to a very high number, in order to avoid accidental exclusion of valid k-mers in most situations. |
Hi @glabbe What about using median kmer frequency instead to exclude these extremely high frequency outliers? Kind of like how median household income is more informative than the mean value since the 0.1% will drag up the mean value substantially. |
That sounds good to me @peterk87. If we implement the median frequency calculation, we could keep an option to implement a user-defined maximum frequency threshold, but to have it "off" by default? |
I think off by default would be a good idea and convenient for viral amplicon sequencing data especially. Based on the Salmonella example, it sounds like having a QC message warning of extremely high frequency kmers rather than excluding those kmers from the subtype calling may be more appropriate. |
Yes, agreed! That would be ideal, and most informative to the user. |
It's not clear why this threshold was implemented and what kinds of situations it's supposed to be help with.
It seems like it would accidentally exclude certain kmers from subtype calling if the frequency of those kmers is "too high":
biohansel/bio_hansel/subtyper.py
Line 287 in d20a00b
The text was updated successfully, but these errors were encountered: