Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purpose of max kmer frequency? #142

Open
peterk87 opened this issue Mar 4, 2021 · 5 comments
Open

Purpose of max kmer frequency? #142

peterk87 opened this issue Mar 4, 2021 · 5 comments

Comments

@peterk87
Copy link
Contributor

peterk87 commented Mar 4, 2021

It's not clear why this threshold was implemented and what kinds of situations it's supposed to be help with.

It seems like it would accidentally exclude certain kmers from subtype calling if the frequency of those kmers is "too high":

df['is_kmer_freq_okay'] = (df.freq >= subtyping_params.min_kmer_freq) & (df.freq <= subtyping_params.max_kmer_freq)

@glabbe
Copy link
Collaborator

glabbe commented Mar 4, 2021

@peterk87: The max frequency threshold could be useful when analyzing metagenomics datasets. For example, Salmonella schemes k-mers are generally present at low frequency (~10-100 fold genome coverage) in the metagenomics samples that we have tested so far. However, we have found that a handful of the S. Enteritidis scheme k-mers were present at 3,000-6,000 fold coverage in some metagenomics samples, and we suspect that these match E. coli genome sequences. By excluding these high coverage k-mers, we get a more accurate average genome coverage estimate (represented by the avg_kmer_coverage value in the BioHansel results) for the pathogen targeted by the scheme. I would recommend increasing the default value to a very high number, in order to avoid accidental exclusion of valid k-mers in most situations.

@peterk87
Copy link
Contributor Author

peterk87 commented Mar 4, 2021

Hi @glabbe

What about using median kmer frequency instead to exclude these extremely high frequency outliers? Kind of like how median household income is more informative than the mean value since the 0.1% will drag up the mean value substantially.

@glabbe
Copy link
Collaborator

glabbe commented Mar 4, 2021

That sounds good to me @peterk87. If we implement the median frequency calculation, we could keep an option to implement a user-defined maximum frequency threshold, but to have it "off" by default?

@peterk87
Copy link
Contributor Author

peterk87 commented Mar 4, 2021

I think off by default would be a good idea and convenient for viral amplicon sequencing data especially.

Based on the Salmonella example, it sounds like having a QC message warning of extremely high frequency kmers rather than excluding those kmers from the subtype calling may be more appropriate.

@glabbe
Copy link
Collaborator

glabbe commented Mar 4, 2021

Yes, agreed! That would be ideal, and most informative to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants