Purpose of max kmer frequency? #142

peterk87 · 2021-03-04T18:53:51Z

It's not clear why this threshold was implemented and what kinds of situations it's supposed to be help with.

It seems like it would accidentally exclude certain kmers from subtype calling if the frequency of those kmers is "too high":

biohansel/bio_hansel/subtyper.py

Line 287 in d20a00b

    
           df['is_kmer_freq_okay'] = (df.freq >= subtyping_params.min_kmer_freq) & (df.freq <= subtyping_params.max_kmer_freq)

glabbe · 2021-03-04T19:21:17Z

@peterk87: The max frequency threshold could be useful when analyzing metagenomics datasets. For example, Salmonella schemes k-mers are generally present at low frequency (~10-100 fold genome coverage) in the metagenomics samples that we have tested so far. However, we have found that a handful of the S. Enteritidis scheme k-mers were present at 3,000-6,000 fold coverage in some metagenomics samples, and we suspect that these match E. coli genome sequences. By excluding these high coverage k-mers, we get a more accurate average genome coverage estimate (represented by the avg_kmer_coverage value in the BioHansel results) for the pathogen targeted by the scheme. I would recommend increasing the default value to a very high number, in order to avoid accidental exclusion of valid k-mers in most situations.

peterk87 · 2021-03-04T19:40:11Z

Hi @glabbe

What about using median kmer frequency instead to exclude these extremely high frequency outliers? Kind of like how median household income is more informative than the mean value since the 0.1% will drag up the mean value substantially.

glabbe · 2021-03-04T19:45:26Z

That sounds good to me @peterk87. If we implement the median frequency calculation, we could keep an option to implement a user-defined maximum frequency threshold, but to have it "off" by default?

peterk87 · 2021-03-04T19:55:19Z

I think off by default would be a good idea and convenient for viral amplicon sequencing data especially.

Based on the Salmonella example, it sounds like having a QC message warning of extremely high frequency kmers rather than excluding those kmers from the subtype calling may be more appropriate.

glabbe · 2021-03-04T19:58:41Z

Yes, agreed! That would be ideal, and most informative to the user.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purpose of max kmer frequency? #142

Purpose of max kmer frequency? #142

peterk87 commented Mar 4, 2021

glabbe commented Mar 4, 2021

peterk87 commented Mar 4, 2021

glabbe commented Mar 4, 2021

peterk87 commented Mar 4, 2021

glabbe commented Mar 4, 2021

Purpose of max kmer frequency? #142

Purpose of max kmer frequency? #142

Comments

peterk87 commented Mar 4, 2021

glabbe commented Mar 4, 2021

peterk87 commented Mar 4, 2021

glabbe commented Mar 4, 2021

peterk87 commented Mar 4, 2021

glabbe commented Mar 4, 2021