Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test various abundance estimation parameters #330

Open
jakobnissen opened this issue Jul 2, 2024 · 1 comment
Open

Test various abundance estimation parameters #330

jakobnissen opened this issue Jul 2, 2024 · 1 comment
Assignees

Comments

@jakobnissen
Copy link
Member

After a talk with @shiraz-shah, we should investigate if we can estimate the abundance more accurately.
He recommended using msamtools with 80% query coverage, 95% id over 80 bp, and using msamtools profile to estimate abundances. This is supposedly more accurate than the naive read counting that CoverM does.
We do not yet know if more accurate abundance estimation actually leads to a better binning. So, we should test the following and compare to our current defaults:

  • 95% id over 80 bp, 80% query coverage filter with CoverM
  • Same filters using msamtools
  • Same filters using msamtools + msamtools profile

Probably the msamtools pipeline should be:

<some mapping command that produces SAM output>
  | msamtools filter -S -bu -l 80 -p 95 -z 80 --besthit - \
  | msamtools profile --multi=proportional --label=SAMPLE --unit=ab -o SAMPLE.profile.txt.gz -

The same filters can be applied using CoverM with coverm contig --min-read-aligned-length 80 --min-read-percent-identity 95 --min-read-aligned-percent 80

@Las02 if you have time, it would be good to also test this (lower priority than the current strobealign tests)

@shiraz-shah
Copy link

In our experience, the individual abundance estimates for each contig without qc'ing mappings with msamtools filter are 50% noise, and 50% signal. In terms of presence/absence, ten times as many contigs are found in a sample if the above qc'ing is not applied. We have benchmarked this using the CAMI data set and we can see that it's all noise.

Additionally, msamtools profile iteratively redistributes ambiguous read mappings to the correct contig based on its unique matches. So if you have two contigs that are 95% identical, normally they would both get reads assigned, but msatools can tell if one contig or the other is present by looking at the reads that map to the dissimilar portions of the contigs. If both are present, their abundances are tweaked so they become more accurate.

These two steps will make your abudances much much more accurate, and I can't help but wonder whether it would make VAMB more accurate than the competition, all of a sudden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants