Test various abundance estimation parameters #330

jakobnissen · 2024-07-02T11:30:29Z

After a talk with @shiraz-shah, we should investigate if we can estimate the abundance more accurately.
He recommended using msamtools with 80% query coverage, 95% id over 80 bp, and using msamtools profile to estimate abundances. This is supposedly more accurate than the naive read counting that CoverM does.
We do not yet know if more accurate abundance estimation actually leads to a better binning. So, we should test the following and compare to our current defaults:

95% id over 80 bp, 80% query coverage filter with CoverM
Same filters using msamtools
Same filters using msamtools + msamtools profile

Probably the msamtools pipeline should be:

<some mapping command that produces SAM output>
  | msamtools filter -S -bu -l 80 -p 95 -z 80 --besthit - \
  | msamtools profile --multi=proportional --label=SAMPLE --unit=ab -o SAMPLE.profile.txt.gz -

The same filters can be applied using CoverM with coverm contig --min-read-aligned-length 80 --min-read-percent-identity 95 --min-read-aligned-percent 80

@Las02 if you have time, it would be good to also test this (lower priority than the current strobealign tests)

The text was updated successfully, but these errors were encountered:

shiraz-shah · 2024-07-02T12:09:31Z

In our experience, the individual abundance estimates for each contig without qc'ing mappings with msamtools filter are 50% noise, and 50% signal. In terms of presence/absence, ten times as many contigs are found in a sample if the above qc'ing is not applied. We have benchmarked this using the CAMI data set and we can see that it's all noise.

Additionally, msamtools profile iteratively redistributes ambiguous read mappings to the correct contig based on its unique matches. So if you have two contigs that are 95% identical, normally they would both get reads assigned, but msatools can tell if one contig or the other is present by looking at the reads that map to the dissimilar portions of the contigs. If both are present, their abundances are tweaked so they become more accurate.

These two steps will make your abudances much much more accurate, and I can't help but wonder whether it would make VAMB more accurate than the competition, all of a sudden.

jakobnissen assigned Las02 Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test various abundance estimation parameters #330

Test various abundance estimation parameters #330

jakobnissen commented Jul 2, 2024

shiraz-shah commented Jul 2, 2024

Test various abundance estimation parameters #330

Test various abundance estimation parameters #330

Comments

jakobnissen commented Jul 2, 2024

shiraz-shah commented Jul 2, 2024