Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post: Switch to bam #30

Merged
merged 44 commits into from
Jun 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
26ac4a9
switch-to-bam post
lskatz May 14, 2024
59367a1
conclusion
lskatz May 14, 2024
9c5123d
more content added
lskatz May 30, 2024
8759c87
phoenix added to the list
lskatz May 30, 2024
398f42a
references and fastq description
lskatz May 30, 2024
47570c1
URLs
lskatz May 30, 2024
7da2a01
ONT reads bam natively
lskatz May 30, 2024
9d0fcfa
light revision
lskatz May 30, 2024
f08895c
remove newlines
lskatz May 30, 2024
c80f39c
sorage advantage
lskatz May 30, 2024
b81e341
remove request for help on QA systems
lskatz May 30, 2024
e5b417e
references
lskatz May 30, 2024
4a051d3
addressed samtools sort and index, and also interleaved reads
lskatz May 31, 2024
35510b1
brought back the main branch Gemfile.lock
lskatz May 31, 2024
93da8f3
Merge branch 'main' into switch-to-bam
lskatz May 31, 2024
220de13
cram
lskatz May 31, 2024
175dc61
remove redundant sentence
lskatz May 31, 2024
5eea800
added peter for a reviewer
lskatz May 31, 2024
33ba49c
removed my username from code block; removed awkward sentence
lskatz Jun 1, 2024
17e501f
Update _posts/2024-05-14-switch-to-bam.md: combine two short sentences
lskatz Jun 3, 2024
e5d1a06
Update _posts/2024-05-14-switch-to-bam.md: added better written sente…
lskatz Jun 3, 2024
f449616
Update _posts/2024-05-14-switch-to-bam.md: better format for Page et …
lskatz Jun 3, 2024
832235c
Added suggestions from hexylena esp uBAM notation
lskatz Jun 3, 2024
34c7619
fixed markdown linting
lskatz Jun 3, 2024
be825e5
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 4, 2024
9e3b717
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 4, 2024
5ddd346
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 4, 2024
82befe8
addressed some of pmenzel's comments
lskatz Jun 4, 2024
1c5afa7
fixed first person singular to plural; fixed file format capitalization
lskatz Jun 4, 2024
64f0587
brought in Dbtara's excellent points and added as an author
lskatz Jun 4, 2024
a803609
brought in Dbtara's excellent points and added as an author
lskatz Jun 4, 2024
1b38023
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 5, 2024
5faac14
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 5, 2024
70282c5
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 5, 2024
23ffb7c
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 5, 2024
8469c8f
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 5, 2024
91de021
Update _posts/2024-05-14-switch-to-bam.md
lskatz Jun 5, 2024
61f5607
remove results
lskatz Jun 5, 2024
e9b0477
added info about Illumina and then ironed out the conclusion
lskatz Jun 5, 2024
487dd37
Update 2024-05-14-switch-to-bam.md
pmenzel Jun 5, 2024
e20a501
me=>out
lskatz Jun 5, 2024
3433788
remove 'here' in footnote2
lskatz Jun 5, 2024
e7c7bf4
fix hyperlink text
lskatz Jun 5, 2024
9eafaf0
Rename 2024-05-14-switch-to-bam.md to 2024-06-06-switch-to-bam.md
pmenzel Jun 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CONTRIBUTORS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,8 @@ bebatut:

pvanheus:
name: Peter van Heusden
orcid: 0000-0001-6553-5274
orcid: 0000-0001-6553-5274

dbtara:
name: Dhwani Batra
orcid: 0000-0002-6789-6396
191 changes: 191 additions & 0 deletions _posts/2024-06-06-switch-to-bam.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
layout: post
title: What does it take to switch over to BAM from FASTQ?
tags:
- BAM
- uBAM
- FASTQ
- file formats
contributors:
- lskatz
- SeanSierra-Patev
- dbtara
reviewers:
- pmenzel
- hexylena
---

It has been over 14 years since the formalization of the FASTQ format ({% cite Cock2009 %}),
which describes sequences and their quality scores.
For paired end reads, sequences are encoded in separate files, usually called R1 and R2 [^1].
Unfortunately despite the publication, FASTQ format is not entirely standardized!
For example, it is possible to have a valid FASTQ format in either 4-line-per-entry format, or splitting sequences into multiple lines.
Additionally, the defline itself is not entirely standardized and is basically free text.
However, we have had a lot of innovations in sequence formats since then.
One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format.
While this file format is typically used to store information about alignments of sequencing reads, it can also just store the unaligned sequencing data.
Crucially, both reads from paired-end sequencing (i.e., R1 and R2) are stored in the same single file
and [it allows for metadata as explained in this GATK post](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format).
We found at least one other [blog post with this same sentiment from _2011_](https://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html).
This idea isn't new.
It is frustratingly old.

FASTQ files, as a means for storing primary sequencing data before any downstream analysis, are integral to genomic epidemiology.
State health labs sequence genomes,
transfer the FASTQ files to an internal repository,
run quality checks (QC), usually through a quality assurance (QA) pipeline.

**So what would it take to change this whole process to BAM files instead?**
BAM files have many advantages including having only one file per sample instead of two,
encoding extra information such as alignment data,
and being indexed for random access.
For our purposes here, BAM files will be unaligned BAM (uBAM)
because they are not aligned against anything.

## Sequencing

First, the sequencers would have to output uBAM files.
Can they?
Illumina will not output uBAM natively, but it does allow
[Local Run Manager modules](https://customprotocolselector.illumina.com/selectors/LRM-module-selector/Content/Source/FrontPages/LRM-module-selector.htm).
If one of these modules aligns against a reference, then an Illumina platform would at least produce a BAM.
The Ion Torrent platforms do produce a uBAM automatically.
After calling with at least Dorado, ONT sequencing outputs uBAM files.
Pacbio does generate BAM as native format (they discontinued HDF5).
For platforms that do not have this automation,
we would need an easy conversion.

One such easy conversion is with `samtools`, e.g.,

```shell
samtools import -1 R1.fastq.gz -2 R2.fastq.gz --order ro -O bam,level=0 | \
samtools sort -M - -o sorted.bam
```

## Repository

Usually the FASTQ files end up in some kind of repository, organized by
run or by organism.
Instead of FASTQ files, it is easy to imagine that now the
repository consists of the uBAM files alone.
One might be concerned about taking additional space,
but actually uBAM files may offer a storage space savings over individual FASTQ files,
besides reduction in complexity gained by combining forward and reverse reads.
The above command uses `-M` in `samtools sort`, which sorts the reads by minimizers, which allows for a much better compression and therefore reduced size of the uBAM file.
In our example dataset, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file.
A 31% storage reduction, in this case, can represent huge file storage savings across an entire sequencing data repository.

## QA/QC

Now that the files are in the repository, how do you run QA on them?
Some examples of a QA system include

* [SneakerNet](https://github.com/lskatz/SneakerNet)
* [Phoenix](https://github.com/CDCgov/phoenix)
* [Pandoo](https://github.com/schultzm/pandoo)
* [Nullarbor](https://github.com/tseemann/nullarbor)

Which of these pipelines can read a uBAM file?
To our knowledge, none of them!
This is one area where we need to see more adaptation of uBAM files.

## Primary analysis

There are several primary analyses that can be performed
right after QA/QC.
For example, reads might need to be transformed into an assembly.
Or also, they might need to have multilocus sequence typing (MLST)
run so that all alleles are known in future comparisons.
Some labs are sketching genomes as they are sequenced
and so certain tools can create those sketches.
Finally, there might be individual genotyping operations
such as Salmonella serotyping or virulence factors detection.

### Assembly

For genome assembly, many labs use [Shovill](https://github.com/tseemann/shovill).
However, Shovill does not natively read uBAM files.
Therefore, this workflow breaks slightly unless there is some conversion step.

Other assemblers that people commonly use for bacterial genomes are [SPAdes](https://github.com/ablab/spades) and [SKESA](https://github.com/ncbi/SKESA).
SPAdes does read uBAM natively and so that is good.
However, it does not appear that SKESA can read uBAM natively.

### MLST

MLST software usually takes FASTA or FASTQ files.
At this point there are a million classic MLST software packages and for some additional information,
please check out {% cite Page2017 %}.
For whole genome MLST software tools, we could also not find any packages that natively read uBAM.
Please see [@lskatz's previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them.

### Sketches

If you're like us, you want to have a directory of at least some sketches from Mash.
(see the [mashpit project](https://github.com/tongzhouxu/mashpit) for an exciting project!)
It appears that Mash natively does not read uBAM according to the v2.3 usage menu.
However it is promising that the [Sourmash](https://github.com/sourmash-bio/sourmash) library [_does_ read uBAM](https://sourmash.readthedocs.io/en/latest/release-notes/sourmash-2.0.html#major-new-features-since-1-0) natively since version 2.

### Genotyping

Generally in our experience, people base genotyping on either
[KMA](https://bitbucket.org/genomicepidemiology/kma),
[SRST2](https://github.com/katholt/srst2),
[SAUTE](https://github.com/ncbi/SKESA),
or [ARIBA](https://github.com/sanger-pathogens/ariba).
Looking at each of these software packages, we could not find any documentation that uBAM is natively read.
However, we could find that FASTA and FASTQ were valid inputs.
There are other software packages in the world for specific pathogens like _Salmonella_
but for this generalized blog post, we did not investigate further.

## Other compression methods

This article discusses the uBAM format, but there have been many attempts over the years to make other compression formats [^2].
The [CRAM format](https://samtools.github.io/hts-specs/CRAMv3.pdf) is probably the best example.
CRAM has an even better lossless compression of sequences than uBAM,
and we even confirmed it on our own sequence.

```bash
samtools import -1 1.fastq.gz -2 2.fastq.gz --order ro -O bam,level=0 | \
samtools sort -O cram --output-fmt-option archive -M - -o archive.cram
```

When viewing the same sequences in FASTQ, BAM, or CRAM, we get an astonishing reduction.

```text
-rw-------. 1 user users 81M May 30 20:03 unmapped.bam
-rw-------. 1 user users 55M May 31 09:18 unmapped.cram
-rw-------. 1 user users 55M Dec 6 2019 1.fastq.gz
-rw-------. 1 user users 63M Dec 6 2019 2.fastq.gz
```

The CRAM format is seemlessly incorporated into samtools, allowing for freely converting between formats.
In fact, [EBI stores a ton of CRAM files already](https://x.com/BonfieldJames/status/1182180199657607168).
So why wouldn't we recommend CRAM upfront?
Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it.
We can check a box on our nanopore that makes BAMs; we can't do the same for CRAM.
That said, given an ideal world, we would encourage the sequencing companies to consider that check box.

## Conclusion

The good part is that many but not all sequencing platforms output uBAM format natively.
For those that don't have this capability, we have a way to convert FASTQ to uBAM.
However even after aquiring a uBAM, we need software to natively read them.

Bioinformatics software developers should be looking ahead
to future proof their software.
They need to accept uBAM natively as input.
Although it might seem straightforward, there are several
links in the genomic epidemiology chain that need to be
updated.
These include updating QA/QC pipelines, primary analyses, and
secondary analyses.
As for @lskatz, I should also say that I am guilty of this.
For some of my own popular software such as
[Mashtree](https://github.com/lskatz/mashtree/tree/master/.github/workflows)
and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read uBAM files!
I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road.
Therefore we can say from our own observations and personal experience, there is some work up ahead to get us moved over to uBAM files!

[^1] To maintain focus in this article, I will gloss past interleaved reads.
[^2] One such example of an organization trying to standardize is [here](https://www.genomeweb.com/informatics/will-bioinformatics-professionals-embrace-mpeg-g-data-compression-standard?utm_source=addthis_shares#.XcwmeiO9qjW.twitter).
29 changes: 29 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,35 @@ @article{Bankevich2012
pages = {455–477}
}

@article{Page2017,
title = {Comparison of classical multi-locus sequence typing software for next-generation sequencing data},
volume = {3},
ISSN = {2057-5858},
url = {http://dx.doi.org/10.1099/mgen.0.000124},
DOI = {10.1099/mgen.0.000124},
number = {8},
journal = {Microbial Genomics},
publisher = {Microbiology Society},
author = {Page, Andrew J. and Alikhan, Nabil-Fareed and Carleton, Heather A. and Seemann, Torsten and Keane, Jacqueline A. and Katz, Lee S.},
year = {2017},
month = jun
}

@article{Cock2009,
title = {The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants},
volume = {38},
ISSN = {1362-4962},
url = {http://dx.doi.org/10.1093/nar/gkp1137},
DOI = {10.1093/nar/gkp1137},
number = {6},
journal = {Nucleic Acids Research},
publisher = {Oxford University Press (OUP)},
author = {Cock, Peter J. A. and Fields, Christopher J. and Goto, Naohisa and Heuer, Michael L. and Rice, Peter M.},
year = {2009},
month = dec,
pages = {1767–1771}
}

@article{SarrazinGendron2024,
title = {Improving microbial phylogeny with citizen science within a mass-market video game},
ISSN = {1546-1696},
Expand Down