From 26ac4a92724a17df4dde86a92ce8920309249bc0 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Tue, 14 May 2024 13:30:57 -0400 Subject: [PATCH 01/43] switch-to-bam post --- _posts/2024-05-14-switch-to-bam.md | 125 +++++++++++++++++++++++++++++ 1 file changed, 125 insertions(+) create mode 100644 _posts/2024-05-14-switch-to-bam.md diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md new file mode 100644 index 0000000..8ffd6a7 --- /dev/null +++ b/_posts/2024-05-14-switch-to-bam.md @@ -0,0 +1,125 @@ +--- +layout: post +title: What does it take to switch over to bam from fastq? +tags: + - bam + - fastq +contributors: + - lskatz + - SeanSierra-Patev +reviewers: + +--- + +It has been over XX years since the formalization of the fastq format. +It describes sequences and their quality scores. +For paired end reads, sequences are encoded in separate files, usually called R1 and R2. +However, we have had a lot of innovations in sequence quality since then. +One of those innovations is the BAM format which is the binary alignment/mapping format. +R1 and R2 are encoded in the same file. + +Fastq files are integral to genomic epidemiology. +State health labs sequence genomes, +transfer the fastq files to an internal repository, +run quality checks (QC), usually through a quality assurance (QA) pipeline. + +**So what would it take to change this whole process to bam files instead?** +Bam files have many advantages including having one file per sample, +encoding extra information such as alignment data, +and being indexed for random access. +For our purposes here, bam files will be unaligned bam (uBAM) +because they are not aligned against anything. + +## Sequencing + +First, the sequencers would have to output bam files. +Can they? +The Illumina platforms and the Ion Torrent platforms do automatically. +[Need help here: PacBio? ONT?] +For platforms that do not have this automation, +we would need an easy conversion. + +One such easy conversion is with `samtools`, e.g., + +```shell +samtools import -1 R1.fastq.gz -2 R2.fastq.gz --order ro -O bam,level=0 | \ + samtools sort - -o sorted.bam +# optionally, index +samtools index sorted.bam +``` + +## Repository + +Usually the fastq files end up in some kind of repository, organized by +run or by organism. +Instead of fastq files, it is easy to imagine that now the +repository consists of the bam files alone. + +## QA/QC + +Now that the files are in the repository, how do you run QA on them? +Some examples of a QA system include + +* SneakerNet +* Phoenix +* Pandoo +* Nullarbor + +Which of these pipelines can read a bam file? +To my knowledge, none of them! [Need help here: is this correct?] +This is one area where we need to see more adaptation of bam files. + +## Primary analysis + +There are several primary analyses that can be performed +right after QA/QC. +For example, reads might need to be transformed into an assembly. +Or also, they might need to have multilocus sequence typing (MLST) +run so that all alleles are known in future comparisons. +Some labs are sketching genomes as they are sequenced +and so certain tools can create those sketches. +Finally, there might be individual genotyping operations +such as Salmonella serotyping or virulence factors detection. + +### assembly + +Which assembly methods can read bam files? + +### MLST + +What takes bams? + +### Sketches + +Mash, etc + +### Genotyping + +What takes bams? + +Salmonella genotyping? + +E. coli serogrouping? + +## Secondary analysis + +This is the stage of where we start seeing exciting things. +What does the sample cluster with? +We have a few methods out there that read assemblies. +If they do, then that's fine because we have already used +the bam files to create an assembly at this point. +If they use reads, then we'll need to make sure that these +methods can read bam. + +### kmer + +Mashtree + +SKA + +KSNP4 + +### MLST clustering + +### SNP analysis + From 59367a156167ed9f1584fb2a8e6a2dbfe2f8b2f8 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Tue, 14 May 2024 13:38:29 -0400 Subject: [PATCH 02/43] conclusion --- _posts/2024-05-14-switch-to-bam.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 8ffd6a7..230f46e 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -123,3 +123,16 @@ KSNP4 ### SNP analysis +## Conclusion + +Bioinformatics software developers should be looking ahead +to future proof their software. +One way would be to accept more than just fastq as standard +input. +Although it might seem straightforward, there are several +links in the genomic epidemiology chain that need to be +updated. +These include ensuring that all sequencing platforms output a bam +but also updating QA/QC pipelines, primary analyses, and +secondary analyses. + From 9c5123d9951ce84357a61ecd1c9a7d16b2dc4ee9 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 11:05:15 -0400 Subject: [PATCH 03/43] more content added --- _posts/2024-05-14-switch-to-bam.md | 65 ++++++++++++++++-------------- 1 file changed, 35 insertions(+), 30 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 230f46e..8e549d0 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -11,7 +11,7 @@ reviewers: --- -It has been over XX years since the formalization of the fastq format. +It has been over 14 years since the formalization of the fastq format [^1]. It describes sequences and their quality scores. For paired end reads, sequences are encoded in separate files, usually called R1 and R2. However, we have had a lot of innovations in sequence quality since then. @@ -83,45 +83,43 @@ such as Salmonella serotyping or virulence factors detection. ### assembly -Which assembly methods can read bam files? +For genome assembly, many labs are using Shovill. +However, Shovill does not natively read bam files. +Therefore, this workflow breaks slightly unless there is some conversion step. + +Other assemblers that people commonly use for bacterial genomes are SPAdes and SKESA. +SPAdes does read bam natively and so that is good. +However, it does not appear that SKESA can read bam natively. ### MLST -What takes bams? +MLST software usually takes fasta or fastq files. +At this point there are a million classic MLST software packages and for some additional information, +please check out Page et al 2017 (ref). +For whole genome MLST software tools, I could also not find any packages that natively read bam. +Please see [my previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. +I could not find any MLST software that reads bam natively. ### Sketches -Mash, etc +If you're like me, you want to have a directory of at least some sketches from Mash. +(see the [mashpit project](https://github.com/tongzhouxu/mashpit) for an exciting project!) +It appears that Mash natively does not read bam according to the v2.3 usage menu. +However it is promising that the Sourmash library [_does_ read bam](https://sourmash.readthedocs.io/en/latest/release-notes/sourmash-2.0.html#major-new-features-since-1-0) natively since version 2. ### Genotyping -What takes bams? - -Salmonella genotyping? - -E. coli serogrouping? - -## Secondary analysis - -This is the stage of where we start seeing exciting things. -What does the sample cluster with? -We have a few methods out there that read assemblies. -If they do, then that's fine because we have already used -the bam files to create an assembly at this point. -If they use reads, then we'll need to make sure that these -methods can read bam. +Generally in my experience, people base genotyping on either +[KMA](https://bitbucket.org/genomicepidemiology/kma), +[SRST2](https://github.com/katholt/srst2), +[SAUTE](https://github.com/ncbi/SKESA), +or [ARIBA](https://github.com/sanger-pathogens/ariba). -### kmer +Looking at each of these software packages, I could not find any documentation that bam is natively read. +However, I could find that fasta or fastq were valid inputs. -Mashtree - -SKA - -KSNP4 - -### MLST clustering - -### SNP analysis +There are other software packages in the world for specific pathogens like _Salmonella_ +but for this generalized blog post, I did not investigate further. ## Conclusion @@ -135,4 +133,11 @@ updated. These include ensuring that all sequencing platforms output a bam but also updating QA/QC pipelines, primary analyses, and secondary analyses. - +I should also say that I am guilty of this. +For some of my own popular software such as +[Mashtree](https://github.com/lskatz/mashtree/tree/master/.github/workflows) +and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read bam files! +I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. +Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to bam files! + +[^1] Peter J.A. Cock et al 2010 "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants", _NAR_ \ No newline at end of file From 8759c87fe4dfc7ddc0b87ce6c2509be2f3006a64 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 11:09:35 -0400 Subject: [PATCH 04/43] phoenix added to the list --- _posts/2024-05-14-switch-to-bam.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 8e549d0..d651300 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -64,6 +64,7 @@ Some examples of a QA system include * Phoenix * Pandoo * Nullarbor +* Phoenix Which of these pipelines can read a bam file? To my knowledge, none of them! [Need help here: is this correct?] From 398f42a1c5bc55afef048228414adb0493899bf5 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 12:28:32 -0400 Subject: [PATCH 05/43] references and fastq description --- _posts/2024-05-14-switch-to-bam.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index d651300..8d6e68f 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -11,10 +11,15 @@ reviewers: --- -It has been over 14 years since the formalization of the fastq format [^1]. +It has been over 14 years since the formalization of the fastq format {% cite Cock2009 %}. It describes sequences and their quality scores. For paired end reads, sequences are encoded in separate files, usually called R1 and R2. -However, we have had a lot of innovations in sequence quality since then. +Unfortunately despite the publication, +fastq format is not entirely standardized! +For example, it is possible to have a valid fastq format in either 4-line-per-entry format, +or splitting sequences into multiple lines. +Additionally, the defline itself is not entirely standardized and is basically free text. +However, we have had a lot of innovations in sequence formats since then. One of those innovations is the BAM format which is the binary alignment/mapping format. R1 and R2 are encoded in the same file. @@ -96,7 +101,7 @@ However, it does not appear that SKESA can read bam natively. MLST software usually takes fasta or fastq files. At this point there are a million classic MLST software packages and for some additional information, -please check out Page et al 2017 (ref). +please check out Page et al 2017 {% cite Page2017 %}. For whole genome MLST software tools, I could also not find any packages that natively read bam. Please see [my previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. I could not find any MLST software that reads bam natively. @@ -140,5 +145,3 @@ For some of my own popular software such as and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read bam files! I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to bam files! - -[^1] Peter J.A. Cock et al 2010 "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants", _NAR_ \ No newline at end of file From 47570c11c2e5c2695266c6cbb2e626b3447eeb7a Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 12:39:01 -0400 Subject: [PATCH 06/43] URLs --- _posts/2024-05-14-switch-to-bam.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 8d6e68f..7b5e0b7 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -65,11 +65,10 @@ repository consists of the bam files alone. Now that the files are in the repository, how do you run QA on them? Some examples of a QA system include -* SneakerNet -* Phoenix -* Pandoo -* Nullarbor -* Phoenix +* [SneakerNet](https://github.com/lskatz/SneakerNet) +* [Phoenix](https://github.com/CDCgov/phoenix) +* [Pandoo](https://github.com/schultzm/pandoo) +* [Nullarbor](https://github.com/tseemann/nullarbor) Which of these pipelines can read a bam file? To my knowledge, none of them! [Need help here: is this correct?] From 7da2a01344b58796b5a7f8a5045697c9ba55f1cd Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 12:47:14 -0400 Subject: [PATCH 07/43] ONT reads bam natively --- Gemfile.lock | 14 +++++++------- _posts/2024-05-14-switch-to-bam.md | 6 +++++- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/Gemfile.lock b/Gemfile.lock index 39448dc..34b2ce8 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -24,9 +24,10 @@ GEM eventmachine (1.2.7) ffi (1.16.3) forwardable-extended (2.6.0) - google-protobuf (3.25.3-x86_64-linux) + google-protobuf (4.26.1-x86_64-linux) + rake (>= 13) http_parser.rb (0.8.0) - i18n (1.14.4) + i18n (1.14.5) concurrent-ruby (~> 1.0) jekyll (4.3.3) addressable (~> 2.4) @@ -78,18 +79,17 @@ GEM nuggets (1.6.1) pathutil (0.16.2) forwardable-extended (~> 2.6) - public_suffix (5.0.4) + public_suffix (5.0.5) racc (1.7.3) - rake (13.1.0) + rake (13.2.1) rb-fsevent (0.11.2) rb-inotify (0.10.1) ffi (~> 1.0) rexml (3.2.6) rouge (4.2.1) safe_yaml (1.0.5) - sass-embedded (1.69.5) - google-protobuf (~> 3.23) - rake (>= 13.0.0) + sass-embedded (1.77.1-x86_64-linux-gnu) + google-protobuf (>= 3.25, < 5.0) terminal-table (3.0.2) unicode-display_width (>= 1.1.1, < 3) unicode-display_width (2.5.0) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 7b5e0b7..91221cd 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -40,7 +40,8 @@ because they are not aligned against anything. First, the sequencers would have to output bam files. Can they? The Illumina platforms and the Ion Torrent platforms do automatically. -[Need help here: PacBio? ONT?] +After calling with at least Dorado, ONT sequencing outputs bam files. +[Need help here: PacBio? ] For platforms that do not have this automation, we would need an easy conversion. @@ -128,6 +129,9 @@ but for this generalized blog post, I did not investigate further. ## Conclusion +The good part is that sequencing platforms output bam format natively, for what I can tell. +However, we need software to natively read these bam files. + Bioinformatics software developers should be looking ahead to future proof their software. One way would be to accept more than just fastq as standard From 9d0fcfa65f0e80f99eedb70c5f586242230ebba9 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 12:51:38 -0400 Subject: [PATCH 08/43] light revision --- _posts/2024-05-14-switch-to-bam.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 91221cd..b034a04 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -134,17 +134,15 @@ However, we need software to natively read these bam files. Bioinformatics software developers should be looking ahead to future proof their software. -One way would be to accept more than just fastq as standard -input. +They need to accept bam natively as input. Although it might seem straightforward, there are several links in the genomic epidemiology chain that need to be updated. -These include ensuring that all sequencing platforms output a bam -but also updating QA/QC pipelines, primary analyses, and +These include updating QA/QC pipelines, primary analyses, and secondary analyses. I should also say that I am guilty of this. For some of my own popular software such as -[Mashtree](https://github.com/lskatz/mashtree/tree/master/.github/workflows) +[Mashtree](https://github.com/lskatz/mashtree/tree/master/.github/workflows) and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read bam files! I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to bam files! From f08895c4b431fa64e55d7818821c827b89f03394 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 12:52:32 -0400 Subject: [PATCH 09/43] remove newlines --- _posts/2024-05-14-switch-to-bam.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index b034a04..4c41808 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -120,10 +120,8 @@ Generally in my experience, people base genotyping on either [SRST2](https://github.com/katholt/srst2), [SAUTE](https://github.com/ncbi/SKESA), or [ARIBA](https://github.com/sanger-pathogens/ariba). - Looking at each of these software packages, I could not find any documentation that bam is natively read. However, I could find that fasta or fastq were valid inputs. - There are other software packages in the world for specific pathogens like _Salmonella_ but for this generalized blog post, I did not investigate further. From c80f39c25b363598ce2096a65926396a98bd969f Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 13:15:38 -0400 Subject: [PATCH 10/43] sorage advantage --- _posts/2024-05-14-switch-to-bam.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 4c41808..b54d314 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -60,6 +60,9 @@ Usually the fastq files end up in some kind of repository, organized by run or by organism. Instead of fastq files, it is easy to imagine that now the repository consists of the bam files alone. +One might be concerned about taking additional space, +but actually unsorted bam may offer a storage space savings over individual fastq files, +besides reduction in complexity gained by combining forward and reverse reads. ## QA/QC From b81e3413d51b6300950bdb7e90c91abc91349992 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 13:24:16 -0400 Subject: [PATCH 11/43] remove request for help on QA systems --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index b54d314..20e1066 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -75,7 +75,7 @@ Some examples of a QA system include * [Nullarbor](https://github.com/tseemann/nullarbor) Which of these pipelines can read a bam file? -To my knowledge, none of them! [Need help here: is this correct?] +To my knowledge, none of them! This is one area where we need to see more adaptation of bam files. ## Primary analysis From e5b417eb3f1b1a6091b35a4c1b644b8e6511e3f7 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 13:29:28 -0400 Subject: [PATCH 12/43] references --- references.bib | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/references.bib b/references.bib index 0fff57c..58b674f 100644 --- a/references.bib +++ b/references.bib @@ -13,3 +13,31 @@ @article{Bankevich2012 pages = {455–477} } +@article{Page2017, + title = {Comparison of classical multi-locus sequence typing software for next-generation sequencing data}, + volume = {3}, + ISSN = {2057-5858}, + url = {http://dx.doi.org/10.1099/mgen.0.000124}, + DOI = {10.1099/mgen.0.000124}, + number = {8}, + journal = {Microbial Genomics}, + publisher = {Microbiology Society}, + author = {Page, Andrew J. and Alikhan, Nabil-Fareed and Carleton, Heather A. and Seemann, Torsten and Keane, Jacqueline A. and Katz, Lee S.}, + year = {2017}, + month = jun +} + +@article{Cock2009, + title = {The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants}, + volume = {38}, + ISSN = {1362-4962}, + url = {http://dx.doi.org/10.1093/nar/gkp1137}, + DOI = {10.1093/nar/gkp1137}, + number = {6}, + journal = {Nucleic Acids Research}, + publisher = {Oxford University Press (OUP)}, + author = {Cock, Peter J. A. and Fields, Christopher J. and Goto, Naohisa and Heuer, Michael L. and Rice, Peter M.}, + year = {2009}, + month = dec, + pages = {1767–1771} +} \ No newline at end of file From 4a051d371e9582c8458be97f06052f167e468f7d Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 20:09:24 -0400 Subject: [PATCH 13/43] addressed samtools sort and index, and also interleaved reads --- _posts/2024-05-14-switch-to-bam.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 20e1066..035b8a4 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -13,7 +13,7 @@ reviewers: It has been over 14 years since the formalization of the fastq format {% cite Cock2009 %}. It describes sequences and their quality scores. -For paired end reads, sequences are encoded in separate files, usually called R1 and R2. +For paired end reads, sequences are encoded in separate files, usually called R1 and R2 [^1]. Unfortunately despite the publication, fastq format is not entirely standardized! For example, it is possible to have a valid fastq format in either 4-line-per-entry format, @@ -49,9 +49,7 @@ One such easy conversion is with `samtools`, e.g., ```shell samtools import -1 R1.fastq.gz -2 R2.fastq.gz --order ro -O bam,level=0 | \ - samtools sort - -o sorted.bam -# optionally, index -samtools index sorted.bam + samtools sort -M - -o sorted.bam ``` ## Repository @@ -63,6 +61,9 @@ repository consists of the bam files alone. One might be concerned about taking additional space, but actually unsorted bam may offer a storage space savings over individual fastq files, besides reduction in complexity gained by combining forward and reverse reads. +The above command uses `-M` in `samtools sort`, which sorts by minimizers, thereby saving tons of space. +In my own anecdote, I transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted bam file. +This represents huge file storage savings. ## QA/QC @@ -147,3 +148,5 @@ For some of my own popular software such as and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read bam files! I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to bam files! + +[^1] To maintain focus in this article, I will gloss past interleaved reads. From 35510b1cad4d9adb7432691128233138091c55ce Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Thu, 30 May 2024 20:10:31 -0400 Subject: [PATCH 14/43] brought back the main branch Gemfile.lock --- Gemfile.lock | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Gemfile.lock b/Gemfile.lock index 34b2ce8..39448dc 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -24,10 +24,9 @@ GEM eventmachine (1.2.7) ffi (1.16.3) forwardable-extended (2.6.0) - google-protobuf (4.26.1-x86_64-linux) - rake (>= 13) + google-protobuf (3.25.3-x86_64-linux) http_parser.rb (0.8.0) - i18n (1.14.5) + i18n (1.14.4) concurrent-ruby (~> 1.0) jekyll (4.3.3) addressable (~> 2.4) @@ -79,17 +78,18 @@ GEM nuggets (1.6.1) pathutil (0.16.2) forwardable-extended (~> 2.6) - public_suffix (5.0.5) + public_suffix (5.0.4) racc (1.7.3) - rake (13.2.1) + rake (13.1.0) rb-fsevent (0.11.2) rb-inotify (0.10.1) ffi (~> 1.0) rexml (3.2.6) rouge (4.2.1) safe_yaml (1.0.5) - sass-embedded (1.77.1-x86_64-linux-gnu) - google-protobuf (>= 3.25, < 5.0) + sass-embedded (1.69.5) + google-protobuf (~> 3.23) + rake (>= 13.0.0) terminal-table (3.0.2) unicode-display_width (>= 1.1.1, < 3) unicode-display_width (2.5.0) From 220de135f5130f6f289ab0bc66f746667921d98a Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Fri, 31 May 2024 11:01:04 -0400 Subject: [PATCH 15/43] cram --- _posts/2024-05-14-switch-to-bam.md | 32 +++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 035b8a4..e6451ba 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -129,9 +129,38 @@ However, I could find that fasta or fastq were valid inputs. There are other software packages in the world for specific pathogens like _Salmonella_ but for this generalized blog post, I did not investigate further. +## Other compression methods + +This article discusses the bam format, but there have been many attempts over the years to make other compression formats [^2]. +The [cram format](https://samtools.github.io/hts-specs/CRAMv3.pdf) is probably the best example. +Cram has an even better lossless compression of sequences than bam, +and I even confirmed it on my own sequence. + +```bash +samtools import -1 1.fastq.gz -2 2.fastq.gz --order ro -O bam,level=0 | \ + samtools sort -O cram --output-fmt-option archive -M - -o archive.cram +``` + +When viewing the same sequences in fastq.gz, bam, or cram, I get an astonishing reduction. + +```text +-rw-------. 1 gzu2 users 81M May 30 20:03 unmapped.bam +-rw-------. 1 gzu2 users 55M May 31 09:18 unmapped.cram +-rw-------. 1 gzu2 users 55M Dec 6 2019 1.fastq.gz +-rw-------. 1 gzu2 users 63M Dec 6 2019 2.fastq.gz +``` + +Cram is seemlessly incorporated into samtools and so you can freely convert between formats. +In fact, [EBI stores a ton of cram files already](https://x.com/BonfieldJames/status/1182180199657607168). +So why wouldn't we recommend cram up front? +Admittedly, I started writing this article because I understood sam/bam much more than cram when starting this article. +But also after some consideration, there is a bigger lift that would involve convincing many sequencing companies to adopt it. +I can check a box on my nanopore that makes bams; I can't do the same for cram. +That said, given an ideal world, I would encourage the sequencing companies to consider that check box. + ## Conclusion -The good part is that sequencing platforms output bam format natively, for what I can tell. +The good part is that sequencing platforms output bam format natively, from what I can tell. However, we need software to natively read these bam files. Bioinformatics software developers should be looking ahead @@ -150,3 +179,4 @@ I wish I could say that I will address this right away, but with all my other re Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to bam files! [^1] To maintain focus in this article, I will gloss past interleaved reads. +[^2] One such example of an organization trying to standardize is here: . From 175dc618b3e6dd612d58c9d1f502b5a78485a632 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Fri, 31 May 2024 11:02:13 -0400 Subject: [PATCH 16/43] remove redundant sentence --- _posts/2024-05-14-switch-to-bam.md | 1 - 1 file changed, 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index e6451ba..ddb480f 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -108,7 +108,6 @@ At this point there are a million classic MLST software packages and for some ad please check out Page et al 2017 {% cite Page2017 %}. For whole genome MLST software tools, I could also not find any packages that natively read bam. Please see [my previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. -I could not find any MLST software that reads bam natively. ### Sketches From 5eea8007f451a771c204d574de948c9d485c0878 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Fri, 31 May 2024 11:08:24 -0400 Subject: [PATCH 17/43] added peter for a reviewer --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index ddb480f..764de7c 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -8,7 +8,7 @@ contributors: - lskatz - SeanSierra-Patev reviewers: - + - pmenzel --- It has been over 14 years since the formalization of the fastq format {% cite Cock2009 %}. From 33ba49c5350e5598db866005e2dd429325a0b70c Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Fri, 31 May 2024 22:13:19 -0400 Subject: [PATCH 18/43] removed my username from code block; removed awkward sentence --- _posts/2024-05-14-switch-to-bam.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 764de7c..0c11ab0 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -143,17 +143,16 @@ samtools import -1 1.fastq.gz -2 2.fastq.gz --order ro -O bam,level=0 | \ When viewing the same sequences in fastq.gz, bam, or cram, I get an astonishing reduction. ```text --rw-------. 1 gzu2 users 81M May 30 20:03 unmapped.bam --rw-------. 1 gzu2 users 55M May 31 09:18 unmapped.cram --rw-------. 1 gzu2 users 55M Dec 6 2019 1.fastq.gz --rw-------. 1 gzu2 users 63M Dec 6 2019 2.fastq.gz +-rw-------. 1 user users 81M May 30 20:03 unmapped.bam +-rw-------. 1 user users 55M May 31 09:18 unmapped.cram +-rw-------. 1 user users 55M Dec 6 2019 1.fastq.gz +-rw-------. 1 user users 63M Dec 6 2019 2.fastq.gz ``` Cram is seemlessly incorporated into samtools and so you can freely convert between formats. In fact, [EBI stores a ton of cram files already](https://x.com/BonfieldJames/status/1182180199657607168). So why wouldn't we recommend cram up front? -Admittedly, I started writing this article because I understood sam/bam much more than cram when starting this article. -But also after some consideration, there is a bigger lift that would involve convincing many sequencing companies to adopt it. +Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it. I can check a box on my nanopore that makes bams; I can't do the same for cram. That said, given an ideal world, I would encourage the sequencing companies to consider that check box. From 17e501fb3693d3cd46a37bc6e7278f64f26f5316 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Mon, 3 Jun 2024 14:26:50 -0400 Subject: [PATCH 19/43] Update _posts/2024-05-14-switch-to-bam.md: combine two short sentences Co-authored-by: Helena --- _posts/2024-05-14-switch-to-bam.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 0c11ab0..7088897 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -11,8 +11,8 @@ reviewers: - pmenzel --- -It has been over 14 years since the formalization of the fastq format {% cite Cock2009 %}. -It describes sequences and their quality scores. +It has been over 14 years since the formalization of the fastq format {% cite Cock2009 %}, +which describes sequences and their quality scores. For paired end reads, sequences are encoded in separate files, usually called R1 and R2 [^1]. Unfortunately despite the publication, fastq format is not entirely standardized! From e5d1a063bcfe3451100c351120dbbf979cab3011 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Mon, 3 Jun 2024 14:28:48 -0400 Subject: [PATCH 20/43] Update _posts/2024-05-14-switch-to-bam.md: added better written sentence about storage savings Co-authored-by: Helena --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 7088897..038f489 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -63,7 +63,7 @@ but actually unsorted bam may offer a storage space savings over individual fast besides reduction in complexity gained by combining forward and reverse reads. The above command uses `-M` in `samtools sort`, which sorts by minimizers, thereby saving tons of space. In my own anecdote, I transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted bam file. -This represents huge file storage savings. +A 31% storage reduction, in this case, can represents huge file storage savings across an entire sequencing data repository. ## QA/QC From f4496162135413445a0ece241ea69731c6da2f5c Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Mon, 3 Jun 2024 14:29:16 -0400 Subject: [PATCH 21/43] Update _posts/2024-05-14-switch-to-bam.md: better format for Page et al citation Co-authored-by: Helena --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 038f489..e7c3f71 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -105,7 +105,7 @@ However, it does not appear that SKESA can read bam natively. MLST software usually takes fasta or fastq files. At this point there are a million classic MLST software packages and for some additional information, -please check out Page et al 2017 {% cite Page2017 %}. +please check out {% cite Page2017 %}. For whole genome MLST software tools, I could also not find any packages that natively read bam. Please see [my previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. From 832235cec6f4503eca549d424131914a8a3796b9 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Mon, 3 Jun 2024 14:37:26 -0400 Subject: [PATCH 22/43] Added suggestions from hexylena esp uBAM notation --- _posts/2024-05-14-switch-to-bam.md | 47 +++++++++++++++--------------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index e7c3f71..74ec55e 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -9,9 +9,10 @@ contributors: - SeanSierra-Patev reviewers: - pmenzel + - hexylena --- -It has been over 14 years since the formalization of the fastq format {% cite Cock2009 %}, +It has been over 14 years since the formalization of the fastq format ({% cite Cock2009 %}), which describes sequences and their quality scores. For paired end reads, sequences are encoded in separate files, usually called R1 and R2 [^1]. Unfortunately despite the publication, @@ -37,10 +38,10 @@ because they are not aligned against anything. ## Sequencing -First, the sequencers would have to output bam files. +First, the sequencers would have to output uBAM files. Can they? The Illumina platforms and the Ion Torrent platforms do automatically. -After calling with at least Dorado, ONT sequencing outputs bam files. +After calling with at least Dorado, ONT sequencing outputs uBAM files. [Need help here: PacBio? ] For platforms that do not have this automation, we would need an easy conversion. @@ -57,13 +58,13 @@ samtools import -1 R1.fastq.gz -2 R2.fastq.gz --order ro -O bam,level=0 | \ Usually the fastq files end up in some kind of repository, organized by run or by organism. Instead of fastq files, it is easy to imagine that now the -repository consists of the bam files alone. +repository consists of the uBAM files alone. One might be concerned about taking additional space, -but actually unsorted bam may offer a storage space savings over individual fastq files, +but actually uBAM files may offer a storage space savings over individual fastq files, besides reduction in complexity gained by combining forward and reverse reads. The above command uses `-M` in `samtools sort`, which sorts by minimizers, thereby saving tons of space. -In my own anecdote, I transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted bam file. -A 31% storage reduction, in this case, can represents huge file storage savings across an entire sequencing data repository. +In my own anecdote, I transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. +A 31% storage reduction, in this case, can represent huge file storage savings across an entire sequencing data repository. ## QA/QC @@ -75,9 +76,9 @@ Some examples of a QA system include * [Pandoo](https://github.com/schultzm/pandoo) * [Nullarbor](https://github.com/tseemann/nullarbor) -Which of these pipelines can read a bam file? +Which of these pipelines can read a uBAM file? To my knowledge, none of them! -This is one area where we need to see more adaptation of bam files. +This is one area where we need to see more adaptation of uBAM files. ## Primary analysis @@ -94,27 +95,27 @@ such as Salmonella serotyping or virulence factors detection. ### assembly For genome assembly, many labs are using Shovill. -However, Shovill does not natively read bam files. +However, Shovill does not natively read uBAM files. Therefore, this workflow breaks slightly unless there is some conversion step. Other assemblers that people commonly use for bacterial genomes are SPAdes and SKESA. -SPAdes does read bam natively and so that is good. -However, it does not appear that SKESA can read bam natively. +SPAdes does read uBAM natively and so that is good. +However, it does not appear that SKESA can read uBAM natively. ### MLST MLST software usually takes fasta or fastq files. At this point there are a million classic MLST software packages and for some additional information, please check out {% cite Page2017 %}. -For whole genome MLST software tools, I could also not find any packages that natively read bam. +For whole genome MLST software tools, I could also not find any packages that natively read uBAM. Please see [my previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. ### Sketches If you're like me, you want to have a directory of at least some sketches from Mash. (see the [mashpit project](https://github.com/tongzhouxu/mashpit) for an exciting project!) -It appears that Mash natively does not read bam according to the v2.3 usage menu. -However it is promising that the Sourmash library [_does_ read bam](https://sourmash.readthedocs.io/en/latest/release-notes/sourmash-2.0.html#major-new-features-since-1-0) natively since version 2. +It appears that Mash natively does not read uBAM according to the v2.3 usage menu. +However it is promising that the Sourmash library [_does_ read uBAM](https://sourmash.readthedocs.io/en/latest/release-notes/sourmash-2.0.html#major-new-features-since-1-0) natively since version 2. ### Genotyping @@ -123,16 +124,16 @@ Generally in my experience, people base genotyping on either [SRST2](https://github.com/katholt/srst2), [SAUTE](https://github.com/ncbi/SKESA), or [ARIBA](https://github.com/sanger-pathogens/ariba). -Looking at each of these software packages, I could not find any documentation that bam is natively read. +Looking at each of these software packages, I could not find any documentation that uBAM is natively read. However, I could find that fasta or fastq were valid inputs. There are other software packages in the world for specific pathogens like _Salmonella_ but for this generalized blog post, I did not investigate further. ## Other compression methods -This article discusses the bam format, but there have been many attempts over the years to make other compression formats [^2]. +This article discusses the uBAM format, but there have been many attempts over the years to make other compression formats [^2]. The [cram format](https://samtools.github.io/hts-specs/CRAMv3.pdf) is probably the best example. -Cram has an even better lossless compression of sequences than bam, +Cram has an even better lossless compression of sequences than uBAM, and I even confirmed it on my own sequence. ```bash @@ -158,12 +159,12 @@ That said, given an ideal world, I would encourage the sequencing companies to c ## Conclusion -The good part is that sequencing platforms output bam format natively, from what I can tell. -However, we need software to natively read these bam files. +The good part is that sequencing platforms output uBAM format natively, from what I can tell. +However, we need software to natively read these uBAM files. Bioinformatics software developers should be looking ahead to future proof their software. -They need to accept bam natively as input. +They need to accept uBAM natively as input. Although it might seem straightforward, there are several links in the genomic epidemiology chain that need to be updated. @@ -172,9 +173,9 @@ secondary analyses. I should also say that I am guilty of this. For some of my own popular software such as [Mashtree](https://github.com/lskatz/mashtree/tree/master/.github/workflows) -and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read bam files! +and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read uBAM files! I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. -Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to bam files! +Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to uBAM files! [^1] To maintain focus in this article, I will gloss past interleaved reads. [^2] One such example of an organization trying to standardize is here: . From 34c7619bf0752546c449c2ea232808682de2035c Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Mon, 3 Jun 2024 14:44:09 -0400 Subject: [PATCH 23/43] fixed markdown linting --- _posts/2024-05-14-switch-to-bam.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 74ec55e..6c4a138 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -57,7 +57,7 @@ samtools import -1 R1.fastq.gz -2 R2.fastq.gz --order ro -O bam,level=0 | \ Usually the fastq files end up in some kind of repository, organized by run or by organism. -Instead of fastq files, it is easy to imagine that now the +Instead of fastq files, it is easy to imagine that now the repository consists of the uBAM files alone. One might be concerned about taking additional space, but actually uBAM files may offer a storage space savings over individual fastq files, @@ -171,7 +171,7 @@ updated. These include updating QA/QC pipelines, primary analyses, and secondary analyses. I should also say that I am guilty of this. -For some of my own popular software such as +For some of my own popular software such as [Mashtree](https://github.com/lskatz/mashtree/tree/master/.github/workflows) and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read uBAM files! I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. From be825e5d11d8d1746b7c22137f1633ab77e105db Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Tue, 4 Jun 2024 08:25:44 -0400 Subject: [PATCH 24/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 6c4a138..c78e93f 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -21,7 +21,7 @@ For example, it is possible to have a valid fastq format in either 4-line-per-en or splitting sequences into multiple lines. Additionally, the defline itself is not entirely standardized and is basically free text. However, we have had a lot of innovations in sequence formats since then. -One of those innovations is the BAM format which is the binary alignment/mapping format. +One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just stored the unaligned sequencing data. R1 and R2 are encoded in the same file. Fastq files are integral to genomic epidemiology. From 9e3b717ec3fcf844aa793afa696b53b632facf2a Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Tue, 4 Jun 2024 08:25:57 -0400 Subject: [PATCH 25/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index c78e93f..816d624 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -24,7 +24,7 @@ However, we have had a lot of innovations in sequence formats since then. One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just stored the unaligned sequencing data. R1 and R2 are encoded in the same file. -Fastq files are integral to genomic epidemiology. +Fastq files, as a means for storing primary sequencing data before any downstream analysis, are integral to genomic epidemiology. State health labs sequence genomes, transfer the fastq files to an internal repository, run quality checks (QC), usually through a quality assurance (QA) pipeline. From 5ddd34693db2dd6593be9d7a7f792b8ec81d8dc3 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Tue, 4 Jun 2024 08:26:16 -0400 Subject: [PATCH 26/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 816d624..704b36d 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -29,7 +29,7 @@ State health labs sequence genomes, transfer the fastq files to an internal repository, run quality checks (QC), usually through a quality assurance (QA) pipeline. -**So what would it take to change this whole process to bam files instead?** +**So what would it take to change this whole process to bam files instead?** Bam files have many advantages including having one file per sample, encoding extra information such as alignment data, and being indexed for random access. From 82befe82013338dbfdac2013fb3ed63cb2b8e459 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Tue, 4 Jun 2024 09:21:44 -0400 Subject: [PATCH 27/43] addressed some of pmenzel's comments --- _posts/2024-05-14-switch-to-bam.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 704b36d..86d3d47 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -22,7 +22,7 @@ or splitting sequences into multiple lines. Additionally, the defline itself is not entirely standardized and is basically free text. However, we have had a lot of innovations in sequence formats since then. One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just stored the unaligned sequencing data. -R1 and R2 are encoded in the same file. +Crucially, both reads from paired-end sequencing (i.e., R1 and R2) are stored in the same single file. Fastq files, as a means for storing primary sequencing data before any downstream analysis, are integral to genomic epidemiology. State health labs sequence genomes, @@ -30,7 +30,7 @@ transfer the fastq files to an internal repository, run quality checks (QC), usually through a quality assurance (QA) pipeline. **So what would it take to change this whole process to bam files instead?** -Bam files have many advantages including having one file per sample, +BAM files have many advantages including having only one file per sample instead of two, encoding extra information such as alignment data, and being indexed for random access. For our purposes here, bam files will be unaligned bam (uBAM) From 1c5afa725fdf7f16139454d849fdcd79b9f3830c Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Tue, 4 Jun 2024 09:55:42 -0400 Subject: [PATCH 28/43] fixed first person singular to plural; fixed file format capitalization --- _posts/2024-05-14-switch-to-bam.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 86d3d47..8c0dbf8 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -63,7 +63,7 @@ One might be concerned about taking additional space, but actually uBAM files may offer a storage space savings over individual fastq files, besides reduction in complexity gained by combining forward and reverse reads. The above command uses `-M` in `samtools sort`, which sorts by minimizers, thereby saving tons of space. -In my own anecdote, I transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. +In our own anecdote, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. A 31% storage reduction, in this case, can represent huge file storage savings across an entire sequencing data repository. ## QA/QC @@ -107,7 +107,7 @@ However, it does not appear that SKESA can read uBAM natively. MLST software usually takes fasta or fastq files. At this point there are a million classic MLST software packages and for some additional information, please check out {% cite Page2017 %}. -For whole genome MLST software tools, I could also not find any packages that natively read uBAM. +For whole genome MLST software tools, we could also not find any packages that natively read uBAM. Please see [my previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. ### Sketches @@ -124,24 +124,24 @@ Generally in my experience, people base genotyping on either [SRST2](https://github.com/katholt/srst2), [SAUTE](https://github.com/ncbi/SKESA), or [ARIBA](https://github.com/sanger-pathogens/ariba). -Looking at each of these software packages, I could not find any documentation that uBAM is natively read. -However, I could find that fasta or fastq were valid inputs. +Looking at each of these software packages, we could not find any documentation that uBAM is natively read. +However, we could find that FASTA and FASTQ were valid inputs. There are other software packages in the world for specific pathogens like _Salmonella_ -but for this generalized blog post, I did not investigate further. +but for this generalized blog post, we did not investigate further. ## Other compression methods This article discusses the uBAM format, but there have been many attempts over the years to make other compression formats [^2]. The [cram format](https://samtools.github.io/hts-specs/CRAMv3.pdf) is probably the best example. Cram has an even better lossless compression of sequences than uBAM, -and I even confirmed it on my own sequence. +and we even confirmed it on our own sequence. ```bash samtools import -1 1.fastq.gz -2 2.fastq.gz --order ro -O bam,level=0 | \ samtools sort -O cram --output-fmt-option archive -M - -o archive.cram ``` -When viewing the same sequences in fastq.gz, bam, or cram, I get an astonishing reduction. +When viewing the same sequences in fastq.gz, bam, or cram, we get an astonishing reduction. ```text -rw-------. 1 user users 81M May 30 20:03 unmapped.bam @@ -154,12 +154,12 @@ Cram is seemlessly incorporated into samtools and so you can freely convert betw In fact, [EBI stores a ton of cram files already](https://x.com/BonfieldJames/status/1182180199657607168). So why wouldn't we recommend cram up front? Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it. -I can check a box on my nanopore that makes bams; I can't do the same for cram. -That said, given an ideal world, I would encourage the sequencing companies to consider that check box. +We can check a box on our nanopore that makes bams; we can't do the same for cram. +That said, given an ideal world, we would encourage the sequencing companies to consider that check box. ## Conclusion -The good part is that sequencing platforms output uBAM format natively, from what I can tell. +The good part is that sequencing platforms output uBAM format natively, from what we can tell. However, we need software to natively read these uBAM files. Bioinformatics software developers should be looking ahead @@ -170,12 +170,12 @@ links in the genomic epidemiology chain that need to be updated. These include updating QA/QC pipelines, primary analyses, and secondary analyses. -I should also say that I am guilty of this. +As for @lskatz, I should also say that I am guilty of this. For some of my own popular software such as [Mashtree](https://github.com/lskatz/mashtree/tree/master/.github/workflows) and [Lyve-SET](https://github.com/lskatz/lyve-SET/), they do not natively read uBAM files! I wish I could say that I will address this right away, but with all my other responsibilities, it will be further down the road. -Therefore I can say from my observations and my own personal experience, there is some work up ahead to get us moved over to uBAM files! +Therefore we can say from our own observations and personal experience, there is some work up ahead to get us moved over to uBAM files! [^1] To maintain focus in this article, I will gloss past interleaved reads. [^2] One such example of an organization trying to standardize is here: . From 64f0587c9b07f47625742a079adcf37b15304d4a Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Tue, 4 Jun 2024 13:06:33 -0400 Subject: [PATCH 29/43] brought in Dbtara's excellent points and added as an author --- _posts/2024-05-14-switch-to-bam.md | 49 +++++++++++++++++------------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 8c0dbf8..099d87d 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -1,39 +1,46 @@ --- layout: post -title: What does it take to switch over to bam from fastq? +title: What does it take to switch over to BAM from FASTQ? tags: - - bam - - fastq + - BAM + - uBAM + - FASTQ + - file formats contributors: - lskatz - SeanSierra-Patev + - dbtara reviewers: - pmenzel - hexylena --- -It has been over 14 years since the formalization of the fastq format ({% cite Cock2009 %}), +It has been over 14 years since the formalization of the FASTQ format ({% cite Cock2009 %}), which describes sequences and their quality scores. For paired end reads, sequences are encoded in separate files, usually called R1 and R2 [^1]. Unfortunately despite the publication, -fastq format is not entirely standardized! -For example, it is possible to have a valid fastq format in either 4-line-per-entry format, +FASTQ format is not entirely standardized! +For example, it is possible to have a valid FASTQ format in either 4-line-per-entry format, or splitting sequences into multiple lines. Additionally, the defline itself is not entirely standardized and is basically free text. However, we have had a lot of innovations in sequence formats since then. One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just stored the unaligned sequencing data. -Crucially, both reads from paired-end sequencing (i.e., R1 and R2) are stored in the same single file. +Crucially, both reads from paired-end sequencing (i.e., R1 and R2) are stored in the same single file +and [it allows for metadata as explained in this GATK post](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format). +We found at least one other [blog post with this same sentiment from _2011_](https://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html). +This idea isn't new. +It is frustratingly old. -Fastq files, as a means for storing primary sequencing data before any downstream analysis, are integral to genomic epidemiology. +FASTQ files, as a means for storing primary sequencing data before any downstream analysis, are integral to genomic epidemiology. State health labs sequence genomes, -transfer the fastq files to an internal repository, +transfer the FASTQ files to an internal repository, run quality checks (QC), usually through a quality assurance (QA) pipeline. -**So what would it take to change this whole process to bam files instead?** +**So what would it take to change this whole process to BAM files instead?** BAM files have many advantages including having only one file per sample instead of two, encoding extra information such as alignment data, and being indexed for random access. -For our purposes here, bam files will be unaligned bam (uBAM) +For our purposes here, BAM files will be unaligned BAM (uBAM) because they are not aligned against anything. ## Sequencing @@ -42,7 +49,7 @@ First, the sequencers would have to output uBAM files. Can they? The Illumina platforms and the Ion Torrent platforms do automatically. After calling with at least Dorado, ONT sequencing outputs uBAM files. -[Need help here: PacBio? ] +Pacbio does generate BAM as native format (they discontinued HDF5). For platforms that do not have this automation, we would need an easy conversion. @@ -55,12 +62,12 @@ samtools import -1 R1.fastq.gz -2 R2.fastq.gz --order ro -O bam,level=0 | \ ## Repository -Usually the fastq files end up in some kind of repository, organized by +Usually the FASTQ files end up in some kind of repository, organized by run or by organism. -Instead of fastq files, it is easy to imagine that now the +Instead of FASTQ files, it is easy to imagine that now the repository consists of the uBAM files alone. One might be concerned about taking additional space, -but actually uBAM files may offer a storage space savings over individual fastq files, +but actually uBAM files may offer a storage space savings over individual FASTQ files, besides reduction in complexity gained by combining forward and reverse reads. The above command uses `-M` in `samtools sort`, which sorts by minimizers, thereby saving tons of space. In our own anecdote, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. @@ -77,7 +84,7 @@ Some examples of a QA system include * [Nullarbor](https://github.com/tseemann/nullarbor) Which of these pipelines can read a uBAM file? -To my knowledge, none of them! +To our knowledge, none of them! This is one area where we need to see more adaptation of uBAM files. ## Primary analysis @@ -104,11 +111,11 @@ However, it does not appear that SKESA can read uBAM natively. ### MLST -MLST software usually takes fasta or fastq files. +MLST software usually takes FASTA or FASTQ files. At this point there are a million classic MLST software packages and for some additional information, please check out {% cite Page2017 %}. For whole genome MLST software tools, we could also not find any packages that natively read uBAM. -Please see [my previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. +Please see [@lskatz's previous blog post](https://lskatz.github.io/posts/2023/04/09/wgMLST.html) for an in depth view into three of them. ### Sketches @@ -119,7 +126,7 @@ However it is promising that the Sourmash library [_does_ read uBAM](https://sou ### Genotyping -Generally in my experience, people base genotyping on either +Generally in our experience, people base genotyping on either [KMA](https://bitbucket.org/genomicepidemiology/kma), [SRST2](https://github.com/katholt/srst2), [SAUTE](https://github.com/ncbi/SKESA), @@ -141,7 +148,7 @@ samtools import -1 1.fastq.gz -2 2.fastq.gz --order ro -O bam,level=0 | \ samtools sort -O cram --output-fmt-option archive -M - -o archive.cram ``` -When viewing the same sequences in fastq.gz, bam, or cram, we get an astonishing reduction. +When viewing the same sequences in FASTQ, BAM, or CRAM, we get an astonishing reduction. ```text -rw-------. 1 user users 81M May 30 20:03 unmapped.bam @@ -154,7 +161,7 @@ Cram is seemlessly incorporated into samtools and so you can freely convert betw In fact, [EBI stores a ton of cram files already](https://x.com/BonfieldJames/status/1182180199657607168). So why wouldn't we recommend cram up front? Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it. -We can check a box on our nanopore that makes bams; we can't do the same for cram. +We can check a box on our nanopore that makes BAMs; we can't do the same for CRAM. That said, given an ideal world, we would encourage the sequencing companies to consider that check box. ## Conclusion From a8036098014c0414a804a8e8c2037adf749f4941 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Tue, 4 Jun 2024 13:08:33 -0400 Subject: [PATCH 30/43] brought in Dbtara's excellent points and added as an author --- CONTRIBUTORS.yaml | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/CONTRIBUTORS.yaml b/CONTRIBUTORS.yaml index 96f7cfe..b8cbc73 100644 --- a/CONTRIBUTORS.yaml +++ b/CONTRIBUTORS.yaml @@ -55,4 +55,8 @@ bebatut: pvanheus: name: Peter van Heusden - orcid: 0000-0001-6553-5274 \ No newline at end of file + orcid: 0000-0001-6553-5274 + +dbtara: + name: Dhwani Batra + orcid: 0000-0002-6789-6396 \ No newline at end of file From 1b380238bf92d3adccb5ac9c990d8efe522d3941 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Wed, 5 Jun 2024 08:38:17 -0400 Subject: [PATCH 31/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 099d87d..563b2c6 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -24,7 +24,7 @@ For example, it is possible to have a valid FASTQ format in either 4-line-per-en or splitting sequences into multiple lines. Additionally, the defline itself is not entirely standardized and is basically free text. However, we have had a lot of innovations in sequence formats since then. -One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just stored the unaligned sequencing data. +One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just store the unaligned sequencing data. Crucially, both reads from paired-end sequencing (i.e., R1 and R2) are stored in the same single file and [it allows for metadata as explained in this GATK post](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format). We found at least one other [blog post with this same sentiment from _2011_](https://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html). From 5faac14d2cc6c016af95bb66e8de0f4ff3b56957 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Wed, 5 Jun 2024 08:38:39 -0400 Subject: [PATCH 32/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 563b2c6..938e246 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -70,7 +70,7 @@ One might be concerned about taking additional space, but actually uBAM files may offer a storage space savings over individual FASTQ files, besides reduction in complexity gained by combining forward and reverse reads. The above command uses `-M` in `samtools sort`, which sorts by minimizers, thereby saving tons of space. -In our own anecdote, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. +In our example dataset, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. A 31% storage reduction, in this case, can represent huge file storage savings across an entire sequencing data repository. ## QA/QC From 70282c5ebb4749cc2e6ca20eff607ada32230be0 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Wed, 5 Jun 2024 08:39:04 -0400 Subject: [PATCH 33/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 938e246..4daa0ed 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -69,7 +69,7 @@ repository consists of the uBAM files alone. One might be concerned about taking additional space, but actually uBAM files may offer a storage space savings over individual FASTQ files, besides reduction in complexity gained by combining forward and reverse reads. -The above command uses `-M` in `samtools sort`, which sorts by minimizers, thereby saving tons of space. +The above command uses `-M` in `samtools sort`, which sorts the reads by minimizers, which results allows for a much better compression and therefore reduced size of the uBAM file. In our example dataset, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. A 31% storage reduction, in this case, can represent huge file storage savings across an entire sequencing data repository. From 23ffb7cc172e5771bc2bfeb6625bba33a1df2fa2 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Wed, 5 Jun 2024 08:39:18 -0400 Subject: [PATCH 34/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 4daa0ed..824a4e7 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -159,7 +159,7 @@ When viewing the same sequences in FASTQ, BAM, or CRAM, we get an astonishing re Cram is seemlessly incorporated into samtools and so you can freely convert between formats. In fact, [EBI stores a ton of cram files already](https://x.com/BonfieldJames/status/1182180199657607168). -So why wouldn't we recommend cram up front? +So why wouldn't we recommend CRAM upfront? Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it. We can check a box on our nanopore that makes BAMs; we can't do the same for CRAM. That said, given an ideal world, we would encourage the sequencing companies to consider that check box. From 8469c8f7c7d34c0d3e10f46bd0e127a1147bd228 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Wed, 5 Jun 2024 08:39:42 -0400 Subject: [PATCH 35/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 824a4e7..34a5bcd 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -157,7 +157,7 @@ When viewing the same sequences in FASTQ, BAM, or CRAM, we get an astonishing re -rw-------. 1 user users 63M Dec 6 2019 2.fastq.gz ``` -Cram is seemlessly incorporated into samtools and so you can freely convert between formats. +The CRAM format is seemlessly incorporated into samtools, allowing for freely converting between formats. In fact, [EBI stores a ton of cram files already](https://x.com/BonfieldJames/status/1182180199657607168). So why wouldn't we recommend CRAM upfront? Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it. From 91de021b061410d6ed72a266e0775f9f75b9efb8 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Wed, 5 Jun 2024 09:17:09 -0400 Subject: [PATCH 36/43] Update _posts/2024-05-14-switch-to-bam.md Co-authored-by: Peter Menzel --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 34a5bcd..88df452 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -158,7 +158,7 @@ When viewing the same sequences in FASTQ, BAM, or CRAM, we get an astonishing re ``` The CRAM format is seemlessly incorporated into samtools, allowing for freely converting between formats. -In fact, [EBI stores a ton of cram files already](https://x.com/BonfieldJames/status/1182180199657607168). +In fact, [EBI stores a ton of CRAM files already](https://x.com/BonfieldJames/status/1182180199657607168). So why wouldn't we recommend CRAM upfront? Probably because it is a bigger lift that would involve convincing many sequencing companies to adopt it. We can check a box on our nanopore that makes BAMs; we can't do the same for CRAM. From 61f5607a165f2c4f66736e411ed256730df01aac Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Wed, 5 Jun 2024 09:18:16 -0400 Subject: [PATCH 37/43] remove results --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 88df452..51dbb24 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -69,7 +69,7 @@ repository consists of the uBAM files alone. One might be concerned about taking additional space, but actually uBAM files may offer a storage space savings over individual FASTQ files, besides reduction in complexity gained by combining forward and reverse reads. -The above command uses `-M` in `samtools sort`, which sorts the reads by minimizers, which results allows for a much better compression and therefore reduced size of the uBAM file. +The above command uses `-M` in `samtools sort`, which sorts the reads by minimizers, which allows for a much better compression and therefore reduced size of the uBAM file. In our example dataset, we transformed a 63M R1 and 55M R2 file into an 81M unmapped sorted uBAM file. A 31% storage reduction, in this case, can represent huge file storage savings across an entire sequencing data repository. From e9b0477552296c548d8f039b5eb8bba473925cca Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Wed, 5 Jun 2024 09:48:41 -0400 Subject: [PATCH 38/43] added info about Illumina and then ironed out the conclusion --- _posts/2024-05-14-switch-to-bam.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 51dbb24..81d23ae 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -47,7 +47,10 @@ because they are not aligned against anything. First, the sequencers would have to output uBAM files. Can they? -The Illumina platforms and the Ion Torrent platforms do automatically. +Illumina will not output uBAM natively, but it does allow +[Local Run Manager modules](https://customprotocolselector.illumina.com/selectors/LRM-module-selector/Content/Source/FrontPages/LRM-module-selector.htm). +If one of these modules aligns against a reference, then an Illumina platform would at least produce a BAM. +The Ion Torrent platforms does produce a uBAM automatically. After calling with at least Dorado, ONT sequencing outputs uBAM files. Pacbio does generate BAM as native format (they discontinued HDF5). For platforms that do not have this automation, @@ -166,8 +169,9 @@ That said, given an ideal world, we would encourage the sequencing companies to ## Conclusion -The good part is that sequencing platforms output uBAM format natively, from what we can tell. -However, we need software to natively read these uBAM files. +The good part is that many but not all sequencing platforms output uBAM format natively. +For those that don't have this capability, we have a way to convert fastq to uBAM. +However even after aquiring a uBAM, we need software to natively read them. Bioinformatics software developers should be looking ahead to future proof their software. From 487dd37122966a1c323ce2cc8372c647f68f7a47 Mon Sep 17 00:00:00 2001 From: Peter Menzel Date: Wed, 5 Jun 2024 21:01:15 +0200 Subject: [PATCH 39/43] Update 2024-05-14-switch-to-bam.md sopme random small fixes --- _posts/2024-05-14-switch-to-bam.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 81d23ae..a3ab5be 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -18,13 +18,12 @@ reviewers: It has been over 14 years since the formalization of the FASTQ format ({% cite Cock2009 %}), which describes sequences and their quality scores. For paired end reads, sequences are encoded in separate files, usually called R1 and R2 [^1]. -Unfortunately despite the publication, -FASTQ format is not entirely standardized! -For example, it is possible to have a valid FASTQ format in either 4-line-per-entry format, -or splitting sequences into multiple lines. +Unfortunately despite the publication, FASTQ format is not entirely standardized! +For example, it is possible to have a valid FASTQ format in either 4-line-per-entry format, or splitting sequences into multiple lines. Additionally, the defline itself is not entirely standardized and is basically free text. However, we have had a lot of innovations in sequence formats since then. -One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. While this file format is typically used to store information about alignments of sequencing reads, it can also just store the unaligned sequencing data. +One of those innovations is the [SAM/BAM format](https://samtools.github.io/hts-specs/), the (binary) alignment/mapping format. +While this file format is typically used to store information about alignments of sequencing reads, it can also just store the unaligned sequencing data. Crucially, both reads from paired-end sequencing (i.e., R1 and R2) are stored in the same single file and [it allows for metadata as explained in this GATK post](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format). We found at least one other [blog post with this same sentiment from _2011_](https://blastedbio.blogspot.com/2011/10/fastq-must-die-long-live-sambam.html). @@ -50,7 +49,7 @@ Can they? Illumina will not output uBAM natively, but it does allow [Local Run Manager modules](https://customprotocolselector.illumina.com/selectors/LRM-module-selector/Content/Source/FrontPages/LRM-module-selector.htm). If one of these modules aligns against a reference, then an Illumina platform would at least produce a BAM. -The Ion Torrent platforms does produce a uBAM automatically. +The Ion Torrent platforms do produce a uBAM automatically. After calling with at least Dorado, ONT sequencing outputs uBAM files. Pacbio does generate BAM as native format (they discontinued HDF5). For platforms that do not have this automation, @@ -102,13 +101,13 @@ and so certain tools can create those sketches. Finally, there might be individual genotyping operations such as Salmonella serotyping or virulence factors detection. -### assembly +### Assembly -For genome assembly, many labs are using Shovill. +For genome assembly, many labs use [Shovill](https://github.com/tseemann/shovill). However, Shovill does not natively read uBAM files. Therefore, this workflow breaks slightly unless there is some conversion step. -Other assemblers that people commonly use for bacterial genomes are SPAdes and SKESA. +Other assemblers that people commonly use for bacterial genomes are [SPAdes](https://github.com/ablab/spades) and [SKESA](https://github.com/ncbi/SKESA). SPAdes does read uBAM natively and so that is good. However, it does not appear that SKESA can read uBAM natively. @@ -125,7 +124,7 @@ Please see [@lskatz's previous blog post](https://lskatz.github.io/posts/2023/04 If you're like me, you want to have a directory of at least some sketches from Mash. (see the [mashpit project](https://github.com/tongzhouxu/mashpit) for an exciting project!) It appears that Mash natively does not read uBAM according to the v2.3 usage menu. -However it is promising that the Sourmash library [_does_ read uBAM](https://sourmash.readthedocs.io/en/latest/release-notes/sourmash-2.0.html#major-new-features-since-1-0) natively since version 2. +However it is promising that the [Sourmash](https://github.com/sourmash-bio/sourmash) library [_does_ read uBAM](https://sourmash.readthedocs.io/en/latest/release-notes/sourmash-2.0.html#major-new-features-since-1-0) natively since version 2. ### Genotyping @@ -142,8 +141,8 @@ but for this generalized blog post, we did not investigate further. ## Other compression methods This article discusses the uBAM format, but there have been many attempts over the years to make other compression formats [^2]. -The [cram format](https://samtools.github.io/hts-specs/CRAMv3.pdf) is probably the best example. -Cram has an even better lossless compression of sequences than uBAM, +The [CRAM format](https://samtools.github.io/hts-specs/CRAMv3.pdf) is probably the best example. +CRAM has an even better lossless compression of sequences than uBAM, and we even confirmed it on our own sequence. ```bash @@ -170,7 +169,7 @@ That said, given an ideal world, we would encourage the sequencing companies to ## Conclusion The good part is that many but not all sequencing platforms output uBAM format natively. -For those that don't have this capability, we have a way to convert fastq to uBAM. +For those that don't have this capability, we have a way to convert FASTQ to uBAM. However even after aquiring a uBAM, we need software to natively read them. Bioinformatics software developers should be looking ahead From e20a501f8677d08de1cd75c96a9d6502e92c9925 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Wed, 5 Jun 2024 15:30:26 -0400 Subject: [PATCH 40/43] me=>out --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index a3ab5be..b744f48 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -121,7 +121,7 @@ Please see [@lskatz's previous blog post](https://lskatz.github.io/posts/2023/04 ### Sketches -If you're like me, you want to have a directory of at least some sketches from Mash. +If you're like us, you want to have a directory of at least some sketches from Mash. (see the [mashpit project](https://github.com/tongzhouxu/mashpit) for an exciting project!) It appears that Mash natively does not read uBAM according to the v2.3 usage menu. However it is promising that the [Sourmash](https://github.com/sourmash-bio/sourmash) library [_does_ read uBAM](https://sourmash.readthedocs.io/en/latest/release-notes/sourmash-2.0.html#major-new-features-since-1-0) natively since version 2. From 343378839c0762b5d347d6d335241d219ad2c7af Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Wed, 5 Jun 2024 15:36:41 -0400 Subject: [PATCH 41/43] remove 'here' in footnote2 --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index b744f48..8f02250 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -188,4 +188,4 @@ I wish I could say that I will address this right away, but with all my other re Therefore we can say from our own observations and personal experience, there is some work up ahead to get us moved over to uBAM files! [^1] To maintain focus in this article, I will gloss past interleaved reads. -[^2] One such example of an organization trying to standardize is here: . +[^2] One such example of an organization trying to standardize: . From e7c7bf44c8033b8e2499ca41474330307cf2b324 Mon Sep 17 00:00:00 2001 From: Lee Katz - Aspen Date: Wed, 5 Jun 2024 15:39:37 -0400 Subject: [PATCH 42/43] fix hyperlink text --- _posts/2024-05-14-switch-to-bam.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-05-14-switch-to-bam.md index 8f02250..e8252c2 100644 --- a/_posts/2024-05-14-switch-to-bam.md +++ b/_posts/2024-05-14-switch-to-bam.md @@ -188,4 +188,4 @@ I wish I could say that I will address this right away, but with all my other re Therefore we can say from our own observations and personal experience, there is some work up ahead to get us moved over to uBAM files! [^1] To maintain focus in this article, I will gloss past interleaved reads. -[^2] One such example of an organization trying to standardize: . +[^2] One such example of an organization trying to standardize is [here](https://www.genomeweb.com/informatics/will-bioinformatics-professionals-embrace-mpeg-g-data-compression-standard?utm_source=addthis_shares#.XcwmeiO9qjW.twitter). From 9eafaf0ed09579ffe84673529032db5dd1c62d16 Mon Sep 17 00:00:00 2001 From: Peter Menzel Date: Thu, 6 Jun 2024 11:51:41 +0200 Subject: [PATCH 43/43] Rename 2024-05-14-switch-to-bam.md to 2024-06-06-switch-to-bam.md --- .../{2024-05-14-switch-to-bam.md => 2024-06-06-switch-to-bam.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2024-05-14-switch-to-bam.md => 2024-06-06-switch-to-bam.md} (100%) diff --git a/_posts/2024-05-14-switch-to-bam.md b/_posts/2024-06-06-switch-to-bam.md similarity index 100% rename from _posts/2024-05-14-switch-to-bam.md rename to _posts/2024-06-06-switch-to-bam.md