-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: ubam support? #757
Comments
The latest version of dnaio (the library that cutadapt uses for sequence IO) has uBAM read support. So cutadapt can already read uBAM files. Most aligners only support FASTQ format though, so there has been no need yet to add uBAM write support. After implementing the uBAM reader, I have not grown very fond of the BAM format for unaligned pure sequencing data for these reasons:
So for pure sequencing data, uBAM adds a lot of parsing difficulties without offering any advantages. While TAGS may seem like a nice format for adding metadata, only very few Tag names that apply to pure sequencing data are anchored in the specification. So most sequencers will have custom named tags anyway. Well in that case a FASTQ header is much, much easier to parse.
Well could you explain yourself a bit more on why you are doing this? What downstream programs that work pre-alignment do require uBAM or would function better with the functionality it offers compared to FASTQ? |
To clarify: Only single-end reads can be read. This is supported from dnaio 1.1.0 onwards. Cutadapt itself does not need to be updated. Since uBAM can only be read, the output will currently be FASTQ with any BAM-specific metadata lost. I am very much in favor of BAM-only workflows and would like to see Cutadapt support it. |
Ah yes, I did not consider the BAM header and the provenance argument. I am convinced. Maybe we can create an issue at dnaio and discuss the technical implementation? The alternative is to defer to pysam. As someone who has tried to make an alternative to pysam I can say that full support of SAM/BAM/CRAM is a major undertaking. However, limited BAM support in context of unaligned data only seems feasible. |
Thanks to both of you for your quick and thoughtful responses. I didn't know that cutadapt can read ubam at all, so that's good to know, although it's unfortunate that currently the tag metadata are lost and only SE reads are supported. While I better appreciate the cons of ubams from a developer perspective now, as I'm sure you're aware fastqs have their own parsing issues for users - they use 4 lines to encode one "unit" of information, and storing metadata by concatenating them in the read comment is clunky to say the least (assuming you have a convenient means of putting them there in the first place - cutadapt is a rare software that has excellent functionality in that respect); and then retrieving those data typically requires an aligner that can move fastq comments into sam tags, of which there are also only a few (bwa, mm2, bowtie2, SNAP; incidentally the latter two can read ubams). And indeed, the provenance argument. Many other software tools are increasingly embracing and even preferring ubam for unaligned reads (e.g. fgbio). So I'd still like to see ubam write suuport someday, but I understand that it's easy to make suggestions, less so to implement them, so I appreciate your even taking it under consideration. |
I am currently working with nanopore uBAM data. That contains tags with all sorts information on the Nanopore run as well as methylation data. I do not want to lose this in the cutadapt step. So looks like I will have to work on BAM writing in dnaio very soon. |
Pretty self-explanatory. We're trying to eliminate the need to process data in fastq format, so it would be terrific if cutadapt could accept ubam input and write ubam output. (We probably don't need the ability to convert fastq to ubam or vice-versa, so mandating ubam in = ubam out would probably be fine, but we wouldn't say no to additional flexibility either.) And if this were implemented, it would be nice to be able to place trimmed sequences or other metadata into sam tags, similar to the current ability to modify fastq read names and/or comments.) Just a suggestion, thanks!
The text was updated successfully, but these errors were encountered: