-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add splitting BAM index to spec #321
base: master
Are you sure you want to change the base?
Conversation
that used in Hadoop-BAM. A htsjdk implementation can be found in samtools/htsjdk#1138.
Here's what the generated page looks like: SAMv1-sbi.pdf |
Thanks, that's looking pretty nice, and seems like a useful convention to establish. Some thoughts from a fairly quick reading:
|
@jmarshall thanks for your comments. I've addressed them in turn below. Latest draft here: SAMv1-sbi.pdf
Correct. Updated the wording.
There would be a single offset for the length of the file. Added a footnote saying "In the unlikely event the BAM file has no alignment records, the list of offsets will contain a single entry for the overall length of the BAM file."
Splitting and granularity are separate things, since you choose a granularity when writing the index (often at the same time as writing the BAM file), which is before you know what the split size is going to be. Indeed, the split size may vary from system to system (e.g. in Hadoop and Spark the split size defaults to 128MB, but may vary on a per-file basis). When you split the file (conceptually), you have a byte range that you convert into virtual file pointers, then you adjust the first point to start on an alignment. It may not be the very first alignment after the split start (unless the granularity is 1), but that doesn't matter (since the assumption is that there are lots of offsets in the split), and the key property is that a set of file ranges that cover the file map into a set of virtual file ranges that cover the file. There is a case when the granularity is not exact, which is when you write what is conceptually a single BAM file in multiple parts (in parallel) then merge them into a single file (we do this a lot in Spark). It's desirable to write the splitting index files at the same time, but merging them means the granularity is not exact. Perhaps granularity should be relaxed to be an approximate concept to cope with this case.
You'd have to buffer all the offsets for a file in memory to do this, since you don't know the total number until you reach the end of the file. The format in use in Hadoop-BAM is slightly different actually, as it's just a list of offsets; the one proposed here adds a magic string and the granularity, and uses a little-endian representation to be consistent with BAM. So we could make changes here.
Yes, it describes how to use the index in the case where the splits are predetermined. The way you describe is another way to use the index.
Thanks. Haven't written latex for years...
Changed to mandate foo.bam.sbi (like CRAM, which is the only one that is consistent according to #215).
Done. |
SAMv1.tex
Outdated
\multicolumn{3}{|l|}{\sf magic} & Magic string & {\tt char[4]} & {\tt SBI\char92 1}\\\cline{1-6} | ||
\multicolumn{3}{|l|}{\sf granularity} & Number of alignments between offsets, or $-1$ if unspecified & {\tt int32\_t} & \\\cline{1-6} | ||
\multicolumn{6}{|c|}{\textcolor{gray}{\it List of offsets}} \\\cline{2-6} | ||
& \multicolumn{2}{l|}{\sf offset} & Virtual file offset of the alignment & {\tt uint64\_t} & \\\cline{1-6} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there should be a special final entry after the list the table for the offset to the end of the bam.
SAMv1.tex
Outdated
file.\footnote{In the unlikely event the BAM file has no alignment records, | ||
the index will consist of a single entry for the overall length of the | ||
BAM file.} It does not need to contain a virtual file offset for every | ||
alignment, merely a subset. A granularity of $n$ means that an offset is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow indication of approximate number of records per offset? Or is that just making things unnecessarily complicated? At the cost of increasing the index size we could include the number of records in each section in the index. Instead of [offset]
, we'd have [offset, number of records until next offset]
That might be useful for tools deciding how to split the file.
SAMv1.tex
Outdated
\subsection{Splitting BAM}\label{sec:code} | ||
A BAM file can be processed in parallel by conceptually dividing the file into | ||
splits (typically of a fixed, but arbitrary, number of bytes) and for each | ||
split processing alignments from the first known alignment after the split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want a sentence here describing why this additional index is necessary and this use can't be handled by the bai.
SAMv1.tex
Outdated
use the index to find the smallest virtual file offset, {\tt v1}, that falls | ||
in this range, and the smallest virtual file offset, {\tt v2}, that is | ||
greater than or equal to {\tt end}. If {\tt v1} does not exist, then the | ||
split has no alignments. Otherwise, it has alignments in the range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no alignments -> no alignment starts
SAMv1.tex
Outdated
split processing alignments from the first known alignment after the split | ||
start up to the first known alignment of the next split. | ||
|
||
A splitting BAM index is a linear index of virtual file offsets of alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is a bit hard to parse on the first run through. To many of
in a row. I'm not sure how to improve it without making it much more wordy though, but maybe someone has a good idea.
Also, does linear index imply that it's a sorted list of increasing offsets? Should we mention that somewhere?
For BGZF files, it doesn't make sense to choose an granularity in bytes smaller than the average size of a BGZF block. Suppose we have a 100GB BAM and the average block size is 10000 bytes (I don't know the exact number). We need 100e9/10000=10M virtual offsets or 80M bytes. This is a relatively small number. In practice, we will want to use a larger granularity and thus less memory. I don't think we should worry about memory when creating .sbi. When using .sbi, we can choose to randomly seek into .sbi anyway. Having an extra Also, I think such an index should be made more general. In principle, all BGZF'd files, including bam, bcf2 and tabix files, can use this index. To be compatible with tabix, we need to store the sequence dictionary like in .csi. There is already a tool for tabix. It would be good to make a similar tool available to all BGZF files that htslib supports. |
I just noticed that csi-v2 keeps the number of records in the index (I was not aware of this; has this been implemented?). It achieves a similar goal to .sbi and is more powerful. What is the intended use of .sbi? Do you load the entire .sbi into memory, or seek into it? EDIT: .csi only works with coordinate-sorted files. .sbi works with any record-based files. In this sense, .sbi is not redundant. Nonetheless, I do think it should be generalized to more file types like .csi. |
@lh3 The goal of the .sbi is to make it easy to find record start sites in close proximity to a physical compressed file location. It's for the case where you want to split a bam in parallel in a deterministic way into roughly even chunks by compressed size. Does .csi-v2 support that use case? .bai isn't good for this case because you can have regions that are very large in file size but cover a small genomic region, and there isn't an easy way to subdivide that. |
I have added an EDIT line later to clarify that csi-v2 doesn't replace .sbi. However, I do think .sbi should have |
@lh3 Ah, sorry, I missed the edit. I agree that we should / can generalize them. I think the original thought was scoping it to bam at first might be easier in some way, but there's no reason we can't make it more general right off the bat. @tomwhite One problem we often have with indexes is that there isn't any way of identifying if they were generated from the bam file that they're associated with. Would it be a good idea to include a checksum of the associated bam in the index here? |
The index format I am thinking about looks like the following, where anything before |
CSIv2 was never implemented (or the format design considered workable) because no conclusion was ever reached as to how to correctly include counts from the higher bins (see mailing list conversation from January 2015). Separating the flat split index into a separate .sbi file has the advantage that the data file need not be coordinate sorted. Agreed that this index needn't be specific to BAM and could be described in terms applicable to any record-orientated BGZF-compressed file. |
@lbergelson One crude way to check if the index corresponds to a specific BGZF file is to store the total file size in .sbi. This approach is cheap to implement at the cost of occasionally missing errors. Checksum would require to read through the entire file (either compressed or uncompressed), so is not suitable for a quick check. Another possible way is to store the file size and the checksum of the uncompressed header. It is a bit complicated to implement, though. |
@lh3 File size is a good inexpensive check. In google cloud they keep an md5 of each file as part of the object store, so on some platforms md5 is also an inexpensive test. I think adding some sort of consistency check to the header of the index would be a good idea. |
@lbergelson I realized that file size is not always easy to implement. When we index a BAM stream and reach the last record, it is not always clear what's the size of a footer. If I am right, it is technically possible to append multiple empty BGZF blocks to the last record. Then we wouldn't easily know the exact file size on disk. Checksum might have a similar issue. I agree it would be good to have a cheap check, but I am not sure what is the best strategy. I'll let others chime in. |
Thanks for all the feedback. I agree about having a more general specification for the format. I'll create a new document for it. @lh3 I agree that it's feasible to store the offsets in memory when writing, so that Regarding I had a look at implementing Checksums: I think MD5 is tricky since it's not possible to compute in parallel easily. I think file length would be good enough in most cases. @h3 is your concern that a BAM file can be tampered with (by adding an empty BGZF block)? I don't think we can guard against that, but the file length is still a good sanity check in most cases. A use case that I want to be able to support is the ability to write SBI files in parallel and merge them afterwards into a single SBI file. This is a pattern we have with BAM files where we write them as headerless shards in parallel using Spark and concatenate them together. I think the proposed format would support this pattern, with the caveat that granularity would be unspecified. (Actually, it might be possible to get the granularity right by writing index files of granularity 1 from each task and then downsampling them appropriately when combining them. It might not be worth the effort for the parallel processing use case though.) Note that this is a different use case to tools like grabix, where granularity must be set. I think we can support both by allowing an unspecified granularity (-1) in the first case. |
Don't worry about a separate document for this, just organised as a section like you have but couched in more generic "BGZF file" rather than "BAM" language will be fine. (I have a project on the back-burner to bring the CSIv1.tex and tabix.tex document contents, such as they are, alongside the BGZF indexing and BAI text. We should have fewer micro-documents for index formats, not more! 😄) |
One omission from bai, csi and crai is any way to tell if they're in sync with the associated file. If we were designing formats from scratch (that have separate indices) then we'd probably add a UUID to the file and require the same UUID in the index so it's possible to spot replacement of one without the other. Here however we're defining one new format and not redefining the other, so it's less clear how to achieve this, unless we wish to also at some stage acknowledge a UUID optional tag in the sam Edit: blergh didn't notice this above. Sorry for dup, but it's subtly different anyway. Actually we can even restrospectively add this in for BAI and CSI if we wish to use the tried and trusted method of fake bins for meta-data. |
An optional field with the checksum (md5 or other algorithm) can be also useful as a temporary solution until an |
Add fields for file length, number of records, start and end offsets, and number of offsets.
Updated to incorporate latest suggestions: SAMv1-sbi.pdf Re UUID: would this be a fixed or variable length string? |
A UUID seems worse than a checksum of some sort to me. How do you guarantee that the file isn't edited without updating the UUID? |
Agreed it's not ideal and we probably shouldn't be starting from here. However computing a checksum has a huge CPU cost while checking headers match does not. Of course it's possible to edit the data and keep the old UUID, but that's a tool chain bug (or more likely just using old tool chains). I don't really know how you can fix this though. All we can do is best efforts, and if something catches some problems but not all then it's better than having no check at all. (Over time it'll improve.) Right now it's pure wishful thinking though as we don't have a way of holding this in headers either - I was simply suggesting making room for a (fixed length) checksum and/or UUID at some point in the future without requiring a change of file format. Edit: actually there can be some other useful checks thinking about it. If we contents of a record but keep the size the same, then the index works anyway. If we extend or shrink the records then the file size is a very simple short check on validity. If we rewrite a BAM header while keeping the size the same (it's possible using eg the ways CRAM manages this, but not done in any BAM implementation I know of) then the md5sum of the header is also a useful check, and quick to compute. [Rewriting headers in place ideally requires a nul terminated header with multiple nul bytes and more than one gzip block, the second of which is uncompressed to grow or shrink up to a predefined max compressed header size.] |
In addition to the length of the file as a quick check, what's about a checksum for the header and per-split (stored in the offset list)? This can be useful in two ways:
For example, the HDF5 file-format has a per-block checksum (only for some blocks of data) using the Jenkins lookup3.hashlittle algorithm. |
The previous text had
but this text seems to have disappeared. For using the index, it seems important for the first offset in the list to point to the first alignment record and for there to be a sentinel offset at the end of the list pointing to EOF — otherwise the code to find the first and last splits is more complex. IMHO adding explicit first_offset and end_offset fields is a mistake. first_offset is the same as the first offset list entry, and end_offset is the same as the last offset list entry (or it should be; it's currently badly described). (In other trivia, please remove |
As I said above, |
What the original text about the sentinel last entry in the list is trying to describe is: the virtual file offset of the byte after the end of the last alignment record. …whether that is EOF or the start of some footers of some kind. It is the virtual offset of the first non-existent alignment record, the same as the open end of a half-open interval like [5,7). So it's the same thing as your end_offset, just more usable. |
One note about offset in general: if the record starts at the boundary of two adjacent blocks, there are two ways to specify the same virtual offset: (blockEnd,lastByte) vs (blockStart,0). Which to report can be implementation dependent. |
the role of the final sentinel offset.
@jmarshall, @lh3 I've updated the text to explain the final sentinel offset in a way that I hope is clearer. I've also removed @lbergelson, @jkbonfield, @magicDGS The minimum level of integrity checking is file length, which I think we should include. Having a checksum per offset may be overkill since it would inflate the size of the index by a factor of three. I would rather add optional metadata fields for MD5 and UUID which can be used by tooling. |
+1 to adding optional MD5 and UUID fields |
+1 to add an optional checksum (MD5 or a different one) It is true that per-offset checksum is overkill, but maybe only for the first and last split would be a faster/less-expensive way to check integrity instead of whole file checksum. |
Added optional MD5 and UUID fields: SAMv1-sbi.pdf I wonder if |
It makes sense to make any field that only shows up once a long, unless there's actually a hard limit on it. Someone always finds a way to make a file with too many of it otherwise. |
And while we're at it, make quantities unsigned unless there is a valid reason why they have to be negative (eg chromosome -1 being unmapped in BAM). Edit: I mean that as a general principle, rather than this document which is already unsigned appropriately. |
Changed |
Updated the htsjdk implementation to reflect the latest draft: samtools/htsjdk#1138 |
Discussion seems to have settled down on this. Are there any more changes to the text that are needed? |
given that UUID is not a thing (yet?) for SAM, I find it strange that SBI spec would reference it as if it is a known thing. could you modify the wording in the table to say "Placeholder for UUID..." and we can remove the "Placeholder" part when a UUID is included in the SAM spec... |
Also, in the second usecase, please note that this is only possible if granularity is specified. Specify that "Nth record" is zero-based. |
Does anyone have further suggestions here or should we return this to @tomwhite for final edits? |
@lbergelson I think you need to commandeer this. |
Someone suggested we look at the PBI index as an an already extant alternative to introducing SBI. I've taken a quick look at it and it does indeed provide a per read-offset. So it could be used as a splitting index instead of sbi. It has a number of required fields which are specific to pac bio bams though. Things like a required ZMW hole number for every read. Obviously this isn't relevant to non-pac bio bams. I imagine you could create a degenerate one which has a 0 for every irrelevant value except for the offset though. It devotes a lot more bytes to each individual read than the SBI and doesn't offer the granularity option, so pbi indexes will inevitably contain more data than sbi files. A complete pbi with non-placeholder values on a short read file seems like it will be rather large. They're bgzf compressed though, so If you fill in 0's for all the required fields other than the offsets they should be roughly equivalent to a granularity 1 sbi index. I tested the pbindex tool on a random tiny (76mb) bam I have on my harddrive. It successfully created an index even though it wasn't a pac bio bam. The index it creates is ~3.4 mb. A granularity one sbi on the same file is 5mb, because we didn't bother compressing them... Compressed it's 1.7mb. So... it seems like we should compress sbi files... That was a silly oversight. In practice for short reads we typically use a much higher granularity value, default in gatk is 4096, so it's usually not an issue... The standard sbi we'd produce would only be 1.3k even uncompressed. These numbers are on a tiny bam because that's what I had on my hd at the moment, so they might scale slightly differently on a much larger one. I suspect it's roughly accurate % wise though.
So it seems like pbi could practically be used instead of SBI. It's substantially more complicated and has a bunch fo vender specific fields that are not relevant for non-pacbio bams. On the other hand, it is already deployed. I'm not sure if there's any non-pacbio specific tooling support for it though. Htsjdk already supports sbi. I still think there is a value in the dedicated sbi. What do others think? |
Initial draft of a spec for splitting BAM index files, similar to that used in Hadoop-BAM. A htsjdk implementation can be found in samtools/htsjdk#1138.