-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the description of BAI metadata pseudo-bin file offsets #499
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With depth 5, the BAI default, this function returns 37449. The half-open range implies valid values are 0 to 37448. The SAM spec also claims "and bins 4681--37448 span 16Kbp
regions" backing this up.
The SAM spec also explains the contents of pseudobins 37450 and 37451, but no mention of 37449. What happened to it? Is this actually used and it's not really half-open, or was it simply a mistake in allocation of pseudo-bin numbers and we accidentally skipped one?
Also if you do the arithmetic about how many bins there are on each level (1 + 8 + 64 + 512 + 4096 + 32768), you confirm that the last BAI bin is numbered 37448 (when the first bin is numbered 0). So the ‘claim’ in the SAM spec is easily confirmed to be accurate 😄 Leaving 37449 unused is a historical accident, as noted here on samtools-help. It wasn't ‘we’ who apparently accidentally skipped one — this was back in the days when one person was implementing things in samtools without review or discussion. (It's pseudo-bin 37450, first chunk and second chunk. There's no bin 37451 either. Also I suppose 37449 not being valid means the recently added |
Ah sorry, yes it's <37451 mentioned. I was just scanning and failed to read properly. :) I realised of course this "we" was before our time. I just meant we as in the group of people who have ever been maintainers of the spec. Anyway, happy with the changes. |
what about CSIv2.pdf? does it also need fixing? also, why are there two versions? |
CSIv2 is marked as draft and has never been implemented. The TeX source exists only on the experimental CSIv2 branch. See also #321 (comment). |
…ools#499) The virtual file offsets in the metadata pseudo-bins have always been misdescribed as locating the *placed unmapped reads* (which are not contiguous but spread throughout the reads on the chromosome), contrary to how they are computed by both HTSlib and HTSJDK and to Heng Li's original description on the samtools-devel mailing list in July 2010 ("Re: Bam index - how to treat alignments with 0 end coordinate", [1]): "[...] each chr has a bin 37450 which currently has two records [...] The first record keeps the start and end file offset of the entire chr (chunk_beg[0]=chr_start_off, chunk_end[0]=chr_end_off)." Fixes samtools#498. [1] https://sourceforge.net/p/samtools/mailman/message/25690499/
Rebased, and version history entry changed to June. I'd like to merge this before the next meeting. It's just waiting for an okay from @lbergelson — transferred from Yossi at the last meeting on the basis that you know more about indexes and are better placed to confirm that the new description matches HTSJDK behaviour (implemented I think in BAMIndexMetaData.java's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me. The CSI clarifications are especially helpful.
It seems like we should have a sentence somewhere in the SamSpec defining placed unmapped reads. There is a note saying that unmapped reads can be placed on the reference but it might be worth explicitly calling out the nomenclature somewhere.
\noindent | ||
The following functions generalise those given in the SAM specification for a BAI-style binning scheme. | ||
Note that in CSI \textit{depth} refers to the scheme's maximal depth, i.e., the level number of the scheme's smallest bins, and the single-bin level spanning the entire coordinate range is level 0. | ||
Hence the BAI-style binning scheme, with six levels in total, is represented by $\mbox{\sf min\_shift} = 14, \mbox{\sf depth} = 5$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely helpful to include here. Thank you.
{\sf unmapped\_beg} & (Virtual) file offset of the start of placed unmapped reads & {\tt uint64\_t} & \\\hline | ||
{\sf unmapped\_end} & (Virtual) file offset of the end of placed unmapped reads & {\tt uint64\_t} & \\\hline | ||
{\sf ref\_beg} & (Virtual) file offset of the start of reads placed on this reference& {\tt uint64\_t} & \\\hline | ||
{\sf ref\_end} & (Virtual) file offset of the end of reads placed on this reference& {\tt uint64\_t} & \\\hline | ||
{\sf n\_mapped} & Number of mapped read-segments for this reference & {\tt uint64\_t} & \\\hline | ||
{\sf n\_unmapped} & Number of unmapped read-segments for this reference & {\tt uint64\_t} & \\\hline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be clarified as "Number of placed unmapped read-segments for this reference"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also a bit weird that we say "reads" in some of these lines and "read-segments" on others when we mean the same thing.
…ools#499) The virtual file offsets in the metadata pseudo-bins have always been misdescribed as locating the *placed unmapped reads* (which are not contiguous but spread throughout the reads on the chromosome), contrary to how they are computed by both HTSlib and HTSJDK and to Heng Li's original description on the samtools-devel mailing list in July 2010 ("Re: Bam index - how to treat alignments with 0 end coordinate", [1]): "[...] each chr has a bin 37450 which currently has two records [...] The first record keeps the start and end file offset of the entire chr (chunk_beg[0]=chr_start_off, chunk_end[0]=chr_end_off)." Fixes samtools#498. [1] https://sourceforge.net/p/samtools/mailman/message/25690499/
Mention that CSI has metadata pseudo-bins with the same contents as BAI, and generalise their bin number calculation from BAI's 37450. Also clarify CSI "depth", which is more max_level than number-of-levels.
Rebased, and version history entry changed to July. Thanks for the thoughtful review, Louis. The placed/unplaced terminology is explained in footnote 36 on the same page as this pseudobin table — does this suffice? You are quite right in your comments about I plan to merge this as is and address those editorial concerns in a followup. |
@jmarshall You're right. I looked for something like that but somehow missed it even though it was right there. The footnote seems fine. |
Ever since the metadata pseudo-bins were added to the spec in #2, their virtual file offset entries have been misdescribed as locating the placed unmapped reads (which are not contiguous but spread throughout the reads on the chromosome), contrary to how they are computed by both HTSlib and HTSJDK and to Heng Li's original description on the samtools-devel mailing list:
Fixes #498 — see additional background there.
Also mention in CSIv1.tex that metadata pseudo-bins appear in CSI too, with the same contents as BAI, and generalise their bin number calculation from BAI's 37450. Clarify CSI “depth”, which is more max_level than number-of-levels.