Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the description of BAI metadata pseudo-bin file offsets #499

Merged
merged 2 commits into from
Jul 21, 2020

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented May 25, 2020

Ever since the metadata pseudo-bins were added to the spec in #2, their virtual file offset entries have been misdescribed as locating the placed unmapped reads (which are not contiguous but spread throughout the reads on the chromosome), contrary to how they are computed by both HTSlib and HTSJDK and to Heng Li's original description on the samtools-devel mailing list:

[…] each chr has a bin 37450 which currently has two records […]
The first record keeps the start and end file offset of the entire chr (chunk_beg[0]=chr_start_off, chunk_end[0]=chr_end_off).

Fixes #498 — see additional background there.

Also mention in CSIv1.tex that metadata pseudo-bins appear in CSI too, with the same contents as BAI, and generalise their bin number calculation from BAI's 37450. Clarify CSI “depth”, which is more max_level than number-of-levels.

@hts-specs-bot
Copy link

Changed PDFs as of 2840708: CSIv1 (diff), SAMv1 (diff).

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With depth 5, the BAI default, this function returns 37449. The half-open range implies valid values are 0 to 37448. The SAM spec also claims "and bins 4681--37448 span 16Kbp
regions" backing this up.

The SAM spec also explains the contents of pseudobins 37450 and 37451, but no mention of 37449. What happened to it? Is this actually used and it's not really half-open, or was it simply a mistake in allocation of pseudo-bin numbers and we accidentally skipped one?

@jmarshall
Copy link
Member Author

Also if you do the arithmetic about how many bins there are on each level (1 + 8 + 64 + 512 + 4096 + 32768), you confirm that the last BAI bin is numbered 37448 (when the first bin is numbered 0). So the ‘claim’ in the SAM spec is easily confirmed to be accurate 😄

Leaving 37449 unused is a historical accident, as noted here on samtools-help. It wasn't ‘we’ who apparently accidentally skipped one — this was back in the days when one person was implementing things in samtools without review or discussion.

(It's pseudo-bin 37450, first chunk and second chunk. There's no bin 37451 either. Also I suppose 37449 not being valid means the recently added n_bin constraint (SAMv1 §5.2) could be one tighter, but it's only an upper bound and one tighter would be more confusing…)

@jkbonfield
Copy link
Contributor

Ah sorry, yes it's <37451 mentioned. I was just scanning and failed to read properly. :)

I realised of course this "we" was before our time. I just meant we as in the group of people who have ever been maintainers of the spec.

Anyway, happy with the changes.

@yfarjoun
Copy link
Contributor

what about CSIv2.pdf? does it also need fixing? also, why are there two versions?

@jmarshall
Copy link
Member Author

jmarshall commented May 28, 2020

CSIv2 is marked as draft and has never been implemented. The TeX source exists only on the experimental CSIv2 branch. See also #321 (comment). The PDF should really just be removed until the ideas get revisited. Since removed by PR #503.

@jmarshall jmarshall requested review from lbergelson and removed request for yfarjoun May 28, 2020 16:03
jmarshall added a commit to jmarshall/hts-specs that referenced this pull request Jun 18, 2020
…ools#499)

The virtual file offsets in the metadata pseudo-bins have always been
misdescribed as locating the *placed unmapped reads* (which are not
contiguous but spread throughout the reads on the chromosome), contrary
to how they are computed by both HTSlib and HTSJDK and to Heng Li's
original description on the samtools-devel mailing list in July 2010
("Re: Bam index - how to treat alignments with 0 end coordinate", [1]):

   "[...] each chr has a bin 37450 which currently has two records [...]
    The first record keeps the start and end file offset of the
    entire chr (chunk_beg[0]=chr_start_off, chunk_end[0]=chr_end_off)."

Fixes samtools#498.

[1] https://sourceforge.net/p/samtools/mailman/message/25690499/
@hts-specs-bot
Copy link

Changed PDFs as of 92b3454: CSIv1 (diff), SAMv1 (diff).

@jmarshall
Copy link
Member Author

Rebased, and version history entry changed to June.

I'd like to merge this before the next meeting. It's just waiting for an okay from @lbergelson — transferred from Yossi at the last meeting on the basis that you know more about indexes and are better placed to confirm that the new description matches HTSJDK behaviour (implemented I think in BAMIndexMetaData.java's recordMetaData()).

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me. The CSI clarifications are especially helpful.

It seems like we should have a sentence somewhere in the SamSpec defining placed unmapped reads. There is a note saying that unmapped reads can be placed on the reference but it might be worth explicitly calling out the nomenclature somewhere.

\noindent
The following functions generalise those given in the SAM specification for a BAI-style binning scheme.
Note that in CSI \textit{depth} refers to the scheme's maximal depth, i.e., the level number of the scheme's smallest bins, and the single-bin level spanning the entire coordinate range is level 0.
Hence the BAI-style binning scheme, with six levels in total, is represented by $\mbox{\sf min\_shift} = 14, \mbox{\sf depth} = 5$.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely helpful to include here. Thank you.

{\sf unmapped\_beg} & (Virtual) file offset of the start of placed unmapped reads & {\tt uint64\_t} & \\\hline
{\sf unmapped\_end} & (Virtual) file offset of the end of placed unmapped reads & {\tt uint64\_t} & \\\hline
{\sf ref\_beg} & (Virtual) file offset of the start of reads placed on this reference& {\tt uint64\_t} & \\\hline
{\sf ref\_end} & (Virtual) file offset of the end of reads placed on this reference& {\tt uint64\_t} & \\\hline
{\sf n\_mapped} & Number of mapped read-segments for this reference & {\tt uint64\_t} & \\\hline
{\sf n\_unmapped} & Number of unmapped read-segments for this reference & {\tt uint64\_t} & \\\hline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be clarified as "Number of placed unmapped read-segments for this reference"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a bit weird that we say "reads" in some of these lines and "read-segments" on others when we mean the same thing.

…ools#499)

The virtual file offsets in the metadata pseudo-bins have always been
misdescribed as locating the *placed unmapped reads* (which are not
contiguous but spread throughout the reads on the chromosome), contrary
to how they are computed by both HTSlib and HTSJDK and to Heng Li's
original description on the samtools-devel mailing list in July 2010
("Re: Bam index - how to treat alignments with 0 end coordinate", [1]):

   "[...] each chr has a bin 37450 which currently has two records [...]
    The first record keeps the start and end file offset of the
    entire chr (chunk_beg[0]=chr_start_off, chunk_end[0]=chr_end_off)."

Fixes samtools#498.

[1] https://sourceforge.net/p/samtools/mailman/message/25690499/
Mention that CSI has metadata pseudo-bins with the same contents as BAI,
and generalise their bin number calculation from BAI's 37450.

Also clarify CSI "depth", which is more max_level than number-of-levels.
@jmarshall
Copy link
Member Author

Rebased, and version history entry changed to July.

Thanks for the thoughtful review, Louis. The placed/unplaced terminology is explained in footnote 36 on the same page as this pseudobin table — does this suffice? You are quite right in your comments about n_unmapped: it would be good to rework that description to include ‘placed’, and what all these fields are probably strictly referring to is alignment records rather than reads/segments (consider when a read has two mappings to the same chromosome and appears in two alignment records). And it's currently intentionally vague about exactly where the ref_end file offset points to (it's the byte after that last alignment record).

I plan to merge this as is and address those editorial concerns in a followup.

@hts-specs-bot
Copy link

Changed PDFs as of 85c048d: CSIv1 (diff), SAMv1 (diff).

@jmarshall jmarshall merged commit 85c048d into samtools:master Jul 21, 2020
@github-pages github-pages bot temporarily deployed to github-pages July 21, 2020 09:03 Inactive
@jmarshall jmarshall deleted the pseudobin branch July 21, 2020 09:03
@lbergelson
Copy link
Member

@jmarshall You're right. I looked for something like that but somehow missed it even though it was right there. The footnote seems fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistency between SAM spec and htslib .bai index generation
5 participants