-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add splitting BAM index to spec #321
base: master
Are you sure you want to change the base?
Changes from 2 commits
6d4f054
15a80f6
d85bb5a
4f854e6
3e56f1e
486445d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1204,6 +1204,48 @@ \subsection{C source code for computing bin number and overlapping bins}\label{s | |
\end{verbatim} | ||
} | ||
|
||
\subsection{Splitting BAM}\label{sec:code} | ||
A BAM file can be processed in parallel by conceptually dividing the file into | ||
splits (typically of a fixed, but arbitrary, number of bytes) and for each | ||
split processing alignments from the first known alignment after the split | ||
start up to the first known alignment of the next split. | ||
|
||
A splitting BAM index is a linear index of virtual file offsets of alignment | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence is a bit hard to parse on the first run through. To many Also, does linear index imply that it's a sorted list of increasing offsets? Should we mention that somewhere? |
||
start positions. The index must contain the virtual file offset for the first | ||
alignment, and a virtual file offset for the overall length of the BAM | ||
file.\footnote{In the unlikely event the BAM file has no alignment records, | ||
the index will consist of a single entry for the overall length of the | ||
BAM file.} It does not need to contain a virtual file offset for every | ||
alignment, merely a subset. A granularity of $n$ means that an offset is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to allow indication of approximate number of records per offset? Or is that just making things unnecessarily complicated? At the cost of increasing the index size we could include the number of records in each section in the index. Instead of |
||
written for every $n$ alignments. | ||
|
||
To find the alignments for a split that covers a byte range {\tt [beg,\,end)} | ||
use the index to find the smallest virtual file offset, {\tt v1}, that falls | ||
in this range, and the smallest virtual file offset, {\tt v2}, that is | ||
greater than or equal to {\tt end}. If {\tt v1} does not exist, then the | ||
split has no alignments. Otherwise, it has alignments in the range | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no alignments -> no alignment starts |
||
{\tt [v1,\,v2)}. This method will map a set of contiguous, non-overlapping | ||
{\it file ranges} that cover the whole BAM file to a set of contiguous, | ||
non-overlapping {\it virtual file ranges} that cover the whole file. | ||
|
||
Splitting BAM index filenames have a {\tt .sbi} extension added to the BAM | ||
filename (so {\tt foo.bam.sbi} is the splitting BAM index filename for | ||
{\tt foo.bam}). Index files contain a header followed by a sorted list of | ||
virtual files offsets in ascending order. | ||
|
||
\begin{table}[ht] | ||
\centering | ||
{\small | ||
\begin{tabular}{|l|l|l|p{8.15cm}|l|r|} | ||
\cline{1-6} | ||
\multicolumn{3}{|c|}{\bf Field} & \multicolumn{1}{c|}{\bf Description} & \multicolumn{1}{c|}{\bf Type} & \multicolumn{1}{c|}{\bf Value} \\\cline{1-6} | ||
\multicolumn{3}{|l|}{\sf magic} & Magic string & {\tt char[4]} & {\tt SBI\char92 1}\\\cline{1-6} | ||
\multicolumn{3}{|l|}{\sf granularity} & Number of alignments between offsets, or $-1$ if unspecified & {\tt int32\_t} & \\\cline{1-6} | ||
\multicolumn{6}{|c|}{\textcolor{gray}{\it List of offsets}} \\\cline{2-6} | ||
& \multicolumn{2}{l|}{\sf offset} & Virtual file offset of the alignment & {\tt uint64\_t} & \\\cline{1-6} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like there should be a special final entry after the list the table for the offset to the end of the bam. |
||
\end{tabular}} | ||
\end{table} | ||
|
||
\pagebreak | ||
|
||
\begin{appendices} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want a sentence here describing why this additional index is necessary and this use can't be handled by the bai.