-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure discussion on VCF position is correct #180
Comments
I see - so 1.2Gb is still well within the signed 32 bit limit. Good catch! |
In the current iteration, there is a draft introductory paragraph to the VCF POS problem. I could edit it but still describe how we convert the coordinate system and provide a notebook, or is this outside the scope of the paper as this is more related to the processing of Zarr files rather than conversion? |
Could you store all of the variants you have in a single VCF file? I.e., will things break when the total number of variants is > 2**31? It would be nice to make the point about these limitations but it's less concrete when we don't have any actual example. |
I'm not sure if there is a documented maximum of these things anywhere, we'd probably need to go poking in the htslib source. |
I could try to concatenate all the files, either with or without converting the coordinates. I'll launch a job and see where it takes me. |
To speed things up, I selected two samples from each VCF and concatenated into one big VCF (without coordinate conversion). Works fine and running bcftools stats shows that there are 3745170452 variants - so that doesn't seem to be an issue. I'll do a similar test by updating coordinates, but I doubt that will change anything. I'm communicating with my colleague who did the chunking to see what his motivation was to do so (could be motivated by some downstream application). |
Nice, thanks for chasing this one up @percyfal |
@jeromekelleher my colleague got back to me and the issue was the limitations imposed by the bam index. From the BAM specification, 5.1.1:
However, it also says
I don't have any good answer why BAI was preferred over CSI. |
Aha, very good. Do you know of any species that definitely do overflow the 32 bit limit, and maybe chase up some reference on them? We could modify the discussion a bit to discuss them, after mentioning that the chrs were chunked up in this way because of the bai index. |
Well, at least the mistletoe should have some chromosomes that are 9-10Gbp, clearly larger than the 2Gbp limit: |
Actually, the 32-bit limit is pretty obvious from the NCBI assembly stats; see Chromosomes section in https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_963277665.1/ And, the ToL article has the following quote:
Still haven't found a reference though, not sure the paper has been published yet |
Very nice! I can't track down the genome note paper for it (presumably would be linked here?) but we can just cite that web page if we need to. |
So we'd say something like
|
When converting the VCF spruce data to Zarr I noticed that even the longer chromosomes (>1.2Gbp) are encoded as 32-bit integers. In the text I wrote that the maximum value of VCF POS is ~537Mbp (God knows where I got that from), but the maximum number for a signed 32-bit integer is 2,147,483,647. The reason we split up chromosomes may therefore be another than an inherent limitation of VCF.
The text was updated successfully, but these errors were encountered: