-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF info and format fields #471
Conversation
One quick comment here: I'd tend to avoid renaming stuff from VCF. Ultimately we're just representing the VCF data so we should keep as close to the original as we can, IMO. By using |
This is great @tomwhite! I've had a quick skim through and it looks very nice. I think it would be worthwhile building up a library of weird and wonderful VCF examples at this point, I suspect some of the assumptions being made won't hold up in practise (that was my repeated experience with the VCF parser in wormtable. Using htslib/cyvcf2 is a huge help, but I suspect people play pretty fast and loose with INFO field types and so on, and we'll need to provide the user with ways of defining the numpy dtype, etc, themselves. |
sgkit/io/vcf/vcf_reader.py
Outdated
for info_field, arr in info_field_arrays.items(): | ||
if info_field in [k for k, _ in variant.INFO]: | ||
if isinstance(variant.INFO[info_field], tuple): | ||
# for the case Number='.', just use the first value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to also throw an error for the '.' case as this is commonly used indicate that the field is variable in length, see this old htslib issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - let's not support this until it's done properly.
I've used test data from GATK for this in the past. It also looks like scikit-allel has some test VCFs that have been chosen/altered to test issues such as these. |
Here's an update which better encapsulates the INFO and FORMAT field parsing, and removes support for the Number |
sgkit/io/vcf/vcf_reader.py
Outdated
def add_variant(self, i: int, variant: Any) -> None: | ||
key = self.key | ||
if self.category == "INFO": | ||
if key in [k for k, _ in variant.INFO]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use variant.INFO.get(key)
here
Here's a ready-made set of example VCFs @tomwhite: https://github.com/ga4gh/ga4gh-server/tree/master/tests/data/datasets/dataset1/variants This is from the old GA4GH REST API reference implementation, where we did try to support all the INFO/GT fields in VCFs. |
9c17620
to
f711779
Compare
I've overhauled this quite a bit with the following changes:
I've checked equivalence for all the test files in sgkit's test data, and it would be fairly straightforward to extend this to other sets of VCF files. That could be done separately though. |
This looks great @tomwhite. The size of "G" can actually be calculated from from math import comb
comb(n_alleles + ploidy - 1, ploidy) |
@timothymillar do you have any VCFs with data from variable ploidy samples that you might be able to contribute to our testing efforts? |
@hammer I'm looking into this, we can definitely share some data for (a chromosome of) a 4x potato population which is already public. I'm not aware of a true mixed ploidy example but at the very least I can call the 4x material with a range of ploidy values to make something similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic @tomwhite, minor comments above! Probably some of these should be pushed out to follow-up PRs.
return "bool", False | ||
elif vcf_type == "Integer": | ||
return "i4", -1 | ||
elif vcf_type == "Float": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth putting a note in here that BCF stores these values as single precision, so there's no point in using doubles.
def update_dataset( | ||
self, ds: xr.Dataset, add_str_max_length_attrs: bool = False | ||
) -> None: | ||
# cyvcf2 represents missing Integer values as the minimum int32 value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly worried about this - could we be silently altering data? Suppose someone uses INT32_MIN in their data, won't we silently change this to fill_value then? I guess this is an upstream problem with cyvcf2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened brentp/cyvcf2#195 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think they could use INT32_MIN for anything else as this value is explicitly interpreted as a 'missing' integer in the VCF spec (page 30)
@@ -261,6 +467,24 @@ def vcf_to_zarrs( | |||
If True, genotype calls with more alleles than the specified (maximum) ploidy value | |||
will be truncated to size ploidy. If false, calls with more alleles than the | |||
specified ploidy will raise an exception. | |||
max_alt_alleles | |||
The (maximum) number of alternate alleles in the VCF file. Any records with more than | |||
this number of alternate alleles will have the extra alleles dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not for this PR, but we should probably raise a warning if this happens. Users should know if their ALTs are being dropped, and ideally it should be discoverable later if they have been truncated. Could we add some metadata tracking when this has happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #487
this number of alternate alleles will have the extra alleles dropped. | ||
fields | ||
Extra fields to extract data for. A list of strings, with ``INFO`` or ``FORMAT`` prefixes. | ||
Wildcards are permitted too, for example: ``["INFO/*", "FORMAT/DP"]``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NICE!!!
``dimension``. The first three correspond to VCF header values, and ``dimension`` is | ||
the name of the final dimension in the array for the case where ``Number`` is a fixed | ||
integer larger than 1. For example, | ||
``{"INFO/AC": {"Number": "A"}, "FORMAT/HQ": {"dimension": "haplotypes"}}`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to explain what "A" means here; maybe we should say something like "see section 1.2.2 of the VCF specification <url>
__ for further details on the available Number specifications."
This PR has conflicts, @tomwhite please rebase and push updated version 🙏 |
f711779
to
f3a919c
Compare
This changes the VCF reader to handle INFO and FORMAT fields of type Flag, Integer, or Float, and where number is 1, as suggested in #464.
A few things for discussion:
fields
list, I've created two,info_fields
andformat_fields
to avoid ambiguities (likeDP
, which can apply to either), Alternatively, we could have a single list, and namespace them, e.g.FORMAT/DP
.variant_
, and FORMAT fields withcall_
. Socall_DP
is an example. Do we want to add a renaming keyword (like scikit-allel has) so it could be renamed to e.g.call_depth
?GT
FORMAT field is special since it is always included, even ifformat_fields
doesn't include it. Should we make it optional?vcf_to_zarr_sequential
is getting a bit unwieldy and should probably be refactored into a class or set of classes to handle various field types.