Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF info and format fields #471

Merged
merged 6 commits into from
Mar 9, 2021

Conversation

tomwhite
Copy link
Collaborator

This changes the VCF reader to handle INFO and FORMAT fields of type Flag, Integer, or Float, and where number is 1, as suggested in #464.

A few things for discussion:

  • Unlike scikit-allel which has a single fields list, I've created two, info_fields and format_fields to avoid ambiguities (like DP, which can apply to either), Alternatively, we could have a single list, and namespace them, e.g. FORMAT/DP.
  • Naming. As it stands INFO fields are prefixed with variant_, and FORMAT fields with call_. So call_DP is an example. Do we want to add a renaming keyword (like scikit-allel has) so it could be renamed to e.g. call_depth?
  • The GT FORMAT field is special since it is always included, even if format_fields doesn't include it. Should we make it optional?
  • The vcf_to_zarr_sequential is getting a bit unwieldy and should probably be refactored into a class or set of classes to handle various field types.

@tomwhite tomwhite added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Feb 23, 2021
@jeromekelleher
Copy link
Collaborator

Naming. As it stands INFO fields are prefixed with variant_, and FORMAT fields with call_. So call_DP is an example. Do we want to add a renaming keyword (like scikit-allel has) so it could be renamed to e.g. call_depth?

One quick comment here: I'd tend to avoid renaming stuff from VCF. Ultimately we're just representing the VCF data so we should keep as close to the original as we can, IMO. By using call_DP it's clear we're not doing anything clever, just showing the user the value of the DP field in the VCF.

@jeromekelleher
Copy link
Collaborator

This is great @tomwhite! I've had a quick skim through and it looks very nice.

I think it would be worthwhile building up a library of weird and wonderful VCF examples at this point, I suspect some of the assumptions being made won't hold up in practise (that was my repeated experience with the VCF parser in wormtable. Using htslib/cyvcf2 is a huge help, but I suspect people play pretty fast and loose with INFO field types and so on, and we'll need to provide the user with ways of defining the numpy dtype, etc, themselves.

for info_field, arr in info_field_arrays.items():
if info_field in [k for k, _ in variant.INFO]:
if isinstance(variant.INFO[info_field], tuple):
# for the case Number='.', just use the first value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to also throw an error for the '.' case as this is commonly used indicate that the field is variable in length, see this old htslib issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - let's not support this until it's done properly.

@tomwhite
Copy link
Collaborator Author

I think it would be worthwhile building up a library of weird and wonderful VCF examples at this point

I've used test data from GATK for this in the past. It also looks like scikit-allel has some test VCFs that have been chosen/altered to test issues such as these.

@tomwhite
Copy link
Collaborator Author

Here's an update which better encapsulates the INFO and FORMAT field parsing, and removes support for the Number '.' case.

def add_variant(self, i: int, variant: Any) -> None:
key = self.key
if self.category == "INFO":
if key in [k for k, _ in variant.INFO]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use variant.INFO.get(key) here

@jeromekelleher
Copy link
Collaborator

Here's a ready-made set of example VCFs @tomwhite: https://github.com/ga4gh/ga4gh-server/tree/master/tests/data/datasets/dataset1/variants

This is from the old GA4GH REST API reference implementation, where we did try to support all the INFO/GT fields in VCFs.

@tomwhite tomwhite force-pushed the vcf-info-and-format-fields branch from 9c17620 to f711779 Compare March 2, 2021 16:40
@tomwhite
Copy link
Collaborator Author

tomwhite commented Mar 2, 2021

I've overhauled this quite a bit with the following changes:

  • Support all VCF Type fields (Integer, Float, String, Flag)
  • Support the following VCF Number fields: any integer, A, R (but not . or G).
  • Allow field definitions to be overridden using the field_defs argument. (This is useful to override a . Number field with a fixed number, for example.)
  • Allow fields to be specified by a wildcard (e.g. INFO/*).
  • Convert INFO and FORMAT fields from scikit-allel format, and add roundtrip tests to check equivalence. This is a useful check since it's effectively testing two VCF parsers against one another.

I've checked equivalence for all the test files in sgkit's test data, and it would be fairly straightforward to extend this to other sets of VCF files. That could be done separately though.

@timothymillar
Copy link
Collaborator

This looks great @tomwhite. The size of "G" can actually be calculated from n_alleles and ploidy using n-choose-k with replacement

from math import comb
comb(n_alleles + ploidy - 1, ploidy)

@hammer
Copy link
Contributor

hammer commented Mar 2, 2021

@timothymillar do you have any VCFs with data from variable ploidy samples that you might be able to contribute to our testing efforts?

@timothymillar
Copy link
Collaborator

@hammer I'm looking into this, we can definitely share some data for (a chromosome of) a 4x potato population which is already public. I'm not aware of a true mixed ploidy example but at the very least I can call the 4x material with a range of ploidy values to make something similar.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic @tomwhite, minor comments above! Probably some of these should be pushed out to follow-up PRs.

return "bool", False
elif vcf_type == "Integer":
return "i4", -1
elif vcf_type == "Float":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth putting a note in here that BCF stores these values as single precision, so there's no point in using doubles.

def update_dataset(
self, ds: xr.Dataset, add_str_max_length_attrs: bool = False
) -> None:
# cyvcf2 represents missing Integer values as the minimum int32 value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried about this - could we be silently altering data? Suppose someone uses INT32_MIN in their data, won't we silently change this to fill_value then? I guess this is an upstream problem with cyvcf2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened brentp/cyvcf2#195 for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they could use INT32_MIN for anything else as this value is explicitly interpreted as a 'missing' integer in the VCF spec (page 30)

@@ -261,6 +467,24 @@ def vcf_to_zarrs(
If True, genotype calls with more alleles than the specified (maximum) ploidy value
will be truncated to size ploidy. If false, calls with more alleles than the
specified ploidy will raise an exception.
max_alt_alleles
The (maximum) number of alternate alleles in the VCF file. Any records with more than
this number of alternate alleles will have the extra alleles dropped.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not for this PR, but we should probably raise a warning if this happens. Users should know if their ALTs are being dropped, and ideally it should be discoverable later if they have been truncated. Could we add some metadata tracking when this has happened?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #487

this number of alternate alleles will have the extra alleles dropped.
fields
Extra fields to extract data for. A list of strings, with ``INFO`` or ``FORMAT`` prefixes.
Wildcards are permitted too, for example: ``["INFO/*", "FORMAT/DP"]``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE!!!

``dimension``. The first three correspond to VCF header values, and ``dimension`` is
the name of the final dimension in the array for the case where ``Number`` is a fixed
integer larger than 1. For example,
``{"INFO/AC": {"Number": "A"}, "FORMAT/HQ": {"dimension": "haplotypes"}}``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to explain what "A" means here; maybe we should say something like "see section 1.2.2 of the VCF specification <url>__ for further details on the available Number specifications."

@mergify
Copy link
Contributor

mergify bot commented Mar 8, 2021

This PR has conflicts, @tomwhite please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Mar 8, 2021
@tomwhite tomwhite force-pushed the vcf-info-and-format-fields branch from f711779 to f3a919c Compare March 9, 2021 10:34
@mergify mergify bot removed the conflict PR conflict label Mar 9, 2021
@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Mar 9, 2021
@mergify mergify bot merged commit 10e3e01 into sgkit-dev:master Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants