write_vcf(): discrete_genome, 1-based coordinates, and contig length #1993

grahamgower · 2021-12-03T08:26:40Z

grahamgower
Dec 3, 2021
Collaborator

Hi tskitters,

My goal is to output a vcf using 1-based inclusive coords from a simulation with mutations at integral positions. By default, write_vcf() will output 0-based coords, and it is possible that a variant will be given POS 0. The write_vcf() docs suggest that if this behaviour is undesirable, the onus is on the user to transform coordinates by passing a position_transform function. (As an aside, I see the VCF4.2 spec says: "telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig".)

For an infinite sites simulation, I guess it makes sense to use position_transform=np.ceil to get 1-based inclusive coords. But this an identity transformation for finite sites simulations---a mutation at position 0 is still at position 0 after transformation. So I figured I'd write it as position_transform=lambda x: 1 + np.floor(x). This gives the right coordinates (I think), but then I saw that the contig length is also being transformed by the position_transform() function! Is there a recommended incantation to get 1-based inclusive coords and not mung the contig length?

import sys

import numpy as np
import msprime


def sim():
    ts = msprime.sim_ancestry(
        population_size=5000,
        samples=5,
        sequence_length=10_000,
        random_seed=1,
    )
    return msprime.sim_mutations(ts, rate=1.25e-8, random_seed=2)


ts = sim()
position = ts.tables.sites.position
# My assumptions about the positions, written in the form of assertions.
# assert ts.discrete_genome  # <-- needs tskit 0.4
assert np.all(position == np.floor(position))
assert np.all(position >= 0)
assert np.all(position < ts.sequence_length)


def position_transform(x):
    return 1 + np.floor(x)


ts.write_vcf(sys.stdout, contig_id="1", position_transform=position_transform)

Output:

##fileformat=VCFv4.2
##source=tskit 0.3.7
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10001>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4
1	8183	.	A	G	.	PASS	.	GT	0|1	0|0	0|0	0|1	0|0
1	9478	.	G	C	.	PASS	.	GT	1|0	0|1	0|0	0|0	0|0

jeromekelleher · 2021-12-03T09:21:09Z

jeromekelleher
Dec 3, 2021
Maintainer

This is hairy stuff @grahamgower! I don't think we've thought deeply about 1-based coordinates, principally going under the assumption that if you're doing simulations it doesn't matter and if you're working with real data you've input the original coordinates, as is.

A couple of notes:

Why use floor and not just lambda x: x + 1? msprime will guarantee those coords are discrete now.
Why are you worried about the contig length? It seems pretty unlikely anything downstream is going to be doing much with it.

2 replies

grahamgower Dec 3, 2021
Collaborator Author

Why use floor and not just lambda x: x + 1? msprime will guarantee those coords are discrete now.

x is actually a list (at least in tskit 0.3.7), so it needs to be lambda x: np.array(x) + 1. But I guess I was hedging my bets about also supporting infinite sites sims.

Why are you worried about the contig length? It seems pretty unlikely anything downstream is going to be doing much with it.

I want to perform analogous operations on empirical data and simulation output. For sims, I'm using ts.sequence_length and for empirical data, I'm getting the contig length from the VCF header. And to check that my two operations working from sims and from vcf are doing the same thing, I have unit tests that convert the ts to vcf. In practice, I can rearrange things to work around the issue highlighted here, but that is how I encountered the problem.

jeromekelleher Dec 3, 2021
Maintainer

Hmm, would probably be more useful if x was an array in the position transform. Not sure why we made it a list.

Re the contig-length, we felt the least likely thing to lead to downstream errors was to apply the position transform to the sequence length also. I think that's probably unlikely to change, so it's best to build this into your tests. (See here)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_vcf(): discrete_genome, 1-based coordinates, and contig length #1993

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

write_vcf(): discrete_genome, 1-based coordinates, and contig length #1993

grahamgower Dec 3, 2021 Collaborator

Replies: 1 comment · 2 replies

jeromekelleher Dec 3, 2021 Maintainer

grahamgower Dec 3, 2021 Collaborator Author

jeromekelleher Dec 3, 2021 Maintainer

grahamgower
Dec 3, 2021
Collaborator

Replies: 1 comment 2 replies

jeromekelleher
Dec 3, 2021
Maintainer

grahamgower Dec 3, 2021
Collaborator Author

jeromekelleher Dec 3, 2021
Maintainer