-
Is there a way of checking whether one tree sequence is topologically identical to another (i.e. if node times are ignored and, therefore, assuming nodes are in time order, ignoring internal node ids)? I can’t think of an easy way off the top of my head. We can compare tree.rank() for trees, but there’s no obvious way to do this for an entire tree sequence, is there? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 12 replies
-
It's not enough to simply check that the breakpoints in the two tree sequences are the same and that each tree is topologically identical between the two tree sequences, because a tree could be topologically identical but the node identities could have changed between trees (which would imply a different topology). Perhaps a brute force approach would be to coiterate over the edge diffs (and trees) and check that each newly introduced parent node has the same set of samples underneath it? |
Beta Was this translation helpful? Give feedback.
-
You could put a mutation on every branch of every tree and check that the genotype matrix is the same! |
Beta Was this translation helpful? Give feedback.
-
This is a Hard Problem I think. We would need to have a combinatorial concept of what a tree sequence topology is to be able to do this properly. I.e., how many topologically distinct tree sequences are there (under some reasonable assumptions), how do we define an ordering over them, etc. This is a PhD project (at least!) as far as I'm concerned... What do you want this for? (Perhaps there's a less difficult problem that actually needs to be solved?) |
Beta Was this translation helpful? Give feedback.
-
I've just PR'ed a method to port whole tree sequences (i.e. the ARG, if we have recombination nodes) into NetworkX. So now to check for tree sequence topological identity we can just read them into NetworkX and use import msprime
import networkx as nx
def edges_identical(d1, d2):
return set((v["left"], v["right"]) for v in d1.values()) == set((v["left"], v["right"]) for v in d2.values())
def as_dict_of_dicts_reorder(ts, mapping=None):
# like ts.as_dict_of_dicts PRed in #1296 but also allows the node IDs
# to be randomly shuffled, so that the mappings are unbiased
# (only needed if you are sampling from the mappings)
dod = {}
for edge in ts.edges():
if mapping is not None:
parent = mapping[edge.parent]
child = mapping[edge.child]
else:
parent = edge.parent
child = edge.child
if parent not in dod:
dod[parent] = {}
if child not in dod[parent]:
dod[parent][child] = {}
dod[parent][child][edge.id] = {
"left": edge.left,
"right": edge.right,
"branch_length": ts.node(parent).time - ts.node(child).time,
}
return dod
target_ts = msprime.sim_ancestry(3, recombination_rate=0.1, sequence_length=2, random_seed=1)
target_graph = nx.from_dict_of_dicts(
as_dict_of_dicts_reorder(target_ts),
create_using=nx.MultiDiGraph(),
multigraph_input=True)
for s in range(2, 1000):
ts = msprime.sim_ancestry(3, recombination_rate=0.1, sequence_length=2, random_seed=s)
graph = nx.from_dict_of_dicts(
as_dict_of_dicts_reorder(ts),
create_using=nx.MultiDiGraph(),
multigraph_input=True)
DiGM = nx.isomorphism.DiGraphMatcher(graph, target_graph, edge_match=edges_identical)
if DiGM.is_isomorphic():
break
print(target_ts.draw_text())
print(ts.draw_text())
print("corresponding nodes:", DiGM.mapping) Which indeed works: the following tree sequences are what I would define as "topologically identical"
|
Beta Was this translation helpful? Give feedback.
I've just PR'ed a method to port whole tree sequences (i.e. the ARG, if we have recombination nodes) into NetworkX. So now to check for tree sequence topological identity we can just read them into NetworkX and use
nx.is_isomorphic
while ensuring that the edges have identicalleft
andright
attributes: