Skip to content

Mutual MRCA method for finding merge compatible pairs without set operations

Jonathan A Rees edited this page Jul 15, 2015 · 9 revisions

The following idea arose from thinking about how smasher might be improved, but it's possible they might have application elsewhere.

Assume that the node sets of trees are disjoint between trees (speaking of them as data structures, not as what they mean), that all tips have labels, and that different tips in the same tree have different labels.

Recall the definitions we've been kicking around:

  • A 'phylogenetic statement' ('PS' or 'split') s is determined by an a pair of disjoint sets, the 'ingroup' in(s) and the 'outgroup' out(s).

  • A node n in tree t determines a phylogenetic statement ps(n) by in(ps(n)) = labels of tips reachable from n, out(ps(n)) = {labels of tips in t} minus in(ps(n)).

  • Phylogenetic statements s1 and s2 are 'merge compatible' if in(s1) intersect out(s2) = empty set and out(s1) intersect in(s2) = empty set.

  • By abuse of terminology we say that nodes n1 and n2 are 'merge compatible' if ps(n1) and ps(n2) are merge compatible.

The problem I want to consider first is that of finding, for two trees t1 and t2, all node pairs (n1,n2) with n1 in t1, n2 in t2, and n1 merge compatible with n2. Future work: talk about (a) dealing with any number of trees, not just two, and (b) also computing the 'nestedchildof' relation.

Cody (I think) has described a way to find merge-compatibility relationships between two trees t1 and t2 as follows:

  1. realize as bit-set data structures the sets in(ps(n)) and out(ps(n)) for every node n in t1 or t2

  2. for every node n1 in t1, check every node n2 in t2 to see if the condition for merge-compatibility holds between ps(n1) and ps(n2), by checking for nonempty bit-set intersections between in(ps(n1)) and out(ps(n2)) and between in(ps(n2)) and out(ps(n1)).

If N = the number of nodes in a tree, then this method requires N^2 space and N^5 time (I believe; and this may be too small). The use of bit sets makes the constant factors small, so this method might be practical for small trees. But for the large trees that smasher deals with (e.g. the NCBI and GBIF taxonomies), I don't think this will work.

It's possible the actual method used in recent synthesis is more clever than this, but I've not yet seen documentation saying so.

I want to investigate the possibility of using common ancestor calculations as a substitute for bit-set calculations.

  • Define tree(n) = the tree in which node n occurs
  • Define labels(t) = labels of tips in tree t
  • Define labels(n) = labels of tips reachable tipward from n = in(ps(n))
  • Define Q = labels(t1) intersect labels(t2)
  • Define a 'Q-label' to be a label in Q
  • Define 'the Q-labels of a node n' (or Q-labels(n)) to be {labels(n) intersect Q}

Observe that:

n1 and n2 are merge-compatible if and only if Q-labels(n1) = Q-labels(n2)

Define: nca(L, t) = nearest common ancestor of those nodes in t that have Q-labels in L

Define: n is 'clean' if n = nca(L, t) for some L a subset of Q, and for some tree t. Equivalently, n is 'clean' if it has no child that has the same Q-labels set as n.

Suppose n1 is a clean node in tree t1. Let

n2 = nca(Q-labels(n1), t2)

n2 has all of n1's Q-labels, and maybe more. If it does not have more, then n1 and n2 are merge-compatible.

So, how do we figure out whether n1 and n2 are merge-compatible, i.e. whether or not Q-labels(n2) is equal to Q-labels(n1), and not bigger?

We could iterate over Q-labels(n2) to see whether any of them is not in Q-labels(n1). That would be similar to doing bit-set intersections.

But we can do it using common ancestors. Just reverse the above process:

m1 = nca(Q-labels(n2), t1)

This is the mirror image of the above: m1 has all of n2's Q-labels, and maybe more. If it does not have more, then n1 and n2 are merge-compatible.

We have:

Q-labels(n1) subsetof Q-labels(m1) subsetof Q-labels(n2)

where 'subsetof' includes equality as a possibility.

Clearly, if n1 = n2, then all three Q-label sets are the same. The converse is also true, because distinct clean nodes in the same tree have distinct Q-label sets.

So, if n1 = n2, ps(n1) and ps(m1) are merge-compatible - i.e. we have found a node in t2 that is merge-compatible with n1, and if we have not found such a node, then none exists. We have eliminated bit-set operations and replaced them by common ancestor computations.

This gives us all merge-compatibility relationships between clean nodes. There are additional relationships involving their not-clean ancestors (if any). Let A = the set {n1 and its not-clean ancestors going up to be not including the next clean ancestor}, and B = similarly for n2. Then every member of A is merge-compatible with every member of B, since they all have the same set of Q-labels.

Finally, to find all merge-compatibility relationships between trees t1 and t2, iterate over the nodes of t1 in preorder. For each node n1 in t2 with non-empty Q-labels(n1), find m1 and n2 as above, memoizing the nca's for both trees as appropriate. If n1 = n2, then n1 and m1 are merge compatible. There are also merge-compatibility relationships between n1 and not-clean ancestors of m2, as above. (n1 may be not-clean, i.e. n2 might be a descendant of n1 rather than an ancestor, so that case requires recognition and special treatment.)

Note that t2 does not need to be traversed.

The complexity is better than the bit-set approach, but still not great. However, there is an optimization due to the fact that we are finding all merge-compatible pairs at once. This is to memoize the nca calculations, e.g. associate n2 with n1 and m1 with n2. This is very helpful because the calculation is compositional - the nca in t2 of Q-labels(n1) is a function of the ncas in t2 of the respective Q-labels of n1's children. If we do this, then a bottom-up process can calculate all of the merge-compatible pairs without having to traverse the entire subtree at every node to calculate or check the set of Q-labels.

To compute nca(Q-labels(n), t) quickly, first compute

a1, ... ak = nca(Q-labels(n1), t), ... nca(Q-labels(nk), t)

These determinations would be memoized at nodes n1, ... nk. Then compute the nearest common ancestor of nodes a1, ... ak. Follow the ancestor chains of a1, ... ak until we get to nodes b1, ... bk that are all as close to the root as the closest of the a's. (This is fast if the distance is memoized.) Then ascend from b1, ... bk in lockstep until the nearest node from which they all descend is found. (For simplicity this can be done pairwise.)

With these memoizations, the procedure looks close to linear to me.

Containment

What if n1 and n2 are clean but distinct? Obviously n2 is an ancestor of n1. We know that m1 is not merge-compatible with any node in t2. But in this case we also know that Q-labels(n1) is a proper subset of Q-labels(m1). This is because if (to the contrary) we had Q-labels(n1) = Q-labels(m1) = S, we would have n2 = nca(Q-labels(m1), t1) = nca(S, t1) = n1, because a given Q-labels set determines the nca uniquely.

This is the 'nestedchildof' relationship from the synthesis document: n1 is not compatible with m1, but it provides an alternate way that the taxon m1 might be carved up (alternate to the descendant structure of m1 in t2).

There is an important distinction to be made when n is a polytomy in t1. Let m be a node in t2 that is not merge-compatible with any node in t1, but maps to a node n (i.e. n = nca(Q-labels(m), t1)) that has more than two children. It is possible that m can 'group together' some of the children of n, suggesting a new clade that is compatible with the tree t1 and that helps us resolve n's polytomy. On the flip side, if for given m there is no such n, then it may be useful to know that m is incompatible with t1 - there is no way for it to 'fit in'. How do we distinguish such nodes m?

Let m1, m2, ... be the nearest descendants of a given m that map to merge-compatible nodes n1, n2, ... in t1 (i.e. that do not map to n). If n1, n2, ... are all children of n, then the condition is met, otherwise it's not. In fact the entire structure of t2 'in between' m and m1, m2, ... provides a collection of clade hypotheses compatible with the polytomy.

As a matter of implementation, this requires a traversal of t2. Suppose we have all mappings from nodes n in t2 to nca(Q-labels(n), t1) in t1. Then a tipward traversal from m in t2 (determined to be non-merge-compatible by mapping to n in t1 and back) finds merge-compatible nodes m1, m2, ... If the nodes in t1 have parent pointers, it is a matter of finding the node mapped from m1, ... and checking to see if it's a child of n (the node m maps to). If it's not, then m does not give a refinement of the n polytomy. If all m1, ... are processed in this way and all map to children of n, then it is.

Note that some child of m may also provide a refinement of n. This affects the choice of order in which these nodes m are found and processed. A postorder traversal may make more sense than preorder.

TBD: prove this claim, and think about the time complexity of this check. This process resembles the bit-set approach somewhat, but might have acceptable complexity if each node only needs to be visited once, which with some clever coding seems plausible.

(This is similar to the polytomy refinement method used by smasher. There is a difficulty with it, in that it is impossible to say whether the children of n that have no Q-labels should be placed within this new clade hypothesis or not. To be principled, one might want to apply this test only when there are no such 'extra' children, but of course you would lose a lot of taxa in t2 this way in a merge where t1 has priority.)