parallel calculation on trees in tree sequence #1595

stsmall · 2021-07-28T16:48:25Z

stsmall
Jul 28, 2021

Hi,
I am calculating the mrca of a population on each tree in a tree_sequence.

mrcas = [functools.reduce(t.mrca, p_nodes) for t in ts.trees()]

I have been attempting to use python multiprocessing module to split the trees into chunks. Why? Since these are trees made in Relate they are uncompressed and when I used ts.as_list() it consumed 400G of memory.

I tried to create a list of indexes [[0,1,2,3], [4,5,6], [7,8,9]] and pass that to pool.map to split up the tree sequence.


for ix in [0, 1 ,2]:
    t = ts.at_index(ix)  
    mrcas = functools.reduce(t.mrca, p_nodes)

That actually took longer than running for each tree in sequence with ts.trees(). This may be a reflection of my poor understanding of how the multiprocessing module works in python and my inexperience in dealing with the tree_sequence object.
any advice?
thanks,
scott

molpopgen · 2021-07-28T16:52:47Z

molpopgen
Jul 28, 2021
Maintainer

Part of the reason for the increased runtime is that the function to advance to a given tree takes linear time, making your total runtime complexity closer to quadratic in the number of trees.

Another issue will depend on how you are using the multi-processing module. If you are using it with threads, then the global interpreter lock may be grinding performance to a halt.

1 reply

stsmall Jul 28, 2021
Author

OK, that makes sense.
*I am not using it with threads, so nothing with the GIL.

jeromekelleher · 2021-07-28T17:47:22Z

jeromekelleher
Jul 28, 2021
Maintainer

Great question @stsmall - how long does the straightforward mrcas = [functools.reduce(t.mrca, p_nodes) for t in ts.trees()] take?

Since these are Relate trees you might be better of splitting the chromosome into chunks - they are slow to iterate over because the tree sequence is stored "tree-by-tree". This also means that ts.at() will be particularly slow.

4 replies

stsmall Jul 28, 2021
Author

Hi Jerome,
Well the mrca is not too slow ... maybe 30 min for 100,000 trees.

What I was more generally attempting was parallelization to get something like the cross-coalescent from Hejase et al. 2020. Defined in section: A Test for Elevation in Cross-Coalescence Time ... "For a given local tree and pair of species, we considered the 10 most recent cross coalescent events between the two species and normalized these ages, as in test 2, by the age of the youngest subtree that contains at least half of the total number of haploid samples."

*edit: since I was asking a different question here, I moved it over to a new question.
#1596

jeromekelleher Jul 29, 2021
Maintainer

Well the mrca is not too slow ... maybe 30 min for 100,000 trees.

Hmm, that's pretty slow in my book. How big is your relate tree sequence on disk?

stsmall Jul 29, 2021
Author

10Gig. I also tried to use simplify since then I can just use the root, but using simplify on Relate trees (as you mentioned they are not compressed) took 6 hours in the case of a 17Mb chromosome (100,000 trees) and I killed the other job that had 250,000 trees.

jeromekelleher Jul 30, 2021
Maintainer

Yeesh, those are some super ugly tree sequences. If I were you I'd split the big file up into maybe 100 or 1000 pieces, using something like this:

def split_trees(ts, chunk_size):

    def make_split(left, right):
        split_ts = ts.keep_intervals([(left, right)])
        # If you want to keep the original coordinate system don't do this -
        # but, be aware of the trailing missing data in the flanks.
        return split_ts.trim()

    left = 0
    num_in_chunk = 0
    for tree in ts.trees():
        if num_in_chunk == chunk_size:
            right = tree.interval.left
            yield make_split(left, right)
            left = right
            num_in_chunk = 0
        num_in_chunk += 1
    yield make_split(left, ts.sequence_length)

ts = msprime.sim_ancestry(
    100, recombination_rate=0.15, sequence_length=10000, random_seed=1234)

total_length = 0
for j, chunk_ts in enumerate(split_trees(ts, 100)):
    print(j, chunk_ts.sequence_length, chunk_ts.num_trees)
    # Sanity check
    total_length += chunk_ts.sequence_length
    # Write to file, e.g.
    chunk_ts.dump(f"chunk_{j}.trees")

assert total_length == ts.sequence_length

You can then write each of the chunk tree sequences out to file, and process them in parallel. Beware though, that this will reset the coordinate systems in each of the chunks, and if you want to maintain the coordinate systems you'll need to remove the call to trim.

This should make working with the Relate trees a bit less painful, hopefully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel calculation on trees in tree sequence #1595

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

parallel calculation on trees in tree sequence #1595

stsmall Jul 28, 2021

Replies: 2 comments · 5 replies

molpopgen Jul 28, 2021 Maintainer

stsmall Jul 28, 2021 Author

jeromekelleher Jul 28, 2021 Maintainer

stsmall Jul 28, 2021 Author

jeromekelleher Jul 29, 2021 Maintainer

stsmall Jul 29, 2021 Author

jeromekelleher Jul 30, 2021 Maintainer

stsmall
Jul 28, 2021

Replies: 2 comments 5 replies

molpopgen
Jul 28, 2021
Maintainer

stsmall Jul 28, 2021
Author

jeromekelleher
Jul 28, 2021
Maintainer

stsmall Jul 28, 2021
Author

jeromekelleher Jul 29, 2021
Maintainer

stsmall Jul 29, 2021
Author

jeromekelleher Jul 30, 2021
Maintainer