Reliable timepoint multiplicity calculation #130

eharkins · 2019-02-15T01:30:13Z

Right now we assume that a given sequence's multiplicity represents the sum of the timepoint multiplicities by calculating their percentages for pie charts like this https://github.com/matsengrp/olmsted/blob/master/src/components/explorer/vega/clonal_family_details.js#L232.

As is noted above that line, we thought it might be better to sum over the total timepoint multiplicities because of how the pie chart marks work, so that we would have certainty that our percentages were accurate. The pie charts are displaying just fine for now so we have closed the issue for that, but this could be important depending on what data looks like.

metasoarous · 2019-02-15T21:37:33Z

To be clear here, after digging into the mechanics of how Vega does these pie chart transforms, we realized that if even one the nodes has timepoint multiplicities which don't add up to the total multiplicity, it could potentially mess up all the other pie charts. So it's worth making sure that the numbers going in here are correct, or doing some kind of validation along the way.

metasoarous · 2019-03-26T22:21:30Z

@eharkins I think this has become a little more important now that we are generalizing things. Are we still taking totals at their face value, or are we dynamically adding them up based on the individuals timepoints data?

eharkins · 2019-04-01T18:08:48Z

@metasoarous
If by

taking totals at their face value

you mean dividing each timepoint multiplicity by the multiplicity (or cluster_multiplicity) value instead of the sum of the timepoint multiplicity values for that sequence (see https://github.com/matsengrp/olmsted/blob/master/src/components/explorer/vega/clonal_family_details.js#L253), then yes.

The code that creates these values (https://github.com/matsengrp/cft/blob/master/bin/process_partis.py#L122-L135) seems to support the assumption that these parts add up to the whole in the cft case. Because of the filtering done here: https://github.com/matsengrp/olmsted/blob/master/src/components/explorer/vega/clonal_family_details.js#L218-L223 (see comment explaining the filter), I don't think it is the case anymore that

if even one the nodes has timepoint multiplicities which don't add up to the total multiplicity, it could potentially mess up all the other pie charts.

as you say above.

Will all this in mind, is it worth doing this anyway or somehow requiring in the schema (if possible) or in the Olmsted code that the individual timepoint multiplicities add up to the total for each sequence?

metasoarous · 2019-04-01T19:21:51Z

I don't see how the filtering you mention implies that we don't have to worry about the cluster multiplicities adding up anymore. Can you explain that please?

Assuming there's still any way a user could put inconsistent data into Olmsted, we should either have a check that the numbers match up, or compute the totals ourselves, so nothing ever ends up being off.

eharkins · 2019-04-01T20:32:23Z

I don't see how the filtering you mention implies that we don't have to worry about the cluster multiplicities adding up anymore. Can you explain that please?

I think you are right that this is does not affect that issue. It just means we can accept some leaves not having multiplicity data.

Assuming there's still any way a user could put inconsistent data into Olmsted, we should either have a check that the numbers match up, or compute the totals ourselves, so nothing ever ends up being off.

Sounds good. One way to do things is to not include total multiplicity values and just

compute the totals ourselves

by summing the timepoint multiplicities in Olmsted front end code somewhere or in bin/process_data.py.

This makes me wonder what other derived data we have in the schema that are liable to be inconsistent with their sources and may need a similar check.

metasoarous · 2019-04-02T22:37:13Z

by summing the timepoint multiplicities in Olmsted front end code somewhere or in bin/process_data.py.

That's not a bad idea; However, we should not require that information be passed in this way. If folks don't have/want to specify timepoints, they should be able to just specify the regular/aggregated multiplicity values verbatim. We should only compute/check them when needed.

eharkins self-assigned this Feb 15, 2019

eharkins added the vega-viz Specifically related to vega viz stuff label Feb 15, 2019

eharkins mentioned this issue Feb 15, 2019

Add pie charts for timepoint multiplicities #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliable timepoint multiplicity calculation #130

Reliable timepoint multiplicity calculation #130

eharkins commented Feb 15, 2019

metasoarous commented Feb 15, 2019

metasoarous commented Mar 26, 2019

eharkins commented Apr 1, 2019

metasoarous commented Apr 1, 2019

eharkins commented Apr 1, 2019

metasoarous commented Apr 2, 2019

Reliable timepoint multiplicity calculation #130

Reliable timepoint multiplicity calculation #130

Comments

eharkins commented Feb 15, 2019

metasoarous commented Feb 15, 2019

metasoarous commented Mar 26, 2019

eharkins commented Apr 1, 2019

metasoarous commented Apr 1, 2019

eharkins commented Apr 1, 2019

metasoarous commented Apr 2, 2019