Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable timepoint multiplicity calculation #130

Open
eharkins opened this issue Feb 15, 2019 · 6 comments
Open

Reliable timepoint multiplicity calculation #130

eharkins opened this issue Feb 15, 2019 · 6 comments
Assignees
Labels
vega-viz Specifically related to vega viz stuff

Comments

@eharkins
Copy link
Contributor

Right now we assume that a given sequence's multiplicity represents the sum of the timepoint multiplicities by calculating their percentages for pie charts like this https://github.com/matsengrp/olmsted/blob/master/src/components/explorer/vega/clonal_family_details.js#L232.

As is noted above that line, we thought it might be better to sum over the total timepoint multiplicities because of how the pie chart marks work, so that we would have certainty that our percentages were accurate. The pie charts are displaying just fine for now so we have closed the issue for that, but this could be important depending on what data looks like.

@eharkins eharkins self-assigned this Feb 15, 2019
@eharkins eharkins added the vega-viz Specifically related to vega viz stuff label Feb 15, 2019
@metasoarous
Copy link
Member

To be clear here, after digging into the mechanics of how Vega does these pie chart transforms, we realized that if even one the nodes has timepoint multiplicities which don't add up to the total multiplicity, it could potentially mess up all the other pie charts. So it's worth making sure that the numbers going in here are correct, or doing some kind of validation along the way.

@metasoarous
Copy link
Member

@eharkins I think this has become a little more important now that we are generalizing things. Are we still taking totals at their face value, or are we dynamically adding them up based on the individuals timepoints data?

@eharkins
Copy link
Contributor Author

eharkins commented Apr 1, 2019

@metasoarous
If by

taking totals at their face value

you mean dividing each timepoint multiplicity by the multiplicity (or cluster_multiplicity) value instead of the sum of the timepoint multiplicity values for that sequence (see https://github.com/matsengrp/olmsted/blob/master/src/components/explorer/vega/clonal_family_details.js#L253), then yes.

The code that creates these values (https://github.com/matsengrp/cft/blob/master/bin/process_partis.py#L122-L135) seems to support the assumption that these parts add up to the whole in the cft case. Because of the filtering done here: https://github.com/matsengrp/olmsted/blob/master/src/components/explorer/vega/clonal_family_details.js#L218-L223 (see comment explaining the filter), I don't think it is the case anymore that

if even one the nodes has timepoint multiplicities which don't add up to the total multiplicity, it could potentially mess up all the other pie charts.

as you say above.

Will all this in mind, is it worth doing this anyway or somehow requiring in the schema (if possible) or in the Olmsted code that the individual timepoint multiplicities add up to the total for each sequence?

@metasoarous
Copy link
Member

I don't see how the filtering you mention implies that we don't have to worry about the cluster multiplicities adding up anymore. Can you explain that please?

Assuming there's still any way a user could put inconsistent data into Olmsted, we should either have a check that the numbers match up, or compute the totals ourselves, so nothing ever ends up being off.

@eharkins
Copy link
Contributor Author

eharkins commented Apr 1, 2019

I don't see how the filtering you mention implies that we don't have to worry about the cluster multiplicities adding up anymore. Can you explain that please?

I think you are right that this is does not affect that issue. It just means we can accept some leaves not having multiplicity data.

Assuming there's still any way a user could put inconsistent data into Olmsted, we should either have a check that the numbers match up, or compute the totals ourselves, so nothing ever ends up being off.

Sounds good. One way to do things is to not include total multiplicity values and just

compute the totals ourselves

by summing the timepoint multiplicities in Olmsted front end code somewhere or in bin/process_data.py.

This makes me wonder what other derived data we have in the schema that are liable to be inconsistent with their sources and may need a similar check.

@metasoarous
Copy link
Member

by summing the timepoint multiplicities in Olmsted front end code somewhere or in bin/process_data.py.

That's not a bad idea; However, we should not require that information be passed in this way. If folks don't have/want to specify timepoints, they should be able to just specify the regular/aggregated multiplicity values verbatim. We should only compute/check them when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
vega-viz Specifically related to vega viz stuff
Projects
None yet
Development

No branches or pull requests

2 participants