Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate random (gaussian or realistic) distance matrices #45

Merged
merged 7 commits into from
May 7, 2018

Conversation

HannesHolste
Copy link
Collaborator

Code includes:

  1. Method to generate random distance matrix (drawn from gaussian distribution), i.e. unrealistic totally random data.
  2. Method to generate random distance matrix from a realistic OTU table (either band or block patterns) – thanks to work by @mortonjt. Uses bray-curtis distance to generate distance matrix from OTU table. I had to package Jamie's work as a python wheel, included in the conda environment.yml file, because it's not public on pypi yet.

Open question:
For #2: Right now the number of features in the OTU table is just equal to whatever is specified as the desired dimension of the distance matrix. Should this be user-configurable? If so, what is a sensible default value of features in the OTU table? e.g. by default, it can be equal to the number of samples, or 1/10th the number of samples, or fixed at like 6,000 or something. Is there any upper limit to number of features we see in typical OTU tables? How much does it differ between closed-reference OTU picked tables vs. deblur tables?

@HannesHolste HannesHolste requested review from wasade and antgonza April 1, 2018 22:25
@coveralls
Copy link

coveralls commented Apr 1, 2018

Coverage Status

Coverage remained the same at 87.429% when pulling d3ddea2 on structured-randdm into 165ae6f on master.

Copy link
Contributor

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On point 2: the difference between features is dependent of environment and processing; but not sure if someone has done a full benchmark to be able to give numbers. Thus, I have the feeling that make it a parameter makes sense but perhaps due to this issue is fine to leave it as is. @mortonjt might have a better intuition on this. Now, this might fall into some specific questions on distance matrices and their behavior, which might impact PCoA performance but that is out of the scope of this first round or tests ...

scripts/randdm Outdated
def generate(dimensions, output_dir, seed, subsample_dims, structure,
overwrite):
"""
Generate random distance matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a description of the parameters and the outputs?

scripts/randdm Outdated
# Subsampling
for subsample_dim in subsample_dims:
# Parse parameter values into integers
if '%' in subsample_dim:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in theory you can pass: 100%, 200%, -10001%, XX for this value ... perhaps worth adding a nicer validation step so it doesn't brake cause it can't transform to int; this applies to all parameters.

@HannesHolste
Copy link
Collaborator Author

@antgonza thanks for feedback. Changes made as requested. ok to merge?

@antgonza
Copy link
Contributor

antgonza commented May 7, 2018

@mortonjt, could you take a look and if you are OK with these changes can you merge? Thanks!

@HannesHolste HannesHolste requested review from mortonjt and removed request for wasade May 7, 2018 15:46
environment.yml Outdated
- numpy
- scikit-learn=0.19.1
- scipy
- pandas
- jupyter
- matplotlib
- gneiss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify the version of gneiss you are using? We may not guarantee backwards compatibility in the future.

scripts/randdm Outdated
otu_table = biom_table.matrix_data.todense()

click.echo('Generating distance matrix from OTU table...')
distance_matrix = beta_diversity('braycurtis', otu_table,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok with hard-coding the distance measure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to be a CLI param with braycurtis as default value.

Copy link
Collaborator

@mortonjt mortonjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge once comments are addressed.

@mortonjt mortonjt merged commit 5911f9c into master May 7, 2018
@mortonjt
Copy link
Collaborator

mortonjt commented May 7, 2018

Very exciting! Thanks @HannesHolste!

@HannesHolste HannesHolste deleted the structured-randdm branch May 7, 2018 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants