Generate random (gaussian or realistic) distance matrices #45

HannesHolste · 2018-04-01T22:25:28Z

Code includes:

Method to generate random distance matrix (drawn from gaussian distribution), i.e. unrealistic totally random data.
Method to generate random distance matrix from a realistic OTU table (either band or block patterns) – thanks to work by @mortonjt. Uses bray-curtis distance to generate distance matrix from OTU table. I had to package Jamie's work as a python wheel, included in the conda environment.yml file, because it's not public on pypi yet.

Open question:
For #2: Right now the number of features in the OTU table is just equal to whatever is specified as the desired dimension of the distance matrix. Should this be user-configurable? If so, what is a sensible default value of features in the OTU table? e.g. by default, it can be equal to the number of samples, or 1/10th the number of samples, or fixed at like 6,000 or something. Is there any upper limit to number of features we see in typical OTU tables? How much does it differ between closed-reference OTU picked tables vs. deblur tables?

… code

coveralls · 2018-04-01T22:31:35Z

Coverage remained the same at 87.429% when pulling d3ddea2 on structured-randdm into 165ae6f on master.

antgonza

On point 2: the difference between features is dependent of environment and processing; but not sure if someone has done a full benchmark to be able to give numbers. Thus, I have the feeling that make it a parameter makes sense but perhaps due to this issue is fine to leave it as is. @mortonjt might have a better intuition on this. Now, this might fall into some specific questions on distance matrices and their behavior, which might impact PCoA performance but that is out of the scope of this first round or tests ...

antgonza · 2018-04-02T13:46:18Z

scripts/randdm

+def generate(dimensions, output_dir, seed, subsample_dims, structure,
+             overwrite):
+    """
+    Generate random distance matrix


Can you add a description of the parameters and the outputs?

antgonza · 2018-04-02T13:49:06Z

scripts/randdm

+        # Subsampling
+        for subsample_dim in subsample_dims:
+            # Parse parameter values into integers
+            if '%' in subsample_dim:


So in theory you can pass: 100%, 200%, -10001%, XX for this value ... perhaps worth adding a nicer validation step so it doesn't brake cause it can't transform to int; this applies to all parameters.

HannesHolste · 2018-05-07T15:35:26Z

@antgonza thanks for feedback. Changes made as requested. ok to merge?

antgonza · 2018-05-07T15:41:01Z

@mortonjt, could you take a look and if you are OK with these changes can you merge? Thanks!

mortonjt · 2018-05-07T15:46:46Z

environment.yml

 - numpy
 - scikit-learn=0.19.1
 - scipy
 - pandas
 - jupyter
 - matplotlib
+- gneiss


Can you specify the version of gneiss you are using? We may not guarantee backwards compatibility in the future.

mortonjt · 2018-05-07T15:47:58Z

scripts/randdm

+            otu_table = biom_table.matrix_data.todense()
+
+            click.echo('Generating distance matrix from OTU table...')
+            distance_matrix = beta_diversity('braycurtis', otu_table,


Are we ok with hard-coding the distance measure?

I changed it to be a CLI param with braycurtis as default value.

mortonjt

Ready to merge once comments are addressed.

mortonjt · 2018-05-07T16:04:40Z

Very exciting! Thanks @HannesHolste!

HannesHolste added 3 commits April 1, 2018 14:31

Script to generate random gaussian and structured distance matrices

71bf377

Add conda environment file and pip wheel for realistic OTU generation…

46af739

… code

Ignore bayesian-regression folder (local install)

df74f16

HannesHolste requested review from wasade and antgonza April 1, 2018 22:25

Add randdm script to setup.py

1497fae

antgonza reviewed Apr 2, 2018

View reviewed changes

Add more parameter error checks and improve docs

19e414f

HannesHolste requested review from mortonjt and removed request for wasade May 7, 2018 15:46

mortonjt reviewed May 7, 2018

View reviewed changes

Specify gneiss version in env

55c8127

mortonjt approved these changes May 7, 2018

View reviewed changes

Make distance metric a CLI parameter

d3ddea2

mortonjt merged commit 5911f9c into master May 7, 2018

HannesHolste deleted the structured-randdm branch May 7, 2018 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate random (gaussian or realistic) distance matrices #45

Generate random (gaussian or realistic) distance matrices #45

HannesHolste commented Apr 1, 2018

coveralls commented Apr 1, 2018 •

edited

Loading

antgonza left a comment

antgonza Apr 2, 2018

antgonza Apr 2, 2018

HannesHolste commented May 7, 2018

antgonza commented May 7, 2018

mortonjt May 7, 2018

mortonjt May 7, 2018

HannesHolste May 7, 2018

mortonjt left a comment

mortonjt commented May 7, 2018

Generate random (gaussian or realistic) distance matrices #45

Generate random (gaussian or realistic) distance matrices #45

Conversation

HannesHolste commented Apr 1, 2018

coveralls commented Apr 1, 2018 • edited Loading

antgonza left a comment

Choose a reason for hiding this comment

antgonza Apr 2, 2018

Choose a reason for hiding this comment

antgonza Apr 2, 2018

Choose a reason for hiding this comment

HannesHolste commented May 7, 2018

antgonza commented May 7, 2018

mortonjt May 7, 2018

Choose a reason for hiding this comment

mortonjt May 7, 2018

Choose a reason for hiding this comment

HannesHolste May 7, 2018

Choose a reason for hiding this comment

mortonjt left a comment

Choose a reason for hiding this comment

mortonjt commented May 7, 2018

coveralls commented Apr 1, 2018 •

edited

Loading