-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate random (gaussian or realistic) distance matrices #45
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On point 2: the difference between features is dependent of environment and processing; but not sure if someone has done a full benchmark to be able to give numbers. Thus, I have the feeling that make it a parameter makes sense but perhaps due to this issue is fine to leave it as is. @mortonjt might have a better intuition on this. Now, this might fall into some specific questions on distance matrices and their behavior, which might impact PCoA performance but that is out of the scope of this first round or tests ...
scripts/randdm
Outdated
def generate(dimensions, output_dir, seed, subsample_dims, structure, | ||
overwrite): | ||
""" | ||
Generate random distance matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a description of the parameters and the outputs?
scripts/randdm
Outdated
# Subsampling | ||
for subsample_dim in subsample_dims: | ||
# Parse parameter values into integers | ||
if '%' in subsample_dim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in theory you can pass: 100%, 200%, -10001%, XX for this value ... perhaps worth adding a nicer validation step so it doesn't brake cause it can't transform to int; this applies to all parameters.
@antgonza thanks for feedback. Changes made as requested. ok to merge? |
@mortonjt, could you take a look and if you are OK with these changes can you merge? Thanks! |
environment.yml
Outdated
- numpy | ||
- scikit-learn=0.19.1 | ||
- scipy | ||
- pandas | ||
- jupyter | ||
- matplotlib | ||
- gneiss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify the version of gneiss you are using? We may not guarantee backwards compatibility in the future.
scripts/randdm
Outdated
otu_table = biom_table.matrix_data.todense() | ||
|
||
click.echo('Generating distance matrix from OTU table...') | ||
distance_matrix = beta_diversity('braycurtis', otu_table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with hard-coding the distance measure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to be a CLI param with braycurtis as default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge once comments are addressed.
Very exciting! Thanks @HannesHolste! |
Code includes:
environment.yml
file, because it's not public on pypi yet.Open question:
For #2: Right now the number of features in the OTU table is just equal to whatever is specified as the desired dimension of the distance matrix. Should this be user-configurable? If so, what is a sensible default value of features in the OTU table? e.g. by default, it can be equal to the number of samples, or 1/10th the number of samples, or fixed at like 6,000 or something. Is there any upper limit to number of features we see in typical OTU tables? How much does it differ between closed-reference OTU picked tables vs. deblur tables?