-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Coleman" approach to feature_importances_ #229
Comments
for step 4, i think we just compute the distribution of feature importance under the null, and then, we can compute a p-value for the importance of each feature under the alternative, right? |
Oh I guess the permuted forest technically gives that(?), but I was assuming you wanted like M forests each with a slightly different feature_importances map constructed from a different collection of trees? |
oh, I thought just 1 null forest. We compute feature_importance for all the features. We need M forests for the p-value computation for two-sample testing, but I don't think we need more than 1 forest for this? |
@jovo from the above steps as mentioned by @adam2392, I thought we wanted distribution of feature_importance score. But if I understood it correctly today, you want rank, right? I get rank from the permuted forest and then get rank from non-permuted forest and count the number of times each feature ranks higher in non-permuted one than that in the permuted case? Should I repeat the process several times or you want to subsample after training a random forest with huge number of trees? I repeated the experiment for several reps as the feature dimension is 1.5 million and there is higher variance in forest with 100 trees. |
@jdey4 write pseudocode so we are super clear, then i can quibble anything i don't like. |
Steps:
|
The basic idea to get a feature_importances distribution map from the Coleman approach is:
You have your feature importances null that you can compare against the feature_importances_ array in the first forest in step 1.
Code that builds a Coleman forest approach for doing multivariate hypothesis testing: https://github.com/neurodata/scikit-tree/blob/95e2597d5948c77bea565fc91b82e1a00b43cac8/sktree/stats/forestht.py#L1222
cc: @jovo @jdey4
The text was updated successfully, but these errors were encountered: