Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Coleman" approach to feature_importances_ #229

Open
adam2392 opened this issue Feb 22, 2024 · 6 comments
Open

"Coleman" approach to feature_importances_ #229

adam2392 opened this issue Feb 22, 2024 · 6 comments

Comments

@adam2392
Copy link
Collaborator

The basic idea to get a feature_importances distribution map from the Coleman approach is:

  1. train one forest on X
  2. train another forest on permuted X
  3. compute feature_importances array across all trees (n_trees, n_features_in,) for both forests
  4. do the "Coleman" idea and resample from both M times

You have your feature importances null that you can compare against the feature_importances_ array in the first forest in step 1.

Code that builds a Coleman forest approach for doing multivariate hypothesis testing: https://github.com/neurodata/scikit-tree/blob/95e2597d5948c77bea565fc91b82e1a00b43cac8/sktree/stats/forestht.py#L1222

cc: @jovo @jdey4

@jovo
Copy link
Member

jovo commented Feb 22, 2024

for step 4, i think we just compute the distribution of feature importance under the null, and then, we can compute a p-value for the importance of each feature under the alternative, right?

@adam2392
Copy link
Collaborator Author

Oh I guess the permuted forest technically gives that(?), but I was assuming you wanted like M forests each with a slightly different feature_importances map constructed from a different collection of trees?

@jovo
Copy link
Member

jovo commented Feb 22, 2024

oh, I thought just 1 null forest. We compute feature_importance for all the features.

We need M forests for the p-value computation for two-sample testing, but I don't think we need more than 1 forest for this?

@jdey4
Copy link

jdey4 commented Feb 29, 2024

@jovo from the above steps as mentioned by @adam2392, I thought we wanted distribution of feature_importance score. But if I understood it correctly today, you want rank, right? I get rank from the permuted forest and then get rank from non-permuted forest and count the number of times each feature ranks higher in non-permuted one than that in the permuted case? Should I repeat the process several times or you want to subsample after training a random forest with huge number of trees? I repeated the experiment for several reps as the feature dimension is 1.5 million and there is higher variance in forest with 100 trees.

@jovo
Copy link
Member

jovo commented Mar 4, 2024

@jdey4 write pseudocode so we are super clear, then i can quibble anything i don't like.

@jdey4
Copy link

jdey4 commented Mar 10, 2024

Steps:

  1. Consider n iid samples $D_n = (x_i, y_i)_{i=1}^n$ and permuted labels data $\tilde{D}_n$.
  2. Train 2 random forests (B trees each) with $D_n$ and $\tilde{D}_n$ : RF and RF_0, respectively.
  3. Consider a specific feature $F_j$. Calculate it's rank from RF: r
  4. Calculate rank for $F_j$ from $RF_0$: $r_0$.
  5. Calculate $r_0 - r$.
  6. Now randomly sample B trees from ${RF, RF_0}: RF*$ and denote the rest of the trees as $RF*_0$.
  7. Caculate $r*_0-r*$ as 6.
  8. Repeat 6 and 7 $N_0$ times.
  9. Calculate $p = \frac{1}{N_0+1} [1 + \sum I((r_0-r) \leq (r*_0-r*))]$.
    Thoughts @jovo @adam2392 ? Should I make a PR somewhere in sktree or first make sure it works in the CMI repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants