You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, at each node a completely new projection array is sampled (max_features, n_features, n_dims_projection), where X is (n_samples, n_features). Each (n_features, n_dims_projection) is a new projection matrix. In practice, we do this in a "sparse" way, so that we store the feature-index and the projection-weight to apply.
This ends up consuming a lot of RAM (and possibly resulting in segfaults, tho I'm unsure why). See related: #226 and #215 .
Another sensible strategy is sample a LARGE projection array per tree (LARGE_MAX_FEATURES, n_features, n_dims_projection), which then considers a random subset of max_features many projection matrices to apply at each split node. This amortizes the sampling of the projection matrix and only does it once per tree. Tho depending on how large LARGE_MAX_FEATURES is, we would have to store a huge array in RAM.
There are some perceivable benefits tho:
we can track constants in the projection: We simply have to keep track of a vector (LARGE_MAX_FEATURES,) long, and we can determine if at any point in the tree, splitting on that projection vector results in no change in the impurity. We therefore would skip that projection vector
For deep trees, this can result in considerable runtime improvement
Assuming we can eat the cost of the initial RAM when sampling the large projection matrix, we wont' have large RAM spikes due to sampling a bunch of new projection arrays per split node.
This could allow the user to specify the large projection array in Python and pass it in! This will allow an easy testing of the Gabor and Fourier kernel ideas because specifying a projection matrix in Python for these complex projections will be a lot easier.
The open question tho is... what is LARGE_MAX_FEATURES? Should it be max_features * 100, max_features * 10, max_features * <some hyperparameter>?
Currently, at each node a completely new projection array is sampled
(max_features, n_features, n_dims_projection)
, where X is(n_samples, n_features)
. Each(n_features, n_dims_projection)
is a new projection matrix. In practice, we do this in a "sparse" way, so that we store the feature-index and the projection-weight to apply.This ends up consuming a lot of RAM (and possibly resulting in segfaults, tho I'm unsure why). See related: #226 and #215 .
Another sensible strategy is sample a LARGE projection array per tree
(LARGE_MAX_FEATURES, n_features, n_dims_projection)
, which then considers a random subset of max_features many projection matrices to apply at each split node. This amortizes the sampling of the projection matrix and only does it once per tree. Tho depending on how largeLARGE_MAX_FEATURES
is, we would have to store a huge array in RAM.There are some perceivable benefits tho:
(LARGE_MAX_FEATURES,)
long, and we can determine if at any point in the tree, splitting on that projection vector results in no change in the impurity. We therefore would skip that projection vectorThe open question tho is... what is
LARGE_MAX_FEATURES
? Should it bemax_features * 100
,max_features * 10
,max_features * <some hyperparameter>
?cc: @jovo @j1c
The text was updated successfully, but these errors were encountered: