vsl.ml: Random Forest #126

ulises-jeremias · 2022-12-18T23:18:43Z

Describe the feature

We want to create a new model on vsl.ml to do classification using the Random Forest algorithm. That model should follow the following interfaces:

vsl.util.Observer
Candidate intarface for the struct that we need to create:

[heap]
pub struct RandomForest {
mut:
	name       string     // name of this "observer"
	data       &Data[f64] // x data
	stat       &Stat[f64] // statistics about x (data)
        min_samples_split int
        max_depth int
}

With the following methods

        name() string
mut:
        update() // called by Data when it changes
        train()
        predict(x [][]f64) []f64

Use Case

Proposed Solution

Other Information

Acknowledgements

I may be able to implement this feature request
This feature might incur a breaking change

Version used

Environment details (OS name and version, etc.)

The text was updated successfully, but these errors were encountered:

BMJHayward · 2023-01-11T09:39:47Z

I've been making some progress on this in my own fork.
Not sure how to "share" the data object between the random forest &Data[f64] and individual trees yet though.

ulises-jeremias · 2023-01-11T19:45:54Z

@BMJHayward feel free to send me a Draft Pull Request or the link to the branch where you have your changes so I can suggest how to do it 😊

BMJHayward · 2023-01-15T03:45:17Z

Thanks @ulises-jeremias here's my current branch, not quite ready for PR:
https://github.com/BMJHayward/vsl/tree/126_implement_random_forest

I thought to make a proper decision tree implementation and use that in the RF as well, but using the Data interface makes it tricky as I can't just use Data.x and Data.y. Or can I? Each tree uses several and different indexes.

ulises-jeremias · 2023-01-17T19:57:11Z

yeah, I think it is not enough. The Data.x and Data.y should be used only to replace the data you were receiving here. You will probably need to have another struct, or multiple instances of Data, but probably this last is lest efficient

ulises-jeremias · 2023-01-17T20:06:53Z

I just updated latest master adding some methods.

there are two ways you can do this now:

creating a new instance of data and sharing the reference to x and the new y

mut new_data := data.clone_with_same_x()
new_data.set_y(new_index_y)?

or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing data.clone() and then setting y with the new index

mut data_with_new_index := data.clone()
data_with_new_index.set_y(new_index_y)?

ulises-jeremias · 2023-01-17T23:50:16Z

@BMJHayward ^^

BMJHayward · 2023-01-19T10:51:23Z

I just updated latest master adding some methods.

there are two ways you can do this now:
* creating a new instance of data and sharing the reference to `x` and the new `y`
mut new_data := data.clone_with_same_x()
new_data.set_y(new_index_y)?
or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing data.clone() and then setting y with the new index
mut data_with_new_index := data.clone()
data_with_new_index.set_y(new_index_y)?

excellent thankyou, I'll take a look over the weekend

ulises-jeremias · 2023-01-29T18:09:22Z

@BMJHayward hey! did that work? is there anything else I can do to help?

BMJHayward · 2023-01-30T03:07:08Z

@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in ml.tree.grow_tree. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. The rand module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns [1,2,3,1,2,3] and it would be perfectly legitimate to use them twitce.

I can't figure out how to do this yet and maintain a consistent interface using Data like the rest of VSL. I'm sure there's a good way, and maybe calling set_y multiple times for each tree will be ok.

I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.

dumblob · 2023-01-31T21:49:33Z

Was just looking for Cox Regression and Random Forest in VSL which brought me here.

I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .

Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).

Any plans for such "stop & resume" API?

Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK

ulises-jeremias · 2023-02-01T04:49:24Z

@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in ml.tree.grow_tree. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. The rand module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns [1,2,3,1,2,3] and it would be perfectly legitimate to use them twitce.

I can't figure out how to do this yet and maintain a consistent interface using Data like the rest of VSL. I'm sure there's a good way, and maybe calling set_y multiple times for each tree will be ok.

I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.

hey! don't rush with it. Family is more important 😊

About the question, I think calling set_y multiple times is OK as soon as the .clone() method is used 👌🏻

ulises-jeremias · 2023-02-01T04:53:14Z

Was just looking for Cox Regression and Random Forest in VSL which brought me here.

I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .

Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).

Any plans for such "stop & resume" API?

Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK

lazypredict is great! we will probably add more models during time 👌🏻

regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations

dumblob · 2023-02-01T09:44:41Z

regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations

Yep, .h5 is fine. Maybe to not slow down the computation we could just fork the process (i.e. delegate COW of all the structs with data to the operating system as e.g. Redis does) so takes a negligible time and then save it to disk. The data might have easily hundreds of MB or more, so not doing it fully in parallel could slow down the computation too much (and V's threading support is probably not enough as it would involve memcpy() which would be definitely much slower than COW over pages the operating systems maintains under the hood). Just a thought.

dumblob · 2023-03-22T09:10:29Z

I wonder if there is any news regarding Cox Regression, Random Forest, and .h5 checkpointing. I could not find anything in the commits.

But no pressure, I just want to regularly get up to date 😉.

dumblob · 2023-07-31T20:57:18Z

Any news? Especially the checkpointing seems highly beneficial to everybody (compared to Cox Regression and Random Forest).

dumblob · 2024-02-10T22:23:12Z

Still interested in this to allow me start recommending V (VSL) within my bubble 😉.

ulises-jeremias added the Hacktoberfest This label is assigned to any issue that is good to go for any Hacktoberfest participant label Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsl.ml: Random Forest #126

vsl.ml: Random Forest #126

ulises-jeremias commented Dec 18, 2022 •

edited

Loading

BMJHayward commented Jan 11, 2023

ulises-jeremias commented Jan 11, 2023

BMJHayward commented Jan 15, 2023

ulises-jeremias commented Jan 17, 2023

ulises-jeremias commented Jan 17, 2023 •

edited

Loading

ulises-jeremias commented Jan 17, 2023

BMJHayward commented Jan 19, 2023

ulises-jeremias commented Jan 29, 2023

BMJHayward commented Jan 30, 2023

dumblob commented Jan 31, 2023

ulises-jeremias commented Feb 1, 2023

ulises-jeremias commented Feb 1, 2023

dumblob commented Feb 1, 2023

dumblob commented Mar 22, 2023

dumblob commented Jul 31, 2023

dumblob commented Feb 10, 2024

vsl.ml: Random Forest #126

vsl.ml: Random Forest #126

Comments

ulises-jeremias commented Dec 18, 2022 • edited Loading

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

Version used

Environment details (OS name and version, etc.)

BMJHayward commented Jan 11, 2023

ulises-jeremias commented Jan 11, 2023

BMJHayward commented Jan 15, 2023

ulises-jeremias commented Jan 17, 2023

ulises-jeremias commented Jan 17, 2023 • edited Loading

ulises-jeremias commented Jan 17, 2023

BMJHayward commented Jan 19, 2023

ulises-jeremias commented Jan 29, 2023

BMJHayward commented Jan 30, 2023

dumblob commented Jan 31, 2023

ulises-jeremias commented Feb 1, 2023

ulises-jeremias commented Feb 1, 2023

dumblob commented Feb 1, 2023

dumblob commented Mar 22, 2023

dumblob commented Jul 31, 2023

dumblob commented Feb 10, 2024

ulises-jeremias commented Dec 18, 2022 •

edited

Loading

ulises-jeremias commented Jan 17, 2023 •

edited

Loading