Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training new models #227

Open
prvst opened this issue Sep 9, 2024 · 3 comments
Open

Training new models #227

prvst opened this issue Sep 9, 2024 · 3 comments

Comments

@prvst
Copy link

prvst commented Sep 9, 2024

Hello, is there any documentation detailing how to train new models?

@RalfG
Copy link
Member

RalfG commented Oct 1, 2024

Hi @prvst,

Unfortunately, not yet. However, if you have some machine learning experience, it should not be too hard to try.

The ms2pip get-training-data command can be used to generate features and targets for training. Then the XGBoost and hyperopt Python packages can be used to optimize hyperparameters and to train new models. Note that the same features are used for each target ion type.

From the most recent MS²PIP paper supplementary:

All models were trained with the XGBoost machine learning algorithm (20) and hyperparameter optimization was performed with the Hyperopt (21) Python package using a four-fold cross-validation evaluation scheme. The maximal number of boosting rounds was fixed at 400 and early stopping was set to 10 boosting rounds. The selected hyperparameters are listed on supplemental Table S2.

Table S2. - The optimal hyperparameters for each new b- and y-ion MS²PIP model, as determined during hyperparameter optimization.

Model Eta Max depth Grow policy Max leaves Min child weight Gamma Lambda Alpha Colsample by tree Sub-sample
HCD 2021 (b-ions) 0.08060612330262913 18 Lossguide 117 500 0.031142279181653326 0.2724553826622634 3.4 0.891381182690278 0.7
HCD 2021 (y-ions) 0.047107785048838 18 Lossguide 490 4 0.37528441949267444 0.35150807248415 3.3 0.6122042447952851 0.6
Immunopeptide HCD (b-ions) 0.09263630381479264 17 Lossguide 131 16 0.6048882172751935 0.9332236183206803 4.6 0.9898165069470042 0.7
Immunopeptide HCD (y-ions) 0.0594145790364741 17 Lossguide 302 3 0.03338151150211477 0.4430375595950531 4.5 0.9389820388602939 0.7
CID-TMT (b-ions) 0.09788304115318931 16 Lossguide 100 175 0.36436201158266845 0 3.1 0.9307205074180112 0.8
CID-TMT (y-ions) 0.07323226418651792 15 Lossguide 15 84 0.06487830003469364 0 0.7 0.7980941914509116 0.7

Once you have new XGBoost models for each ion type, they can be saved to a file in your ~/.ms2pip directory and added to the ms2pip.constants.MODELS dictionary. Then they should be available for usage.

Do let us know if the models you have in mind would be of interest to the wider community. In that case, we could definitely consider shipping the models with MS²PIP.

Best,
Ralf

@prvst
Copy link
Author

prvst commented Oct 1, 2024

Thanks! Can this be used for the training? train_xgboost_c.py

@RalfG
Copy link
Member

RalfG commented Nov 3, 2024

That script is mostly out of date and should be removed or updated. Nevertheless, it could be of help as a template. Mostly all parts referring to C code can be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants