Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planned support of multilabel? #244

Open
xbno opened this issue Nov 8, 2022 · 1 comment
Open

Planned support of multilabel? #244

xbno opened this issue Nov 8, 2022 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@xbno
Copy link

xbno commented Nov 8, 2022

Following this quick tutorial, I was hoping to use XGBoost for multilabel classification by passing label_column as a list within XGBoostTrainer. Is there any plan to support this functionality? https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html

import ray
import pandas as pd
import xgboost as xgb

from ray.train.xgboost import XGBoostTrainer, XGBoostPredictor
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split


num_classes = 30
X, y = make_multilabel_classification(
    n_classes=num_classes, random_state=0, n_samples=1000
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train = pd.DataFrame(X_train, columns=[f"x{i}" for i in range(X_train.shape[1])])
y_train = pd.DataFrame(y_train, columns=[f"y{i}" for i in range(y_train.shape[1])])
train_ds = ray.data.from_pandas(pd.concat([X_train, y_train], axis=1))

trainer = XGBoostTrainer(
    # label_column="y1",  # works
    label_column=["y1", "y2"],  # not supported
    params={
        "tree_method": "hist",
        "max_depth": 15,
        "n_estimators": 50,
    },
    num_boost_round=10,
    datasets={"train": train_ds},
)
result = trainer.fit()

The trace is:

Current time: 2022-11-08 20:11:24 (running for 00:00:02.47)
Memory usage on this node: 21.7/62.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/16 CPUs, 0/0 GPUs, 0.0/25.74 GiB heap, 0.0/12.87 GiB objects
Result logdir: /home/gcounihan/ray_results/XGBoostTrainer_2022-11-08_20-11-21
Number of trials: 1/1 (1 RUNNING)
+----------------------------+----------+---------------------+
| Trial name                 | status   | loc                 |
|----------------------------+----------+---------------------|
| XGBoostTrainer_8362b_00000 | RUNNING  | 10.50.101.142:64403 |
+----------------------------+----------+---------------------+


(XGBoostTrainer pid=64403) /home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/main.py:464: UserWarning: `num_actors` in `ray_params` is smaller than 2 (1). XGBoost will NOT be distributed!
(XGBoostTrainer pid=64403)   warnings.warn(
(XGBoostTrainer pid=64403) 2022-11-08 20:11:24,790      ERROR function_trainable.py:298 -- Runner Thread raised error.
(XGBoostTrainer pid=64403) Traceback (most recent call last):
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 289, in run
(XGBoostTrainer pid=64403)     self._entrypoint()
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 362, in entrypoint
(XGBoostTrainer pid=64403)     return self._trainable_func(
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
(XGBoostTrainer pid=64403)     return method(self, *_args, **_kwargs)
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/base_trainer.py", line 460, in _trainable_func
(XGBoostTrainer pid=64403)     super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 684, in _trainable_func
(XGBoostTrainer pid=64403)     output = fn()
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/base_trainer.py", line 375, in train_func
(XGBoostTrainer pid=64403)     trainer.training_loop()
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/gbdt_trainer.py", line 246, in training_loop
(XGBoostTrainer pid=64403)     model = self._train(
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/ray/train/xgboost/xgboost_trainer.py", line 77, in _train
(XGBoostTrainer pid=64403)     return xgboost_ray.train(**kwargs)
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/main.py", line 1482, in train
(XGBoostTrainer pid=64403)     bst, train_evals_result, train_additional_results = _train(
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/main.py", line 1041, in _train
(XGBoostTrainer pid=64403)     dtrain.assert_enough_shards_for_actors(num_actors=ray_params.num_actors)
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 788, in assert_enough_shards_for_actors
(XGBoostTrainer pid=64403)     self.loader.assert_enough_shards_for_actors(num_actors=num_actors)
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 486, in assert_enough_shards_for_actors
(XGBoostTrainer pid=64403)     data_source = self.get_data_source()
(XGBoostTrainer pid=64403)   File "/home/gcounihan/miniconda3/envs/ncf38/lib/python3.8/site-packages/xgboost_ray/matrix.py", line 448, in get_data_source
(XGBoostTrainer pid=64403)     raise ValueError(
(XGBoostTrainer pid=64403) ValueError: Invalid `label` value for distributed datasets: ['y1', 'y2']. Only strings are supported. 
(XGBoostTrainer pid=64403) FIX THIS by passing a string indicating the label column of the dataset as the `label` argument.
@Yard1
Copy link
Member

Yard1 commented Nov 8, 2022

Thanks, will take a look at what it would take to support this!

@Yard1 Yard1 self-assigned this Nov 8, 2022
@Yard1 Yard1 added the enhancement New feature or request label Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants