Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [python-package] ensure predict() always returns an array #6348

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Mar 1, 2024

Contributes to #3756.
Contributes to #3867.

Overview

For most combinations of parameters and input data types, .predict() methods in the Python package return either a numpy array or a scipy.sparse.{csc,csr}_matrix.

Since #3000 (merged June 2020, released in LightGBM v3.0.0 August 2020), there is exactly 1 exception.

For multiclass classification models, .predict(X, pred_contrib=True) returns a Python list of sparse matrices if X is a scipy.sparse.{csc,csr}_matrix.

example with lightgbm 4.3.0 (click me)
import lightgbm as lgb
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.datasets import make_blobs

X_numpy, y = make_blobs(n_samples=10_000, centers=5, n_features=11)
X_csr = csr_matrix(X_numpy)
X_csc = csc_matrix(X_numpy)

clf = lgb.LGBMClassifier(
    n_estimators=9,
    num_classes=5
)
clf.fit(X_numpy, y)

p = clf.predict(X_numpy, pred_contrib=True)
print(f"type: {type(p).__name__} | shape: {p.shape}")
# type: ndarray | shape: (10000, 60)

p = clf.predict(X_csc, pred_contrib=True)
print(f"type: {type(p).__name__} | element types: {type(p[0]).__name__}")
# type: list | element types: csc_matrix

p = clf.predict(X_csr, pred_contrib=True)
print(f"type: {type(p).__name__} | element types: {type(p[0]).__name__}")
type: list | element types: csr_matrix

This PR proposes modifying LightGBM such that .predict() methods always return a numpy or scipy.sparse array.

And other related changes:

  • fixes / updates type hints
  • updates docs
  • removes now-unnecessary code in lightgbm.dask that existed to handle this special case

Benefits of this change

  • enables removal of non-trivial complexity in the lightgbm.dask interface
  • enables simplifications in lightgbm's internal prediction routines, and in docs for .predict() methods
  • removes a source of surprise and friction for users
  • allows any code generating predictions with lightgbm to use array operations on the output of .predict() methods unconditionally
  • prevents the need for projects using lightgbm to ignore or work-around type-checking issues when they unconditionally use the output of .predict() as if it's only either numpy or scipy.sparse array
  • would be a step towards supporting SHAP contributions for LightGBM multiclass classification models in the shap library (see "Does this break shap" below)
  • makes adding and maintaining correct type hints in lightgbm easier, improving the likelihood that type-checkers like mypy will be able to catch bugs

Costs of this change

  • breaks any existing code relying on lightgbm returning a list for this specific case
  • might make generation of feature contributions for multiclass models on sparse inputs much slower or less memmory-efficient

Notes for Reviewers

Why is a list currently returned?

I'm not sure. I couldn't figure that out from #3000. I hope @imatiach-msft or @guolinke will be able to remember or someone else who understands C++ and CSC/CSR format better than me will be able to explain.

I suspect it was related to concerns of the form "concatenating CSC or CSR matrices requires hard-to-do-efficiently index updates", as mentioned here:

Or maybe it's because of concerns about having more than INT32_MAX items, which couldn't be represented as an int32 indptr, as mentioned here:

Does this break shap?

No... the shap package already cannot generate SHAP values for multiclass classification lightgbm models and scipy.sparse inputs.

reproducible example (click me)

Consider the following example, with numpy==1.26.0, scikit-learn==1.4.1, scipy==1.12.0, and shap==0.44.0, using Python 3.11 on x84_64 macOS.

import lightgbm as lgb
import shap
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10_000, centers=5, n_features=11)
X_csr = csr_matrix(X)
X_csc = csc_matrix(X)

clf_lgb = lgb.LGBMClassifier(
    n_estimators=9,
    num_classes=5
)
clf_lgb.fit(X, y)

shap_values = shap.Explainer(clf_lgb)(X_csc)

With lightgbm==4.3.0:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 243, in __call__
   v = self.shap_values(X, y=y, from_call=True, check_additivity=check_additivity, approximate=self.approximate)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 428, in shap_values
   if phi.shape[1] != X.shape[1] + 1:
      ^^^^^^^^^
AttributeError: 'list' object has no attribute 'shape'

As of this branch, shap still fails on such cases... but in a way that I think we could fix more easily after a release of lightgbm including this PR.

details (click me)

Error raised by shap using lightgbm as of this PR:

Traceback (most recent call last):
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 430, in shap_values
    phi = phi.reshape(X.shape[0], phi.shape[1]//(X.shape[1]+1), X.shape[1]+1)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/scipy/sparse/_base.py", line 152, in reshape
    shape = check_shape(args, self.shape)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/scipy/sparse/_sputils.py", line 339, in check_shape
    raise ValueError('matrix shape must be two-dimensional')
ValueError: matrix shape must be two-dimensional

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 243, in __call__
    v = self.shap_values(X, y=y, from_call=True, check_additivity=check_additivity, approximate=self.approximate)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 436, in shap_values
    raise ValueError(emsg) from e
ValueError: This reshape error is often caused by passing a bad data matrix to SHAP. See https://github.com/shap/shap/issues/580.

So yes still broken... but in a way that I think we could easily fix in shap. That error is coming from shap's attempt here to reshape the .predict(..., pred_contrib=True) output into a 3D matrix.

https://github.com/shap/shap/blob/d51d173f28b52d2f501b33668bf4529acf22709a/shap/explainers/_tree.py#L444

scipy.sparse.{csc,csr}_matrix objects are 2-dimensional.

X_csc.reshape(10, 1000, 11)
# ValueError: matrix shape must be two-dimensional

With xgboost==2.0.3, shap happily generates SHAP values for multiclass classification models and scipy.sparse matrices.

details (click me)

All setup identical to the reproducible example above.

clf_xgb = xgb.XGBClassifier(
    n_estimators=9,
)
clf_xgb.fit(X, y)

preds = clf_xgb.predict(X_csc, pred_contrib=True)

shap_values = shap.Explainer(clf_xgb)(X_csc)
type(shap_values.data)
# <class 'scipy.sparse._csc.csc_matrix'>

shap_values = shap.Explainer(clf_xgb)(X_csr)
type(shap_values.data)
# <class 'scipy.sparse._csr.csr_matrix'>

How I tested this

Interactively, with the examples shared above.

Confirmed in the lint CI job that this does not introduce new mypy errors: https://github.com/microsoft/LightGBM/actions/runs/8107505294/job/22159131171?pr=6348.

Modified the 2 existing Python unit tasks (one for Booster.predict(), one for dask) which already thoroughly cover this code path.

References

More notes: More notes on that at #3867 (comment)

@jameslamb jameslamb changed the title WIP: [python-package] ensure predict() always returns an array [python-package] ensure predict() always returns an array Mar 1, 2024
@jameslamb jameslamb marked this pull request as ready for review March 1, 2024 06:33
@jameslamb
Copy link
Collaborator Author

Tagging some others who might have opinions and be able to correct my understanding of the tradeoffs here:

@trivialfis @hcho3 @david-cortes @mayer79 @connortann @thatlittleboy @imatiach-msft

@david-cortes
Copy link
Contributor

I will take a guess that the reason why it's output like this is because SciPy's sparse matrices do not support n-dimensional arrays and the output here is 3-dimensional.

What this PR does however is inefficient: the output generated by concatenating is already generated during the call to predict:

_LIB.LGBM_BoosterPredictSparseOutput(

But then it gets split into lists:

matrices = self.__create_sparse_native(

From SciPy's docs, their sparse matrix classes are meant to be deprecated in the future, and the recommended class now are "sparse arrays":
https://docs.scipy.org/doc/scipy/reference/sparse.html

They also mention that n-dimensional sparse arrays might be implemented in the future, in which case this could move to a 3-dimensional structure.

@jameslamb
Copy link
Collaborator Author

I will take a guess that the reason why it's output like this is because SciPy's sparse matrices do not support n-dimensional arrays and the output here is 3-dimensional.

Thanks!

If that's true, then I think the decision to support this output change gets even easier... because for the numpy case (which uses a different code path), LightGBM is not returning a 3D matrix. It's returning a 2D matrix, with the feature contributions for each class stacked horizontally. (see the "example with lightgbm 4.3.0" in the PR description)

So the change in this PR, in addition to removing a hard-to-work-with data structure, would also ensure that the shape isn't dependent on the type of the input data.

What this PR does however is inefficient

Totally agree. I will try to remove the intermediate list generation.

From SciPy's docs, their sparse matrix classes are meant to be deprecated in the future, and the recommended class now are "sparse arrays"

Thanks for that information and link! I've created #6352 to capture that. I think lightgbm should start supporting those new array types everywhere it supports the corresponding matrix types.

@jameslamb jameslamb changed the title [python-package] ensure predict() always returns an array WIP: [python-package] ensure predict() always returns an array Jul 12, 2024
@jameslamb jameslamb marked this pull request as draft July 12, 2024 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants