WIP: [python-package] ensure predict() always returns an array #6348

jameslamb · 2024-03-01T04:30:52Z

Contributes to #3756.
Contributes to #3867.

Overview

For most combinations of parameters and input data types, .predict() methods in the Python package return either a numpy array or a scipy.sparse.{csc,csr}_matrix.

Since #3000 (merged June 2020, released in LightGBM v3.0.0 August 2020), there is exactly 1 exception.

For multiclass classification models, .predict(X, pred_contrib=True) returns a Python list of sparse matrices if X is a scipy.sparse.{csc,csr}_matrix.

example with lightgbm 4.3.0 (click me)

import lightgbm as lgb
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.datasets import make_blobs

X_numpy, y = make_blobs(n_samples=10_000, centers=5, n_features=11)
X_csr = csr_matrix(X_numpy)
X_csc = csc_matrix(X_numpy)

clf = lgb.LGBMClassifier(
    n_estimators=9,
    num_classes=5
)
clf.fit(X_numpy, y)

p = clf.predict(X_numpy, pred_contrib=True)
print(f"type: {type(p).__name__} | shape: {p.shape}")
# type: ndarray | shape: (10000, 60)

p = clf.predict(X_csc, pred_contrib=True)
print(f"type: {type(p).__name__} | element types: {type(p[0]).__name__}")
# type: list | element types: csc_matrix

p = clf.predict(X_csr, pred_contrib=True)
print(f"type: {type(p).__name__} | element types: {type(p[0]).__name__}")
type: list | element types: csr_matrix

This PR proposes modifying LightGBM such that .predict() methods always return a numpy or scipy.sparse array.

And other related changes:

fixes / updates type hints
updates docs
removes now-unnecessary code in lightgbm.dask that existed to handle this special case

Benefits of this change

enables removal of non-trivial complexity in the lightgbm.dask interface
enables simplifications in lightgbm's internal prediction routines, and in docs for .predict() methods
removes a source of surprise and friction for users
allows any code generating predictions with lightgbm to use array operations on the output of .predict() methods unconditionally
prevents the need for projects using lightgbm to ignore or work-around type-checking issues when they unconditionally use the output of .predict() as if it's only either numpy or scipy.sparse array
would be a step towards supporting SHAP contributions for LightGBM multiclass classification models in the shap library (see "Does this break shap" below)
makes adding and maintaining correct type hints in lightgbm easier, improving the likelihood that type-checkers like mypy will be able to catch bugs

Costs of this change

breaks any existing code relying on lightgbm returning a list for this specific case
might make generation of feature contributions for multiclass models on sparse inputs much slower or less memmory-efficient
- I haven't benchmarked, and don't know the performance characteristics of scipy.sparse.hstack() for CSC/CSR matrices
- I hope it's efficient enough, see this from 2017: https://stackoverflow.com/a/45990096/3986677

Notes for Reviewers

Why is a list currently returned?

I'm not sure. I couldn't figure that out from #3000. I hope @imatiach-msft or @guolinke will be able to remember or someone else who understands C++ and CSC/CSR format better than me will be able to explain.

I suspect it was related to concerns of the form "concatenating CSC or CSR matrices requires hard-to-do-efficiently index updates", as mentioned here:

Or maybe it's because of concerns about having more than INT32_MAX items, which couldn't be represented as an int32 indptr, as mentioned here:

Does this break `shap`?

No... the shap package already cannot generate SHAP values for multiclass classification lightgbm models and scipy.sparse inputs.

reproducible example (click me)

Consider the following example, with numpy==1.26.0, scikit-learn==1.4.1, scipy==1.12.0, and shap==0.44.0, using Python 3.11 on x84_64 macOS.

import lightgbm as lgb
import shap
from scipy.sparse import csr_matrix, csc_matrix
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10_000, centers=5, n_features=11)
X_csr = csr_matrix(X)
X_csc = csc_matrix(X)

clf_lgb = lgb.LGBMClassifier(
    n_estimators=9,
    num_classes=5
)
clf_lgb.fit(X, y)

shap_values = shap.Explainer(clf_lgb)(X_csc)

With lightgbm==4.3.0:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 243, in __call__
   v = self.shap_values(X, y=y, from_call=True, check_additivity=check_additivity, approximate=self.approximate)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 428, in shap_values
   if phi.shape[1] != X.shape[1] + 1:
      ^^^^^^^^^
AttributeError: 'list' object has no attribute 'shape'

As of this branch, shap still fails on such cases... but in a way that I think we could fix more easily after a release of lightgbm including this PR.

details (click me)

Error raised by shap using lightgbm as of this PR:

Traceback (most recent call last):
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 430, in shap_values
    phi = phi.reshape(X.shape[0], phi.shape[1]//(X.shape[1]+1), X.shape[1]+1)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/scipy/sparse/_base.py", line 152, in reshape
    shape = check_shape(args, self.shape)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/scipy/sparse/_sputils.py", line 339, in check_shape
    raise ValueError('matrix shape must be two-dimensional')
ValueError: matrix shape must be two-dimensional

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 243, in __call__
    v = self.shap_values(X, y=y, from_call=True, check_additivity=check_additivity, approximate=self.approximate)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/shap/explainers/_tree.py", line 436, in shap_values
    raise ValueError(emsg) from e
ValueError: This reshape error is often caused by passing a bad data matrix to SHAP. See https://github.com/shap/shap/issues/580.

So yes still broken... but in a way that I think we could easily fix in shap. That error is coming from shap's attempt here to reshape the .predict(..., pred_contrib=True) output into a 3D matrix.

https://github.com/shap/shap/blob/d51d173f28b52d2f501b33668bf4529acf22709a/shap/explainers/_tree.py#L444

scipy.sparse.{csc,csr}_matrix objects are 2-dimensional.

X_csc.reshape(10, 1000, 11)
# ValueError: matrix shape must be two-dimensional

With xgboost==2.0.3, shap happily generates SHAP values for multiclass classification models and scipy.sparse matrices.

details (click me)

All setup identical to the reproducible example above.

clf_xgb = xgb.XGBClassifier(
    n_estimators=9,
)
clf_xgb.fit(X, y)

preds = clf_xgb.predict(X_csc, pred_contrib=True)

shap_values = shap.Explainer(clf_xgb)(X_csc)
type(shap_values.data)
# <class 'scipy.sparse._csc.csc_matrix'>

shap_values = shap.Explainer(clf_xgb)(X_csr)
type(shap_values.data)
# <class 'scipy.sparse._csr.csr_matrix'>

How I tested this

Interactively, with the examples shared above.

Confirmed in the lint CI job that this does not introduce new mypy errors: https://github.com/microsoft/LightGBM/actions/runs/8107505294/job/22159131171?pr=6348.

Modified the 2 existing Python unit tasks (one for Booster.predict(), one for dask) which already thoroughly cover this code path.

References

More notes: More notes on that at #3867 (comment)

jameslamb · 2024-03-01T06:35:50Z

Tagging some others who might have opinions and be able to correct my understanding of the tradeoffs here:

@trivialfis @hcho3 @david-cortes @mayer79 @connortann @thatlittleboy @imatiach-msft

david-cortes · 2024-03-01T16:14:59Z

I will take a guess that the reason why it's output like this is because SciPy's sparse matrices do not support n-dimensional arrays and the output here is 3-dimensional.

What this PR does however is inefficient: the output generated by concatenating is already generated during the call to predict:

LightGBM/python-package/lightgbm/basic.py

Line 1483 in 1a292f8

_LIB.LGBM_BoosterPredictSparseOutput(

But then it gets split into lists:

LightGBM/python-package/lightgbm/basic.py

Line 1504 in 1a292f8

matrices = self.__create_sparse_native(

From SciPy's docs, their sparse matrix classes are meant to be deprecated in the future, and the recommended class now are "sparse arrays":
https://docs.scipy.org/doc/scipy/reference/sparse.html

They also mention that n-dimensional sparse arrays might be implemented in the future, in which case this could move to a 3-dimensional structure.

jameslamb · 2024-03-05T03:54:16Z

I will take a guess that the reason why it's output like this is because SciPy's sparse matrices do not support n-dimensional arrays and the output here is 3-dimensional.

Thanks!

If that's true, then I think the decision to support this output change gets even easier... because for the numpy case (which uses a different code path), LightGBM is not returning a 3D matrix. It's returning a 2D matrix, with the feature contributions for each class stacked horizontally. (see the "example with lightgbm 4.3.0" in the PR description)

So the change in this PR, in addition to removing a hard-to-work-with data structure, would also ensure that the shape isn't dependent on the type of the input data.

What this PR does however is inefficient

Totally agree. I will try to remove the intermediate list generation.

From SciPy's docs, their sparse matrix classes are meant to be deprecated in the future, and the recommended class now are "sparse arrays"

Thanks for that information and link! I've created #6352 to capture that. I think lightgbm should start supporting those new array types everywhere it supports the corresponding matrix types.

jameslamb added 3 commits February 29, 2024 21:29

[python-package] ensure predict() always returns an array

d3782da

update tests

7605296

remove special case from dask test

5a45947

jameslamb added the breaking label Mar 1, 2024

add scikit-learn type hints

92d3171

jameslamb changed the title ~~WIP: [python-package] ensure predict() always returns an array~~ [python-package] ensure predict() always returns an array Mar 1, 2024

jameslamb marked this pull request as ready for review March 1, 2024 06:33

jameslamb requested review from guolinke, shiyu1994, jmoralez and borchero as code owners March 1, 2024 06:33

jameslamb added the awaiting review label Mar 1, 2024

connortann mentioned this pull request Mar 4, 2024

BREAKING: make shap explanations consistent for xgboost and lightgbm shap/shap#3318

Merged

2 tasks

jameslamb mentioned this pull request Mar 5, 2024

[python-package] [c++] support scipy sparse arrays #6352

Closed

jameslamb added in progress and removed awaiting review labels Apr 20, 2024

jameslamb changed the title ~~[python-package] ensure predict() always returns an array~~ WIP: [python-package] ensure predict() always returns an array Jul 12, 2024

jameslamb marked this pull request as draft July 12, 2024 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: [python-package] ensure predict() always returns an array #6348

WIP: [python-package] ensure predict() always returns an array #6348

jameslamb commented Mar 1, 2024 •

edited

Loading

jameslamb commented Mar 1, 2024

david-cortes commented Mar 1, 2024

jameslamb commented Mar 5, 2024

WIP: [python-package] ensure predict() always returns an array #6348

Are you sure you want to change the base?

WIP: [python-package] ensure predict() always returns an array #6348

Conversation

jameslamb commented Mar 1, 2024 • edited Loading

Overview

Benefits of this change

Costs of this change

Notes for Reviewers

Why is a list currently returned?

Does this break shap?

How I tested this

References

jameslamb commented Mar 1, 2024

david-cortes commented Mar 1, 2024

jameslamb commented Mar 5, 2024

jameslamb commented Mar 1, 2024 •

edited

Loading

Does this break `shap`?