Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Nov 12, 2024

Changes the logic to flatten the data array to consider its layout:

  • If the array is Fortran contiguous, then it's flattened as such and passed to the Dataset constructor indicating that the array is column-major, which avoids copying it to make it row-major.
  • If the array is C contiguous, no copy is made (current behavior).
  • If the array is not contiguous because it was sliced, then a copy is made to make it C-contiguous (current behavior).

I ran some quick tests with an array of (100k, 50) and the timings to construct the dataset from a C-contiguous and an F-contiguous arrays are roughly the same, so this doesn't introduce extra latency, just avoids the copy.

@jmoralez jmoralez changed the title do not copy column-major numpy arrays when creating Dataset WIP: [python-package] do not copy column-major numpy arrays when creating Dataset Nov 12, 2024
@jmoralez jmoralez marked this pull request as ready for review November 14, 2024 19:31
@jmoralez jmoralez changed the title WIP: [python-package] do not copy column-major numpy arrays when creating Dataset [python-package] do not copy column-major numpy arrays when creating Dataset Nov 14, 2024
@jmoralez
Copy link
Collaborator Author

This could also be done for the predict portion, i.e.

def __inner_predict_np2d(
self,
mat: np.ndarray,
start_iteration: int,
num_iteration: int,
predict_type: int,
preds: Optional[np.ndarray],
) -> Tuple[np.ndarray, int]:
if mat.dtype == np.float32 or mat.dtype == np.float64:
data = np.asarray(mat.reshape(mat.size), dtype=mat.dtype)
else: # change non-float data to float data, need to copy
data = np.array(mat.reshape(mat.size), dtype=np.float32)
ptr_data, type_ptr_data, _ = _c_float_array(data)
n_preds = self.__get_num_preds(
start_iteration=start_iteration,
num_iteration=num_iteration,
nrow=mat.shape[0],
predict_type=predict_type,
)
if preds is None:
preds = np.empty(n_preds, dtype=np.float64)
elif len(preds.shape) != 1 or len(preds) != n_preds:
raise ValueError("Wrong length of pre-allocated predict array")
out_num_preds = ctypes.c_int64(0)
_safe_call(
_LIB.LGBM_BoosterPredictForMat(
self._handle,
ptr_data,
ctypes.c_int(type_ptr_data),
ctypes.c_int32(mat.shape[0]),
ctypes.c_int32(mat.shape[1]),
ctypes.c_int(_C_API_IS_ROW_MAJOR),
ctypes.c_int(predict_type),
ctypes.c_int(start_iteration),
ctypes.c_int(num_iteration),
_c_str(self.pred_parameter),
ctypes.byref(out_num_preds),
preds.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
)
)

Please let me know if it's ok to include it here or in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant