[python-package] do not copy column-major numpy arrays when creating Dataset #6721

jmoralez · 2024-11-12T22:54:19Z

Changes the logic to flatten the data array to consider its layout:

If the array is Fortran contiguous, then it's flattened as such and passed to the Dataset constructor indicating that the array is column-major, which avoids copying it to make it row-major.
If the array is C contiguous, no copy is made (current behavior).
If the array is not contiguous because it was sliced, then a copy is made to make it C-contiguous (current behavior).

I ran some quick tests with an array of (100k, 50) and the timings to construct the dataset from a C-contiguous and an F-contiguous arrays are roughly the same, so this doesn't introduce extra latency, just avoids the copy.

jmoralez · 2024-11-14T19:44:08Z

This could also be done for the predict portion, i.e.

LightGBM/python-package/lightgbm/basic.py

Lines 1268 to 1307 in 5151fe8

    
           def __inner_predict_np2d( 
        
               self, 
        
               mat: np.ndarray, 
        
               start_iteration: int, 
        
               num_iteration: int, 
        
               predict_type: int, 
        
               preds: Optional[np.ndarray], 
        
           ) -> Tuple[np.ndarray, int]: 
        
               if mat.dtype == np.float32 or mat.dtype == np.float64: 
        
                   data = np.asarray(mat.reshape(mat.size), dtype=mat.dtype) 
        
               else:  # change non-float data to float data, need to copy 
        
                   data = np.array(mat.reshape(mat.size), dtype=np.float32) 
        
               ptr_data, type_ptr_data, _ = _c_float_array(data) 
        
               n_preds = self.__get_num_preds( 
        
                   start_iteration=start_iteration, 
        
                   num_iteration=num_iteration, 
        
                   nrow=mat.shape[0], 
        
                   predict_type=predict_type, 
        
               ) 
        
               if preds is None: 
        
                   preds = np.empty(n_preds, dtype=np.float64) 
        
               elif len(preds.shape) != 1 or len(preds) != n_preds: 
        
                   raise ValueError("Wrong length of pre-allocated predict array") 
        
               out_num_preds = ctypes.c_int64(0) 
        
               _safe_call( 
        
                   _LIB.LGBM_BoosterPredictForMat( 
        
                       self._handle, 
        
                       ptr_data, 
        
                       ctypes.c_int(type_ptr_data), 
        
                       ctypes.c_int32(mat.shape[0]), 
        
                       ctypes.c_int32(mat.shape[1]), 
        
                       ctypes.c_int(_C_API_IS_ROW_MAJOR), 
        
                       ctypes.c_int(predict_type), 
        
                       ctypes.c_int(start_iteration), 
        
                       ctypes.c_int(num_iteration), 
        
                       _c_str(self.pred_parameter), 
        
                       ctypes.byref(out_num_preds), 
        
                       preds.ctypes.data_as(ctypes.POINTER(ctypes.c_double)), 
        
                   ) 
        
               )

Please let me know if it's ok to include it here or in a separate PR.

do not copy column-major numpy arrays when creating Dataset

e3cc120

jmoralez changed the title ~~do not copy column-major numpy arrays when creating Dataset~~ WIP: [python-package] do not copy column-major numpy arrays when creating Dataset Nov 12, 2024

jmoralez added 2 commits November 12, 2024 17:47

fix logic

84607e3

lint

0d61224

jmoralez marked this pull request as ready for review November 14, 2024 19:31

jmoralez requested review from guolinke, jameslamb, shiyu1994, borchero and StrikerRUS as code owners November 14, 2024 19:31

jmoralez changed the title ~~WIP: [python-package] do not copy column-major numpy arrays when creating Dataset~~ [python-package] do not copy column-major numpy arrays when creating Dataset Nov 14, 2024

jmoralez added feature efficiency and removed feature labels Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

jmoralez commented Nov 12, 2024 •

edited

Loading

jmoralez commented Nov 14, 2024

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

Are you sure you want to change the base?

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

Conversation

jmoralez commented Nov 12, 2024 • edited Loading

jmoralez commented Nov 14, 2024

jmoralez commented Nov 12, 2024 •

edited

Loading