-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what is the role of sparse DMatrix
constructors?
#160
Comments
The bad news is that it's probably converting to a dense matrix and running out of memory. The "good" news is that I don't think that libxgboost actually does support sparse matrices which would mean that this is not a bug but intended behavior. I believe that, when sparse formats are created via That's not to say there is no feature to be added here: we ought to have something for calling If you are up for experimentation you can see the deleted method here. @trivialfis can you confirm that I'm getting this right? I think what I'm saying is consistent with our earlier discussion, but I don't think we explicitly mentioned |
I've been thinking a bit, and it seems to me that there are likely quite a few situations in which treating 0's as nulls in a sparse dataset for the purpose of training an xgboost model might actually make some sense. This would explain how you seem to have been content with whatever was happening in v1.1.1 which might have been exactly this. Perhaps one of the things we need to do here is add a new function which creates a I'm sure trivialfis has some thoughts on this (but one ping is enough 🙂 ). |
@ExpandingMan, thanks for bringing up the question about the handling of 0s in the parsing matrix. From my previous experience and comparing various implementations, such as XGBoost in Julia, Python XGBoost, LightGBM, and CatBoost, I've found that performance is on par when using version 1.1.1. In my experience, for datasets with millions of rows and mostly sparse and categorical data, it's fine to ignore the 0s as I can fit huge data and have the benefit from it. One potential solution could be to assign a different internal representation for 'explicit' 0s and ignore the missing data. I'll give your suggestion a try and run some baselines using the same data in Python to compare quality performance (e.g., precision, etc). If the results are satisfactory, I'd propose this fix as one option. Although I'm not very familiar with the C bindings of the library, I have some experience with it in general. I'll take a look at the code and see if I can incorporate this fix there. Thanks for your input on this issue! |
Great news, @ExpandingMan! Your suggestion worked. I implemented the code you suggested and was able to load more than 1.7M rows in the DMatrix using version 1.1.1 as expected. I'll now proceed to test the quality of the model. If it is consistent with the previous metrics, I'm happy with this solution as it appears to be effective. |
Hmm, I need to write a document on the behavior and get some reviews from all participants. I think it's not that complicated, it's just my presentation got unnecessary complications into it. Assuming this is a CSC matrix with 0 as valid (non-missing) value:
so:
The corresponding |
let me unify the csc input with csr. I believe that's where the confusion comes in. The csc api is legacy. |
Thanks @trivialfis . My main question here was more about how xgboost treats a CSC (or even a CSR) input. In every CSC matrix implementation I'm aware of, the non-explicit elements, i.e. those not explicitly included in the values array, are The second part of this, which occurred to me last night, is that it actually might make some sense to treat 0's as missing when training an xgboost model in cases in which the data is very sparse. My reasoning is basically "there are so many 0's, don't try to split on them because all your meaningful signal is coming from the non-zeros". If that's the case, then the CSC method could be quite useful, particularly since this is the most common type of sparse matrix in Julia. The only catch is that the method for using it must not be Incidentally, I've recently had to do some machine learning on sparse data, and it can be pretty challenging, so now I'm interested in experimenting with the "treat the 0's as missing" idea. |
Here is an update from my side. I tried to use the commented code at
Good news: I could load huge sparse data into it and train my models. Below is the data summarizing my runs: Metric P means So, is this a problem produced by the way that we are reading the sparse data in the new DMatrix or has some other side effect from version 2.1.1 that we need to consider? Maybe I'm missing something about how to read the sparse data into the DMatrix correctly. |
I don't see what could be causing this other than the data getting into the You can now get data back out of the |
Ok. Let me investigate more about the differences between these two versions and post them here. |
Apologies for the slow response. I think this is making things too complicated. csr = load_data() # a csr matrix loaded by your library code, same for dense
# internal DMatrix is doing this
class DMatrix:
"""I'm a CSR matrix called DMatrix, with parameter `missing`, constants `NaN` and `Inf` removed`."""
def __init__(self, missing: float):
self.missing = missing
self.indptr = []
self.data = []
self.indices = []
def load_csr(self, csr):
for i, v in enumerate(csr.data):
if v not in (self.missing, NaN, Inf):
self.data.append(v)
self.indices.append(csr.indices[i])
self.handle_indptr() For CSC matrix before dmlc/xgboost#8672 can be merged: csc = load_data() # a csc matrix loaded by your library code
# internal DMatrix is doing this
class DMatrix:
"""I'm a CSR matrix called DMatrix, with constants `NaN` and `Inf` removed`."""
def __init__(self):
self.indptr = []
self.data = []
self.indices = []
def load_csc(self, csc):
for i, v in enumerate(csc.data):
if v not in (NaN, Inf):
self.data.append(v)
self.indices.append(csc.indices[i])
self.handle_indptr()
self.transform_to_csr() After the PR, it's the same as CSR and dense. |
Thank you @ExpandingMan and @trivialfis for your feedback and support. The DMatrix appears to be working correctly. Sample - Original Sparse SVMLight Data: Sample - DMatrix Representation: I am using a Pairwise optimization method in version 1.1.1 that utilizes group information. I will check if this information is being correctly passed to the DMatrix using the setgroup function in the new version. Another option is to simplify the configuration and compare the results between the two versions. I will keep you updated and consider any additional suggestions. |
Research UpdateScore Model AnalysisI have been running a comparison of two distinct versions of XGBoost, As a result, I have found that Any thoughts or suggestions on this issue are welcome. |
To keep the thread fresh. I am currently investigating the differences between versions |
It seems like there is a lot of confusion here. First, I have not tested the sparse @dmoliveira this is only actionable if you can show some evidence of what is causing the discrepancy. You've showed detailed results but you need to provide a complete working example with data in order for anything to happen here. I frankly have no idea what to make of what you have thus far shown. @trivialfis I think we are talking past each other somewhat. I understand how to construct a |
I've just tested some constructors. When converting the At the moment I can only speculate on the wisdom of actually using this as a technique to train on sparse data. Again, I can see how simply ignoring the 0's might be a reasonable thing to do if your data is very sparse. This approach would not be without major caveats, in particular data points in which every dimension is |
DMatrix
constructors?
Your observation is correct. It's filtered out. If one wants the 0s to be used during training, a dense matrix should be preferred. There was a proposal to restore the zero elements in the input sparse matrix but we opt to not implement it due to code complexity. In order for xgboost to restore the 0s, there are 2 options:
Let me know if you think this should be prioritized. I can open an issue to track it. |
Thanks, that clears things up! I don't really have an opinion on this right now. Like I said, I'm still curious about the idea that maybe omitting 0's from sparse data actually makes sense for training in some special cases. I suppose the bigger question is whether xgboost is likely to do a good job on sparse data with the 0's included. If yes, I think it would be really nice to have an implementation that respects the sparse matrix structure. I also think it will be hard to gauge how much demand there really is for this. I suspect that in most cases, if you have data too big to fit in memory, it's not going to be sparse, but if it does just so happen to be sparse, the existence of sparse matrix constructors can potentially save you an enormous amount of effort. |
A small repository has been created (https://github.com/dmoliveira/xgboost-benchmark) that contains a single file to demonstrate the issue and provide the necessary data. The script relies solely on the XGBoost library, which can be installed based on the version (version 1.5.2 and 2.2.3 were used in the tests). The tests were conducted using Julia v1.8. This repository contains a simple learning-to-rank task aimed at properly ranking documents. For some reason, the performance in metrics fluctuates depending on when the prediction metric is called. The cause of this issue is still unknown, but version 1 of the library appears to be more stable. The results show that when plotted, the prediction quality decreases, particularly for relevant versus irrelevant documents in lower and higher scores. Ranking Score Results for V2 depending on when you call Results Run XGBoost v1.5.2
Invert Test and Train prediction call order for Evaluation
Results Run XGBoost v2.2.3
Invert Test and Train prediction call order for Evaluation
|
Full code used to generate the results.
|
@ExpandingMan @trivialfis do you have any idea what could be happening? |
Empirical observation of the results is really not useful to me here, what I would need to take action would be specifically what is the discrepancy between the elements of the constructed The |
@ExpandingMan sure. I attempted some experiments to address the issue, but I concur that we require a more pragmatic approach. Unfortunately, I'm currently pressed for time, but I'm eager to investigate the root cause. As I use these models for production, which caters to millions of users, I'm stuck on version v1 rather than v2. I'll return when I have more availability, and perhaps we can schedule a brief discussion then. |
apologies for missing the ping. I can try to reproduce it once I finish some other experiments. |
I have tested the xgboost.jl with 2.5.1, the issue seems to be fixed there? |
I have been using version v1.1.1 for some time now and have been able to successfully load large amounts of data (5-10 GB) into memory using XGBoost and DMatrix, as I have ample RAM to do so. However, after updating to v2.1.1, I am experiencing an 'Out-of-Memory' exception when attempting to load significantly smaller amounts of data (200 MB-700 MB). I am concerned that this update has made the library less efficient and effective for training large models. This issue is a major obstacle for real-world usage and requires immediate attention and resolution. Can you please take a look at this and provide a fix? @ExpandingMan @aviks Thank you.
Julia Version: 1.8
Machine RAM: 64GB
Data Size (MB): 700 MB
Data Stats: X:(1705844, 29996), Y:(1705844,)
Data Format: LibSVM Format
The error is when executing the DMatrix conversion using a
SparseMatrixCSC{Float32, Int32}
Error Message:
The text was updated successfully, but these errors were encountered: