Integragion with C++ #1181
-
Hello Everyone, Our team wants to develop some functions in C++ (for efficiency reasons) and port the solutions for the river. |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 16 replies
-
Hello @paulorla.
Nice! What kind of contributions are you thinking of? Do you have any method in mind? Also, what do you mean exactly by "efficiency reasons"? Even if you implement a method in C++, you'll be limited by the fact that the main control loop happens in Python. That's the downside of streaming vs. mini-batch processing -- although we support mini-batch processing in some places too.
I see no reason not to. As of now, we have some Cython code and some Rust code. We build wheels, so it's not a problem for our users. I'm not savvy enough about the pros and cons of PyBind11 vs. Cython vs. cffi, so it's up to you -- I found a comparison here. @AdilZouitine @gbolmier do you have any input? Note that we have a PS: it's great that you're willing to contribute! The team is here to help if needed :) |
Beta Was this translation helpful? Give feedback.
-
Hello @MaxHalford and @gbolmier . Thanks for your thoughts.
We want to start with the ROCAUC. The current implementation makes an approximation (by default using 10 points). I believe that it was implemented like this to save computing time. We could, for instance, implement the processing in C++ and wrap it to be accessible through Python.
In the ROCAUC example, the current metrics must be computed for every threshold possible, thus justifying the use of C++ or something 'more efficient' than Python.
Yes, Rust could be a good idea. Do you have any good material on integrating Rust with Python (we are not Rust programmers, but we can learn)? |
Beta Was this translation helpful? Give feedback.
-
About ROC AUC, we did indeed use 10 thresholds for performance reasons. If you are able to implement a deterministic version that produces the same results as scikit-learn, then be my guest. Other than the performance issue, the problem with using every possible threshold is that you end up storing one confusion matrix for each threshold. |
Beta Was this translation helpful? Give feedback.
-
Thanks @MaxHalford . |
Beta Was this translation helpful? Give feedback.
-
Hello Everyone, I'm a member of Paulo's team. I made a C++ implementation of AUC-ROC, using the Wilcoxon statistical test (https://link.springer.com/article/10.1007/s10115-017-1022-8). It utilizes all needed thresholds, so the result is closer to the SKLearn implementation, performing the calculation faster than the current river implementation. Since the result of the C++ implementation is more accurate, some tests made by "pytest", i.e., those who use an AUC expected value, don't pass. Our implementation doesn't use sample weight, so some tests that expect usage of this functionality don't pass too. Should we commit and make a pull request or make changes to the tests? |
Beta Was this translation helpful? Give feedback.
-
Hey, I implemented the prequential version of the ROCAUC, have a look: |
Beta Was this translation helpful? Give feedback.
-
Hey @davidlpgomes, is there any way you could take a look at this failing job? It seems that the C++ code won't compile on MacOS. |
Beta Was this translation helpful? Give feedback.
Hello @paulorla.
Nice! What kind of contributions are you thinking of? Do you have any method in mind?
Also, what do you mean exactly by "efficiency reasons"? Even if you implement a method in C++, you'll be limited by the fact that the main control loop happens in Python. That's the downside of streaming vs. mini-batch processing -- although we support mini-batch processing in some places too.
I see no reason not to. As of now, w…