Integragion with C++ #1181

paulorla · 2023-02-09T13:58:34Z

paulorla
Feb 9, 2023

Hello Everyone,

Our team wants to develop some functions in C++ (for efficiency reasons) and port the solutions for the river.
For now, we are using PyBind11 to make our implementation visible to Python functions.
Is it possible to make pull requests of C++ code? If so, which library (PyBind11, Cython, ...) should we use to expose the C++ functions to Python?

Answered by MaxHalford

Feb 9, 2023

Hello @paulorla.

Our team wants to develop some functions in C++ (for efficiency reasons) and port the solutions for the river.

Nice! What kind of contributions are you thinking of? Do you have any method in mind?

Also, what do you mean exactly by "efficiency reasons"? Even if you implement a method in C++, you'll be limited by the fact that the main control loop happens in Python. That's the downside of streaming vs. mini-batch processing -- although we support mini-batch processing in some places too.

Is it possible to make pull requests of C++ code? If so, which library (PyBind11, Cython, ...) should we use to expose the C++ functions to Python?

I see no reason not to. As of now, w…

View full answer

MaxHalford · 2023-02-09T14:05:57Z

MaxHalford
Feb 9, 2023
Maintainer

Hello @paulorla.

Our team wants to develop some functions in C++ (for efficiency reasons) and port the solutions for the river.

Nice! What kind of contributions are you thinking of? Do you have any method in mind?

Also, what do you mean exactly by "efficiency reasons"? Even if you implement a method in C++, you'll be limited by the fact that the main control loop happens in Python. That's the downside of streaming vs. mini-batch processing -- although we support mini-batch processing in some places too.

Is it possible to make pull requests of C++ code? If so, which library (PyBind11, Cython, ...) should we use to expose the C++ functions to Python?

I see no reason not to. As of now, we have some Cython code and some Rust code. We build wheels, so it's not a problem for our users. I'm not savvy enough about the pros and cons of PyBind11 vs. Cython vs. cffi, so it's up to you -- I found a comparison here. @AdilZouitine @gbolmier do you have any input?

Note that we have a VectorDict implemented with Cython. So maybe it's worth sticking to Cython, that way you can re-use VectorDict.

PS: it's great that you're willing to contribute! The team is here to help if needed :)

1 reply

gbolmier Feb 9, 2023
Maintainer

I would tend to limit the number of languages being used for maintenance purposes. Especially if we don't have anyone familiar/experienced with C++ bindings in the active maintainers.
Hence I would prefer sticking to Cython and Rust for low level implementations.

paulorla · 2023-02-09T16:07:14Z

paulorla
Feb 9, 2023
Author

Hello @MaxHalford and @gbolmier .

Thanks for your thoughts.

Nice! What kind of contributions are you thinking of? Do you have any method in mind?

We want to start with the ROCAUC. The current implementation makes an approximation (by default using 10 points). I believe that it was implemented like this to save computing time. We could, for instance, implement the processing in C++ and wrap it to be accessible through Python.

Also, what do you mean exactly by "efficiency reasons"? Even if you implement a method in C++, you'll be limited by the fact that the main control loop happens in Python. That's the downside of streaming vs. mini-batch processing -- although we support mini-batch processing in some places too.

In the ROCAUC example, the current metrics must be computed for every threshold possible, thus justifying the use of C++ or something 'more efficient' than Python.

Hence I would prefer sticking to Cython and Rust for low level implementations.

Yes, Rust could be a good idea. Do you have any good material on integrating Rust with Python (we are not Rust programmers, but we can learn)?

1 reply

AdilZouitine Feb 9, 2023
Maintainer

Hey @paulorla,
I agree with @gbolmier; limiting the number of languages is a good idea for maintenance and CD purpose.

Yes, Rust could be a good idea. Do you have any good material on integrating Rust with Python (we are not Rust programmers, but we can learn)?

I wrote a blog post on how we bind rust in river's stats module https://boring-guy.sh/posts/river-rust/ 😄
You can find in PyO3 repo lots of resources for binding rust to python https://github.com/PyO3/pyo3 .

MaxHalford · 2023-02-09T16:49:58Z

MaxHalford
Feb 9, 2023
Maintainer

About ROC AUC, we did indeed use 10 thresholds for performance reasons. If you are able to implement a deterministic version that produces the same results as scikit-learn, then be my guest. Other than the performance issue, the problem with using every possible threshold is that you end up storing one confusion matrix for each threshold.

1 reply

MaxHalford Feb 15, 2023
Maintainer

Hey @paulorla, are you aware of this paper? It's quite recent and suggest a simple idea based on histograms. Histograms are easy to update and it's fast enough in Python. However, the looping over the bins part could be improved with a Rust/Cython implementation.

paulorla · 2023-02-16T14:25:04Z

paulorla
Feb 16, 2023
Author

Thanks @MaxHalford .

0 replies

davidlpgomes · 2023-02-16T14:56:03Z

davidlpgomes
Feb 16, 2023

Hello Everyone, I'm a member of Paulo's team.

I made a C++ implementation of AUC-ROC, using the Wilcoxon statistical test (https://link.springer.com/article/10.1007/s10115-017-1022-8). It utilizes all needed thresholds, so the result is closer to the SKLearn implementation, performing the calculation faster than the current river implementation.

Since the result of the C++ implementation is more accurate, some tests made by "pytest", i.e., those who use an AUC expected value, don't pass. Our implementation doesn't use sample weight, so some tests that expect usage of this functionality don't pass too. Should we commit and make a pull request or make changes to the tests?

5 replies

MaxHalford Feb 16, 2023
Maintainer

Hey! Thanks for the link to the paper. Can you clarify a little bit how your implementation works? Does it operate on a sliding window? Or does it include all data from the start?

Feel welcome to open a pull request, it's always a good idea to show some code as early in the process as possible.

davidlpgomes Feb 16, 2023

It includes all data from the start. The prequential method, which utilizes a sliding window, is a possible future contribution 😄.

The C++ implements a batch version of the method, which receives a vector with labels and a vector with scores. It sorts the vectors and iterates calculating the AUC.

We used Cython to make the integration with Python, as shown in https://cython.readthedocs.io/en/latest/src/userguide/wrapping_CPlusPlus.html.

I made a commit with the implementation and pushed it to my fork: davidlpgomes@e2014a5

MaxHalford Feb 16, 2023
Maintainer

Ok but then what sketching method does it use? How do you know store all samples in memory?

Edit: it seems that you just store the samples in memory. So one of the things about River is that we don't allow this kind of logic. The way we see streaming algorithms is that we want them to run on a potentially infinite stream. In your implementation, the memory usage grows with the number of samples.

davidlpgomes Feb 16, 2023

Yes, it stores all the samples in memory. It avoids the overhead on the CPU but increases memory usage.

With that in mind, we could contribute with an efficient version of the prequential method, which utilizes a sliding window with a size S, where S is determined on the constructor of the metric. We would use a binary search tree on C++ (map/multimap), based on the paper linked above, storing only S samples, avoiding catastrophic memory usage.

Let me know your thoughts :)

MaxHalford Feb 16, 2023
Maintainer

Thanks for confirming and clarifying. The sliding window is interesting, but it would still be nice to have a implementation which measures the ROC AUC over the whole dataset. The only way to do this seems to use a sketch data structure, such as a histogram in the paper I linked above.

davidlpgomes · 2023-02-17T19:35:31Z

davidlpgomes
Feb 17, 2023

Hey, I implemented the prequential version of the ROCAUC, have a look:

272f195

0 replies

MaxHalford · 2023-02-27T23:03:07Z

MaxHalford
Feb 27, 2023
Maintainer

Hey @davidlpgomes, is there any way you could take a look at this failing job? It seems that the C++ code won't compile on MacOS.

8 replies

davidlpgomes Feb 28, 2023

I imagined, the extra_compile_args=["-std=c++11"] must be exclusive to the C++ files.

Sure, I do. I will test some settings :)

davidlpgomes Feb 28, 2023

Hey @MaxHalford, it seems that removing the extra_compile_args from setup.py and adding # distutils: extra_compile_args = "-std=c++11" to efficient_rollingrocauc.pyx limited the -std=c++11 flag only to the RollingROCAUC C++ files (commit on my fork).

Another approach is to specify compiling options for each Cython file, as implemented on scikit-learn.

MaxHalford Feb 28, 2023
Maintainer

I think I prefer the first option; it feels better to have everything in one place.

Can I let you send a pull request?

Thank you so much by the way @davidlpgomes, I'm impressed at how quick you are!

davidlpgomes Feb 28, 2023

No problem, I'll send the pull request. Let's test the build to check it 🤞🏻.

My pleasure! Haha, thanks :)

MaxHalford Mar 6, 2023
Maintainer

Ok the CI is green, thanks a lot ✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integragion with C++ #1181

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 16 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Integragion with C++ #1181

Replies: 7 comments · 16 replies

MaxHalford Feb 9, 2023 Maintainer

gbolmier Feb 9, 2023 Maintainer

paulorla Feb 9, 2023 Author

AdilZouitine Feb 9, 2023 Maintainer

MaxHalford Feb 9, 2023 Maintainer

MaxHalford Feb 15, 2023 Maintainer

paulorla Feb 16, 2023 Author

MaxHalford Feb 16, 2023 Maintainer

MaxHalford Feb 16, 2023 Maintainer

MaxHalford Feb 16, 2023 Maintainer

MaxHalford Feb 27, 2023 Maintainer

MaxHalford Feb 28, 2023 Maintainer

MaxHalford Mar 6, 2023 Maintainer

Replies: 7 comments 16 replies

MaxHalford
Feb 9, 2023
Maintainer

gbolmier Feb 9, 2023
Maintainer

paulorla
Feb 9, 2023
Author

AdilZouitine Feb 9, 2023
Maintainer

MaxHalford
Feb 9, 2023
Maintainer

MaxHalford Feb 15, 2023
Maintainer

paulorla
Feb 16, 2023
Author

MaxHalford Feb 16, 2023
Maintainer

MaxHalford Feb 16, 2023
Maintainer

MaxHalford Feb 16, 2023
Maintainer

MaxHalford
Feb 27, 2023
Maintainer

MaxHalford Feb 28, 2023
Maintainer

MaxHalford Mar 6, 2023
Maintainer