-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PySpark Arrow Stream Serializer #3
base: pandas-udf-integration
Are you sure you want to change the base?
PySpark Arrow Stream Serializer #3
Conversation
f92d865
to
e54cd16
Compare
e54cd16
to
0f294c2
Compare
@icexelloss , here is what I had so far. Feel free to use what you like and let me know if you have any questions. |
…s Series, modified PythonRDD to support this and maintain backwards compatibility
0f294c2
to
45db636
Compare
Thanks Bryan! This is quite a bit a change. I will take a look this week. |
Sure, no problem! |
Bryan, You mentioned a ~2.5x speed up comparing this and the original udf methods. How did you run the experiments? I am trying to reproduce your results. |
I was basically using the code below, just manually turning on/off Arrow by commenting a couple lines (I left a note in the code as to what needs to be commented out)
|
* null values properly returned * create joined row object
Enable UDF evaluation with Arrow using stream format to load as Pandas Series, modified PythonRDD to support this and maintain backwards compatibility.