-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(python): from_arrow_fixed_size_list #16751
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #16751 +/- ##
========================================
Coverage 81.45% 81.45%
========================================
Files 1413 1413
Lines 186306 186045 -261
Branches 2777 2754 -23
========================================
- Hits 151750 151552 -198
+ Misses 34036 33990 -46
+ Partials 520 503 -17 ☔ View full report in Codecov by Sentry. |
My test is probably too big but I didn't know what the alternative was. I did some internal tests with row_group sizing and creating pyarrow tables with parquet files but the only way to I could reproduce the bug was by saving a parquet file that is >=131073 rows and then reopening it. If it's fewer rows than that then it doesn't manifest. Additionally, the windows runner threw an error because the file was open but the linux runners didn't have a problem so I added a means to retry 5 times with a 1 second wait. |
I want to look if this is maybe something wrong on the rust side. |
I want to wait on this work before looking into this one: #16747 |
yeah I definitely think there's something upstream to address. There's also the goal of moving over to use stream/capsule protocol instead of pyarrow. One thing I found in the current state of things was that in polars/py-polars/src/interop/arrow/to_rust.rs Lines 56 to 71 in 5a0c803
rb will have however many chunks/batches in the Table but then arr will be the full length for each batch so that's how/why the result is the multiple of the number of chunks. One thing that OP's issue didn't show is that if there is more than one column then the from_arrow will panic on a side error because the Array column will be longer than the other columns.
I only took up bandaiding this on the python side when I saw that Structs and Dictionaries already needed the bandaid. I would suggest that if the rust fix isn't ready before the next release that we put this bandaid on until it is. This can always be taken out but if the core functionality is broken....well that's not good for anybody. |
fixes #16614
In the process of putting this fix in I noticed that we'd have a bug if there were both structs and dictionaries because when it's adding those columns back, it doesn't merge them between the dictionary_cols and struct_cols. There isn't any reason to separate the special cases so I put them in a single dict along with fixed_size_list.