Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust): properly read/write fixed-sized lists from/to parquet files #16747

Merged
merged 1 commit into from
Jun 13, 2024

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Jun 5, 2024

This PR rewrites many of the parts that relate to Parquet reading and writing of nested fields. This PR is quite large, and I tried my best to extensively test the new implementation, but I am also confident that there are still bugs in it. My view at the moment that these bugs are fringe enough for the moment.

Specifically, this PR merges the RepLevelIter and DefLevelIter into a BufferedDremelIter that properly follows the Parquet specification when it comes to repetition and definition levels. It is also buffered and thus should be faster (although I don't have numbers to back this up). For the reader side, most of the code is adjusted to skip add elements that are implied by invalid higher-level nestings.

This PR also does many, many drive-by refactors in the Parquet code.

Fixes #16608.

@github-actions github-actions bot added fix Bug fix rust Related to Rust Polars labels Jun 5, 2024
@coastalwhite coastalwhite force-pushed the fix-parquet-write branch 11 times, most recently from 441f416 to 794932b Compare June 12, 2024 12:16
@coastalwhite coastalwhite marked this pull request as ready for review June 12, 2024 12:16
@coastalwhite coastalwhite force-pushed the fix-parquet-write branch 5 times, most recently from ac9a541 to 2a91e1c Compare June 12, 2024 12:44
This PR rewrites many of the parts that relate to Parquet reading and writing of nested fields. This PR is quite large, and I tried my best to extensively test the new implementation, but I am also confident that there are still bugs in it. My view at the moment that these bugs are fringe enough for the moment.

Specifically, this PR merges the `RepLevelIter` and `DefLevelIter` into a `BufferedDremelIter` that properly follows the Parquet specification when it comes to *repetition* and *definition* levels. It is also buffered and thus should be faster (although I don't have numbers to back this up). For the reader side, most of the code is adjusted to skip add elements that are implied by invalid higher-level nestings.

This PR also does many, many drive-by refactors in the Parquet code.

Fixes pola-rs#16608.
Copy link

codecov bot commented Jun 12, 2024

Codecov Report

Attention: Patch coverage is 88.55302% with 231 lines in your changes missing coverage. Please review.

Project coverage is 81.53%. Comparing base (2dba716) to head (4829f54).

Files Patch % Lines
crates/polars-parquet/src/parquet/error.rs 21.73% 18 Missing ⚠️
...quet/src/parquet/schema/io_message/from_message.rs 71.92% 16 Missing ⚠️
...tes/polars-parquet/src/parquet/read/page/reader.rs 25.00% 15 Missing ⚠️
crates/polars-parquet/src/parquet/compression.rs 46.15% 14 Missing ⚠️
...olars-parquet/src/arrow/write/nested/dremel/mod.rs 94.88% 13 Missing ⚠️
crates/polars-parquet/src/parquet/page/mod.rs 61.76% 13 Missing ⚠️
...es/polars-parquet/src/parquet/schema/types/spec.rs 27.77% 13 Missing ⚠️
crates/polars-parquet/src/arrow/write/pages.rs 86.74% 11 Missing ⚠️
...rates/polars-parquet/src/parquet/parquet_bridge.rs 46.66% 8 Missing ⚠️
crates/polars-parquet/src/parquet/write/stream.rs 0.00% 8 Missing ⚠️
... and 37 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16747      +/-   ##
==========================================
+ Coverage   81.43%   81.53%   +0.09%     
==========================================
  Files        1425     1422       -3     
  Lines      187987   188429     +442     
  Branches     2704     2704              
==========================================
+ Hits       153093   153633     +540     
+ Misses      34397    34299      -98     
  Partials      497      497              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit a61dcb4 into pola-rs:main Jun 13, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nulls disappear when writing array type (to parquet)
2 participants