Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table statistics #1285

Merged
merged 6 commits into from
Jan 16, 2025
Merged

Add table statistics #1285

merged 6 commits into from
Jan 16, 2025

Conversation

ndrluis
Copy link
Collaborator

@ndrluis ndrluis commented Nov 4, 2024

The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.

@ndrluis ndrluis changed the title Add table statistics update Add table statistics Nov 4, 2024
@ndrluis
Copy link
Collaborator Author

ndrluis commented Nov 4, 2024

I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation.

@Fokko @sungwy @kevinjqliu

@ndrluis ndrluis changed the title Add table statistics WIP: Add table statistics Nov 4, 2024
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Added a few comments. I think it would also be helpful to include integration tests

pyiceberg/table/metadata.py Show resolved Hide resolved
pyiceberg/table/statistics.py Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/update/__init__.py Outdated Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
@ndrluis ndrluis force-pushed the add-statistics branch 2 times, most recently from 9b15c86 to d16ef47 Compare November 10, 2024 23:30
@ndrluis ndrluis requested a review from kevinjqliu November 10, 2024 23:42
@ndrluis ndrluis changed the title WIP: Add table statistics Add table statistics Nov 10, 2024
@ndrluis ndrluis marked this pull request as ready for review November 10, 2024 23:43
@ndrluis
Copy link
Collaborator Author

ndrluis commented Nov 10, 2024

@kevinjqliu could you please review it once more?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

mkdocs/docs/api.md Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
pyiceberg/table/statistics.py Outdated Show resolved Hide resolved
@ndrluis
Copy link
Collaborator Author

ndrluis commented Nov 12, 2024

Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file

@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update.

What do you think?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this!

Regarding the integration tests, since we're manipulating table metadata to add/remove table stats, it would be great to verify that another source can interact with these stats. Not a hard blocker

@ndrluis ndrluis mentioned this pull request Nov 24, 2024
@kevinjqliu
Copy link
Contributor

@ndrluis do you mind resolving the merge conflict here?

@ndrluis
Copy link
Collaborator Author

ndrluis commented Jan 8, 2025

@kevinjqliu Done!

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ndrluis Thanks for working on this, and sorry for leaving this hanging for so long. I have some small comments, but it looks good to me 👍

dev/provision.py Outdated Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
pyiceberg/table/statistics.py Outdated Show resolved Hide resolved
Comment on lines +494 to +495
if update.snapshot_id != update.statistics.snapshot_id:
raise ValueError("Snapshot id in statistics does not match the snapshot id in the update")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of an awkward check, but something that we have to live with I guess.

@ndrluis ndrluis force-pushed the add-statistics branch 3 times, most recently from 2034836 to ba64764 Compare January 15, 2025 21:19
@ndrluis ndrluis force-pushed the add-statistics branch 2 times, most recently from 8fe9992 to 217bb95 Compare January 15, 2025 21:43
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks @ndrluis 🙌

@Fokko Fokko merged commit 0a3a886 into apache:main Jan 16, 2025
8 checks passed
@ndrluis ndrluis deleted the add-statistics branch January 16, 2025 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants