-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datafusion): Expose DataFusion statistics on an IcebergTableScan #880
base: main
Are you sure you want to change the base?
Conversation
5a30c42
to
80b8d8c
Compare
let statistics = compute_statistics(&self.table, self.snapshot_id) | ||
.await | ||
.unwrap_or(Statistics::new_unknown(self.schema.as_ref())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably this should be computed when instantiating the IcebergTableProvider
(consequently IcebergTableProvider::new
would become async).
That way not only could TableProvider::statistics
be exposed, we'd save a const perf penalty during planning for reading the manifests.
// For each existing/added manifest in the snapshot aggregate the row count, as well as null | ||
// count and min/max values. | ||
for manifest_file in manifest_list.entries() { | ||
let manifest = manifest_file.load_manifest(file_io).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two problems with this approach:
- It maybe quite slow for large table
- The value is incorrect for table with deletions, which maybe quite different.
Also iceberg has table level statistics: https://iceberg.apache.org/spec/#table-statistics But currently it only contains ndv for each column. Should we consider reading this table statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, I greatly appreciate it.
Regarding 1, I agree completely (and think it should be done during table instantiation), I mainly wanted to get some validation on the general approach first. (Perhaps it might also be an optional call via something like IcebergTableProvider::with_statistics
, which would be chained after one of the existing construction methods.)
As for point 2, is it not sufficient that I aggregate stats only for manifest_entry.status() != ManifestStatus::Deleted
below? Put another way is it possible for ManifestStatus::Existing | ManifestStatus::Added
entries to contain some misleading stats?
Finally, while I think exposing the spec (puffin) statistics should definitely be implemented, it seems that this is not always available (it may be opt-in for some external writers such as pyiceberg/spark?), so the best course of action for starters is to gather the stats from the manifest (entries) by default.
Closes #869.
Provide detailed statistics via DataFusion's
ExecutionPlan::statistics
for more efficient join planning.The statistics is accumulated from the snapshot's manifests, and converted to the adequate DataFusion struct.