Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datafusion): Expose DataFusion statistics on an IcebergTableScan #880

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Jan 6, 2025

Closes #869.

Provide detailed statistics via DataFusion's ExecutionPlan::statistics for more efficient join planning.

The statistics is accumulated from the snapshot's manifests, and converted to the adequate DataFusion struct.

@gruuya gruuya force-pushed the datafusion-statistics branch from 5a30c42 to 80b8d8c Compare January 6, 2025 13:41
Comment on lines +134 to +136
let statistics = compute_statistics(&self.table, self.snapshot_id)
.await
.unwrap_or(Statistics::new_unknown(self.schema.as_ref()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably this should be computed when instantiating the IcebergTableProvider (consequently IcebergTableProvider::new would become async).

That way not only could TableProvider::statistics be exposed, we'd save a const perf penalty during planning for reading the manifests.

// For each existing/added manifest in the snapshot aggregate the row count, as well as null
// count and min/max values.
for manifest_file in manifest_list.entries() {
let manifest = manifest_file.load_manifest(file_io).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two problems with this approach:

  1. It maybe quite slow for large table
  2. The value is incorrect for table with deletions, which maybe quite different.

Also iceberg has table level statistics: https://iceberg.apache.org/spec/#table-statistics But currently it only contains ndv for each column. Should we consider reading this table statistics?

cc @Fokko @Xuanwo @sdd what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I greatly appreciate it.

Regarding 1, I agree completely (and think it should be done during table instantiation), I mainly wanted to get some validation on the general approach first. (Perhaps it might also be an optional call via something like IcebergTableProvider::with_statistics, which would be chained after one of the existing construction methods.)

As for point 2, is it not sufficient that I aggregate stats only for manifest_entry.status() != ManifestStatus::Deleted below? Put another way is it possible for ManifestStatus::Existing | ManifestStatus::Added entries to contain some misleading stats?

Finally, while I think exposing the spec (puffin) statistics should definitely be implemented, it seems that this is not always available (it may be opt-in for some external writers such as pyiceberg/spark?), so the best course of action for starters is to gather the stats from the manifest (entries) by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Expose Iceberg table statistics in DataFusion interface(s)
2 participants