feat(datafusion): Expose DataFusion statistics on an IcebergTableScan #880

gruuya · 2025-01-06T13:37:10Z

Closes #869.

Provide detailed statistics via DataFusion's ExecutionPlan::statistics for more efficient join planning.

The statistics is accumulated from the snapshot's manifests, and converted to the adequate DataFusion struct.

gruuya · 2025-01-06T14:09:43Z

crates/integrations/datafusion/src/table/mod.rs

+        let statistics = compute_statistics(&self.table, self.snapshot_id)
+            .await
+            .unwrap_or(Statistics::new_unknown(self.schema.as_ref()));


Arguably this should be computed when instantiating the IcebergTableProvider (consequently IcebergTableProvider::new would become async).

That way not only could TableProvider::statistics be exposed, we'd save a const perf penalty during planning for reading the manifests.

liurenjie1024 · 2025-01-07T10:29:10Z

crates/integrations/datafusion/src/statistics.rs

+    // For each existing/added manifest in the snapshot aggregate the row count, as well as null
+    // count and min/max values.
+    for manifest_file in manifest_list.entries() {
+        let manifest = manifest_file.load_manifest(file_io).await?;


There are two problems with this approach:

It maybe quite slow for large table

The value is incorrect for table with deletions, which maybe quite different.

Also iceberg has table level statistics: https://iceberg.apache.org/spec/#table-statistics But currently it only contains ndv for each column. Should we consider reading this table statistics?

cc @Fokko @Xuanwo @sdd what do you think?

Thanks for the feedback, I greatly appreciate it.

Regarding 1, I agree completely (and think it should be done during table instantiation), I mainly wanted to get some validation on the general approach first. (Perhaps it might also be an optional call via something like IcebergTableProvider::with_statistics, which would be chained after one of the existing construction methods.)

As for point 2, is it not sufficient that I aggregate stats only for manifest_entry.status() != ManifestStatus::Deleted below? Put another way is it possible for ManifestStatus::Existing | ManifestStatus::Added entries to contain some misleading stats?

Finally, while I think exposing the spec (puffin) statistics should definitely be implemented, it seems that this is not always available (it may be opt-in for some external writers such as pyiceberg/spark?), so the best course of action for starters is to gather the stats from the manifest (entries) by default.

Expose DataFusion statistics on an IcebergTableScan

80b8d8c

gruuya force-pushed the datafusion-statistics branch from 5a30c42 to 80b8d8c Compare January 6, 2025 13:41

Default to unknown statistics upon encountering an error

bae11ea

gruuya commented Jan 6, 2025

View reviewed changes

liurenjie1024 reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datafusion): Expose DataFusion statistics on an IcebergTableScan #880

feat(datafusion): Expose DataFusion statistics on an IcebergTableScan #880

gruuya commented Jan 6, 2025

gruuya Jan 6, 2025

liurenjie1024 Jan 7, 2025

gruuya Jan 7, 2025

feat(datafusion): Expose DataFusion statistics on an IcebergTableScan #880

Are you sure you want to change the base?

feat(datafusion): Expose DataFusion statistics on an IcebergTableScan #880

Conversation

gruuya commented Jan 6, 2025

gruuya Jan 6, 2025

Choose a reason for hiding this comment

liurenjie1024 Jan 7, 2025

Choose a reason for hiding this comment

gruuya Jan 7, 2025

Choose a reason for hiding this comment