Tracking Issue: utilize and track array metadata/statistics #17070

coastalwhite · 2024-06-19T13:49:56Z

This issue follows the effort of incorporating computationally relevant metadata into the compute engine.

There are 4 components to this effort:

Importing metadata during DataFrame initialization (IO and in-memory)
Exporting metadata when writing DataFrame
Utilize available metadata to accelerate computation
Propagate and set metadata when available

Current Status

This metadata is now kept on ChunkedArray and can be retrieved from a Series with the SeriesTrait::get_metadata. The plan is to initially add a lot of functionally between the MetadataEnv::experimental_enabled (which can be enabled with POLARS_METADATA_USE=experimental), but to
later make this enabled by default.

Fields

We keep the following fields:

Sorted: ascending, descending, none,
Fast Explode
Minimum Value
Maximum Value
Distinct Count

IO

We should properly read and write statistics when reading from and when writing into files. For file formats, that provide the possibility of saving metadata or statistics these should be used to save and load these metadata fields.

File Format	Reading	Writing
Parquet	🗸	🗸¹
JSON	✗	N/A
NDJson	✗	N/A
Avro	?	?
CSV	✗	N/A
IPC	?	?
Excel	?	?
Arrow	?	?
Numpy	?	?
Pandas	?	?

Future

List of metadata fields that could be included:

parted: All duplicates are clustered together (e.g. [1, 1, 1, 0, 0, 3, 2, 2])

List of operations where metadata can be used or set:

unique_count can just use a bitmask to keep track of existence if the max_value - min_value < 128
expand_from_offset can set the distinct_count to 1
A distinct_count=1 can immediatly make a group slice as one group (in the group-by)
A group_by can set the distinct_count per key/

We don't fully support writing parquet statistics (Support writing Parquet distinct_count statistics for all types #17087) and this writing should be sped up with polars-compute (chore: use polars-compute in polars-parquet statistics #16687) ↩

The text was updated successfully, but these errors were encountered:

coastalwhite added the rust Related to Rust Polars label Jun 19, 2024

coastalwhite changed the title ~~Tracking Issue: Utilizing and Keeping track of Metadata~~ Tracking Issue: utilize and track array metadata/statistics Jun 20, 2024

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: utilize and track array metadata/statistics #17070

Tracking Issue: utilize and track array metadata/statistics #17070

coastalwhite commented Jun 19, 2024 •

edited

Loading

This comment was marked as off-topic.

Tracking Issue: utilize and track array metadata/statistics #17070

Tracking Issue: utilize and track array metadata/statistics #17070

Comments

coastalwhite commented Jun 19, 2024 • edited Loading

Current Status

Fields

IO

Future

Footnotes

This comment was marked as off-topic.

coastalwhite commented Jun 19, 2024 •

edited

Loading