Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue: utilize and track array metadata/statistics #17070

Open
5 tasks
coastalwhite opened this issue Jun 19, 2024 · 1 comment
Open
5 tasks

Tracking Issue: utilize and track array metadata/statistics #17070

coastalwhite opened this issue Jun 19, 2024 · 1 comment
Labels
rust Related to Rust Polars

Comments

@coastalwhite
Copy link
Collaborator

coastalwhite commented Jun 19, 2024

This issue follows the effort of incorporating computationally relevant metadata into the compute engine.

There are 4 components to this effort:

  1. Importing metadata during DataFrame initialization (IO and in-memory)
  2. Exporting metadata when writing DataFrame
  3. Utilize available metadata to accelerate computation
  4. Propagate and set metadata when available

Current Status

This metadata is now kept on ChunkedArray and can be retrieved from a Series with the SeriesTrait::get_metadata. The plan is to initially add a lot of functionally between the MetadataEnv::experimental_enabled (which can be enabled with POLARS_METADATA_USE=experimental), but to
later make this enabled by default.

Fields

We keep the following fields:

  • Sorted: ascending, descending, none,
  • Fast Explode
  • Minimum Value
  • Maximum Value
  • Distinct Count

IO

We should properly read and write statistics when reading from and when writing into files. For file formats, that provide the possibility of saving metadata or statistics these should be used to save and load these metadata fields.

File Format Reading Writing
Parquet 🗸 🗸1
JSON N/A
NDJson N/A
Avro ? ?
CSV N/A
IPC ? ?
Excel ? ?
Arrow ? ?
Numpy ? ?
Pandas ? ?

Future

List of metadata fields that could be included:

  • parted: All duplicates are clustered together (e.g. [1, 1, 1, 0, 0, 3, 2, 2])

List of operations where metadata can be used or set:

  • unique_count can just use a bitmask to keep track of existence if the max_value - min_value < 128
  • expand_from_offset can set the distinct_count to 1
  • A distinct_count=1 can immediatly make a group slice as one group (in the group-by)
  • A group_by can set the distinct_count per key/

Footnotes

  1. We don't fully support writing parquet statistics (Support writing Parquet distinct_count statistics for all types #17087) and this writing should be sped up with polars-compute (chore: use polars-compute in polars-parquet statistics #16687)

@coastalwhite coastalwhite added the rust Related to Rust Polars label Jun 19, 2024
@coastalwhite coastalwhite changed the title Tracking Issue: Utilizing and Keeping track of Metadata Tracking Issue: utilize and track array metadata/statistics Jun 20, 2024
@coastalwhite

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rust Related to Rust Polars
Projects
None yet
Development

No branches or pull requests

1 participant