You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue follows the effort of incorporating computationally relevant metadata into the compute engine.
There are 4 components to this effort:
Importing metadata during DataFrame initialization (IO and in-memory)
Exporting metadata when writing DataFrame
Utilize available metadata to accelerate computation
Propagate and set metadata when available
Current Status
This metadata is now kept on ChunkedArray and can be retrieved from a Series with the SeriesTrait::get_metadata. The plan is to initially add a lot of functionally between the MetadataEnv::experimental_enabled (which can be enabled with POLARS_METADATA_USE=experimental), but to
later make this enabled by default.
Fields
We keep the following fields:
Sorted: ascending, descending, none,
Fast Explode
Minimum Value
Maximum Value
Distinct Count
IO
We should properly read and write statistics when reading from and when writing into files. For file formats, that provide the possibility of saving metadata or statistics these should be used to save and load these metadata fields.
coastalwhite
changed the title
Tracking Issue: Utilizing and Keeping track of Metadata
Tracking Issue: utilize and track array metadata/statistics
Jun 20, 2024
This issue follows the effort of incorporating computationally relevant metadata into the compute engine.
There are 4 components to this effort:
Current Status
This metadata is now kept on
ChunkedArray
and can be retrieved from aSeries
with theSeriesTrait::get_metadata
. The plan is to initially add a lot of functionally between theMetadataEnv::experimental_enabled
(which can be enabled withPOLARS_METADATA_USE=experimental
), but tolater make this enabled by default.
Fields
We keep the following fields:
IO
We should properly read and write statistics when reading from and when writing into files. For file formats, that provide the possibility of saving metadata or statistics these should be used to save and load these metadata fields.
Future
List of metadata fields that could be included:
parted
: All duplicates are clustered together (e.g.[1, 1, 1, 0, 0, 3, 2, 2]
)List of operations where metadata can be used or set:
unique_count
can just use a bitmask to keep track of existence if themax_value - min_value < 128
expand_from_offset
can set thedistinct_count
to1
distinct_count=1
can immediatly make a group slice as one group (in the group-by)group_by
can set thedistinct_count
per key/Footnotes
We don't fully support writing parquet statistics (Support writing Parquet
distinct_count
statistics for all types #17087) and this writing should be sped up withpolars-compute
(chore: usepolars-compute
inpolars-parquet
statistics #16687) ↩The text was updated successfully, but these errors were encountered: