[WIP] Add experimental BalancedParquetEngine implementation #80

rjzamora · 2022-05-03T14:47:21Z

WARNING: This PR is still very rough, but I am sharing it now in case others wish to experiment (and/or share feedback)

May address parts of NVIDIA-Merlin/NVTabular#1340

This PR currently adds a balance_partitions= option to the Dataset API, which ultimately uses a new BalancedParquetEngine implementation for engine="parquet". The primary goal of this engine is to generate an underlying Dask DataFrame collection with an equivalent row count in every partition. The user may manually specify this desired row count (with rows_per_partition), or they can pass in the usual part_size or part_mem_fraction.

An additional option is to specify a desired batch_size to align with. If the batch_size argument is specified, the output partition sizes will always be divisible by this number.

TODO:

Handle hive/directory-partitioned datasets
Add an option to specify/guide the total number of desired partitions (for multi-gpu workflows)
More testing, iteration and general cleanup
Does a standalone engine make sense for this?

github-actions · 2022-05-03T14:51:21Z

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-80

rjzamora added 7 commits May 2, 2022 14:06

initial BalancedParquetEngine implementation

6499c29

use fsspec.parquet for remote storage

9514210

align partition sizes with batch_size

9a7ff2c

fix fsspec_parquet bug

e95947a

remove from_map usage from now (its not even merged upstream yet)

6f84530

add basic test coverage

ca473f6

add warning message

60130ec

cover num_workers argument

a0d7e07

karlhigley assigned rjzamora Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add experimental BalancedParquetEngine implementation #80

[WIP] Add experimental BalancedParquetEngine implementation #80

rjzamora commented May 3, 2022

github-actions bot commented May 3, 2022

[WIP] Add experimental BalancedParquetEngine implementation #80

Are you sure you want to change the base?

[WIP] Add experimental BalancedParquetEngine implementation #80

Conversation

rjzamora commented May 3, 2022

github-actions bot commented May 3, 2022

Documentation preview