-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Retiring text_bytes_aware_shuffle
to use shuffle
directly
#316
base: main
Are you sure you want to change the base?
[WIP] Retiring text_bytes_aware_shuffle
to use shuffle
directly
#316
Conversation
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Thanks! Was wondering if you have any small-scale examples that could be added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me on the assumption we have verified output_shuffled_docs_path
looked the same for both main
and the PR
.
text_bytes_aware_shuffle
to use shuffle
directlytext_bytes_aware_shuffle
to use shuffle
directly
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
…raateek/fuzzy-shuffle-fix
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
elif os.environ["SHUFFLE_APPROACH"] == "dask_vanilla": | ||
self._logger.info("Using dask's vanilla shuffle") | ||
output_df = subset_merged_df.shuffle(on=partition_on) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add shuffle_approach
to FuzzyDuplicatesConfig
if there's more than one of these that you think we should keep in the codebase. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarahyurick we won't be keeping all approaches, I just had them for my own benchmarking efforts. I'll update the PR with the results, and based on those we'll only be keeping the approach of dask_vanilla
(once @ayushdg is able to confirm at scale too)
Description
Now that cudf supports large strings, we can use shuffle directly.
Fixes #240 and hence #49
Moved to WIP because of performance degradation
old_vs_new_shuffle.zip
Usage
# Add snippet demonstrating usage
Checklist