Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Retiring text_bytes_aware_shuffle to use shuffle directly #316

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

praateekmahajan
Copy link
Collaborator

@praateekmahajan praateekmahajan commented Oct 21, 2024

Description

Now that cudf supports large strings, we can use shuffle directly.
Fixes #240 and hence #49

Moved to WIP because of performance degradation

old_vs_new_shuffle.zip

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
@praateekmahajan praateekmahajan added the gpuci Run GPU CI/CD on PR label Oct 22, 2024
@sarahyurick
Copy link
Collaborator

Thanks! Was wondering if you have any small-scale examples that could be added to test_fuzzy_dedup.py?

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me on the assumption we have verified output_shuffled_docs_path looked the same for both main and the PR .

@praateekmahajan praateekmahajan marked this pull request as draft October 23, 2024 20:07
@praateekmahajan praateekmahajan linked an issue Oct 28, 2024 that may be closed by this pull request
@praateekmahajan praateekmahajan changed the title Retiring text_bytes_aware_shuffle to use shuffle directly [WIP] Retiring text_bytes_aware_shuffle to use shuffle directly Oct 31, 2024
Comment on lines +1158 to +1160
elif os.environ["SHUFFLE_APPROACH"] == "dask_vanilla":
self._logger.info("Using dask's vanilla shuffle")
output_df = subset_merged_df.shuffle(on=partition_on)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add shuffle_approach to FuzzyDuplicatesConfig if there's more than one of these that you think we should keep in the codebase. Up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarahyurick we won't be keeping all approaches, I just had them for my own benchmarking efforts. I'll update the PR with the results, and based on those we'll only be keeping the approach of dask_vanilla (once @ayushdg is able to confirm at scale too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
3 participants