Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dapt data curation tutorial fuzzy and semantic dedupe #322

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ruchaa-apte
Copy link

@ruchaa-apte ruchaa-apte commented Oct 24, 2024

Description

The PR adds

  • Versions for installations edits in requirements.txt for pip packages
  • fuzzy_dedupe module implementation to DAPT data curator tutorial for chipNeMo. It performs Fuzzy Deduplication on code as well as text documents from Wikipedia and Arxiv sources.
  • semantic_dedupe module implementation to DAPT data curator tutorial for chipNeMo. It performs Semantic Deduplication on text documents from Wikipedia and Arxiv sources.

Usage

cd NeMo-Curator/tutorials/dapt-curation/code/
python main.py

Checklist

  • [ X] I am familiar with the Contributing Guide.
  • [ X] New or Existing tests cover these changes.
  • [ X] The documentation is up to date with these changes.

@ruchaa-apte
Copy link
Author

pre-commit.ci autofix

@ayushdg
Copy link
Collaborator

ayushdg commented Oct 24, 2024

Looks like the commits are not gpg signed. You might need to recommit your changes with commit signing enabled -Ss flags during git commit. -S for signing and -s for sigoff.
More info on signing commits here: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

@ruchaa-apte ruchaa-apte force-pushed the dapt_data_curation_fuzzy_dedupe branch from 914cc06 to e177052 Compare October 24, 2024 22:08
@ruchaa-apte
Copy link
Author

pre-commit.ci autofix

@ruchaa-apte
Copy link
Author

Looks like the commits are not gpg signed. You might need to recommit your changes with commit signing enabled -Ss flags during git commit. -S for signing and -s for sigoff. More info on signing commits here: https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

Hi Ayush, thank you for pointing that out. Looks like me email address linked to signing off was not correct. I modified it and signed off the commit.

@ruchaa-apte ruchaa-apte reopened this Oct 24, 2024
Signed-off-by: Rucha Apte <[email protected]>
@ruchaa-apte
Copy link
Author

pre-commit.ci autofix

@ruchaa-apte ruchaa-apte changed the title Dapt data curation fuzzy dedupe Dapt data curation tutorial fuzzy and semantic dedupe Nov 2, 2024
@ayushdg ayushdg requested a review from Maghoumi November 5, 2024 18:39
tutorials/dapt-curation/README.md Show resolved Hide resolved
tutorials/dapt-curation/code/main.py Outdated Show resolved Hide resolved
tutorials/dapt-curation/code/utils.py Outdated Show resolved Hide resolved
tutorials/dapt-curation/code/utils.py Outdated Show resolved Hide resolved
Signed-off-by: Rucha Apte <[email protected]>
@ruchaa-apte ruchaa-apte force-pushed the dapt_data_curation_fuzzy_dedupe branch from b12e26f to 839c081 Compare November 6, 2024 00:09
@ruchaa-apte
Copy link
Author

pre-commit.ci autofix

@ayushdg ayushdg added the gpuci Run GPU CI/CD on PR label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants