-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dapt data curation tutorial fuzzy and semantic dedupe #322
base: main
Are you sure you want to change the base?
Dapt data curation tutorial fuzzy and semantic dedupe #322
Conversation
pre-commit.ci autofix |
Looks like the commits are not gpg signed. You might need to recommit your changes with commit signing enabled |
Signed-off-by: Rucha Apte <[email protected]>
914cc06
to
e177052
Compare
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Hi Ayush, thank you for pointing that out. Looks like me email address linked to signing off was not correct. I modified it and signed off the commit. |
Signed-off-by: Rucha Apte <[email protected]>
pre-commit.ci autofix |
Signed-off-by: Rucha Apte <[email protected]>
b12e26f
to
839c081
Compare
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Description
The PR adds
fuzzy_dedupe
module implementation to DAPT data curator tutorial for chipNeMo. It performs Fuzzy Deduplication on code as well as text documents from Wikipedia and Arxiv sources.semantic_dedupe
module implementation to DAPT data curator tutorial for chipNeMo. It performs Semantic Deduplication on text documents from Wikipedia and Arxiv sources.Usage
Checklist