Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic Data Generation for Retriever Evaluation #338

Open
wants to merge 95 commits into
base: main
Choose a base branch
from

Conversation

vinay-raman
Copy link

@vinay-raman vinay-raman commented Oct 30, 2024

Description

Synthetic data generation for Retriever Evaluation

Usage

python tutorials/synthetic-retrieval-evaluation-customization/main.py \
  --api_key=<API Key> \
  --input_file=tutorials/synthetic-retrieval-evaluation-customization/data/sample_data_rawdoc.jsonl \
  --pipeline-config=tutorials/synthetic-retrieval-evaluation-customization/config/config.yaml\
  --input_format=rawdoc \
  --output_dir=tutorials/synthetic-retrieval-evaluation-customization/outputs/sample_data_rawdoc

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@vinay-raman
Copy link
Author

@ryantwolf I have added the tests and fixed the signoff checks as well. Thanks!

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more pieces of feedback, more about the configuration this time.

@vinay-raman vinay-raman force-pushed the sdg_pipeline/retriever_evalset_generation_signoff_fixed branch from 062cac6 to d5dc0ae Compare November 12, 2024 22:52
ryantwolf and others added 27 commits November 12, 2024 16:17
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
* Sign and add

Signed-off-by: Praateek Mahajan <[email protected]>

* implement strtobool

Signed-off-by: Praateek Mahajan <[email protected]>

* pre-commit

Signed-off-by: Praateek Mahajan <[email protected]>

* Update readme

Signed-off-by: Praateek Mahajan <[email protected]>

---------

Signed-off-by: Praateek Mahajan <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
* Make embedding column flexible for semdedup

Signed-off-by: Ryan Wolf <[email protected]>

* Fix embedding col in add_dist_to_cents

Signed-off-by: Ryan Wolf <[email protected]>

* Add image curation tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Address Sarah's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add output to image curation tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add punctuation

Signed-off-by: Ryan Wolf <[email protected]>

* Address Vibhu's comments

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
* Fixing Model Address

Signed-off-by: Chris Alexiuk <[email protected]>

* Fixing Model Address v1

Signed-off-by: Chris Alexiuk <[email protected]>

---------

Signed-off-by: Chris Alexiuk <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Fix NVIDIA#264. Scikit-learn is not expecting anymore the "affinity parameter"

Signed-off-by: Miguel Martínez <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
* start adding dask-expr support

Signed-off-by: rjzamora <[email protected]>

* add query_planning_enabled util

Signed-off-by: rjzamora <[email protected]>

* add global keyword

Signed-off-by: rjzamora <[email protected]>

* Forgot to remove top level query-planning check

Signed-off-by: rjzamora <[email protected]>

* fix other shuffle-arg problems that don't 'work' with dask-expr

Signed-off-by: rjzamora <[email protected]>

* remove name arg usage for now

Signed-off-by: rjzamora <[email protected]>

* fix bugs

Signed-off-by: rjzamora <[email protected]>

---------

Signed-off-by: rjzamora <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
…VIDIA#256)

* Improve performance in jsonl files

Signed-off-by: miguelusque <[email protected]>

* Improve performance in jsonl files

Signed-off-by: miguelusque <[email protected]>

* Shutdown Dask cluster at exit

Signed-off-by: miguelusque <[email protected]>

* Remove unneeded persist() and wait() operations

Signed-off-by: miguelusque <[email protected]>

* Display only Dask error messages or above

Signed-off-by: miguelusque <[email protected]>

* Cancel any remaining futures

Signed-off-by: miguelusque <[email protected]>

* Remove Dask warning message

Signed-off-by: miguelusque <[email protected]>

* Rename new arguments

Signed-off-by: miguelusque <[email protected]>

* Refactor separate_by_metadata

Signed-off-by: miguelusque <[email protected]>

---------

Signed-off-by: miguelusque <[email protected]>
Co-authored-by: miguelusque <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
* Pin to spacy<3.8 temporarily to unblock CI

Signed-off-by: Ayush Dattagupta <[email protected]>

* Update pin in rapids nightly dep as well

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: viraman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
…meters to filters constructor

Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
@vinay-raman vinay-raman force-pushed the sdg_pipeline/retriever_evalset_generation_signoff_fixed branch from d5dc0ae to 052044c Compare November 13, 2024 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.