Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the evaluation method to pass the dataset item to the scoring method #698

Merged
merged 6 commits into from
Nov 22, 2024

Conversation

jverre
Copy link
Collaborator

@jverre jverre commented Nov 22, 2024

Details

Update the evaluation method to pass the dataset item to the scoring method

@jverre jverre requested review from a team as code owners November 22, 2024 13:42
Copy link
Collaborator

@alexkuzmik alexkuzmik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the tests:

  1. Please update unit.evaluation.test_evaluate.test_evaluate_happyflow so that task outputs don't duplicate dataset items content. Also update e2e.test_experiment.py
  2. Add a new test similar to unit.evaluation.test_evaluate.test_evaluate_happyflow, but which will use mapping functionality.

It should be enough.

@@ -21,6 +21,9 @@ def evaluate(
nb_samples: Optional[int] = None,
task_threads: int = 16,
prompt: Optional[Prompt] = None,
scoring_key_mapping: Optional[
Dict[str, Union[str, Callable[[dataset_item.DatasetItem], Any]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset_item.DatasetItem is not a part of our public API, users manipulate with dictionaries.
So, it's better to allow them to specify callables that take dataset item dictionary as input.
We can also make type hints more verbose with aliases:

DatasetItemDict = Dict[str, Any]
...
scoring_key_mapping: Optional[
    Dict[str, Union[str, Callable[[DatasetItemDict], Any]]]
]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPD. I ran the code, looks like it works with dictionaries as it should, so probably just the type hint is wrong.

@@ -56,12 +56,32 @@ def _score_test_case(
return test_result_


def _create_scoring_inputs(
Copy link
Collaborator

@alexkuzmik alexkuzmik Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to metrics.arguments_helpers.create_score_inputs and then rename item to dataset_item.
A few unit tests would be nice specifically for this function, it's a very important and sensitive piece of logic.

) -> Dict[str, Any]:
mapped_inputs = {**item, **task_output}

if scoring_key_mapping is not None:
Copy link
Collaborator

@alexkuzmik alexkuzmik Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invert if condition with a "guard clause" and the nesting will be reduced

  if scoring_key_mapping is None:
      return mapped_inputs

  for k, v in scoring_key_mapping.items():
      if callable(v):
          mapped_inputs[k] = v(item)
      else:
          mapped_inputs[k] = mapped_inputs[v]

  return mapped_inputs

sdks/python/src/opik/evaluation/utils.py Outdated Show resolved Hide resolved
@jverre jverre requested a review from alexkuzmik November 22, 2024 16:53
@jverre jverre merged commit 399b5cc into main Nov 22, 2024
23 checks passed
@jverre jverre deleted the jacques/merge_dataset_items_with_task_output branch November 22, 2024 17:35
aadereiko pushed a commit that referenced this pull request Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants