Add hf metric wrapper #599

antoine-tran · 2024-06-18T12:36:54Z

What does this PR do? Please describe:

This PR adds a wrapper to HuggingFace's evaluate.Metric to make it compatible to fairseq2.metrics APIs. This enables evaluating fairseq2 model on many downstream tasks available in HuggingFace, using standard evaluation metrics such as comet, bleu(t), bertscore, etc.

Small changes in fairseq2.metrics:

evaluate.Metric has internal support for multi-node evaluation, where the incremental metric updating code run in separate nodes (and writing results to temporary PyArrow tables), after that the final compute() is done on rank 0 only, pulling results from the tables. To adapt this to fairseq2.metrics.bag.sync_and_compute_metrics() , we introduce theauto_sync attribute in the MetricBag, that will turn on if it has HF metrics. In other words, we leave the synchronization to the underlying metrics and not in the bag itself.

One caveat is that we can not mix the HF and non-HF metrics within one bag, but put them in separate bags.

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

cbalioglu · 2024-06-20T17:57:29Z

I think we might need to reconsider the generic metrics API in fairseq2 before merging this PR. Right now it pretty much uses torcheval which apparently uses a different approach to syncing metrics. I have a fundamental question though. In your use case, how do you combine fairseq2 metrics and HG evaluate? Looks like HG has some similar APIs like combine that acts like MetricBag. Which approach does make more sense in your opinion (1) using the two APIs in parallel in your code or (2) abstracting away HG under fairseq2.metrics (i.e. this PR)? If the latter, we might consider defining a Metric interface in fairseq2 (instead of using torcheval'a Metrics interface) and have derived interfaces for torcheval and HG that fit better to the corresponding libraries.

antoine-tran · 2024-06-21T14:35:10Z

I think we might need to reconsider the generic metrics API in fairseq2 before merging this PR. Right now it pretty much uses torcheval which apparently uses a different approach to syncing metrics.

I think the decision of binding fairseq2 to torcheval is a good one. It is a thin, low-level library to help developing further libraries on top. HG on the other hand targets ready-to-use use cases and it makes sense to hide the synchronization logic inside.

I have a fundamental question though. In your use case, how do you combine fairseq2 metrics and HG evaluate? Looks like HG has some similar APIs like combine that acts like MetricBag. Which approach does make more sense in your opinion (1) using the two APIs in parallel in your code or (2) abstracting away HG under fairseq2.metrics (i.e. this PR)? If the latter, we might consider defining a Metric interface in fairseq2 (instead of using torcheval'a Metrics interface) and have derived interfaces for torcheval and HG that fit better to the corresponding libraries.

Good questions. Until now I always followed (1) to have a separate metrics API in evaluation tasks. It served well my use cases (pure evaluation), and now I am thinking of using the evaluation within the training loop. Having the same MetricBag API helps me "register" the states and continue later.

antoine-tran · 2024-06-21T14:36:00Z

@cbalioglu Maybe we can indeed wait a bit for the use case to be clearer and gather more requirements before merging this PR.

Tuan Tran added 2 commits June 18, 2024 13:53

add hf metric wrapper

fce32b8

update test cases

68d09a6

antoine-tran requested a review from cbalioglu as a code owner June 18, 2024 12:36

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2024

Tuan Tran added 5 commits June 18, 2024 15:24

restructure

1e36968

revert import

54f1090

revert import

037d73b

add evaluate to CI

4c5e73e

add scikit-learn to CI

12cd1b3

antoine-tran mentioned this pull request Aug 2, 2024

Integrate whisper model with eval interface Ahmedsaed/fairseq2#2

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hf metric wrapper #599

Add hf metric wrapper #599

antoine-tran commented Jun 18, 2024 •

edited

Loading

cbalioglu commented Jun 20, 2024

antoine-tran commented Jun 21, 2024

antoine-tran commented Jun 21, 2024

Add hf metric wrapper #599

Are you sure you want to change the base?

Add hf metric wrapper #599

Conversation

antoine-tran commented Jun 18, 2024 • edited Loading

Small changes in fairseq2.metrics:

cbalioglu commented Jun 20, 2024

antoine-tran commented Jun 21, 2024

antoine-tran commented Jun 21, 2024

antoine-tran commented Jun 18, 2024 •

edited

Loading