Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate whisper model with eval interface #2

Draft
wants to merge 3 commits into
base: cli/eval/interface
Choose a base branch
from

Conversation

Ahmedsaed
Copy link
Owner

What does this PR do? Please describe:
Adds integration for whisper model with the eval interface.

Does your PR introduce any breaking changes? If yes, please list them:
None

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@Ahmedsaed Ahmedsaed changed the title Eval/whisper Integrate whisper model with eval interface Aug 1, 2024
@antoine-tran
Copy link

antoine-tran commented Aug 2, 2024

Taking a step back, here is the current state of the fairseq2 design:

  • In fairseq2, we have "Evaluator" which encapsulates all routines needed to evaluate a model: The model itself, the dataset, the metrics, I/O (where to record metrics), seeds, .... . The evaluator will basically iterate over dataset reader, and for each batch, spin off one or multiple "EvalUnit" to run different evaluation and report different metrics.

  • Ideally, we just have to define a "HFEvalUnit" that evaluate a model using HuggingFace's evaluate metrics. This way, we have just one final Evaluator (think of it as a pipeline), and several units , one for external models like whisper , one for fairseq2 models like wav2vec2, ....

  • However, currently, the design of fairseq2 ties to torcheval and declare that every metrics is a form of the Metricbag. I tried to resolve this wih a wrapper, but we do not reach the consensus yet.

  • Therefore, for the time being, we have 2 overlapping versions of the Evaluator, the Evaluator and HFEvaluator, one for the torcheval metrics and one for HuggingFace metrics.

  • The concept presets simply means a registry with the pre-defined configuration for dataset, model, ... The value in the presets can be overriden in runtime (i.e. via CLI) with the --config KEY VALUE argument.

  • Right now we have a preset "hf_presets" which means: A registry of different eval units, all will be evaluated over the HF data reader. The "hf_presets" has one decorator "librispeech_asr" which means: An eval unit for ASR evaluation over model (default model is fairseq2 wav2vec2_asr) on a HF dataset reader (default dataset is librispeech_asr). It is important to note that this preset can be customized to beyond librispeech and other models too.

So, if we stick to the above design, a simpler extension to enable Whisper eval would be:

  1. Re-use the HFEvaluator and hf_presets registry (we don't want 2 registries with 1 preset each, but a central registry storing all the presets)

  2. Make an entry function "load_asr_evaluator(config)" that fans out to 2 functions, one for fairseq2 model such as wav2vec2 and one for whisper model separately based on the config.model_name. These 2 functions are similar, except the way the model is loaded (either using fairseq2.models.load_model() or whisper.load_model())

  3. Change the CLI command to be able to run both, i.e.:

    • For wav2vec2: fairseq2 eval asr --config model_name=wav2vec2_asr_base_10h dataset_name=librispeech_asr
    • For Whisper: fairseq2 eval asr --config model_name=whisper/base dataset_name=librispeech_asr


log = get_log_writer(__name__)


def _add_wav2vev2_asr_eval_cli(group: CliGroup) -> None:
from fairseq2.recipes.eval.asr import load_wav2vec2_asr_evaluator
from fairseq2.recipes.eval.asr import ASREvaluator
Copy link

@antoine-tran antoine-tran Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if changing the function "load_wav2vec2_asr_evaluator" to ASREvaluator is the best way. I'm not picky between having a function or a callable class, but the problem is that Whisper is an end-to-end model, while wav2vec2 is - as the name suggests - only an encoder that generates a vector. For wav2vec2, we need a text tokenizer and decoder, while for Whisper it is not required. So the "ASREvaluator" is still not abstract enough (at least your currently proposal, with self.tokenizer and self.decoder)

Basically I think we need just 2 functions to generate the HFEvaluator accordingly from its config (see my comments above)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes lots of sense. I was just afraid of having two many functions just to support different models and thier requirements and that's why I created the class.

load_wav2vec2_asr_evaluator,
preset_configs=hf_presets,
ASREvaluator(),
preset_configs=wav2vec2_presets,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don'y need to have 2 registries for wav2vec2 and whisper

@@ -26,6 +30,21 @@ def _add_wav2vev2_asr_eval_cli(group: CliGroup) -> None:
)


def _add_whisper_asr_eval_cli(group: CliGroup) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is highly redundant I think. We can try parameterize the evaluator setup function, the presets can be customized at runtime.

model = load_wav2vec2_asr_model(
config.model_name, device=init_device, dtype=config.dtype
@whisper_presets.decorator("librispeech_asr")
def _whisper_librispeech_asr_config() -> AsrEvalConfig:
Copy link

@antoine-tran antoine-tran Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One default preset is enough I think. If the user wants to evaluate Whisper with librispeech_asr or other datasets, then they should specify it directly at runtime.

Otherwise we will have MxN presets for M models and N datasets :).

@Ahmedsaed
Copy link
Owner Author

@antoine-tran thanks for the information. This makes things a lot clearer. The motivation behind the class was to encapsulate the helper functions that was created in the process. I will apply the suggestions and refactor the code again into functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants