Potential Integration with the `eval` APIs of Llama Stack #25

yhwang · 2024-10-18T03:38:30Z

yhwang
Oct 18, 2024
Maintainer

Llama Stack defines the building blocks needed to bring generative AI applications to market. One set of the APIs is about the model evaluation, which aligns with the purpose of LM-Eval-aaS in this repository. Let's use this discussion thread to explore the possibility of synergy between the evaluation APIs of Llama Stack and LM-Eval-aaS.

yhwang · 2024-10-18T04:53:54Z

yhwang
Oct 18, 2024
Maintainer Author

TL;DR

Overall, the APIs are very similar to the APIs we defined. The major differences are:

Llama Stacks explicitly provides 3 APIs for different evaluation tasks. For us, the task is one of the arguments in LM-Eval APIs. This can be handled easily by mapping the Llama Stacks evaluation APIs to specific tasks when calling LM-Eval APIs. And the defined metrics are supported in the Unitxt.
Llama Stacks APIs don't specify models when submitting evaluation jobs. The reason is the working context is directly bound to a model already. To support it, we just need to get the model from the working context and then send it to LM-Eval APIs.
Llama Stacks APIs don't declare the HTTP methods. These could be the possible mappings:
- /evaluate/[text_generation|question_answering|summarization]: POST
- /evaluate/jobs, /evaluate/job/logs, /evaluate/job/artifacts: GET
- /evaluate/job/cancel: DELETE

APIs:

Llama Stack defines several APIs for model evaluation, including:

/evaluate: Initiate an evaluation job and provide 3 types of evaluation tasks:
- text_generation: This is for text generation and currently 3 metrics can be used for this task: Perplexity, Bilingual Evaluation Understudy(bleu), and Recall-Oriented Understudy for Gissing Evaluation(rouge)
- question_answering: This is for question answering and 3 metrics can be used for this task: Exact Match (em) and f1 score(f1).
- summarization: This is for summarization and 2 metrics are available for this task: rouge and bleu.
inputs: In the context of an evaluation request, you can specify dataset, metric, and generation params. Interestingly, there is no model specification. Based on my observation, when users invoke the evaluation APIs, their working context is binding to a specific model already. Therefore, there is no need to specify the model in the APIs.

outputs: The return value of these APIs is the job ID which could be used in the following APIs to retrieve or cancel the corresponding evaluation job.
/evaluate/jobs: Retrieve the list of evaluation jobs for the current model.

inputs: N/A

outputs: The return value is a list of Job IDs.
/evaluate/job/status: Get an evaluation job's status

inputs: Job ID from the /evaluate/jobs API or the return value of the evaluation request APIs.

outputs: Job ID. I believe this should return the status of the evaluation job, i.e. Running, Finished, Error, etc.
/evaluate/job/logs: Retrieve the logs of an evaluation job

inputs: Job ID

outputs: The logs of the evaluation job.
/evaluate/job/artifacts: Retrieve the evaluation results if the job is done.

inputs: Job ID

outputs: The result artifacts of the evaluation job
/evaluate/job/cancel: Cancel an evaluation job.

inputs: Job ID

output: N/A

Foreseeable Efforts:

Mapping the Llama Stacks APIs to LM-Eval APIs is pretty straightforward. The extra effort outside of the LM-Eval scope is to bring up the model and kick off the inference service which allows the LM-Eval to perform the evaluation process. Or finding the inference endpoint for the evaluation model.

Flow Diagram

sequenceDiagram
    CLI ->>+ Llam Stacks Eval: submit an evaluation job
    Llam Stacks Eval ->>+ LM-Eval-aaS: get the model and forward the request
    LM-Eval-aaS ->>- Llam Stacks Eval: return the job id
    Llam Stacks Eval ->>- CLI: relay the job id
    LM-Eval-aaS ->>+ Model Inference: perform the evaluation process
    CLI ->>+ Llam Stacks Eval: get evaluation results
    Llam Stacks Eval ->>+ LM-Eval-aaS: forward the request
    LM-Eval-aaS ->>- Llam Stacks Eval: return the evaluation results and wrap as the artifacts
    Llam Stacks Eval ->>- CLI: return the artifacts

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Integration with the `eval` APIs of Llama Stack #25

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Potential Integration with the eval APIs of Llama Stack #25

yhwang Oct 18, 2024 Maintainer

Replies: 1 comment

yhwang Oct 18, 2024 Maintainer Author

TL;DR

APIs:

Foreseeable Efforts:

Flow Diagram

Potential Integration with the `eval` APIs of Llama Stack #25

yhwang
Oct 18, 2024
Maintainer

yhwang
Oct 18, 2024
Maintainer Author