Code evaluation using bigcode-evaluation-harness framework #1776

mtasic85 · 2024-10-05T15:54:55Z

Code evaluation task/benchmark such as HumanEval and MBPP are missing from lm-evaluation-harness, but are present and maintained in bigcode-evaluation-harness.

https://github.com/bigcode-project/bigcode-evaluation-harness

Since, we would need to parse tasks and check if they are in lm-evaluation-harness or bigcode-evaluation-harness, I propose to keep litgpt evaluate but add argument --framework "lm-evaluation-harness" (default if not specified) or --framework "bigcode-evaluation-harness".

The text was updated successfully, but these errors were encountered:

rasbt · 2024-10-07T14:33:36Z

Thanks for suggesting. That's a good idea, in my opinion. I was just reading through EleutherAI/lm-evaluation-harness#1157 and HumanEval and MBPP might eventually come to the lm-evaluation-harness, but it's hard to say when.

So, in the meantime, I think it's a good idea to add support as you suggested with the --framework "lm-evaluation-harness" default flag. (Please feel free to open a PR if you are interested and have time.)

mtasic85 added the enhancement New feature or request label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code evaluation using bigcode-evaluation-harness framework #1776

Code evaluation using bigcode-evaluation-harness framework #1776

mtasic85 commented Oct 5, 2024

rasbt commented Oct 7, 2024

Code evaluation using bigcode-evaluation-harness framework #1776

Code evaluation using bigcode-evaluation-harness framework #1776

Comments

mtasic85 commented Oct 5, 2024

rasbt commented Oct 7, 2024