You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code evaluation task/benchmark such as HumanEval and MBPP are missing from lm-evaluation-harness, but are present and maintained in bigcode-evaluation-harness.
Since, we would need to parse tasks and check if they are in lm-evaluation-harness or bigcode-evaluation-harness, I propose to keep litgpt evaluate but add argument --framework "lm-evaluation-harness" (default if not specified) or --framework "bigcode-evaluation-harness".
The text was updated successfully, but these errors were encountered:
Thanks for suggesting. That's a good idea, in my opinion. I was just reading through EleutherAI/lm-evaluation-harness#1157 and HumanEval and MBPP might eventually come to the lm-evaluation-harness, but it's hard to say when.
So, in the meantime, I think it's a good idea to add support as you suggested with the --framework "lm-evaluation-harness" default flag. (Please feel free to open a PR if you are interested and have time.)
Code evaluation task/benchmark such as HumanEval and MBPP are missing from lm-evaluation-harness, but are present and maintained in bigcode-evaluation-harness.
https://github.com/bigcode-project/bigcode-evaluation-harness
Since, we would need to parse tasks and check if they are in lm-evaluation-harness or bigcode-evaluation-harness, I propose to keep
litgpt evaluate
but add argument--framework "lm-evaluation-harness"
(default if not specified) or--framework "bigcode-evaluation-harness"
.The text was updated successfully, but these errors were encountered: