LLM Evaluation

Using lm-evaluation-harness

You can evaluate Lit-GPT using EleutherAI's lm-eval framework with a large number of different evaluation tasks.

You need to install the lm-eval framework first:

pip install https://github.com/EleutherAI/lm-evaluation-harness/archive/refs/heads/master.zip -U

Evaluating Lit-GPT base models

Use the following command to evaluate Lit-GPT models on all tasks in Eleuther AI's Evaluation Harness.

python eval/lm_eval_harness.py \
    --checkpoint_dir "checkpoints/meta-llama/Llama-2-7b-hf" \
    --precision "bf16-true" \
    --save_filepath "results.json"

To evaluate on LLMs on specific tasks, for example, TruthfulQA and HellaSwag, you can use the --eval_task flag as follows:

python eval/lm_eval_harness.py \
    --checkpoint_dir "checkpoints/meta-llama/Llama-2-7b-hf" \
    --eval_tasks "[truthfulqa_mc,hellaswag]" \
    --precision "bf16-true" \
    --save_filepath "results.json"

A list of supported tasks can be found here.

Evaluating LoRA-finetuned LLMs

The above command can be used to evaluate models that are saved via a single checkpoint file. This includes downloaded checkpoints and base models finetuned via the full and adapter finetuning scripts.

For LoRA-finetuned models, you need to first merge the LoRA weights with the original checkpoint file as described in the Merging LoRA Weights section of the LoRA finetuning documentation.

FAQs

How do I evaluate on MMLU?

MMLU is available as with lm-eval harness but the task name is not MMLU. You can use hendrycksTest* as regex to evaluate on MMLU.

python eval/lm_eval_harness.py \
    --checkpoint_dir "checkpoints/meta-llama/Llama-2-7b-hf" \
    --precision "bf16-true" \
    --eval_tasks "[hendrycksTest*]" \
    --num_fewshot 5 \
    --save_filepath "results.json"

Is Truthful MC is not available in lm-eval?

It is available as truthfulqa_mc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation.md

evaluation.md

LLM Evaluation

Using lm-evaluation-harness

Evaluating Lit-GPT base models

Evaluating LoRA-finetuned LLMs

FAQs

Files

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

LLM Evaluation

Using lm-evaluation-harness

Evaluating Lit-GPT base models

Evaluating LoRA-finetuned LLMs

FAQs