Evaluation of LLMs Should Not Ignore Non-Determinism

Official repo for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Authors: Yifan Song, Guoyin Wang, Sujian Li, Bill Yuchen Lin.

TLDR: Are there performance differences between greedy decoding and sampling methods for LLM generation? The answer is YES!

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the non-determinism of LLM generations, identifying benchmarks’ consistency regarding non-determinism, and examining unique model behaviors.

Here are our findings:

A notable performance gap is observed between greedy decoding and sampling generation.
Greedy decoding outperforms sampling on most evaluated benchmarks, except for AlpacaEval.
Math reasoning and code generation were most impacted by sampling variance.
The above findings remain consistent across different sizes and families of LLMs.
Alignment methods, e.g., DPO, can significantly reduce the sampling variance for most benchmarks.
High temperature will significantly harm the reasoning and code generation capabilities of LLMs, while higher repetition penalty leads to improved performance on AlpacaEval.
In the best-of-N sampling setting, 7B-level LMs have the potential to outperform GPT-4-Turbo.

🧩 Structure of This Project

There are two parts in this project: LLM non-determinism analysis, best-of-N experiments

analyse: analyse the results of non-determinism generation on seven benchmarks, including AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.

get_benchmark_rewards: prepare reward data for best-of-N experiments. Using cutting-edge reward models to generate rewards for sampled model responses.

best_of_n_eval: unveiling the potential of LLM non-determinism generation by using best-of-N strategy, selecting the best response from N sampled generations.

🛠️ Setup

Download LLM samples from Huggingface.
pip install -r requirements.txt

📊 LLM Non-Determinism Analysis

We evaluate non-determinism generation of LLMs on seven benchmarks: AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.

Dataset	Instance Num.	Sample Num.	Metric
AlpacaEval 2	805	16	LC
Arena-Hard	500	16	Win Rate
WildBench v2	1024	16	WB-Score
MixEval	4000	16	Score
MMLU-Redux	3000	32	Acc
GSM8K	1319	128	EM
HumanEval	164	128	Pass@1

Take AlpacaEval for example, you can analyse the 16 sampled generations:

bash scripts/eval_alpacaeval_sample_baseline.sh <DATA_PATH>/alpaca_eval

From the results, we observe a consistent performance gap between greedy decoding and the sampling method. Greedy decoding generally proves more effective for most tasks, except for AlpacaEval.

🚀 Best-of-N Evaluation

First, employ off-the-shelf reward models to get the rewards for LLM generations. We have implement reward modeling code for Starling-RM, Eurus-RM, FsfairX, and ArmoRM. Take AlpacaEval for example:

bash scripts/get_alpacaeval_sample_rewards.sh Meta-Llama-3-8B-Instruct <DATA_PATH>/alpaca_eval

The reward information will be saved in benchmark_rewards/{model_name}/{dataset}/{reward_model_name}.json. Then, evaluate in a best-of-N setting:

bash scripts/eval_alpacaeval_sample_reward.sh <DATA_PATH>/alpaca_eval

With the oracle selection, even smaller LLMs like Llama-3-8B-Instruct can outperform GPT-4-Turbo on MMLU, GSM8K, and HumanEval. Furthermore, cutting-edge reward models can also select superior responses from multiple generations, and can outperform GPT-4-Turbo on GSM8K with only 8 samples.

📖 Citation

If you find this repo helpful, please cite out paper:

@article{song2024good,
    author={Yifan Song and Guoyin Wang and Sujian Li and Bill Yuchen Lin},
    title={The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism},
    year={2024},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of LLMs Should Not Ignore Non-Determinism

🧩 Structure of This Project

🛠️ Setup

📊 LLM Non-Determinism Analysis

🚀 Best-of-N Evaluation

📖 Citation

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
analyse		analyse
assets		assets
best_of_n_eval		best_of_n_eval
get_benchmark_rewards		get_benchmark_rewards
models		models
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Yifan-Song793/GoodBadGreedy

Folders and files

Latest commit

History

Repository files navigation

Evaluation of LLMs Should Not Ignore Non-Determinism

🧩 Structure of This Project

🛠️ Setup

📊 LLM Non-Determinism Analysis

🚀 Best-of-N Evaluation

📖 Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages