Skip to content

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Notifications You must be signed in to change notification settings

Yifan-Song793/GoodBadGreedy

Repository files navigation

Evaluation of LLMs Should Not Ignore Non-Determinism

🤗 Dataset | 📖 arXiv

Official repo for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Authors: Yifan Song, Guoyin Wang, Sujian Li, Bill Yuchen Lin.

TLDR: Are there performance differences between greedy decoding and sampling methods for LLM generation? The answer is YES!

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the non-determinism of LLM generations, identifying benchmarks’ consistency regarding non-determinism, and examining unique model behaviors.

Here are our findings:

  • A notable performance gap is observed between greedy decoding and sampling generation.
  • Greedy decoding outperforms sampling on most evaluated benchmarks, except for AlpacaEval.
  • Math reasoning and code generation were most impacted by sampling variance.
  • The above findings remain consistent across different sizes and families of LLMs.
  • Alignment methods, e.g., DPO, can significantly reduce the sampling variance for most benchmarks.
  • High temperature will significantly harm the reasoning and code generation capabilities of LLMs, while higher repetition penalty leads to improved performance on AlpacaEval.
  • In the best-of-N sampling setting, 7B-level LMs have the potential to outperform GPT-4-Turbo.

🧩 Structure of This Project

There are two parts in this project: LLM non-determinism analysis, best-of-N experiments

analyse: analyse the results of non-determinism generation on seven benchmarks, including AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.

get_benchmark_rewards: prepare reward data for best-of-N experiments. Using cutting-edge reward models to generate rewards for sampled model responses.

best_of_n_eval: unveiling the potential of LLM non-determinism generation by using best-of-N strategy, selecting the best response from N sampled generations.

🛠️ Setup

  1. Download LLM samples from Huggingface.
  2. pip install -r requirements.txt

📊 LLM Non-Determinism Analysis

We evaluate non-determinism generation of LLMs on seven benchmarks: AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.

Dataset Instance Num. Sample Num. Metric
AlpacaEval 2 805 16 LC
Arena-Hard 500 16 Win Rate
WildBench v2 1024 16 WB-Score
MixEval 4000 16 Score
MMLU-Redux 3000 32 Acc
GSM8K 1319 128 EM
HumanEval 164 128 Pass@1

Take AlpacaEval for example, you can analyse the 16 sampled generations:

bash scripts/eval_alpacaeval_sample_baseline.sh <DATA_PATH>/alpaca_eval

From the results, we observe a consistent performance gap between greedy decoding and the sampling method. Greedy decoding generally proves more effective for most tasks, except for AlpacaEval.

🚀 Best-of-N Evaluation

First, employ off-the-shelf reward models to get the rewards for LLM generations. We have implement reward modeling code for Starling-RM, Eurus-RM, FsfairX, and ArmoRM. Take AlpacaEval for example:

bash scripts/get_alpacaeval_sample_rewards.sh Meta-Llama-3-8B-Instruct <DATA_PATH>/alpaca_eval

The reward information will be saved in benchmark_rewards/{model_name}/{dataset}/{reward_model_name}.json. Then, evaluate in a best-of-N setting:

bash scripts/eval_alpacaeval_sample_reward.sh <DATA_PATH>/alpaca_eval

With the oracle selection, even smaller LLMs like Llama-3-8B-Instruct can outperform GPT-4-Turbo on MMLU, GSM8K, and HumanEval. Furthermore, cutting-edge reward models can also select superior responses from multiple generations, and can outperform GPT-4-Turbo on GSM8K with only 8 samples.

📖 Citation

If you find this repo helpful, please cite out paper:

@article{song2024good,
    author={Yifan Song and Guoyin Wang and Sujian Li and Bill Yuchen Lin},
    title={The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism},
    year={2024},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Topics

Resources

Stars

Watchers

Forks