Cataloguing LLM Evaluations

The table below provides a comprehensive catalogue of the Large Language Model (LLM) evaluation frameworks, benchmarks and papers we've surveyed in our paper, "Cataloguing LLM Evaluations". It organizes them based on the taxonomy proposed in our paper.

The realm of LLM evaluation is advancing at an unparalleled pace. Collaboration with the broader community is pivotal to maintaining the relevance and utility of our work.

To that end, we invite submissions of LLM evaluation frameworks, benchmarks, and papers for inclusion in this catalogue.

Before you raise a PR for a new submission, please read our contribution guidelines. Submissions will be reviewed and integrated into the catalogue on a rolling basis.

For any inquiries, feel free to reach out to us at info@aiverify.sg.

Task/Attribute	Evaluation Framework/Benchmark/Paper	Testing Approach
1.1. Natural Language Understanding
Text classification	HELM Miscellaneous text classification	Benchmarking
	Big-bench Emotional understanding Intent recognition Humor	Benchmarking
	Hugging Face Text classification Token classification Zero-shot classification	Benchmarking
Sentiment analysis	HELM Sentiment analysis	Benchmarking
	Evaluation Harness GLUE	Benchmarking
	Big-bench Emotional understanding	Benchmarking
Toxicity detection	HELM Toxicity detection	Benchmarking
	Evaluation Harness ToxiGen	Benchmarking
	Big-bench Toxicity	Benchmarking
Information retrieval	HELM Information retrieval	Benchmarking
Sufficient information	Big-bench Sufficient information	Benchmarking
Sufficient information	FLASK Metacognition	Benchmarking (with human and model scoring)
Natural language inference	Big-bench Analytic entailment (specific task) Formal fallacies and syllogisms with negation (specific task) Entailed polarity (specific task)	Benchmarking
Natural language inference	Evaluation Harness GLUE	Benchmarking
General English understanding	HELM Language	Benchmarking
	Big-bench Morphology Grammar Syntax	Benchmarking
	Evaluation Harness BLiMP	Benchmarking
	Eval Gauntlet Language Understanding	Benchmarking
1.2. Natural Language Generation
Summarization	HELM Summarization	Benchmarking
	Big-bench Summarization	Benchmarking
	Evaluation Harness BLiMP	Benchmarking
	Hugging Face Summarization	Benchmarking
Question generation and answering	HELM Question answering	Benchmarking
	Big-bench Contextual question answering Reading comprehension Question generation	Benchmarking
	Evaluation Harness CoQA ARC	Benchmarking
	FLASK Logical correctness Logical robustness Logical efficiency Comprehension Completeness	Benchmarking (with human and model scoring)
	Hugging Face Question answering	Benchmarking
	Eval Gauntlet Reading comprehension	Benchmarking
Conversations and dialogue	MT-bench	Benchmarking (with human and model scoring)
	Evaluation Harness MuTual	Benchmarking
	Hugging Face Conversational	Benchmarking
Paraphrasing	Big-bench Paraphrase	Benchmarking
Other response qualities	FLASK Readability Conciseness Insightfulness	Benchmarking (with human and model scoring)
	Big-bench Creativity	Benchmarking
	Putting GPT-3's Creativity to the (Alternative Uses) Test	Benchmarking (with human scoring)
Miscellaneous text generation	Hugging Face Fill-mask Text generation	Benchmarking
1.3. Reasoning	HELM Reasoning	Benchmarking
	Big-bench Algorithms Logical reasoning Implicit reasoning Mathematics Arithmetic Algebra Mathematical proof Fallacy Negation Computer code Probabilistic reasoning Social reasoning Analogical reasoning Multi-step Understanding the World	Benchmarking
	Evaluation Harness PIQA, PROST - Physical reasoning MC-TACO - Temporal reasoning MathQA - Mathematical reasoning LogiQA - Logical reasoning SAT Analogy Questions - Similarity of semantic relations DROP, MuTual – Multi-step reasoning	Benchmarking
	Eval Gauntlet Commonsense reasoning Symbolic problem solving Programming	Benchmarking
1.4. Knowledge and factuality	HELM Knowledge	Benchmarking
	Big-bench Context Free Question Answering	Benchmarking
	Evaluation Harness HellaSwag, OpenBookQA - General commonsense knowledge TruthfulQA - Factuality of knowledge	Benchmarking
	FLASK Background Knowledge	Benchmarking (with human and model scoring)
	Eval Gauntlet World Knowledge	Benchmarking
1.5. Effectiveness of tool use	HuggingGPT	Benchmarking (with human and model scoring)
	TALM	Benchmarking
	Toolformer	Benchmarking (with human scoring)
	ToolLLM	Benchmarking (with model scoring)
1.6. Multilingualism	Big-bench Low-resource language Non-English Translation	Benchmarking
	Evaluation Harness C-Eval (Chinese evaluation suite) MGSM Translation	Benchmarking
	BELEBELE	Benchmarking
	MASSIVE	Benchmarking
	HELM Language (Twitter AAE)	Benchmarking
	Eval Gauntlet Language Understanding	Benchmarking
1.7. Context length	Big-bench Context length	Benchmarking
1.7. Context length	Evaluation Harness SCROLLS	Benchmarking
2.1. Law	LegalBench	Benchmarking (with algorithmic and human scoring)
2.2. Medicine	Large Language Models Encode Clinical Knowledge	Benchmarking (with human scoring)
2.2. Medicine	Towards Generalist Biomedical AI	Benchmarking (with human scoring)
2.3. Finance	BloombergGPT	Benchmarking
3.1. Toxicity generation	HELM Toxicity	Benchmarking
	DecodingTrust Toxicity	Benchmarking
	Red Teaming Language Models to Reduce Harms	Manual Red Teaming
	Red Teaming Language Models with Language Models	Automated Red Teaming
3.2. Bias
Demographical representation	HELM	Benchmarking
Demographical representation	Finding New Biases in Language Models with a Holistic Descriptor Dataset	Benchmarking
Stereotype bias	HELM Bias	Benchmarking
	DecodingTrust Stereotype Bias	Benchmarking
	Big-bench Social bias Racial bias Gender bias Religious bias	Benchmarking
	Evaluation Harness CrowS-Pairs	Benchmarking
	Red Teaming Language Models to Reduce Harms	Manual Red Teaming
Fairness	DecodingTrust Fairness	Benchmarking
Distributional bias	Red Teaming Language Models with Language Models	Automated Red Teaming
Representation of subjective opinions	Towards Measuring the Representation of Subjective Global Opinions in Language Models	Benchmarking
Political bias	From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models	Benchmarking
Political bias	The Self-Perception and Political Biases of ChatGPT	Benchmarking
Capability fairness	HELM Language (Twitter AAE)	Benchmarking
3.3. Machine ethics	DecodingTrust Machine Ethics	Benchmarking
3.3. Machine ethics	Evaluation Harness ETHICS	Benchmarking
3.4. Psychological traits	Does GPT-3 Demonstrate Psychopathy?	Benchmarking
	Estimating the Personality of White-Box Language Models	Benchmarking
	The Self-Perception and Political Biases of ChatGPT	Benchmarking
3.5. Robustness	HELM Robustness to contrast sets	Benchmarking
	DecodingTrust Out-of-Distribution Robustness Adversarial Robustness Robustness Against Adversarial Demonstrations	Benchmarking
	Big-bench Out-of-Distribution Robustness	Benchmarking
	Susceptibility to Influence of Large Language Models	Benchmarking
3.6. Data governance	DecodingTrust Privacy	Benchmarking
	HELM Memorization and copyright	Benchmarking
	Red Teaming Language Models to Reduce Harms	Manual Red Teaming
	Red Teaming Language Models with Language Models	Automated Red Teaming
	An Evaluation on Large Language Model Outputs: Discourse and Memorization	Benchmarking (with human scoring)
4.1. Dangerous Capabilities
Offensive cyber capabilities	GPT-4 System Card Cybersecurity	System Card
Weapons acquisition	GPT-4 System Card Proliferation of Conventional and Unconventional Weapons	System Card
Self and situation awareness	Big-bench Self-Awareness	Benchmarking
Autonomous replication / self-proliferation	ARC Evals Autonomous replication	Manual Red Teaming
Persuasion and manipulation	HELM Narrative Reiteration Narrative Wedging	Benchmarking (with human scoring)
	Big-bench Convince Me (specific task)	Benchmarking
	Co-writing with Opinionated Language Models Afffects Users' Views	Manual Red Teaming
5.1. Misinformation	HELM Question answering Summarization	Benchmarking
	Big-bench Truthfulness	Benchmarking
	Red Teaming Language Models to Reduce Harms	Manual Red Teaming
5.2. Disinformation	HELM Narrative Reiteration Narrative Wedging	Benchmarking (with human scoring)
5.2. Disinformation	Big-bench Convince Me (specific task)	Benchmarking
5.3. Information on harmful, immoral or illegal activity	Red Teaming Language Models to Reduce Harms	Manual Red Teaming
5.4. Adult content	Red Teaming Language Models to Reduce Harms	Manual Red Teaming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cataloguing LLM Evaluations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cataloguing LLM Evaluations