Simple features predict LLM benchmark answers

Code for the paper "Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers".

Citation

If you find our paper or code useful, please consider cite the following paper:

@misc{pacchiardi2024leavingbarndooropen,
      title={Leaving the barn door open for {C}lever {H}ans: Simple features predict {LLM} benchmark answers}, 
      author={Lorenzo Pacchiardi and Marko Tesic and Lucy G. Cheke and José Hernández-Orallo},
      year={2024},
      eprint={2410.11672},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.11672}, 
}

Reproducing the experiments:

1) Install dependencies

The original results were obtained with Python 3.9.19. The required dependencies can be installed with pip install -r requirements.txt.

2) Download the data

To run the experiments, the raw performance data must be downloaded. The data comes from several sources; following the instruction below will download the data and put them in a folder raw_results in the root of this project.

`legalbench`

The raw results for the considered scenarios of the legalbench dataset are available from Helm-Lite. Run the notebook obtain_raw_results/download_helm/download_lite.ipynb to download them.

`CLadder`

Download the data from this link and decompress it into a folder raw_results/cladder_output_files-1. Released under CC0 license

`ProntoQA`

The repository contains the result file obtained from ProntoQA for simplicity. The result file can be obtained by cloning the original repository (released under Apache 2.0 license) and running the script analyze_results.py.

Datasets from the `KindsOfReasoning` collection

The following datasets are obtained from the KindsOfReasoning collection:

fantasy_reasoning, metaphor_boolean, anli, space_nli, wanli, babi_task_16, formal_fallacies_syllogisms_negation

See the above repository for credits and license information on those datasets.

To obtain them, download this file and extract the folder in the raw_results folder.

Remaining datasets

The evaluation results for the following datasets are included in the repository for convenience. They can be reproduced by following the instructions present in the obtain_raw_results folder.

neubaroco, moral_permissibility, causal_judgment, commonsense_qa_2

Attributions:

moral_permissibility and causal_judgment are obtained from BIG-Bench (License: Apache License 2.0).
commonsense_qa_2 can be obtained from this repository (License: CC-BY-4.0).
neubaroco can be obtained from this repository (the file we used in the experiments is data/naloma2023/NeuBAROCO_NALOMA.tsv).

3) Run the experiments

Run the two notebooks 1_can_simple_features_predict_ground_truth.ipynb and 2_effect_of_predicting_ground_truth_on_performance.ipynb (in this order) in the folder experiments to reproduce all results, figures and tables present in the paper.

Credits

The code to download HELM-Lite was adapted from this file.
The code to compute Word2Vec and FastText embeddings was adapted from https://github.com/lorypack/llm-liedetector (released under BSD-3-Clause license)
We thank the creators of the NeuBAROCO dataset to allow us to include their dataset and instance-level results on various LLMs in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
experiments		experiments
obtain_raw_results		obtain_raw_results
raw_results		raw_results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple features predict LLM benchmark answers

Citation

Reproducing the experiments:

1) Install dependencies

2) Download the data

`legalbench`

`CLadder`

`ProntoQA`

Datasets from the `KindsOfReasoning` collection

Remaining datasets

3) Run the experiments

Credits

About

Languages

License

Kinds-of-Intelligence-CFI/benchmark-ground-truth-predictability

Folders and files

Latest commit

History

Repository files navigation

Simple features predict LLM benchmark answers

Citation

Reproducing the experiments:

1) Install dependencies

2) Download the data

legalbench

CLadder

ProntoQA

Datasets from the KindsOfReasoning collection

Remaining datasets

3) Run the experiments

Credits

About

Resources

License

Stars

Watchers

Forks

Languages

`legalbench`

`CLadder`

`ProntoQA`

Datasets from the `KindsOfReasoning` collection