Code for the paper "Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers".
If you find our paper or code useful, please consider cite the following paper:
@misc{pacchiardi2024leavingbarndooropen,
title={Leaving the barn door open for {C}lever {H}ans: Simple features predict {LLM} benchmark answers},
author={Lorenzo Pacchiardi and Marko Tesic and Lucy G. Cheke and José Hernández-Orallo},
year={2024},
eprint={2410.11672},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.11672},
}
The original results were obtained with Python 3.9.19. The required dependencies can be installed with pip install -r requirements.txt
.
To run the experiments, the raw performance data must be downloaded. The data comes from several sources; following the instruction below will download the data and put them in a folder raw_results
in the root of this project.
The raw results for the considered scenarios of the legalbench
dataset are available from Helm-Lite. Run the notebook obtain_raw_results/download_helm/download_lite.ipynb
to download them.
Download the data from this link and decompress it into a folder raw_results/cladder_output_files-1
. Released under CC0 license
The repository contains the result file obtained from ProntoQA for simplicity. The result file can be obtained by cloning the original repository (released under Apache 2.0 license) and running the script analyze_results.py
.
The following datasets are obtained from the KindsOfReasoning
collection:
fantasy_reasoning, metaphor_boolean, anli, space_nli, wanli, babi_task_16, formal_fallacies_syllogisms_negation
See the above repository for credits and license information on those datasets.
To obtain them, download this file and extract the folder in the raw_results
folder.
The evaluation results for the following datasets are included in the repository for convenience. They can be reproduced by following the instructions present in the obtain_raw_results
folder.
neubaroco, moral_permissibility, causal_judgment, commonsense_qa_2
Attributions:
moral_permissibility
andcausal_judgment
are obtained from BIG-Bench (License: Apache License 2.0).commonsense_qa_2
can be obtained from this repository (License: CC-BY-4.0).neubaroco
can be obtained from this repository (the file we used in the experiments isdata/naloma2023/NeuBAROCO_NALOMA.tsv
).
Run the two notebooks 1_can_simple_features_predict_ground_truth.ipynb
and 2_effect_of_predicting_ground_truth_on_performance.ipynb
(in this order) in the folder experiments
to reproduce all results, figures and tables present in the paper.
- The code to download HELM-Lite was adapted from this file.
- The code to compute Word2Vec and FastText embeddings was adapted from https://github.com/lorypack/llm-liedetector (released under BSD-3-Clause license)
- We thank the creators of the
NeuBAROCO
dataset to allow us to include their dataset and instance-level results on various LLMs in this repository.