Skip to content

Code for the paper "Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers"

License

Notifications You must be signed in to change notification settings

Kinds-of-Intelligence-CFI/benchmark-ground-truth-predictability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple features predict LLM benchmark answers

Code for the paper "Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers".

Citation

If you find our paper or code useful, please consider cite the following paper:

@misc{pacchiardi2024leavingbarndooropen,
      title={Leaving the barn door open for {C}lever {H}ans: Simple features predict {LLM} benchmark answers}, 
      author={Lorenzo Pacchiardi and Marko Tesic and Lucy G. Cheke and José Hernández-Orallo},
      year={2024},
      eprint={2410.11672},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.11672}, 
}

Reproducing the experiments:

1) Install dependencies

The original results were obtained with Python 3.9.19. The required dependencies can be installed with pip install -r requirements.txt.

2) Download the data

To run the experiments, the raw performance data must be downloaded. The data comes from several sources; following the instruction below will download the data and put them in a folder raw_results in the root of this project.

legalbench

The raw results for the considered scenarios of the legalbench dataset are available from Helm-Lite. Run the notebook obtain_raw_results/download_helm/download_lite.ipynb to download them.

CLadder

Download the data from this link and decompress it into a folder raw_results/cladder_output_files-1. Released under CC0 license

ProntoQA

The repository contains the result file obtained from ProntoQA for simplicity. The result file can be obtained by cloning the original repository (released under Apache 2.0 license) and running the script analyze_results.py.

Datasets from the KindsOfReasoning collection

The following datasets are obtained from the KindsOfReasoning collection:

fantasy_reasoning, metaphor_boolean, anli, space_nli, wanli, babi_task_16, formal_fallacies_syllogisms_negation

See the above repository for credits and license information on those datasets.

To obtain them, download this file and extract the folder in the raw_results folder.

Remaining datasets

The evaluation results for the following datasets are included in the repository for convenience. They can be reproduced by following the instructions present in the obtain_raw_results folder.

neubaroco, moral_permissibility, causal_judgment, commonsense_qa_2 

Attributions:

  • moral_permissibility and causal_judgment are obtained from BIG-Bench (License: Apache License 2.0).
  • commonsense_qa_2 can be obtained from this repository (License: CC-BY-4.0).
  • neubaroco can be obtained from this repository (the file we used in the experiments is data/naloma2023/NeuBAROCO_NALOMA.tsv).

3) Run the experiments

Run the two notebooks 1_can_simple_features_predict_ground_truth.ipynb and 2_effect_of_predicting_ground_truth_on_performance.ipynb (in this order) in the folder experiments to reproduce all results, figures and tables present in the paper.

Credits

  • The code to download HELM-Lite was adapted from this file.
  • The code to compute Word2Vec and FastText embeddings was adapted from https://github.com/lorypack/llm-liedetector (released under BSD-3-Clause license)
  • We thank the creators of the NeuBAROCO dataset to allow us to include their dataset and instance-level results on various LLMs in this repository.

About

Code for the paper "Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers"

Resources

License

Stars

Watchers

Forks