This repository contains the performance results of different LLMs on fact-checking tasks.
You can see the results in the metrics_graph.ipynb notebook.
This document outlines the step-by-step instructions required to run Ollama and includes the code for measuring performance of different LLMs on fact-checking tasks.
Follow the installation instructions provided to set it up on your machine.
If you are using Linux:
curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama
chmod +x ollama
On Linux, you may need to run commands as follows:
./ollama
Or check Mac App or Windows app on: https://ollama.com/download/
To download models, use the following commands:
ollama pull gemma2:9b
ollama pull gemma2:27b
ollama pull mixtral:8x7b
ollama pull mixtral:8x22b
ollama pull mistral:7b
ollama pull llama3:8b
ollama pull llama3:70b
ollama pull qwen2:7b
ollama pull qwen2:72b
ollama pull phi3:3.8b
ollama pull phi3:14b
ollama pull command-r:35b
ollama pull command-r-plus:104b
Note: You may need to run as ./ollama pull gemma2:9b on Linux
For more info on models, see this library: https://ollama.com/library.
Before starting run ollama server:
On Mac:
run ollama app from applications
On Linux:
ollama serve
To generate predictions:
python main.py gemma2:9b
python main.py gemma2:27b
python main.py mixtral:8x7b
python main.py mixtral:8x22b
python main.py mistral:7b
python main.py llama3:8b
python main.py llama3:70b
python main.py qwen2:7b
python main.py qwen2:72b
python main.py phi3:3.8b
python main.py phi3:14b
python main.py command-r:35b
python main.py command-r-plus:104b
The output will be saved in the data folder.
To see how many predictions were completed:
wc -l data/results_gemma2:9b.json
For real-time monitoring:
tail -f data/results*.json
To generate accuracy and F1 scores:
python score_generator.py gemma2:9b
For a pilot experiment, I chose 1,000 samples for each category from a total pool of 21,152 PolitiFact claims to ensure equal representation. This approach helps us eliminate any bias that might arise from imbalanced data. If I had reflected on the actual distribution of claims, some categories might have been underrepresented, potentially skewing the results and affecting the reliability of our model comparisons. In the preprocessing step, I eliminated instances where the models refused to respond.
I selected 11 models from 6 different LLM families, as shown in table. This choice is informed by their open-source nature and their representation of the latest advancements in the field.
LLM Family | Parameter Count | Provider | Release Date |
---|---|---|---|
Mixtral | 8x7B [1] / 8x22B [2] | Mistral AI | 11/12/2023 [1], 17/04/2024 [2] |
Command R | 35B | Cohere | 11/03/2024 [3] |
Llama 3 | 8B/70B | Meta | 18/04/2024 [4] |
Phi-3 | 8B/14B | Microsoft | 21/05/2024 [5] |
Qwen-2 | 7B/22B | Alibaba Cloud | 07/06/2024 [6] |
Gemma 2 | 9B/27B | 27/06/2024 [7] |
Table: The LLM families used in the study.
Text generation with LLMs depends significantly on the prompting methods employed. The prompts for this thesis were developed based on the outputs of Paper I. In this pilot experiment, a simple prompt was used to have LLMs fact-check a claim, aligning with previous studies (e.g., Hoes et al. (2023), Quelle and Bovet (2024)). This approach ensures the use of prompting methods likely to be used by the general public (DeVerna et al., (2023), Karinshak et al., (2023)). Although more sophisticated prompting techniques exist and may enhance accuracy, they often require technical skills that are less commonly possessed by lay users.
LLMs were instructed to annotate each statement (see System Prompt below). Six categories and their definitions from PolitiFact’s Truth-o-Meter ratings were used to ensure a straightforward comparison of ChatGPT’s findings with PolitiFact’s fact-checkers (Drobnic Holan, 2018).
The labels range from TRUE to PANTS ON FIRE, with intermediate categories such as MOSTLY TRUE, HALF TRUE, MOSTLY FALSE, and FALSE. Additionally, similar to previous studies (Hoes et al. (2023), Quelle and Bovet (2024)), a new label, INCONCLUSIVE, was added to be used when a claim lacks sufficient context or there is not enough information to assess its veracity.
You are a fact-checking expert. Evaluate the given statement and assign one of the following labels:
- TRUE – The statement is accurate, and nothing significant is missing.
- MOSTLY TRUE – The statement is accurate but requires clarification or additional information.
- HALF TRUE – The statement is partially accurate but omits important details or takes things out of context.
- MOSTLY FALSE – The statement contains some truth but ignores critical facts that would provide a different impression.
- FALSE – The statement is not accurate.
- PANTS ON FIRE – The statement is not accurate and makes a ridiculous claim.
- INCONCLUSIVE – A clear decision cannot be made due to insufficient context or information.
Provide the assigned
{label}
along with a detailed{explanation}
supporting your assessment.
The graphs below show the accuracy metrics and parameter sizes of LLMs, each tested on 6,000 PolitiFact claims.
- The results for multiclass categories (see Figure 1) and binary categories (see Figure 2) indicate a general trend where larger models tend to achieve higher accuracy, though there are exceptions where smaller models outperform larger ones.
- Qwen2-72b stands out with the highest accuracy and largest parameter size, while Phi3-3.8b has both the smallest parameter size and lowest accuracy.
- Despite having a significantly larger parameter size, the performance of Gemma2-27b did not scale proportionally.