This research project in digital humanities aims to evaluate the capabilities of various large language models (LLMs) in the domain of question-answering in history. Two primary objectives are of particular interest, both in substance and in form:
- to analyze the quality, the accuracy and the reliability of responses provided by LLMs across different levels of historical questions,
- to assess the feasibility of extracting predominant data from generated responses for potential reuse elsewhere.
All questions along with the results obtained from our testbed are documented in the file questions-and-data.xlsx
Technology | Specific feature | Model used | Parameters | Type of use | Release date |
---|---|---|---|---|---|
ChatGPT | Multimodal | GPT-4 | 170,000 bn | Chat | 03/2023 |
Bard | Multimodal | PaLM 2 | ? | Chat | 03/2023 |
Copilot | Multimodal | Prometheus | ? | Chat | 09/2023 |
Gemini | Multimodal | Gemini-1.0 | ? | Chat | 12/2023 |
Mistral AI | FR/Mixtral | Mixtral-8x7b | 12/45 bn | Instruct | 12/2023 |
Vigogne | FR/Mistral | Vigostral-7b | 7 bn | Instruct | 10/2023 |
Vigogne | FR/Llama | Instruct-13b | 13 bn | Instruct | 03/2023 |
Guanaco | Llama | Guanaco-33b | 33 bn | Chat | 05/2023 |
Vicuna | Llama | Vicuna-33b-v1.3 | 33 bn | Chat | 03/2023 |
Koala | Llama | 13b-diff-v2 | 13 bn | Chat | 04/2023 |
ChatGPT | GPT-3.5-Turbo | 175 bn | Chat | 03/2022 | |
TextCortex | Sophos-2 | 20 bn | Chat | ? | |
GPT4All | L13b-snoozy | 13 bn | Chat | 03/2023 | |
Falcon | Instruct-7B | 7 bn | Instruct | 04/2023 |
A set of 62 questions regarding the history of ancient Poitou in France (currently located in the Nouvelle-Aquitaine region) was created to serve as a foundation for our study. We decomposed this set into subclasses of questions to assess the abilities of LLMs to respond with consistent accuracy based on the type of questions they encounter.
Types of questions | Data type expected in responses | Number of questions |
---|---|---|
Quantitative (closed) | Numeric data | 16 |
Qualitative (closed) | Metadata | 15 |
Qualitative (closed) | Data list | 10 |
Qualitative (open) | Definition/Description | 16 |
Qualitative (open) | Detailed description of a problem | 5 |
These questions were divided into 5 themes with various characteristics:
- Bataille de Poitiers (732) / Battle of Poitiers (732)
- Bataille de Poitiers (1356) / Battle of Poitiers (1356)
- 3ème guerre de religion (1568-1570) / Third war of religion (1568-1570)
- Siège de La Rochelle (1627-1628) / Siege of La Rochelle (1627-1628)
- Artisanat (époque moderne) / Craftsmanship (in the modern era)
These five topics encompass historical facts that are sometimes extensively covered on the web yet contain many gaps for historians (such as the Battle of Poitiers in 732), sometimes well-known to specialists (such as the Third War of Religion), or even relatively ambiguous and complex (such as craftsmanship in the modern era). We also aimed to create potential thematic confusions, which accounts for the inclusion of subjects that have occurred multiple times in French history (several battles of Poitiers or sieges of La Rochelle, for example).
We didn't merely pose our raw questions to the various tested LLMs; we aimed to refine the analysis by offering variants to verify if LLMs can provide accurate responses even when the query form varies. Initially, for each question, we created another variant with an equivalent general meaning (for example: "When did the battle of Poitiers with Jean le Bon precisely occur?" is the original question, and "What is the exact date of the battle of Poitiers with Jean le Bon?" is its variant ***). Subsequently, each of these variants was duplicated to obtain the same questions as keyword queries, to verify if LLMs adapt better to natural language or keyword queries (thus, our example queries become respectively "precise period of the battle of Poitiers with Jean le Bon" and "exact date of the battle of Poitiers with Jean le Bon"). Consequently, each question actually generates four queries to be tested, allowing us to analyze if LLMs react differently according to the query type, and also if they are capable of providing a correct response in each case. Furthermore, to complete this process and ensure that LLMs do not respond correctly only by chance, we requested regeneration for each posed query. Ultimately, we obtain eight responses per question, for each LLM.
As we also aimed to test semantic variations and diachrony, we created complementary variants for certain closed qualitative questions using named entity names from the targeted period (to address our example, we proposed variants such as "What is the exact date of the battle of Poictiers with Jehan le Bon?" and "exact date battle of Poictiers with Jehan le Bon"). Our objective was to verify if LLMs have the capability to draw analogies between current named entities and those from the past, while still correctly answering the posed question. In total, 7504 responses were thus verified based on our 5 themes and 62 original questions (268 queries).
*** All questions were asked in French in our study, we present them in English here only for ease of understanding.
We compared the accuracy of responses (number of correct answers out of the total queries analyzed) by LLM, and we obtained these results:
Results Correct answers |
Results Other answers |
Precision Correct answers |
||
---|---|---|---|---|
Gemini | 377 | 159 | 70.34% | |
Copilot | 303 | 233 | 56.53% | |
ChatGPT (GPT-4) | 287 | 249 | 53.54% | |
ChatGPT (GPT-3.5-Turbo) | 273 | 263 | 50.93% | |
Mixtral-8x7b | 272 | 264 | 50.75% | |
TextCortex AI | 266 | 270 | 49.63% | |
Bard | 244 | 292 | 45.52% | |
Guanaco | 184 | 352 | 34.33% | |
Vicuna | 197 | 339 | 36.75% | |
Koala | 120 | 416 | 22.39% | |
GPT4All | 95 | 441 | 17.72% | |
Vigogne | 94 | 442 | 17.54% | |
Vigostral | 86 | 450 | 16.04% | |
Falcon | 22 | 514 | 4.10% | |
Totals | 2820 | 4684 | Average | 37.58% |
We also studied the reliability of the answers (100% correct answers provided for the same question), for each LLM (following table) and also for each type of question (in the following figure). You can find the detailed results in the file reliability-rate.xlsx.
LLM | Number of 100% correct answers |
Reliability rate |
---|---|---|
Gemini | 24 | 38,71% |
GPT-4 | 21 | 33,87% |
Copilot | 18 | 29,03% |
GPT-3.5 | 17 | 27,42% |
Mixtral | 14 | 22,58% |
TextCortex | 14 | 22,58% |
Bard | 12 | 19,35% |
Vicuna | 5 | 8,06% |
Guanaco | 4 | 6,45% |
Vigostral | 3 | 4,84% |
Koala | 2 | 3,23% |
GPT4All | 1 | 1,61% |
Vigogne | 0 | 0,00% |
Falcon | 0 | 0,00% |
Total/Average | 135 | 15,55% |
Reliability rate by type of questions
We also compiled a histogram of results by historical theme to verify the differences in precision and in reliabily among the tested LLMs:
Reliability and precision rate by historical theme
License Agreement Details: LICENCE
- Mathieu CHARTIER (CRIHAM, Université de Poitiers)
- Stéphane JEAN (LIAS, ISAE-ENSMA et université de Poitiers)
- Guillaume BOURGEOIS (CRIHAM, Université de Poitiers)
- Nabil DAKKOUNE (LIAS, ISAE-ENSMA et université de Poitiers)