Did you know AI Models Struggle with Historical Accuracy, GPT-4 Turbo Only Scores 46%
According to a new study, many AI
models don't answer accurately about world history which is a very
concerning matter. The researchers of the study developed some answer
questions using benchmarks from Seshat Global History Databank and found
that GPT-4 Turbo was able to score 46% in a test, which is better than guessing
but not expert-level. The team of researchers transformed the data from
the databank into multiple choice questions about different historical
features.
Seven different AI models like LLama, GPT-3.5, Gemini
and GPT-Turbo were tested and they were asked to act like expert
historians so that their strengths and weaknesses can be evaluated and
suggestions about improvements can be made. The researchers also made a
scale for accuracy of answers, with 25% score given to random guessed
answers and 100% score given to perfectly accurate answers. The AI
models were also evaluated on the basis of answers with evidence and
answers after drawing random conclusions.
GPT-Turbo was the best
performing model with a score of 43.8% but it couldn't answer
accurately on an expert level. In a two-choice test where the answer was
either ‘present’ or ‘absent’, GPT-Turbo scored 63.2% which indicates
that it can handle basic factual questions but is unable to answer
complex historical questions. The study also found AI models’
performances based on different regions, time period and time of
historical data. AI models performed better in questions about earlier
historical periods like before 3000 BCE but struggled in questions about
modern data because of complexities in societies. AI models also showed
better performances in answering questions about Americans while they
showed poor performances in answering questions about Oceania and
Sub-Saharan Africa.
There are some limitations in the study too like the Seshat Databank being in English and only biased towards well documented societies as well as a limited set of AI models. This study shows that AI still has a long way to go in answering historical data and more unbiased and inclusive training data is needed for AI to talk about global history more accurately.