A new study from researchers at UCL has found that Large Language Models (LLMs), the sophisticated AI systems behind popular generative platforms like ChatGPT, give different answers when asked to respond to the same reasoning test and do not improve when provided with additional context. The findings, published in Royal Society Open Science, underscore the importance of understanding how these AIs ‘think’ before entrusting them with tasks, especially those involving decision-making.
As LLMs have become increasingly advanced, their ability to generate realistic text, images, audio, and video has raised concerns about their potential to displace jobs, sway elections, and facilitate crime. However, these AIs have also been shown to frequently fabricate information, respond inconsistently, and even make simple mathematical errors.
Evaluating Rational Reasoning in LLMs
In this study, UCL researchers systematically analyzed the rational reasoning capabilities of seven LLMs using a battery of 12 common tests from cognitive psychology, including the Wason task, the Linda problem, and the Monty Hall problem. The researchers adopted a common definition of a rational agent (human or artificial) as one that reasons according to the rules of logic and probability, while an irrational agent does not reason according to these rules.
The LLMs exhibited irrationality in many of their answers, such as providing varying responses when asked the same question 10 times. They were also prone to making simple mistakes, including basic addition errors and mistaking consonants for vowels, leading to incorrect answers.
Inconsistent Performance and Ethical Concerns
The models’ performance varied significantly across tasks, with correct answers to the Wason task ranging from 90% for GPT-4 to 0% for GPT-3.5 and Google Bard. While most humans would also struggle with the Wason task, it is unlikely that this would be due to mistaking a consonant for a vowel, as Llama 2 70b did.
Olivia Macmillan-Scott, first author of the study from UCL Computer Science, commented, “Based on the results of our study and other research on Large Language Models, it’s safe to say that these models do not ‘think’ like humans yet.”
Some models declined to answer the tasks on ethical grounds, despite the questions being innocent, likely due to safeguarding parameters that are not operating as intended. Additionally, providing additional context for the tasks, which has been shown to improve human responses, did not lead to consistent improvement in the LLMs tested.
Professor Mirco Musolesi, senior author of the study from UCL Computer Science, remarked on the surprising capabilities of these models and the lack of understanding of their emergent behavior. He posed thought-provoking questions about the implications of fine-tuning these models and whether we want fully rational machines or ones that make mistakes like humans do.
As AI continues to advance and integrate into various aspects of our lives, studies like this one serve as a reminder to approach these powerful tools with caution and to prioritize understanding their inner workings before relying on them for critical tasks. The models tested in this study included GPT-4, GPT-3.5, Google Bard, Claude 2, Llama 2 7b, Llama 2 13b, and Llama 2 70b.