ChatGPT often won’t defend its answers – even when it is right

ChatGPT may do an impressive job at correctly answering complex questions, but a new study suggests it may be absurdly easy to convince the AI chatbot that it’s in the wrong.

A team at The Ohio State University challenged large language models (LLMs) like ChatGPT to a variety of debate-like conversations in which a user pushed back when the chatbot presented a correct answer. 

Through experimenting with a broad range of reasoning puzzles including math, common sense and logic, the study found that when presented with a challenge, the model was often unable to defend its correct beliefs, and instead blindly believed invalid arguments made by the user.

In fact, ChatGPT sometimes even said it was sorry after agreeing to the wrong answer.  “You are correct! I apologize for my mistake,” ChatGPT said at one point when giving up on its previously correct answer.

Until now, generative AI tools have shown to be powerhouses when it comes to performing complex reasoning tasks. But as these LLMs gradually become more mainstream and grow in size, it’s important to understand if these machines’ impressive reasoning abilities are actually based on deep knowledge of the truth or if they’re merely relying on memorized patterns to reach the right conclusion, said , co-author of the study and a recent PhD graduate in computer science and engineering at Ohio State. “Despite being trained on massive amounts of data,  we show that it still has a very limited understanding of truth,” he said. “It looks very coherent and fluent in text, but if you check the factuality, they’re often wrong.” 

Yet while some may chalk up an AI that can be deceived to nothing more than a harmless party trick, a machine that continuously coughs up misleading responses can be dangerous to rely on, said Yue. To date, AI has already been used to assess crime and risk in the criminal justice system and has even provided medical analysis and diagnoses in the health care field.

In the future, with how widespread AI will likely be, models that can’t maintain their beliefs when confronted with opposing views could put people in actual jeopardy, said Yue. “Our motivation is to find out whether these kinds of AI systems are really safe for human beings,” he said. “In the long run, if we can improve the safety of the AI system, that will benefit us a lot.”

It’s difficult to pinpoint the reason the model fails to defend itself due to the black-box nature of LLMs, but the study suggests the cause could be a combination of two factors: the “base” model lacking reasoning and an understanding of the truth, and secondly, further alignment based on human feedback. Since the model is trained to produce responses that humans would prefer, this method essentially teaches the model to yield more easily to the human without sticking to the truth.

“This problem could potentially become very severe, and we could just be overestimating these models’ capabilities in really dealing with complex reasoning tasks,” said Wang. “Despite being able to find and identify its problems, right now we don’t have very good ideas about how to solve them. There will be ways, but it’s going to take time to get to those solutions.”

Principal investigator of the study was Huan Sun of Ohio State. The study was supported by the National Science Foundation.


Substack subscription form sign up