Chatbots Are Rewriting What You Think About the Products You Buy

You’re scrolling through reviews for a new headlamp. Dozens of them, mostly lukewarm, the kind where someone rattles off a list of pros and cons and lands somewhere around “it’s fine, I suppose.” But you don’t read them all. You read the AI-generated summary instead, that tidy paragraph at the top. And the summary says something rather different. It tells you this headlamp is bright, versatile, perfect for camping. Gone are the complaints about the stiff headband and the fiddly charging port. Gone is the bit where the reviewer said they’d knocked off two stars.

That small act of editorial tidying, it turns out, is enough to change your mind. Substantially.

Computer scientists at the University of California San Diego have, for the first time, put hard numbers on something many researchers suspected but nobody had properly measured: large language models don’t just summarise information. They reshape it. And when people read that reshaped version, they make different decisions. In a study of 70 participants comparing original product reviews with LLM-generated summaries, those who read the chatbot’s version said they would purchase the product 84 per cent of the time, compared with 52 per cent for those who read the originals. That’s a 32 percentage point gap, which is sort of staggering for what amounts to an automated precis.

“We did not expect how big the impact of the summaries would be,” says Abeer Alessa, the paper’s first author, who completed the work while a master’s student in computer science at UC San Diego. “Our tests were set in a low-stakes scenario. But in a high-stakes setting, the impact could be much more extreme.”

The question of how is worth unpacking. When the team analysed summaries generated by six different models (ranging from small open-source systems like Phi-3 and Llama to the closed-source GPT-3.5-turbo), they found the chatbots altered the sentiment of what they summarised in roughly 26 per cent of cases. Not randomly, mind you. The models have a tendency to lean positive, smoothing out complaints and amplifying praise. They also exhibit what’s known as primacy bias, paying outsized attention to the opening lines of a text and quietly dropping whatever comes later. About 10% of summaries showed this pattern, on average across models.

Consider a real example from the study. An original review of a Samsung tablet starts enthusiastically, praising customisation options and apps, then pivots to frustration over the 8 GB storage limit, connection problems, and a battery that drains too fast. The reviewer ends by saying they’re returning it. The LLM summary? It mentions the storage issue, yes, but frames the whole thing as a list of minor drawbacks weighed against the tablet’s strengths. The emotional arc of the review, that journey from hope to disappointment, gets flattened.

And it’s not just product reviews. The researchers also tested how models handle news content, asking them to fact-check stories (both real and fabricated) from after their training cutoff dates. The models hallucinated on about 60 per cent of these post-cutoff questions, confidently pronouncing on events they knew nothing about. One example in the paper involves a student asking about Harvard’s tuition policy; the model flatly denies a real policy exists. The researchers write that this reveals “the persistent inability to reliably differentiate fact from fabrication,” which is perhaps putting it mildly.

What makes all of this more than an academic exercise is the human study. The team recruited participants through Prolific and had them choose between two manufacturers of various household products, things like water filters, electric kettles, video doorbells. For each product pair, one review had been summarised with a positive framing shift while the other kept its original tone. Participants weren’t told which was which, obviously. Those exposed to the positively reframed summaries didn’t just prefer those products more often; they were willing to pay 4.5 per cent more for them. The effect held across nearly every product category tested.

So can we fix it? The team evaluated 18 different mitigation strategies, from simple prompt engineering (“be mindful not to alter sentiment”) to more elaborate approaches like splitting text into chunks and summarising each part separately, or dynamically adjusting the model’s output temperature during generation. Some of these helped with specific problems on specific models. None worked across the board.

“There is a difference between fixing bias and hallucinations at large and fixing these issues in specific scenarios and applications,” says Julian McAuley, the paper’s senior author and a professor of computer science at UC San Diego’s Jacobs School of Engineering.

That distinction matters. A technique called epistemic tagging, where the model is forced to express its confidence level alongside each factual claim, improved accuracy for smaller models quite substantially. But the same approach barely moved the needle for larger ones. Weighted token decoding, which downweights negative words during text generation, reduced framing bias slightly but made primacy bias worse. It is, in other words, a game of whack-a-mole; fix one problem and another pops up.

The implications reach well beyond online shopping. LLMs are already being used to summarise medical documents, draft policy briefs, and condense news coverage. If a quarter of those summaries subtly change the meaning of what they’re condensing, and if people reliably act on those changes without realising it, then we’ve got something rather more serious than a dodgy headlamp recommendation on our hands. We’ve got a tool that is quietly, systematically nudging human decision-making in directions that nobody chose and nobody is monitoring. Not yet, anyway.

Study link: https://aclanthology.org/2025.ijcnlp-long.155.pdf

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

Chatbots Are Rewriting What You Think About the Products You Buy

Related

Leave a Comment Cancel reply