Picture a task so simple it takes four minutes. Generate ten words. That’s all. Make them as different from each other as possible, in every way that matters (meaning, usage, the way they sound in the mouth). This isn’t a test you’d find in an IQ exam or written into the competitive frameworks that sort human talent from machine learning. It’s a creativity test, one that psychologists have spent years validating as a genuine marker of divergent thinking. And now, for the first time, researchers have run it against 100,000 humans and multiple artificial intelligence models side by side.
The results are unsettling in precisely the way new scientific findings sometimes are. They shatter one assumption while raising something more complicated in its place.
On 21 January 2026, a team led by Karim Jerbi at Université de Montréal published their findings in Scientific Reports. The core of their discovery reads like a headline designed to provoke: some of the leading AI language models (GPT-4 chief among them) now outperform the average creative human. They do this on an objective, measurable scale. They surpass the middle. But here’s the catch that everyone’s somehow been waiting for: even the best AI systems still fall short of what the most creative humans can do. The top 10 percent of people? They open an even wider gap.
“Our study shows that some AI systems based on large language models can now outperform average human creativity on well-defined tasks,” Jerbi says. “This result may be surprising, even unsettling, but our study also highlights an equally important observation: even the best AI systems still fall short of the levels reached by the most creative humans.”
It’s the perfect statement of a paradox: we’ve crossed the line, except we haven’t.
What matters about Jerbi’s work is not that it contains a shocker, but that it’s genuinely systematic. His team didn’t compare a single AI model to a handful of human volunteers. They tested multiple leading language models (GPT-4, GeminiPro, Claude 3, GPT-3.5, and others) against a massive cohort that was age-balanced, gender-balanced, and geographically diverse. The research included co-first authors Antoine Bellemare-Pépin and François Lespinasse, and counted among its collaborators Yoshua Bengio, one of the founders of deep learning itself, now at Anthropic.
The task they used was the Divergent Association Task, or DAT. You’re asked to produce ten words, and the scoring depends on semantic distance (how far apart the words sit from each other in the vast conceptual space of language). A highly creative person might suggest: “galaxy, fork, freedom, algae, harmonica, quantum, nostalgia, velvet, hurricane, photosynthesis.” You can feel the scatter in that list. Nothing echoes. Nothing is predictable.
The measure works. Researchers have validated that DAT performance predicts performance on other creativity tests (the Alternative Uses Task, insight problems, real-world creative writing). It’s not a parlor trick. It’s a genuine window into how creatively someone’s mind can operate.
So what happened when Jerbi’s team ran this test with machines?
GPT-4 won. It delivered average scores that exceeded the average human score. GeminiPro landed at a level statistically indistinguishable from the human mean. Several other models underperformed by various degrees. But here’s what’s crucial: when you slice the human data into segments, the moment you look at people in the top half by creativity, all the AI models fall below that threshold. The aggregated top 10 percent of humans (roughly 10,000 of the participants) opens a gap that even GPT-4 cannot cross. The data reveal a ceiling. The machines have hit it.
“We developed a rigorous framework that allows us to compare human and AI creativity using the same tools, based on data from more than 100,000 participants,” Jerbi explains. The sheer scale is part of what makes this work compelling. Previous studies of AI creativity have ranged widely in their conclusions, sometimes finding machines outperforming people, sometimes finding the opposite. Often they relied on smaller samples or contested metrics. This study puts weight behind its claim by bringing 100,000 human data points to bear.
But there’s more. The team asked whether AI creativity could be tuned. It can. By adjusting temperature (the hyperparameter that controls the randomness in the model’s word sampling), they coaxed higher creativity scores from GPT-4. Low temperature produces conservative, predictable outputs. Higher temperature introduces more randomness, more exploration. Push it up and the model takes greater risks, moves beyond well-worn pathways, generates more varied associations. The highest temperature tested produced mean scores that exceeded 72 percent of human responses.
They also experimented with strategy. They asked the models to generate ten words using different prompts. One strategy focused on etymology (the roots and origins of words). That worked. Both GPT-3.5 and GPT-4 improved their scores significantly when explicitly told to vary etymology. Another asked for semantic opposition (words with opposite meanings). That decreased performance, unsurprisingly, since opposite words are, by definition, semantically close. The point is that AI systems respond acutely to how humans frame the task. They’re not locked into a single mode. They adapt.
To test whether improvements on the DAT would actually translate to real creative work, Jerbi’s team had the models generate haikus, film synopses, and flash fiction stories. They measured these using something called Divergent Semantic Integration, which tracks semantic distance across sentence-level language. They applied this metric to human-written haikus and synopses as well. The result: GPT-4 consistently outperforms GPT-3.5 across all three writing tasks. Yet humans, particularly those whose work was sampled from professional sources, maintain an advantage. Human-written haikus outpace the machines. Human film synopses carry more semantic complexity. The machines approach, but they don’t arrive.
What does this mean for the fear that’s haunted creative professionals since generative AI began its spectacular rise: that writers, artists, and inventors might find themselves displaced by machines?
“Even though AI can now reach human-level creativity on certain tests, we need to move beyond this misleading sense of competition,” Jerbi says. “Generative AI has above all become an extremely powerful tool in the service of human creativity: it will not replace creators, but profoundly transform how they imagine, explore, and create, for those who choose to use it.”
The findings suggest that such fears remain, for now, premature. The creative work that sustains careers (the writing that wins prizes, the conceptual breakthroughs that define fields, the ideas that resonate) emerges from the top tiers of human creative ability. That’s where the machines haven’t arrived. They’ve reached the median. They’ve climbed toward the mean. But they’re not at the summit.
Yet the research opens a different question, one that Jerbi phrases carefully: “By directly confronting human and machine capabilities, studies like ours push us to rethink what we mean by creativity.” If GPT-4 can exceed average human divergent thinking, if it can generate semantically complex narratives, if it can be tuned and prompted and coaxed toward greater originality, then perhaps the traditional hierarchy of human creative ability isn’t quite what we thought it was. Perhaps creativity isn’t unitary. Perhaps the kinds of associative, combinatorial thinking that machines can now do represent one facet of a much larger, more intricate phenomenon.
The study’s data are openly available. The code is on GitHub. This is the kind of work designed to be built upon, extended, revisited as new models emerge and new questions arise. For now, one thing is clear: we’ve crossed an interesting threshold, the one where machines stopped being incomparably worse at creative tasks. But the question of what they actually are, how they work, and what their presence means for human creativity and creative professions is a conversation that’s only just beginning.
Study link: https://www.nature.com/articles/s41598-025-25157-3
Discover more from NeuroEdge
Subscribe to get the latest posts sent to your email.