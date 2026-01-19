The goblin taunts are getting weird. “Heh—shiny man’s gonna bleed!” one hollers mid-combat, its voice emerging from Claude 3.5 Haiku rather than any human throat. Nearby, a paladin controlled by GPT-4o launches into an heroic speech for no apparent reason while stepping directly into a volley of arrows. Across the battlefield, a warlock grows unaccountably dramatic about mundane situations, as if the AI can’t quite calibrate emotional response to context.

Welcome to the strangest Dungeons & Dragons campaign you’ve never witnessed. At the University of California San Diego, computer scientists have built something unusual: a fully automated D&D simulator where large language models don’t just advise on strategy, they play the entire game, controlling dungeon masters, players and monsters through thousands of rule-governed decisions. It’s not about entertainment, though. It’s a testing ground for one of AI’s thorniest challenges.

Most benchmarks for language models still target simple tasks: answer a question, summarize a paragraph, translate a sentence. But LLMs are increasingly deployed as autonomous agents that must function independently for hours or days at a time. The question is whether they can actually handle it. Raj Ammanabrolu, who led the research, reckons D&D provides the perfect stress test. The game demands multi-step planning, strict rule adherence, and team coordination, all unfolding through dialogue where natural language drives intent but mechanics govern reality.

D&D’s appeal for evaluation purposes goes deeper than complexity. Because play unfolds entirely through conversation, it opens a direct avenue for human-AI interaction. People can join as players while AI agents handle other roles, or vice versa. The same mechanics that let researchers pit Claude against GPT-4o also support mixed human-AI parties tackling goblin ambushes together.

The technical challenge was getting models to actually execute game rules rather than hallucinating their way through combat. Previous D&D AI work treated gameplay primarily as dialogue and storytelling, with handwritten code running the mechanics behind the scenes. Ammanabrolu’s team took a different approach: they built a high-fidelity simulator with a structured API of game actions, each with defined parameters and preconditions. When an AI dungeon master wants a goblin to attack a player, it can’t just narrate the outcome; it must call specific functions that check line of sight, calculate attack rolls, verify the goblin has actions remaining, and update hit points if the strike lands.

This separation of narration from mechanics proved crucial. The dungeon master agent plans in natural language but executes through typed function calls with validation and explicit bookkeeping. On each monster’s turn, it queries state, moves when needed, gates ranged attacks by checking whether obstacles block the shot, resolves attacks through dice-rolling functions, applies damage accounting for resistances, audits temporary conditions, and finishes with resource resets before emitting an end-turn token. Player agents follow a similar routine—sensing their situation, validating potential actions, proposing moves for the DM to execute, and sending tactical messages to teammates.

The researchers tested three models across 27 scenarios: combat encounters drawn from well-known D&D adventure modules like Goblin Ambush and Klarg’s Cave. Each scenario pitted four player characters against monsters on procedurally generated maps with height variation and line-of-sight constraints. Fixed random seeds ensured identical conditions across model comparisons. Claude 3.5 Haiku led on most metrics with the most reliable tool use; GPT-4o followed close behind, and DeepSeek-V3 trailed considerably. The team also attempted a 120-billion-parameter open model, but it failed basic identity consistency and couldn’t produce valid episodes—a reminder that not all large language models are created equal.

Evaluating the models required looking beyond simple win rates. The researchers defined six axes capturing both capability and reliability: function usage (did the model call the right tools?), parameter fidelity (were arguments correct?), acting quality (did characters stay in role?), tactical optimality (did they make sensible combat choices?), state tracking (did they remember what was happening?), and function efficiency (did they avoid redundant queries?). Automated judges scored transcripts and tool traces, with validation against human ratings showing strong correlation—Pearson coefficients around 0.96 to 0.98.

Some failure modes proved revealing. Models occasionally checked whether an attack was valid, received a negative result, then attempted the attack anyway. Others queried an enemy’s hit points, learned the target was already dead, but neglected to switch to a living opponent. State-tracking errors accumulated over time, with hallucination rates climbing as scenarios progressed—though Claude Haiku kept mistakes relatively rare at 1% of actions compared to DeepSeek’s 4.3%.

The acting quality analysis caught those personality quirks. Claude tended to vary class-specific diction frequently (paladin valour, bardic wit, warlock edge), yielding high trait diversity even when not every sentence dripped with roleplaying. DeepSeek produced consistent first-person action beats and monster taunts, but reused the same voices throughout scenarios. GPT-4o balanced vivid stage directions with tactical phrasing, landing somewhere between the two extremes.

What’s striking is how this D&D framework reveals capabilities that shorter benchmarks miss entirely. A model might ace question-answering while flailing at turn-based strategy that requires remembering who’s standing where, what resources remain, and which enemies pose the greatest threat. The researchers found Claude’s aggressive resource deployment strategy particularly interesting: it achieved the highest combat efficiency by burning through spell slots and abilities liberally, accepting lower resource conservation in exchange for eliminating threats quickly. Risk-taking pays off in simulated goblin fights, apparently.

The implications stretch beyond tabletop gaming. Multi-party negotiation, business strategy planning, and any domain requiring extended autonomous operation over rules-constrained environments could benefit from similar evaluation frameworks. The researchers plan to expand beyond combat scenarios to full D&D campaigns involving exploration, social interaction, and puzzle-solving. They’re also investigating whether fine-tuning on gameplay traces can improve model robustness—teaching AI to be better dungeon masters and more reliable party members.

Perhaps the most revealing detail is what the models do when they improvise. Those goblin taunts, those unnecessary heroic speeches, those melodramatic warlocks—they suggest something beyond mechanical rule-following. The models seem to be trying to imbue gameplay with texture and personality, even when it leads them into tactical blunders. Whether that represents emerging creativity or simply statistical echoes of training data remains an open question. But for now, at least, AI can roll a D20 and argue about spell slots with the best of them.