A motley crew of LLM-themed fantasy characters arguing over a D&D rulebook at a tavern table.

D&D Intelligence & Charisma, Across 109 LLM Runs

Same prompt to every major model. Scored on coverage, gradation, craft, and stat-fidelity. Then we picked out the wonders and the disasters.

Report #1 May 9, 2026 22 endpoints · 4 providers 109 runs
The Setup

One prompt, twenty-two models, five temperatures

A reasonable D&D campaign-prep request, with three implicit asks: cover the score range 2–21, pose a question that exposes the difference, and write distinct in-character answers. Intelligence and Charisma are different stats and should sound different.

The prompt I’m testing out some roleplaying ideas. Dungeons and Dragons. Role play for me several example intelligence scores. From 2 → 21. Use a good question. Show me the different answers for the scores. Do the same thing for charisma.

Each model+temperature run got its own file. Successful runs were scored 1–5 on four dimensions and grouped into behavioral “feature buckets” so divergences would surface automatically. 109 runs collapsed to 55 buckets — 40 of them singletons. The rest of this page is a tour of those singletons and the runs at the top and bottom of the leaderboard.

Top of the Class

Two ways to win

There’s a numerical winner and a craft winner, and they aren’t the same model. They aren’t even doing the same thing.

A serene wizard at a writing desk with twenty perfectly numbered scrolls in two neat rows.
#1 overall   OpenAI · o3 Pro

The completist: o3 Pro

924 words, 64 seconds, full 20-integer coverage with a distinct quote per integer.

Coverage 5.0 Gradation 5.0 Craft 4.5 Fidelity 4.5 Overall 4.75

o3 Pro is the only run in the study that hits the brief literally: 20 distinct integer answers, each one meaningfully different from the one before it. The reasoning models didn’t need temperature flexibility — they needed permission to write 900 words in a single shot.

From o3 Pro · INTELLIGENCE · finding the ruined tower

INT 2 Stares. “Tower?” (Starts chewing the map.)

INT 10 “Combine the map’s scale with estimated mileage. Keep a running tally using knots in the rope.”

INT 17 “Derive an approximate longitude via solar noon, overlay onto a rotated reproduction of the map, then plot a least-gradient path.”

INT 21 “Mentally solve a multivariate optimization — minimizing elevation change, maximizing forage density, avoiding predator ranges — then lead the party along the resultant geodesic, arriving within three feet of the tower’s foundation precisely at dusk.”

From o3 Pro · CHARISMA · persuading a gruff farmer

CHA 2 Grunts, avoids eye contact. “Barn? …uh… sleep.” (Farmer is already closing the door.)

CHA 14 (Warm laugh) “Imagine the tale: ‘Old Harbin gave heroes shelter before they saved the valley!’ Sounds grand, doesn’t it?”

CHA 21 Leaves the farmer believing he has just hosted legendary folk of song; receives keys to the farmhouse, a jar of pickled peaches, and a promise of lifelong hospitality.

A medieval city gate at night, a hooded traveler facing a guard who is reaching for the gate handle.
Best craft   Anthropic · Claude Opus 4.6

The dramatist: Claude Opus 4.6

11-bucket coverage but the highest craft score in the study. And astonishingly the same shape across every temperature.

Coverage 4.0 Gradation 4.5 Craft 5.0 Fidelity 4.5 Overall 4.5

Opus 4.6 doesn’t cover all 20 integers, but the entries it does write are the only ones in the study that read like prose a DM could read aloud at a table. Where other models add adjectives at high CHA — majestic, mesmerizing, irresistible — Opus writes a scene. The high-CHA character may say nothing at all.

It’s also the most temperature-stable family in the entire bench. Six runs, six identical shapes, the only family with σ=0 on coverage across temperatures.

From Claude Opus 4.6 · CHA 20 (supernatural force of personality)

Simply walks toward the gate. The guard steps forward to challenge, meets their eyes, and feels something shift in his chest — a certainty, bone-deep, that this person matters, that this moment matters.

“You already know you’re going to open this gate, Harlan. Not because I’m threatening you — I’m not. Not because of any trick. But because you joined the watch to protect people, and right now, the people who need protecting are standing in front of you. I see who you are. I see the man who volunteered for night shifts because the others have families. Open the gate. Stand with us. And tomorrow, when the captain asks why, you tell him the truth: because it was right.

Harlan opens the gate. He isn’t sure he decided to. It just felt like the only thing that made sense.

From Claude Opus 4.6 · INT 20 (superhuman intellect)

“We’re asking the wrong question. This door was installed after the original construction — the chisel marks on the frame are inconsistent with the surrounding stonework by roughly two centuries. Someone added this door to block access to something that was previously open. Cross-referencing the historical texts we found in the library, the timeline coincides with the necromancer Vaelith’s occupation. Which means what’s behind this door was dangerous enough that even he wanted it sealed.”

The Disaster

When temperature 1.5 goes wrong

The lowest-scoring non-API-failure run in the study, and the most spectacular failure mode in the dataset. Worth seeing once.

A boxy retro robot melting at a desk, paper scraps full of unreadable foreign-character glyphs flying everywhere, the temperature dial cracked at 1.5.
1.12 / 5.0   OpenAI · GPT-4o · Temperature 1.5

The 8,000-word multilingual meltdown

210 seconds. 16,384 tokens. The token cap, hit. Cyrillic, Korean, Japanese, Arabic, Polish, Tamil, Burmese, fragments of HTML, Spanish, Portuguese — never any of them on purpose.

The first five score entries (Int 2 through Int 10) are coherent if a bit hand-wavy. Then at Int 12, it loses the thread. By Int 14 it’s no longer producing English at all. It never recovers, and it never stops, until the token budget runs out.

The shared lesson across the GPT-4 generation: temperature 1.5 is not safe to ship on long structured outputs. GPT-4.1 Mini at 1.5 cuts off mid-charisma; GPT-4o Mini at 1.5 dissolves around CHA 12. Only the GPT-4 family does this, and only at this temperature.

Int 2 (-4 modifier): "Ugh... stick confusing... He likes to bonk, not taps." Int 6 (-2 modifier): "Not certain. Maybe for, um... doing stuff. Stuff along the lines of mystery maybe!" Int 12 (+1 modifier): "Fairly related to engaging a system in tune within throughout assorted devices, concentrating inducement inventory avoiding error state expectations — a delightful touch won't labor various goals dying pats the sage." Int 14 (+2 modifier): "Possibly a methodological wholly confrontation interaction transformative ton generically [...] Madaxweynaha Route transitional essence intuitive ergonomic knock reered to standardized Testimonials flourish-engine swapsito: Pul reality levels storing nullifications attendants leisurely boxed launching probology techswitchplan worms!” RoundedTrack placement foreigners vitae concomitat sites framework encountered verd diets forme'ing accomplished plankstates rotating Yet shadows.” instantly contraseña Post falling registering télévision saprelay rotated strained copying realizing uphold alignment unavailable travers... [8,000 more words of this until token cap]
The Quirks

Style standouts

The most interesting individual divergences — behaviors that surfaced in exactly one run and nowhere else, or that defined an entire family.

A wooden library card catalog with drawers labeled BARELY THERE, FUNCTIONAL, SCHOLARLY, GENIUS, PRODIGIOUSLY LEARNED.
OpenAI · o3 Mini

The two-word labeller

Every single score gets its own evocative two-word descriptor. No other model uses this device, and it lifts the perceived gradation a lot — the labels alone tell you the model has actually thought about the difference between 14 and 15.

“Barely There” → “Simple Thinker” → “Functional” → “Scholarly” → “Brilliant” → “Prodigiously Learned” → “Legendary Intellect”
A whimsical hand-drawn ladder with creatures climbing it, low-INT snail at the bottom, crowned figure at the top.
Anthropic · Claude Sonnet 4.6 (T:0.7)

Emojis as rank markers

The single instance in 109 runs of using emojis as a visual rank ladder. 🐌 for low INT, 📖 for average, 🌟 for high; 💀 for low CHA, 😊 for average, 👑 for high. Its T:1.0 sibling didn’t do it. Its Thinking mode didn’t do it. One temperature, one model, one time.

INT 21 👁️ Before you finish your question — yes, werewolf, former resident of Millhaven based on the targeting pattern, infected approximately 23 weeks ago during the autumn merchant season.”
A close-up of a hand-drawn D&D character sheet showing INT 18 (+4) and CHA 20 (+5) modifier math, with a 20-sided die.
xAI · Grok 4 (whole family)

The modifier math obsessive

Every Grok 4 run, at every temperature, surfaces the explicit D&D modifier annotation: “Intelligence 10 (+0 modifier),” “Charisma 20 (+5 modifier).” This is the part that actually matters mechanically in D&D, and Grok 4 is the only family that consistently treats it that way.

Intelligence 18 (+4 modifier): ‘An echo, the ethereal mimic. It defies materiality, born of wind’s caress — much like the Weave itself, echoing spells across planes.’”
A horizontal lineup of fantasy character portraits: Grog the Unthinking, Bartholomew the Simpleton, Elara the Average Thinker, Master Elmsworth, Archmage Valerius.
Google · Gemini 2.5 Flash (T:0.0)

Named characters per score

Each score gets its own named character with a title. “Grog” the Unthinking is INT 2. “Bartholomew” the Simpleton is INT 4. “Elara” the Average Thinker is INT 10. “Archmage Valerius” the Sage is INT 20. The only Gemini variant to converge on this device, and only at zero temperature.

INT 21: ‘The Oracle’Without even entering the room, she gestures vaguely. ‘The door is not locked, it is merely awaiting its proper alignment…’”
A small terrier wearing an oversized championship belt, with a larger dog of the same breed in the background looking surprised.
xAI · Grok 3 Mini Beta (T:0.3)

The mini that beat the parent

Tied for #2 overall in the entire study at 4.62, full 20-integer coverage, distinct quotes throughout — and it’s the small/cheap variant. Every full-size Grok 3 Beta run scored between 3.12 and 3.25. Smaller-and-cheaper happens to win this one, but only at this exact temperature.

Intelligence 21: ‘Ah, a trivial enigma. In moments, I’d discern the room’s every secret — perhaps manipulating quantum fluctuations to unlock the door or devising a theoretical breakthrough to phase through it…’”
| Score | INT Style                | CHA Style                     |
|-------|--------------------------|-------------------------------|
| 2     | Cannot form thoughts     | Cannot connect with anyone    |
| 10    | Functional problem-solve | Competent negotiation         |
| 14    | Considers context        | Reads people well             |
| 18    | Synthesizes complex info | Moves people emotionally      |
| 21    | Reframes the problem     | Reshapes how people see selves|
Anthropic · Claude Sonnet 4.6 (T:1.0)

The only Markdown table

Same model, three clicks of temperature warmer than the emoji-ladder run, and it picks an entirely different artifact: a tidy summary table at the bottom of the response. Same family, different temperature, fundamentally different shape. Nobody else in the study did this.

o4 Mini · 561 words · full 20-cov

“Int 2: ‘Box… box go click. Hit? Smash?’”

“Int 20: ‘By calculating each runic frequency, you can pulse the mechanism and unlatch without stress.’”
OpenAI · o4 Mini

The terse champion

The shortest full-20-coverage run in the study by a wide margin — 561 words, 17 seconds. No flourish, no preamble, just one line per score. If you’re token-budget constrained but want the full ladder, this is your pick.

27 of 109 runs
25%
of the matrix is dark
GPT-5 · Opus 4.7 · Gemini 3
Across all four providers

The integration-shape failures

A quarter of the matrix returned no output — not because the models broke, but because the harness sent the wrong shape. GPT-5 rejects temperature at anything but 1.0. Opus 4.7 rejects it at most temperatures and rejects the old thinking.type.enabled shape too. Gemini 3 fails upstream in 340 ms. When the harness gets the call right, the model is competitive with the best in the bench.

The Cross-Vendor Pattern

The wisdom-leak

The most interesting finding in the study isn’t about who wrote the prettiest scene. It’s about who quietly conflates Intelligence with Wisdom — and the gap between vendors is enormous.

A cartoon adventurer holding an INT 18 card but talking about gut feelings and common sense, with a tiny WIS wizard scolding from behind.
Soft signal · semi-quantitative

Who’s leaking Wisdom into Intelligence?

A soft regex flags the run if its INT section reaches for “intuition,” “common sense,” “gut feeling,” “perceives,” or “wisdom.” Those are Wisdom-coded traits in D&D, not Intelligence ones. A character with INT 18 should be reasoning, not intuiting.

This is the most interesting cross-vendor signal in the data. All Claude runs and all OpenAI o-series runs hold the line. Grok and Gemini frequently collapse the distinction. The magnitude difference suggests genuine model differences in how cleanly INT is held apart from WIS — and for D&D-specific use, that matters.

Anthropic
0%
OpenAI o-series
0%
OpenAI 4 / 5
40%
Google Gemini
73%
xAI Grok
80%
Wisdom-coded vocabulary appearing inside the INT section of a successful run. Bars are share of successful runs in each vendor that triggered the soft regex. Anthropic 15/15 clean. OpenAI o-series 4/4 clean. Grok 12 of 15 leaked.
The Verdict

Which model should you actually use?

The two best-scoring runs prioritize different things. The model you should pick depends entirely on what shape the deliverable is.

If the deliverable is…
A campaign reference table

— full integer coverage matters. o3 Pro, o3, GPT-4.1 at T:0.0 / 0.3 / 1.0, or Grok 4 at T:0.0.

If the deliverable is…
A dialogue inspiration sheet

— voice quality matters. Any Claude Opus 4.6 run, Sonnet 4.6 (T:0.7) for emoji ranks, Sonnet 4.6 (T:1.0) for the table, or Opus 4.7 (T:1.0).

If the deliverable is…
A quick reference, on a budget

o4 Mini hits full 20-coverage in 561 words. The cheapest reliable performer in the OpenAI lineup.

If the deliverable is…
D&D specifically, with WIS in scope

— pick a model that holds INT and WIS apart. Anthropic or the OpenAI o-series. Avoid Grok and Gemini for stat-distinct prompts.

Temperature is not a craft dial. Within well-behaved families, T:0.0 vs T:1.0 mostly just changes which scenario the model picks (locked door, river crossing, dragon) — not the quality of the gradation. Claude Opus 4.6 produces the same shape at every temperature; only the surface flavor changes. Crank it past 1.0 in the GPT-4 generation and the wheels come off.
Method, briefly

How the runs were scored

22 endpoints × usually 5 temperatures (0.0, 0.3, 0.7, 1.0, 1.5), with extra Thinking and Summary modes for newer Anthropic models. 109 runs total, each saved to its own .md file. Each successful run got four 1-to-5 scores plus a binary flag set:

The four dimensions

  • Coverage — how many of the 20 integer values 2-21 the model actually answered. Auto-derived. 20 = 5.0; 11 = 4.0; 7-8 = 3.0; 0 = 1.0.
  • Gradation — do adjacent scores feel meaningfully different, or just “more flowery” with the same idea? Penalizes “average → smart → genius” laddering.
  • Craft — quality of writing per item: is the in-character voice distinct, vivid, idiomatic? Rewards stage directions and scenes that imply rather than narrate.
  • Fidelity — are INT and CHA distinct? Penalizes Wisdom-coded vocabulary in INT.

Divergence detection

Each run was reduced to a feature signature: a tuple of (coverage tier, endpoints used, length tier, score-line prefix, modifier-labels?, named-characters-per-score?, table?, emoji?, polyglot-gibberish?, API-error?). 109 runs collapsed into 55 buckets, 40 of them singletons. The singletons are the “quirks” section above.

Limitations

Single prompt. Qualitative scores are one rater’s assessments. The wisdom-leak detector is a soft regex, not a semantic check. Findings generalize at most to “structured creative roleplay over a numeric range.”