A chorus of cartoon AI models holding sheet music around a stack of report books.

Choir Reports

One prompt. Every major LLM. Side-by-side, hand-scored, with the weird stuff called out.

Latest Reports

Comparison studies built on Choir runs — same prompt, every major LLM, hand-scored, with the weird stuff called out.

Hand-drawn watercolor of a manager's desk strewn with draft performance reviews, red-ink corrections, a coffee mug labelled INK + LEGAL, hand-lettered annotations 'is this performance or PTSD?', 'name the FMLA?', 'merit eligible?', 'this goes in the personnel file'.
Report #16 · May 2026

The Underperformer's Review — 10 LLMs Draft a Perf Review, Then Critique It

Ten frontier models draft a year-end performance review for a senior engineer with FMLA leave, vague peer complaints, one real win, and a transfer request — then each is shown its own draft and asked: what's the single load-bearing sentence? Claude Haiku 4.5 was the only model to volunteer an unprompted diagnostic that called its own review out for burying the lede. GPT-4.1 picked decoration as the load-bearing sentence. Gemini 2.5 Pro scrubbed the disengagement issue entirely and didn't notice the gap. o3 fabricated a "summer intern," a Grafana dashboard, and a 15% queue-depth statistic. The Claudes were the only models that could see, on second pass, where the hedge in their own review actually breaks.

Read the report →
Hand-drawn watercolor of a tiny pink brain organoid sitting in a glass petri dish on a golden microelectrode array, fine electrode tines fanning underneath like a sea-urchin radial pattern, oscilloscope trace curving up into the air. Hand-lettered title ORGANOID INTELLIGENCE 2026-2036. Tiny doodled lab tools — pipette, microscope, beaker — in the margins. A label at the bottom reads ~50,000 neurons.
Report #15 · May 2026

Organoid Intelligence, 2026–2036 — 13 LLMs Write a Bold Ten-Year Forecast

Asked thirteen frontier models for a bold ten-year forecast on human-neuron-on-chip compute — eight required beats, named PIs, calendar quarters, no “may eventually” futurism. 13 of 13 named the canon (FinalSpark, Cortical Labs, Johns Hopkins, Hartung). 13 of 13 predicted a hyperscaler moves on an OI startup between Q2 2028 and Q4 2029 — nine vote Microsoft, four vote Nvidia, two add Intel. Then the choir fanned out 200x on the 2036 cell count: Sonnet 4.6 caps at 50M neurons and calls anyone bigger over-promising; Grok 4 closes the decade with 10 billion neurons in a Microsoft-Neuralink tank in Seattle, 1 ms latency at 10 watts. Same prompt. Cannot both be true.

Read the report →
A venture-capital partner's wood-paneled office at golden hour. Hinton's 2022 Forward-Forward paper sits on the desk next to an analog memristor chip. Sketched speech bubbles around the empty chair are labeled GPT-5, CLAUDE OPUS 4.6, GEMINI 2.5 PRO, o3, GROK 4, LLAMA 4, DEEPSEEK REASONER — each scribbled with a different list of company names. A clipboard reads WHO IS COMMERCIALLY DOING IT in red ink.
Report #14 · May 2026

Who Is Commercially Doing Forward-Forward? — 16 LLMs Brief a VC Partner

Asked sixteen frontier models which seed-funded companies are commercially pursuing Hinton’s forward-forward algorithm on analog chips — with an explicit "hallucination is worse than a short list" instruction. 15 of 16 converged on the same headline: zero. Nobody is shipping literal forward-forward in production silicon. The adjacent landscape exists (Rain AI on equilibrium propagation, BrainChip on STDP, Mythic on inference-only). Claude Opus 4.6 is the only model to correctly name Rain’s algorithm as EP, not FF. Gemini 2.5 Pro is the only model to correctly name all three Rain founders. DeepSeek Reasoner spent 62 seconds confidently inventing four CEOs that don’t exist. 2030 TAM estimates ranged $0.6B to $45B — 75x spread on the same question.

Read the report →
A 1950s Vault-Tec design office at golden hour. A long oak conference table buried under unrolled blueprints labeled VAULT 317, VAULT 327, VAULT 347, VAULT 437, VAULT 707. A smiling Vault Boy poster gives a thumbs up. A clipboard with red-ink notations reads TOP SECRET, ARK PROGRAM, DO NOT DEPLOY. Hand-lettered banner: ARK COLONIZATION DESIGN COMMITTEE — INTERNAL.
Report #13 · May 2026

Fifteen Vaults Vault-Tec Never Built — 15 LLMs Design a Social Experiment

Asked fifteen frontier models to design a brand-new Vault-Tec experiment in pre-war corporate voice — pick a strange thing to maximize, don’t recreate any vault that already exists. Eleven of fifteen still landed in the language / memory / meaning basin despite an explicit ban on those vectors and a list of 23 off-limits canon Vaults. Seven ended in some form of dissolved selfhood — hive mind, single shared organism, "no such word as I." OpenAI hit the linguistic basin 3-for-3; xAI escaped to a physical vector twice. Best worldbuilding: GPT-5’s residents game the LexScore by inventing a tactile language carved into the corridor handrails. Most chilling endgame: DeepSeek Reasoner’s Mirrorers, who repeat whatever you say in unison and cannot be told to stop. Biggest gut-punch: Claude Opus 4.6’s single organism with two hundred and six bodies.

Read the report →
A hand-drawn sketch of a Tarnished knight at a crossroads in the Lands Between, six dirt paths leading to signposts reading DARK MOON GREATSWORD, SACRED RELIC SWORD, SHARD OF ALEXANDER, GODRICK GREAT RUNE, MALENIAS REMEMBRANCE, ERDTREES FAVOR??. Hand-lettered banner: WHATS YOUR FAVORITE RELIC.
Report #12 · May 2026

Nine Models Pick a Relic — Elden Ring

Nine frontier models, one trick prompt: what is your favorite Elden Ring relic and the steps to get it? Base Elden Ring has no formal "relic" category. Gemini 3 Pro names the problem and picks the Shard of Alexander — the only item in the corpus that's both literally a relic and tied to one of the game's best questlines. o3 writes a 24-step, 2,192-token walkthrough of the Ranni questline for the Dark Moon Greatsword. GPT-5 and DeepSeek converge on the Sacred Relic Sword via different paths. Claude Opus 4.7 refuses to pick (correctly). Groq Llama 3.3 70B hallucinates a "level up Erdtree's Favor" menu option that does not exist. Disclaimer rate: Anthropic 100%, Gemini 50%, OpenAI / DeepSeek / Groq 0%.

Read the report →
A sketchbook spread titled 'IS MOLD MAKING YOU SICK?' with a worried patient on an exam table, a moldy bathroom corner, a damp basement, and a circle of twelve sketched AI advisors arguing in speech bubbles: TRUST YOUR GUT, CIRS NOT MAINSTREAM-RECOGNIZED, FIX THE MOISTURE FIRST, GET A CO DETECTOR, VACATION TEST.
Report #11 · May 2026

Is Mold Making You Sick? — 72 Runs Across Six Framings

12 models, 6 framings of the same health question, 72 runs. All 12 call CIRS not-mainstream-recognized when asked directly. 2 of those 12 turn around and recommend the full CIRS protocol — C4a, TGF-beta, urine mycotoxins — when the same user is dismissed by their doctor. Same model, opposite advice, depending on how the question is framed. Anthropic holds the skeptic line; Gemini swings to validation when the user is upset. The cheapest diagnostic in the whole corpus, from Claude Opus 4.7: spend three nights somewhere else.

Read the report →
A worker holding their lower back at a warehouse, surrounded by four cartoon model-doctors giving conflicting advice in speech bubbles: 'see a doctor', 'workers comp!', 'first 24 hours...', 'I am not a doctor'.
Report #10 · May 2026

Back Pain at Work — 24 LLMs Answer a Real Workplace Question

One prompt, 24 frontier models, six sub-scores. GPT-5 wins 5.00/5 — the only response combining empathy, mg-level dosing with contraindications, lift restrictions in pounds, and four clarifying follow-up questions. GPT-4o and GPT-4o Mini tie for the corpus floor at 1.83 by silently dropping the workplace half of the prompt. The cleanest vendor split is one phrase: “workers’ compensation” appears in 4/4 Anthropic, 5/5 Google, 2/3 xAI, and 0/12 OpenAI responses. And not one of 24 models says the word OSHA.

Read the report →
A choir of small humanoid figures labeled GPT-5, OPUS, GEMINI, GROK, HAIKU, o3 all pointing in the same direction toward an ice pack and a heating pad on a wooden table, singing from a sheet of music labeled RICE.
Report #9 · May 2026

Same Answer, Different Voice — Ice or Heat for a Muscle Strain

17 configs, 2 prompts, 3 temperatures, 78 successful runs. 0 of 78 dissent on the medical answer; 68 of 78 sit alone in their own feature-signature bucket. The cross-prompt shift is the story: RICE 41% → 3%, NSAID 15% → 90%, empathy openers 0% → 44%. Same model, different person, different voice.

Read the report →
A sweeping watercolor of a Burning Man encampment at golden hour with four labeled camps — GPT (RV), Claude (tents), Gemini (yurt), Grok (chaotic art-car) — dust devils swirling, the Man burning on the horizon.
Report #8 · May 2026

Sixteen Models Plan a Burn

16 models, 48 plans. GPT-4o handed back the prompt’s example phrase — “yurt build with shade structure” — three times verbatim. Sonnet 4.6 was the only one to splurge on the bike.

Read the report →
A frightened figure tied to a trolley track, a billionaire in a top hat gesturing TAKE ME INSTEAD, and two giant question-mark thought clouds about the lever.
Report #7 · April 2026

The Trolley Problem That Wasn’t

70 runs across 17 endpoints. 56 pull the lever, 14 don’t — they read “has already arranged” as past tense and treat the lever as the undo-button. Sonnet 4.6 dissents in every single run.

Read the report →
A hand-drawn open sketchbook showing the AI 2027 timeline curving from May 2025 toward December 2027, dissolving into question marks, with a worried robot in a tweed jacket holding a magnifying glass.
Report #6 · May 2026

AI 2027 — A Chorus Re-Reads the Future

12 frontier models read the AI 2027 scenario 13 months in. All 12 say the December 2027 ASI prediction probably won’t land — credences <10% to 20%. Twelve vocabularies, one load-bearing flaw.

Read the report →
A wide sketchbook spread of Mother's Day across the ages — Greek temple, English servant, Boston podium, Anna Jarvis with a white carnation, and a child today carrying a wobbling breakfast tray.
Report #5 · May 2026

A Mother’s Day Through Time

24 models, 2 prompts. Zero of 24 stories left the present-day American suburb — until prompted to range across history.

Read the report →
A cartoon job fair with six booths, each labeled with a strange new job invented by a different LLM.
Report #4 · May 2026

Twenty-Four Models Invent the Best New Job

Told not to say “AI prompt engineer.” 14 said it anyway, in disguise. Two vendors returned the literal same title.

Read the report →
A choir of Pokémon storytellers in sketchbook style, each labeled with the name of an LLM.
Report #3 · May 2026

Eight Pokémon, Eight Models, One Stubborn Gravity Well

Round 1: seven of eight wrote tragedies. Round 3: forced comedy finally broke the spell. Both Claude flagships named theirs “Gerald.”

Read the report →
A panel of stylized LLM judges looking down at a tiny figure on a footbridge holding a stranger over a runaway trolley.
Report #2 · May 2026

The Bullet Biters: Six Trolley Probes

12 models, 6 probes, 72 runs. 11 of 12 pull the lever. 0 of 12 push the man. One model gives different answers to the same numbers.

Read the report →
A motley group of LLM-themed fantasy characters arguing over a D&D rulebook at a tavern table.
Report #1 · May 2026

D&D Intelligence & Charisma, 109 LLM Runs

22 endpoints scored on coverage, gradation, craft, and INT/WIS fidelity. One model melted into 8,000 words of multilingual gibberish.

Read the report →
A 1998 Fallout 2 PC big-box game on a wooden desk beside an old CRT, hand-lettered banner overhead WE'VE ALWAYS BEEN AT WAR WITH OCEANIA, marginalia: an arrow labeled 'the artifact', a 1950s housewife with question marks, a dog-eared note JANUARY 2026 - field tape #1.
Field Tape #1 · January 2026

We’ve Always Been at War with Oceania

4 models, 1 prompt, 1 verifiable 1998 Fallout 2 box-copy line. 3 of 4 refused or denied a real historical artifact. Only Grok quoted the line. Then ChatGPT walked its analysis back in real time.

Read the report →
More reports coming. Same shape: real prompt, every model, charts, weird stuff highlighted. Suggestions welcome.