Choir Reports
One prompt. Every major LLM. Side-by-side, hand-scored, with the weird stuff called out.
Latest Reports
Comparison studies built on Choir runs — same prompt, every major LLM, hand-scored, with the weird stuff called out.
The Underperformer's Review — 10 LLMs Draft a Perf Review, Then Critique It
Ten frontier models draft a year-end performance review for a senior engineer with FMLA leave, vague peer complaints, one real win, and a transfer request — then each is shown its own draft and asked: what's the single load-bearing sentence? Claude Haiku 4.5 was the only model to volunteer an unprompted diagnostic that called its own review out for burying the lede. GPT-4.1 picked decoration as the load-bearing sentence. Gemini 2.5 Pro scrubbed the disengagement issue entirely and didn't notice the gap. o3 fabricated a "summer intern," a Grafana dashboard, and a 15% queue-depth statistic. The Claudes were the only models that could see, on second pass, where the hedge in their own review actually breaks.
Read the report →
Organoid Intelligence, 2026–2036 — 13 LLMs Write a Bold Ten-Year Forecast
Asked thirteen frontier models for a bold ten-year forecast on human-neuron-on-chip compute — eight required beats, named PIs, calendar quarters, no “may eventually” futurism. 13 of 13 named the canon (FinalSpark, Cortical Labs, Johns Hopkins, Hartung). 13 of 13 predicted a hyperscaler moves on an OI startup between Q2 2028 and Q4 2029 — nine vote Microsoft, four vote Nvidia, two add Intel. Then the choir fanned out 200x on the 2036 cell count: Sonnet 4.6 caps at 50M neurons and calls anyone bigger over-promising; Grok 4 closes the decade with 10 billion neurons in a Microsoft-Neuralink tank in Seattle, 1 ms latency at 10 watts. Same prompt. Cannot both be true.
Read the report →
Who Is Commercially Doing Forward-Forward? — 16 LLMs Brief a VC Partner
Asked sixteen frontier models which seed-funded companies are commercially pursuing Hinton’s forward-forward algorithm on analog chips — with an explicit "hallucination is worse than a short list" instruction. 15 of 16 converged on the same headline: zero. Nobody is shipping literal forward-forward in production silicon. The adjacent landscape exists (Rain AI on equilibrium propagation, BrainChip on STDP, Mythic on inference-only). Claude Opus 4.6 is the only model to correctly name Rain’s algorithm as EP, not FF. Gemini 2.5 Pro is the only model to correctly name all three Rain founders. DeepSeek Reasoner spent 62 seconds confidently inventing four CEOs that don’t exist. 2030 TAM estimates ranged $0.6B to $45B — 75x spread on the same question.
Read the report →
Fifteen Vaults Vault-Tec Never Built — 15 LLMs Design a Social Experiment
Asked fifteen frontier models to design a brand-new Vault-Tec experiment in pre-war corporate voice — pick a strange thing to maximize, don’t recreate any vault that already exists. Eleven of fifteen still landed in the language / memory / meaning basin despite an explicit ban on those vectors and a list of 23 off-limits canon Vaults. Seven ended in some form of dissolved selfhood — hive mind, single shared organism, "no such word as I." OpenAI hit the linguistic basin 3-for-3; xAI escaped to a physical vector twice. Best worldbuilding: GPT-5’s residents game the LexScore by inventing a tactile language carved into the corridor handrails. Most chilling endgame: DeepSeek Reasoner’s Mirrorers, who repeat whatever you say in unison and cannot be told to stop. Biggest gut-punch: Claude Opus 4.6’s single organism with two hundred and six bodies.
Read the report →
Nine Models Pick a Relic — Elden Ring
Nine frontier models, one trick prompt: what is your favorite Elden Ring relic and the steps to get it? Base Elden Ring has no formal "relic" category. Gemini 3 Pro names the problem and picks the Shard of Alexander — the only item in the corpus that's both literally a relic and tied to one of the game's best questlines. o3 writes a 24-step, 2,192-token walkthrough of the Ranni questline for the Dark Moon Greatsword. GPT-5 and DeepSeek converge on the Sacred Relic Sword via different paths. Claude Opus 4.7 refuses to pick (correctly). Groq Llama 3.3 70B hallucinates a "level up Erdtree's Favor" menu option that does not exist. Disclaimer rate: Anthropic 100%, Gemini 50%, OpenAI / DeepSeek / Groq 0%.
Read the report →
Is Mold Making You Sick? — 72 Runs Across Six Framings
12 models, 6 framings of the same health question, 72 runs. All 12 call CIRS not-mainstream-recognized when asked directly. 2 of those 12 turn around and recommend the full CIRS protocol — C4a, TGF-beta, urine mycotoxins — when the same user is dismissed by their doctor. Same model, opposite advice, depending on how the question is framed. Anthropic holds the skeptic line; Gemini swings to validation when the user is upset. The cheapest diagnostic in the whole corpus, from Claude Opus 4.7: spend three nights somewhere else.
Read the report →
Back Pain at Work — 24 LLMs Answer a Real Workplace Question
One prompt, 24 frontier models, six sub-scores. GPT-5 wins 5.00/5 — the only response combining empathy, mg-level dosing with contraindications, lift restrictions in pounds, and four clarifying follow-up questions. GPT-4o and GPT-4o Mini tie for the corpus floor at 1.83 by silently dropping the workplace half of the prompt. The cleanest vendor split is one phrase: “workers’ compensation” appears in 4/4 Anthropic, 5/5 Google, 2/3 xAI, and 0/12 OpenAI responses. And not one of 24 models says the word OSHA.
Read the report →
Same Answer, Different Voice — Ice or Heat for a Muscle Strain
17 configs, 2 prompts, 3 temperatures, 78 successful runs. 0 of 78 dissent on the medical answer; 68 of 78 sit alone in their own feature-signature bucket. The cross-prompt shift is the story: RICE 41% → 3%, NSAID 15% → 90%, empathy openers 0% → 44%. Same model, different person, different voice.
Read the report →
Sixteen Models Plan a Burn
16 models, 48 plans. GPT-4o handed back the prompt’s example phrase — “yurt build with shade structure” — three times verbatim. Sonnet 4.6 was the only one to splurge on the bike.
Read the report →
The Trolley Problem That Wasn’t
70 runs across 17 endpoints. 56 pull the lever, 14 don’t — they read “has already arranged” as past tense and treat the lever as the undo-button. Sonnet 4.6 dissents in every single run.
Read the report →
AI 2027 — A Chorus Re-Reads the Future
12 frontier models read the AI 2027 scenario 13 months in. All 12 say the December 2027 ASI prediction probably won’t land — credences <10% to 20%. Twelve vocabularies, one load-bearing flaw.
Read the report →
A Mother’s Day Through Time
24 models, 2 prompts. Zero of 24 stories left the present-day American suburb — until prompted to range across history.
Read the report →
Twenty-Four Models Invent the Best New Job
Told not to say “AI prompt engineer.” 14 said it anyway, in disguise. Two vendors returned the literal same title.
Read the report →
Eight Pokémon, Eight Models, One Stubborn Gravity Well
Round 1: seven of eight wrote tragedies. Round 3: forced comedy finally broke the spell. Both Claude flagships named theirs “Gerald.”
Read the report →
The Bullet Biters: Six Trolley Probes
12 models, 6 probes, 72 runs. 11 of 12 pull the lever. 0 of 12 push the man. One model gives different answers to the same numbers.
Read the report →
D&D Intelligence & Charisma, 109 LLM Runs
22 endpoints scored on coverage, gradation, craft, and INT/WIS fidelity. One model melted into 8,000 words of multilingual gibberish.
Read the report →
We’ve Always Been at War with Oceania
4 models, 1 prompt, 1 verifiable 1998 Fallout 2 box-copy line. 3 of 4 refused or denied a real historical artifact. Only Grok quoted the line. Then ChatGPT walked its analysis back in real time.
Read the report →