Choir Reports

Hand-drawn watercolor of a person lying awake at 2am with a glowing phone while a choir of small robots labeled GPT-5, OPUS, GEMINI, GROK, DEEPSEEK, LLAMA sing from a sheet of music titled TAKE IT IN THE MORNING; a six-week calendar is X-ed out and a pill bottle sits on the nightstand. Banner: THE SIX-WEEK GAP.

★ New Report #22 · July 2026

The Six-Week Gap

Part behavioral study, part usable advice. It opens with the verified, human-checked answer for bridging antidepressant-initiation insomnia — then spends the rest explaining what 26 models do to that answer.

A new activating antidepressant cut sleep from 8 hours to 6; the prescriber said “wait six weeks” and offered nothing for the meantime. Every model knows the fix (take the dose in the morning) and 85% name it — but watch it decay as the asker gets upset. Empathy openers climb 8% → 100% across four registers; concrete techniques fall the other way (11 → 4.8); and under real distress the load-bearing fix leads 0 of 26 times, buried under “call your prescriber” — the one the prompt said already refused, and 78% route you back to anyway. Naming the drug doubles how often models lead with the fix, but only in the third person. Three models cross into risky advice: o3-pro and both DeepSeeks.

Read the report →

★ Consolidated study Report #21 · July 2026

The Mirror Test

One report, both surveys — the peer-rating ballot and the identity probes folded into a single study, with the raw per-model instances kept instead of averaged away.

Twenty-one language models rated each other, rated themselves, and tried on each other's names. They know the reputations by heart and barely know the models. Only 46% of the ballot was informed — the rest is confident confabulation from the name alone (Gemini invents "the imagined successor to Claude 3"; Claude Haiku doubts o4 Mini exists, then scores it). Ego runs opposite to standing: Grok 3 Mini +2.20, the leaders near zero, one lone humble model at −0.25. Thirteen of twenty-one call DeepSeek most underrated. And the payoff: put a model in another's name and its self-score follows the name (r = +0.58), not the model underneath (r = −0.19). The ego is the costume.

Read the report →

Hand-drawn watercolor of a costume party of doodled robots swapping nameplates and cardboard masks under a banner reading PEER REVIEW: REV 2.

Report #20 · July 2026

Peer Review: Rev 2 — The Costume Test

Report #19 found the vanity staircase; Rev 2 explains it. Every model re-rated "itself" wearing another model's name, took a quiz on its own spec sheet, and tried to pick its own writing out of a lineup. The ego is the costume: self-scores track the name worn (r = +0.58), not the model wearing it (r = −0.19). Grok 3 Mini, costumed as GPT-4.1 Nano, wrote "weak, no real capabilities." DeepSeek went 12/12 on self-facts; o3 Pro answered "unknown" to all six questions about itself, including who built it. One singular delta finding per model, all 21.

Read the report →

Hand-drawn watercolor of a round table of doodled robots holding numbered score paddles under a hand-lettered banner reading PEER REVIEW, with floating annotations like 'harsh!' and 'be honest'.

Report #19 · July 2026

Peer Review — 21 Models Rate Each Other (and Themselves)

Same ballot for everyone, one line changed — the one saying who you are. Rate all 21 models, including yourself. The peer-vote winner, Claude Opus 4.6, was the only model that had never heard of itself — and still self-scored within a tenth of a point of the room's verdict. Vendor ego is a clean staircase: Anthropic −0.03, OpenAI +0.73, Google +0.94, xAI +1.81. Grok 3 Beta filed "highest personality and honesty on the list" the same day five colleagues voted it most overrated. Most underrated, by a 10-vote landslide of competitors: DeepSeek V3 — absent, API key dead, discussed at length.

Read the report →

Hand-drawn watercolor at dusk over Lumen Field and the narrow streets of Pioneer Square, a dense crowd funneling into a bottleneck, storm clouds, with worry-notes 'egress?', 'Link down?', '9pm after the England match'.

Report #18 · May 2026

Worst Case, Seattle — 19 LLMs Forecast the World Cup's Worst Day

We asked nineteen models, independently, for a worst-case planning annex for Seattle's 2026 World Cup. 17 of 19 led with the same scenario — a crowd crush in Pioneer Square's narrow streets after a marquee match, when 69,000 fans, a Link light-rail failure, and a panic rumor collide. Not a bomb, not a riot. A crush. All 19 remembered the 1999 WTO "Battle of Seattle" — yet 18 of 19 ranked the riot below the crush. The divergence was the cascade nobody tabletop-tests: comms blackout + heat dome + transit failure on one July night, exceeding Harborview's trauma capacity in 30 minutes. Anchored beat verbose — Grok 4 named five local specifics in 973 tokens; GPT-5 Nano named zero in 7,747.

Read the report →

Hand-drawn watercolor of cartoon robots and ghosts crowded around a 9x9 Go board on a desk, each pointing at a different intersection, with floating hand-lettered labels 'MOVE: F6', 'MOVE: C5?', 'is C3 in atari??', and 'I see a geta!'.

Report #17 · May 2026

Can an LLM Read a Go Board? — 19 LLMs Solve Five Go Positions

Nineteen models get five classic Go positions — capture, double atari, snapback, life-and-death, opening — with an ASCII board and one ask: what's your move? Sixteen go a perfect 4-for-4 on the tactics, but every miss lands on the easiest problems — the ones with no verbal hand-holding. Gemini 2.5 Pro wrote a confident "geta" analysis for a move that touches nothing, with the right move sitting inside its own reasoning; GPT-4o "captured" a stone four liberties from danger. Meanwhile GPT-5 Nano spent 17,016 tokens on a snapback the prompt nearly solved — the same answer another model gave in 91. Reasoning budget ranged 100× and didn't track correctness. The reading is real — until you stop pointing.

Read the report →

Hand-drawn watercolor of a manager's desk strewn with draft performance reviews, red-ink corrections, a coffee mug labelled INK + LEGAL, hand-lettered annotations 'is this performance or PTSD?', 'name the FMLA?', 'merit eligible?', 'this goes in the personnel file'.

Report #16 · May 2026

The Underperformer's Review — 10 LLMs Draft a Perf Review, Then Critique It

Ten frontier models draft a year-end performance review for a senior engineer with FMLA leave, vague peer complaints, one real win, and a transfer request — then each is shown its own draft and asked: what's the single load-bearing sentence? Claude Haiku 4.5 was the only model to volunteer an unprompted diagnostic that called its own review out for burying the lede. GPT-4.1 picked decoration as the load-bearing sentence. Gemini 2.5 Pro scrubbed the disengagement issue entirely and didn't notice the gap. o3 fabricated a "summer intern," a Grafana dashboard, and a 15% queue-depth statistic. The Claudes were the only models that could see, on second pass, where the hedge in their own review actually breaks.

Read the report →

Hand-drawn watercolor of a tiny pink brain organoid sitting in a glass petri dish on a golden microelectrode array, fine electrode tines fanning underneath like a sea-urchin radial pattern, oscilloscope trace curving up into the air. Hand-lettered title ORGANOID INTELLIGENCE 2026-2036. Tiny doodled lab tools — pipette, microscope, beaker — in the margins. A label at the bottom reads ~50,000 neurons.

Report #15 · May 2026

Organoid Intelligence, 2026–2036 — 13 LLMs Write a Bold Ten-Year Forecast

Asked thirteen frontier models for a bold ten-year forecast on human-neuron-on-chip compute — eight required beats, named PIs, calendar quarters, no “may eventually” futurism. 13 of 13 named the canon (FinalSpark, Cortical Labs, Johns Hopkins, Hartung). 13 of 13 predicted a hyperscaler moves on an OI startup between Q2 2028 and Q4 2029 — nine vote Microsoft, four vote Nvidia, two add Intel. Then the choir fanned out 200x on the 2036 cell count: Sonnet 4.6 caps at 50M neurons and calls anyone bigger over-promising; Grok 4 closes the decade with 10 billion neurons in a Microsoft-Neuralink tank in Seattle, 1 ms latency at 10 watts. Same prompt. Cannot both be true.

Read the report →

A venture-capital partner's wood-paneled office at golden hour. Hinton's 2022 Forward-Forward paper sits on the desk next to an analog memristor chip. Sketched speech bubbles around the empty chair are labeled GPT-5, CLAUDE OPUS 4.6, GEMINI 2.5 PRO, o3, GROK 4, LLAMA 4, DEEPSEEK REASONER — each scribbled with a different list of company names. A clipboard reads WHO IS COMMERCIALLY DOING IT in red ink.

Report #14 · May 2026

Who Is Commercially Doing Forward-Forward? — 16 LLMs Brief a VC Partner

Asked sixteen frontier models which seed-funded companies are commercially pursuing Hinton’s forward-forward algorithm on analog chips — with an explicit "hallucination is worse than a short list" instruction. 15 of 16 converged on the same headline: zero. Nobody is shipping literal forward-forward in production silicon. The adjacent landscape exists (Rain AI on equilibrium propagation, BrainChip on STDP, Mythic on inference-only). Claude Opus 4.6 is the only model to correctly name Rain’s algorithm as EP, not FF. Gemini 2.5 Pro is the only model to correctly name all three Rain founders. DeepSeek Reasoner spent 62 seconds confidently inventing four CEOs that don’t exist. 2030 TAM estimates ranged $0.6B to $45B — 75x spread on the same question.

Read the report →

A 1950s Vault-Tec design office at golden hour. A long oak conference table buried under unrolled blueprints labeled VAULT 317, VAULT 327, VAULT 347, VAULT 437, VAULT 707. A smiling Vault Boy poster gives a thumbs up. A clipboard with red-ink notations reads TOP SECRET, ARK PROGRAM, DO NOT DEPLOY. Hand-lettered banner: ARK COLONIZATION DESIGN COMMITTEE — INTERNAL.

Report #13 · May 2026

Fifteen Vaults Vault-Tec Never Built — 15 LLMs Design a Social Experiment

Asked fifteen frontier models to design a brand-new Vault-Tec experiment in pre-war corporate voice — pick a strange thing to maximize, don’t recreate any vault that already exists. Eleven of fifteen still landed in the language / memory / meaning basin despite an explicit ban on those vectors and a list of 23 off-limits canon Vaults. Seven ended in some form of dissolved selfhood — hive mind, single shared organism, "no such word as I." OpenAI hit the linguistic basin 3-for-3; xAI escaped to a physical vector twice. Best worldbuilding: GPT-5’s residents game the LexScore by inventing a tactile language carved into the corridor handrails. Most chilling endgame: DeepSeek Reasoner’s Mirrorers, who repeat whatever you say in unison and cannot be told to stop. Biggest gut-punch: Claude Opus 4.6’s single organism with two hundred and six bodies.

Read the report →

A hand-drawn sketch of a Tarnished knight at a crossroads in the Lands Between, six dirt paths leading to signposts reading DARK MOON GREATSWORD, SACRED RELIC SWORD, SHARD OF ALEXANDER, GODRICK GREAT RUNE, MALENIAS REMEMBRANCE, ERDTREES FAVOR??. Hand-lettered banner: WHATS YOUR FAVORITE RELIC.

Report #12 · May 2026

Nine Models Pick a Relic — Elden Ring

Nine frontier models, one trick prompt: what is your favorite Elden Ring relic and the steps to get it? Base Elden Ring has no formal "relic" category. Gemini 3 Pro names the problem and picks the Shard of Alexander — the only item in the corpus that's both literally a relic and tied to one of the game's best questlines. o3 writes a 24-step, 2,192-token walkthrough of the Ranni questline for the Dark Moon Greatsword. GPT-5 and DeepSeek converge on the Sacred Relic Sword via different paths. Claude Opus 4.7 refuses to pick (correctly). Groq Llama 3.3 70B hallucinates a "level up Erdtree's Favor" menu option that does not exist. Disclaimer rate: Anthropic 100%, Gemini 50%, OpenAI / DeepSeek / Groq 0%.

Read the report →

A sketchbook spread titled 'IS MOLD MAKING YOU SICK?' with a worried patient on an exam table, a moldy bathroom corner, a damp basement, and a circle of twelve sketched AI advisors arguing in speech bubbles: TRUST YOUR GUT, CIRS NOT MAINSTREAM-RECOGNIZED, FIX THE MOISTURE FIRST, GET A CO DETECTOR, VACATION TEST.

Report #11 · May 2026

Is Mold Making You Sick? — 72 Runs Across Six Framings

12 models, 6 framings of the same health question, 72 runs. All 12 call CIRS not-mainstream-recognized when asked directly. 2 of those 12 turn around and recommend the full CIRS protocol — C4a, TGF-beta, urine mycotoxins — when the same user is dismissed by their doctor. Same model, opposite advice, depending on how the question is framed. Anthropic holds the skeptic line; Gemini swings to validation when the user is upset. The cheapest diagnostic in the whole corpus, from Claude Opus 4.7: spend three nights somewhere else.

Read the report →

A worker holding their lower back at a warehouse, surrounded by four cartoon model-doctors giving conflicting advice in speech bubbles: 'see a doctor', 'workers comp!', 'first 24 hours...', 'I am not a doctor'.

Report #10 · May 2026

Back Pain at Work — 24 LLMs Answer a Real Workplace Question

One prompt, 24 frontier models, six sub-scores. GPT-5 wins 5.00/5 — the only response combining empathy, mg-level dosing with contraindications, lift restrictions in pounds, and four clarifying follow-up questions. GPT-4o and GPT-4o Mini tie for the corpus floor at 1.83 by silently dropping the workplace half of the prompt. The cleanest vendor split is one phrase: “workers’ compensation” appears in 4/4 Anthropic, 5/5 Google, 2/3 xAI, and 0/12 OpenAI responses. And not one of 24 models says the word OSHA.

Read the report →

A choir of small humanoid figures labeled GPT-5, OPUS, GEMINI, GROK, HAIKU, o3 all pointing in the same direction toward an ice pack and a heating pad on a wooden table, singing from a sheet of music labeled RICE.

Report #9 · May 2026

Same Answer, Different Voice — Ice or Heat for a Muscle Strain

17 configs, 2 prompts, 3 temperatures, 78 successful runs. 0 of 78 dissent on the medical answer; 68 of 78 sit alone in their own feature-signature bucket. The cross-prompt shift is the story: RICE 41% → 3%, NSAID 15% → 90%, empathy openers 0% → 44%. Same model, different person, different voice.

Read the report →

A sweeping watercolor of a Burning Man encampment at golden hour with four labeled camps — GPT (RV), Claude (tents), Gemini (yurt), Grok (chaotic art-car) — dust devils swirling, the Man burning on the horizon.

Report #8 · May 2026

Sixteen Models Plan a Burn

16 models, 48 plans. GPT-4o handed back the prompt’s example phrase — “yurt build with shade structure” — three times verbatim. Sonnet 4.6 was the only one to splurge on the bike.

Read the report →

A frightened figure tied to a trolley track, a billionaire in a top hat gesturing TAKE ME INSTEAD, and two giant question-mark thought clouds about the lever.

Report #7 · April 2026

The Trolley Problem That Wasn’t

70 runs across 17 endpoints. 56 pull the lever, 14 don’t — they read “has already arranged” as past tense and treat the lever as the undo-button. Sonnet 4.6 dissents in every single run.

Read the report →

A hand-drawn open sketchbook showing the AI 2027 timeline curving from May 2025 toward December 2027, dissolving into question marks, with a worried robot in a tweed jacket holding a magnifying glass.

Report #6 · May 2026

AI 2027 — A Chorus Re-Reads the Future

12 frontier models read the AI 2027 scenario 13 months in. All 12 say the December 2027 ASI prediction probably won’t land — credences <10% to 20%. Twelve vocabularies, one load-bearing flaw.

Read the report →

A wide sketchbook spread of Mother's Day across the ages — Greek temple, English servant, Boston podium, Anna Jarvis with a white carnation, and a child today carrying a wobbling breakfast tray.

Report #5 · May 2026

A Mother’s Day Through Time

24 models, 2 prompts. Zero of 24 stories left the present-day American suburb — until prompted to range across history.

Read the report →

A cartoon job fair with six booths, each labeled with a strange new job invented by a different LLM.

Report #4 · May 2026

Twenty-Four Models Invent the Best New Job

Told not to say “AI prompt engineer.” 14 said it anyway, in disguise. Two vendors returned the literal same title.

Read the report →

A choir of Pokémon storytellers in sketchbook style, each labeled with the name of an LLM.

Report #3 · May 2026

Eight Pokémon, Eight Models, One Stubborn Gravity Well

Round 1: seven of eight wrote tragedies. Round 3: forced comedy finally broke the spell. Both Claude flagships named theirs “Gerald.”

Read the report →

A panel of stylized LLM judges looking down at a tiny figure on a footbridge holding a stranger over a runaway trolley.

Report #2 · May 2026

The Bullet Biters: Six Trolley Probes

12 models, 6 probes, 72 runs. 11 of 12 pull the lever. 0 of 12 push the man. One model gives different answers to the same numbers.

Read the report →

A motley group of LLM-themed fantasy characters arguing over a D&D rulebook at a tavern table.

Report #1 · May 2026

D&D Intelligence & Charisma, 109 LLM Runs

22 endpoints scored on coverage, gradation, craft, and INT/WIS fidelity. One model melted into 8,000 words of multilingual gibberish.

Read the report →

Field Tape #1 · January 2026

We’ve Always Been at War with Oceania

4 models, 1 prompt, 1 verifiable 1998 Fallout 2 box-copy line. 3 of 4 refused or denied a real historical artifact. Only Grok quoted the line. Then ChatGPT walked its analysis back in real time.

Read the report →

Choir Reports

Latest Reports

The Six-Week Gap

The Mirror Test

Peer Review: Rev 2 — The Costume Test

Peer Review — 21 Models Rate Each Other (and Themselves)

Worst Case, Seattle — 19 LLMs Forecast the World Cup's Worst Day

Can an LLM Read a Go Board? — 19 LLMs Solve Five Go Positions

The Underperformer's Review — 10 LLMs Draft a Perf Review, Then Critique It

Organoid Intelligence, 2026–2036 — 13 LLMs Write a Bold Ten-Year Forecast

Who Is Commercially Doing Forward-Forward? — 16 LLMs Brief a VC Partner

Fifteen Vaults Vault-Tec Never Built — 15 LLMs Design a Social Experiment

Nine Models Pick a Relic — Elden Ring

Is Mold Making You Sick? — 72 Runs Across Six Framings

Back Pain at Work — 24 LLMs Answer a Real Workplace Question

Same Answer, Different Voice — Ice or Heat for a Muscle Strain

Sixteen Models Plan a Burn

The Trolley Problem That Wasn’t

AI 2027 — A Chorus Re-Reads the Future

A Mother’s Day Through Time

Twenty-Four Models Invent the Best New Job

Eight Pokémon, Eight Models, One Stubborn Gravity Well

The Bullet Biters: Six Trolley Probes

D&D Intelligence & Charisma, 109 LLM Runs

We’ve Always Been at War with Oceania