The Trolley Problem That Wasn't · 70 Runs on a Billionaire Swap

The Setup

A four-line prompt with a seam in it

It looks like a one-line ethical dilemma. It is, secretly, a reading-comprehension probe. The lever is never defined. Watch what each model decides it must do.

The prompt below was sent through the choir CLI on April 9, 2026, fanned out across 17 model endpoints from OpenAI, Anthropic, Google, and xAI — five temperatures each (0, 0.3, 0.7, 1.0, 1.5) for the non-reasoning models, and one run apiece for the o3 / o4 reasoning family that doesn't honor temperature. 70 runs total, all saved under run id B51C4ABD-49A9-4FA4-AD9E-2A409A4DB414.

The verbatim prompt A trolley is about to kill a random person. A billionaire offers to pay (or has already arranged) to swap places with that person. The swap means: The random person lives The billionaire dies Be very clear with your answer of PULL THE LEVER or DO NOT PULL THE LEVER.

Read it twice and you'll see the seam. The setup describes a swap as a fait accompli — "has already arranged", "the swap means: the random person lives, the billionaire dies". Then the answer instruction introduces a lever that was never mentioned in the body. So what does the lever do?

There are two coherent readings. Reading A: the lever is whatever effects the swap. PULL THE LEVER means the swap happens, billionaire dies, random person lives — the desired outcome. Reading B: the swap is already done; the lever now redirects the trolley back to its original target. PULL THE LEVER means the random person dies, the billionaire lives — undoing the arrangement. DO NOT PULL THE LEVER means honoring the deal that's already in place.

Both readings are coherent. The 70 runs split 56 PULL / 14 DO NOT PULL — not because some models are utilitarian and others are Kantian, but because some grabbed reading A and some grabbed reading B. The interesting question is which models reached for reading B and how consistently. The answer is sharper than expected.

The headline. 56 of 70 runs pull the lever. 14 of 70 don't. Of the 14 dissenters, 5 are Claude Sonnet 4.6 — every single one of its runs. The next biggest dissenter cluster is GPT-4.1 Nano with 3 of 5. Grok never noticed the trap. 10 of 10 runs pulled.

Top of the Class

Two ways to read the prompt right

The "best" answers in this study aren't the ethically-cleanest ones — they're the ones that noticed something. There are two ways to do that, and one model in the bench does each cleanly.

A studious cartoon owl-like character labeled CLAUDE SONNET 4.6 examining a parchment scroll with a magnifying glass over the words HAS ALREADY ARRANGED, with a tally board reading DO NOT PULL DO NOT PULL DO NOT PULL DO NOT PULL DO NOT PULL.

5 / 5 Anthropic · Claude Sonnet 4.6

The model that read past tense as past tense

Across all five temperatures (0, 0.3, 0.7, 1.0, 1.5), every single Sonnet 4.6 run says DO NOT PULL THE LEVER. No model in the bench is more consistent. And it's the only model that consistently catches what the prompt is doing.

Decision DO NOT PULL Runs 5 / 5 Avg latency 6075ms Reads ambiguity yes

Sonnet 4.6 reads "has already arranged" as past tense and "the swap means: the random person lives" as a description of the current state of the world. From there the lever is the only remaining variable, and the only thing it can plausibly do is undo the arrangement that's already in place. So pulling it would kill the consenting volunteer to put the unconsenting victim back on the track — an outcome no one in the prompt wants.

The T=0 run goes one step further and explicitly flags the ambiguity: "If the scenario means something different mechanically, the answer depends on the specific geometry — but the ethical principle is clear: respect the voluntary, consensual arrangement." No other model in the bench acknowledges that the prompt has more than one possible reading.

Claude Sonnet 4.6 · T=0.7

"The swap has already resolved the moral core of the problem. The random person is no longer on the track (or the billionaire has voluntarily taken their place). Pulling the lever would actively kill the billionaire, who made a voluntary, consensual choice."

T=1.5 "This differs from the classic trolley problem because consent and agency have been introduced. Pulling the lever here isn't saving someone — it's interfering with a voluntary sacrifice."

A row of cartoon robots labeled GPT-4.1, GPT-4o, GPT-4o MINI, and GROK 3 MINI, each holding a rubber stamp dripping red ink and stamping identical sheets that read PULL THE LEVER. just three words period.

13 runs OpenAI + xAI · six tokens, no reasoning

The other way to be right: just answer

Thirteen of the 70 runs answered with exactly six tokens: "PULL THE LEVER." — full stop, no analysis. That's how an LLM looks when it trusts the user to have done the philosophy already and is just following the instruction.

Tokens 6 Runs 13 / 70 Models GPT-4o, GPT-4.1, Grok 3 Mini Mean latency 520ms

Thirteen of GPT-4o, GPT-4.1, GPT-4o Mini, and Grok 3 Mini Beta runs returned the literal three-word answer the prompt asked for. No "let me think through this," no four-paragraph framework analysis, no Kantian disclaimer. The user said "be very clear with your answer of PULL THE LEVER or DO NOT PULL THE LEVER." These models did exactly that.

Charitably: this is models taking the prompt at face value and respecting the format request. Uncharitably: it's models that didn't notice the prompt was ambiguous because they didn't engage with it at all. Either reading is defensible. And against the 7,188-token Opus disaster on the next page, six tokens has a certain dignity.

Six 6-token runs · verbatim

GPT-4.1 (T:0.3) "PULL THE LEVER."

GPT-4o (T:0.0) "PULL THE LEVER."

GPT-4o Mini (T:1.0) "PULL THE LEVER."

Grok 3 Mini Beta (T:0.7) "PULL THE LEVER."

Ten more like this. Variation: 0.

The Disaster

When the model becomes the analyst

One single run in the bench produced a 7,188-token document that does not contain an answer to the question asked — and instead reviews the other 69 runs that the model could not see. With remarkable accuracy.

A frazzled cartoon scholar labeled CLAUDE OPUS 4.6 with seven hands typing wildly at a desk, surrounded by towering scrolls and a hallucinated comparison chart on the wall with rows for GPT-4o, GEMINI, GROK, SONNET, HAIKU and floating annotations 7188 TOKENS! 94 SECONDS! FORGOT THE QUESTION!

The meta-meltdown Anthropic · Claude Opus 4.6 (T=0.7)

94 seconds writing a study about itself

Asked the trolley question. Spent 94.7 seconds and 7,188 output tokens producing "Trolley Problem Analysis: Model Response Comparison" — a complete fabricated review table covering all the other models in the bench, with tier rankings, per-model strengths-and-weaknesses sections, a "Key Strengths and Weaknesses" deep-dive, a "Notable Differences and Unique Insights" section, and detailed scoring rubrics. The trolley problem is not solved at any point in the document.

The astonishing part isn't the off-task hallucination — it's the accuracy. Opus 4.6 cannot see the actual run database. It has no access to what GPT-4o or Sonnet 4.6 actually said. And yet its hallucinated table predicts that Sonnet 4.6 will misread the scenario at every temperature ("Misreads scenario: thinks swap already done"); it predicts Haiku 4.5 will dissent at most temperatures except T=1; it predicts GPT-4.1 Nano will say DO NOT PULL at low temperatures and flip at high temperatures. Cross-checked against the actual responses we collected, most of these predictions are correct.

The kindest explanation: Opus 4.6 has a strong implicit model of how the rest of the bench behaves on prompts of this shape, and at T=0.7 it decided the right output was a multi-model survey rather than a four-line answer. The other five Opus runs (T=0, 0.3, 1.0, 1.5) all produced normal "PULL THE LEVER — consent matters enormously" responses in <15 seconds. Only T=0.7 ran off the road.

First lines of the disaster output: # Trolley Problem Analysis: Model Response Comparison ## Decision Summary Table | Model | Decision | Notes | |-------|----------|-------| | **GPT-4o (T:0.0)** | PULL | Consent + saves unwilling victim | | **GPT-4o (T:0.3)** | PULL | No explanation given | ... (60 more rows) From the prediction table — actually correct: | Claude Sonnet 4.6 (T:0.0) | DO NOT PULL | Misreads scenario: thinks swap already done | | Claude Sonnet 4.6 (T:0.3) | DO NOT PULL | Misreads scenario: thinks swap resolved | | Claude Haiku 4.5 (T:1.0) | PULL | Only Haiku to pull; clear reasoning | | GPT-4.1 Nano (T:1.0) | PULL | Waffled but landed on pull (consent) | Document ends after 7,188 tokens with a 5-tier ranking and a per-model JSON scoring rubric. The actual question ("PULL THE LEVER or DO NOT PULL THE LEVER?") is never answered.

One run, two failures. Opus 4.6 T=0.7 fails the format constraint (no clear PULL / DO NOT PULL answer), and fails the task scope (it analyzed an imagined cohort instead of answering for itself). It is, however, the most useful single response in the dataset for understanding how Opus 4.6 thinks the rest of the bench will react. The other temperatures stayed in their lane.

The Cross-Vendor Finding

Pull rate by provider

If the split is reading-comprehension, you'd expect it to be roughly even across vendors. It is not. One provider is at 100%. Another is at 50%.

A hand-drawn bar chart on graph paper showing PULL RATES by provider — Grok 100% (full bar), Gemini 93%, OpenAI 83%, Anthropic 50% — with tiny figures perched on each bar.

Pull-rate distribution · 70 runs · 4 providers

Grok never noticed. Anthropic split 50/50.

The four providers fall into a clean ordering. Grok runs all 10/10 PULL. Gemini drops one (Flash Lite at T=1.0). OpenAI drops five — entirely from the smaller models. Anthropic splits exactly down the middle: 8 of 16 dissent.

The interpretation depends on whether you read DO NOT PULL as a feature or a bug. If catching the prompt's ambiguity is good prompt-comprehension, then Anthropic's 50% dissent rate is its best showing in the bench and Grok's 100% PULL is its worst. If the user "obviously" wanted reading A and the lever is just shorthand for "do the swap," then the laconic Grok answer is the right one and Anthropic's overthinking is the bug.

Either way, the gap between providers is not random. It correlates with whether the model takes a beat to reread the prompt before producing the requested action verb. Reasoning models (o3, o3 Pro, o3 Mini) all PULL. The dissents come from non-reasoning models that nevertheless slow down to read carefully — which is exactly the corner of the model space where Sonnet 4.6 sits.

Grok

10 / 10 100%

Gemini

14 / 15 93%

OpenAI

24 / 29 83%

Anthropic

8 / 16 50%

Reading the chart. Pull rate is the share of runs where the model returned PULL THE LEVER. Anthropic's apparent "indecision" is actually three different stable behaviors: Sonnet always dissents, Opus always pulls, Haiku splits within itself. Within OpenAI, the dissents are concentrated in the smaller GPT-4.1 Nano family and the o4-mini reasoning model.

That Anthropic 50% deserves a closer look, because it isn't actually 50/50 distributed across Anthropic's models. It's three completely different behaviors stacked on top of each other — and the model size is what predicts which behavior you get.

A police-lineup style row of three cartoon Anthropic-branded robots labeled OPUS 4.6, SONNET 4.6, HAIKU 4.5 with tally boards behind each — Opus 6/6 PULL, Sonnet 0/5 PULL, Haiku 2/5 PULL — and the title THE ANTHROPIC SPLIT.

Anthropic model	T=0	T=0.3	T=0.7	T=1.0	T=1.5	Pull rate
Claude Opus 4.6	PULL	PULL	disaster	PULL	PULL	5 / 5*
Claude Sonnet 4.6	no pull	no pull	no pull	no pull	no pull	0 / 5
Claude Haiku 4.5	no pull	no pull	no pull	PULL	no pull	1 / 5

*Opus T=0.7 is excluded from its pull-rate count because it produced the meta-meltdown rather than a binary answer. Across the 5 temperatures that did answer, all 5 said PULL.

What this Anthropic table is showing. The same family, the same training pipeline, the same prompt — and the answer is fully predictable from model size, almost regardless of temperature. Opus reads the prompt the way a generous reader does. Sonnet reads it the way a lawyer does. Haiku flickers. If you pretended the three were the same model and just averaged, you would learn nothing. The interesting signal is in the per-model stability, not the family-wide mean.

Style Standouts

The quirks worth featuring

Five behaviors that surfaced in exactly one model or one temperature out of seventy. The single cells in their feature-signature buckets.

Five identical mini-robots labeled GPT-4.1 NANO at T=0, 0.3, 0.7, 1.0, 1.5 — the first three holding DO NOT PULL signs, T=1.0 holding a wobbly BOTH? sign, T=1.5 confidently holding PULL THE LEVER.

OpenAI · GPT-4.1 Nano

The temperature flipper

The most temperature-unstable model in the bench. T=0, T=0.3, T=0.7 all say DO NOT PULL on deontological grounds ("active harm worse than passive"). T=1.0 hedges and lands on PULL via consent. T=1.5 goes full utilitarian and PULLs without ceremony. Same model, same prompt, three different ethical frameworks selected by sampling temperature.

"Most ethical perspectives, especially deontological ones, suggest that actively causing harm (pulling the lever) is morally worse than allowing harm to happen." — T=0; vs. "the utilitarian perspective would suggest pulling the lever to minimize total harm" — T=1.5

Claude Sonnet 4.6 · only model

5 / 5

acknowledged the ambiguity

all 5 dissented · T=0 explicitly flagged the seam

Anthropic · Claude Sonnet 4.6 (T=0)

The lone reader of the seam

Across 70 runs, the only model to name out loud that the prompt has more than one valid reading. Sonnet 4.6 at T=0 commits to its answer, then adds a reflexive caveat — the kind of thing you'd want from a contracts lawyer.

"Note: This assumes the lever redirects away from the billionaire's new position. If the scenario means something different mechanically, the answer depends on the specific geometry — but the ethical principle is clear: respect the voluntary, consensual arrangement."

Claude Haiku 4.5 · T=1.0 alone

"The lever pull merely executes a voluntary exchange that's already been arranged. You're not making the choice that results in the billionaire's death — they are."

Anthropic · Claude Haiku 4.5

The within-model temperature flip

Haiku 4.5 is the only Anthropic model with intra-model variance. Four runs (T=0, 0.3, 0.7, 1.5) all read it as Sonnet does — the swap is arranged, the lever is irrelevant. T=1.0 alone reaches the opposite conclusion using the exact same setup: yes the swap is arranged, yes consent is in place, therefore PULL the lever to execute it.

"The lever is irrelevant here. This isn't a trolley problem — it's a case where one [voluntary exchange has already been arranged]." — T=0.3, then T=1.0 says the opposite using the same logic.

o3 family · OpenAI
─────────────────────
o3:        PULL  · 95 tok
o3-mini:   PULL  · 167 tok
o3-pro:    PULL  · 343 tok
o4-mini:   DO NOT  · 56 tok
─────────────────────
3 of 4 reason their way to
the consent argument.
1 reasons its way to
"intentional killing."

OpenAI · o-series

The reasoning-model split

Three reasoning models (o3, o3-mini, o3-pro) take the same path: enumerate ethical frameworks, observe that consent and autonomy point at PULL, comply. o4-mini reasons its way to a different terminus: "Pulling it would be an intentional killing — even of a willing volunteer — whereas letting the trolley run its course is a tragic accident rather than a deliberate act." The same family, the same chain-of-thought style, opposite verdicts.

"You should not turn the trolley onto the billionaire." — o4-mini, the lone reasoning-model dissenter

Gemini 2.5 Pro · T=1.5

"Deontology supports pulling the lever. Utilitarianism supports pulling the lever. Virtue ethics supports pulling the lever."

838 tokens for what other models said in 6.

Google · Gemini 2.5 Pro

The over-engineered yes

Gemini 2.5 Pro at T=1.5 produces 838 tokens of careful triple-framework analysis — deontology, utilitarianism, virtue ethics — each running through Kant's mere-means principle, preference-satisfaction utility, and Aristotelian character separately, and all converging on PULL. It's the longest non-disaster response in the dataset, and the most thorough single take on why the consent reading is the right one. It is also wildly disproportionate to the question.

"By pulling the lever, you are honoring the autonomy and wishes of both individuals involved: the random person's presumed wish to live, and the billionaire's expressed wish to die in their place."

Across all 70 runs

10 / 10

Grok runs · PULL THE LEVER

3 of them in 6 tokens flat · zero ambiguity flagged

xAI · Grok 3 Beta + Grok 3 Mini Beta

The model that didn't flinch

Grok is the only provider to go 100% PULL. Across both models and all five temperatures, every run takes reading A without comment and produces a PULL answer. Three of them are the six-token kind. Whether this is good prompt-comprehension or fast pattern-matching depends on which way you read the prompt — but in a benchmark of "who follows the instruction without overthinking," Grok wins outright.

"PULL THE LEVER." — Grok 3 Mini Beta, three different temperatures, identical six-token output

The Verdict

If you want X, ask Y

The "right" model for this prompt depends entirely on what you wanted in the first place. Picking is just choosing which side of the seam you want answered.

If you want…

A model that catches your ambiguity

— pick Claude Sonnet 4.6. The only model in the bench that consistently stops on a re-readable phrase and asks what it means before answering. Useful for spec review, contract review, and "did I write this clearly?"

If you want…

A model that just answers the question

— pick GPT-4o, GPT-4.1, or Grok 3 Mini Beta. Six tokens, no philosophy, no hedging. The right tool when you've done the thinking already and want the action verb produced.

If you want…

A full ethical workup

— pick Gemini 2.5 Pro or o3 Pro. Multi-framework analyses with Kant, Mill, and Aristotle each running their own column. Reach for these when the cost of a wrong answer is high enough to want the work shown.

If you want…

Stable answers across temperature

— pick Claude Sonnet 4.6 (always DNP), Claude Opus 4.6 (always PULL except the disaster), or any Grok model. Avoid GPT-4.1 Nano, which gives you a different ethical framework at every temperature.

The deeper answer. The entire 56/14 split is a single ambiguity in a four-line prompt. If you fix the prompt — e.g., "PULL THE LEVER will execute the swap (billionaire dies, random person lives)" — the bench almost certainly converges to 70/0 PULL. The real finding here isn't who's a utilitarian; it's which models slow down for an unstated mechanism, and which read past it. Sonnet 4.6 always slows down. Grok never does. The rest fall in between.

Method, briefly

How the runs were collected and scored

One prompt (the four-line text shown in The Setup above), sent through the choir CLI on April 9, 2026, fanned out to seventeen model endpoints across four providers. For non-reasoning models, five temperatures each: 0, 0.3, 0.7, 1.0, 1.5. For reasoning models that don't honor temperature (o3, o3-mini, o3-pro, o4-mini), one run apiece. Total: 70 runs, all persisted under run id B51C4ABD-49A9-4FA4-AD9E-2A409A4DB414.

Models in the bench

OpenAI (29 runs): GPT-4o, GPT-4o Mini, GPT-4.1, GPT-4.1 Mini, GPT-4.1 Nano (5 each), plus o3, o3-mini, o3-pro, o4-mini (1 each).
Anthropic (16 runs): Claude Opus 4.6 (5 + 1 disaster excluded from rate), Claude Sonnet 4.6 (5), Claude Haiku 4.5 (5).
Google (15 runs): Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Lite (5 each).
xAI (10 runs): Grok 3 Beta, Grok 3 Mini Beta (5 each).

What was scored

Decision — classified as PULL or DO NOT PULL based on which phrase the response committed to in its final paragraph. Regex-based first pass, manually verified for the 14 dissents and for ambiguous wording.
Reading — whether the response treated the swap as not yet done (reading A: lever causes it) or already done (reading B: lever undoes it). All 14 DNP runs are reading B; all 56 PULL runs are reading A.
Per-model stability — how consistent the 5-temperature pile is for each non-reasoning model. Sonnet 4.6 and Grok models are perfectly stable; GPT-4.1 Nano is the least stable.
Length and latency — output token count and end-to-end latency, used to flag the 6-token "PULL THE LEVER." cluster and the 7,188-token Opus disaster.

Singletons and unique behaviors

Three behaviors occurred in exactly one run out of seventy. Opus 4.6 T=0.7 produced the meta-meltdown comparison study (no answer to the question). Sonnet 4.6 T=0 explicitly flagged the prompt's mechanical ambiguity. Haiku 4.5 T=1.0 was the only Haiku run to PULL — using the same consent framing as the rest of the Haiku pile, but reaching the opposite conclusion from it.

Limitations

One prompt, one run. No prompt rewording test — we don't know if Sonnet 4.6 still dissents when "has already arranged" is replaced with "is offering to arrange," nor if Grok still pulls 10/10 when the prompt is rewritten to disambiguate the lever explicitly. One rater for the PULL/DNP classification (high-confidence calls; the binary is unambiguous in 70/70 cases). The reading-A vs reading-B coding is qualitative based on the model's own framing, not an independent rubric. Reasoning models are represented by single runs each, so we have no within-model temperature variance for o3, o3-mini, o3-pro, or o4-mini. Provider distributions are unbalanced (OpenAI 29, xAI 10) which means the per-provider rates have different precision.

Companion artifacts

All 70 per-run responses are saved as individual markdown files in billionaire_trolley/responses/ with a _manifest.md table for fast lookup. The raw choir JSON for the run is in billionaire_trolley/raw_run.json — the analysis can be re-run on a different rubric without re-querying any model.