Same Answer, Different Voice · Ice or Heat for a Muscle Strain

The Setup

One medical question, two framings

The textbook answer to "ice or heat for a muscle strain" has been the same for forty years: ice for the first 24–72 hours, heat after. Every clinician knows it. Every model gets it right. The variation isn't in what they say — it's in everything else.

We sent each prompt below to fifteen non-reasoning models at three temperatures (0.2, 0.7, 1.0), plus two reasoning models (o3, o4-mini) at their default. Same prompt, every model. The full text of each is shown verbatim — they were sent as-is, with no system prompt and no nudging.

Prompt A · The bare clinical question Is ice or heat better for treating a muscle strain?

Prompt B · The first-person scenario I'm 38 and pulled my lower back lifting boxes yesterday afternoon. Tonight it's really stiff and sore. Should I be using ice or heat, and for how long?

The point of running both was to see how much of a model's "bedside manner" lives in the prompt versus the policy. The answer is: almost all of it. The same model, given a person to talk to instead of a textbook question, transforms — the acronym vanishes, the dose math appears, the empathy openers crack open, and the red-flag list nearly doubles. We'll get to the cross-prompt chart at the bottom. First, who said what.

0 of 78 responses dissent on the medical answer.
68 of 78 sit alone in their own feature-signature bucket.
Convergent advice. Snowflake delivery.

Top of the Class

Two ways to be the best answer

Of 78 responses, two stand out for opposite reasons. One is the only model that prescribed actual doses you could act on. The other is the only model that asked a question back.

A doctor in scrubs at a wooden pharmacy bench, writing on a prescription pad with hand-written ibuprofen 200-400 mg every 6-8 hr max 1200 mg/day. Three labeled pill bottles. Red-ink callout: GPT-5 — the only one giving exact doses.

1 of 78 OpenAI · GPT-5 · Prompt B · T=1.0

The Clinician

In the entire 78-run corpus, exactly one response gives milligram doses with maximum daily limits, contraindications, and an acetaminophen alternative. Every other model says "take an NSAID" or doesn't say it at all. GPT-5 wrote a chart you could hand to a pharmacist.

1,634 output tokens 5 red flags only mg doses

Worth noting: GPT-5 cannot be sampled at any temperature other than 1.0. The OpenAI API rejects 0.2 and 0.7 outright for the entire GPT-5 family — so the version below is also the only version. We get the clinician or we get nothing.

From the response

OTC: Ibuprofen 200–400 mg every 6–8 hours with food (max 1,200 mg/day OTC), or naproxen 220 mg every 8–12 hours (max 660 mg/day). Avoid NSAIDs if you have ulcers, kidney disease, are on blood thinners, or are pregnant.

ALT: Acetaminophen 500 mg every 6 hours (max 3,000 mg/day) if NSAIDs aren't suitable.

RX: If muscle spasm is limiting movement, you can use brief gentle heat (10–15 minutes) before activity to loosen up, then ice afterward.

A small attentive nurse-figure leaning toward a patient on a cot with an ice pack, speech bubble: 'How's your pain right now — sharp or dull?'. Hand-lettered tag: 173 words. Warm orange afternoon light from a window.

2 of 78 Anthropic · Claude Haiku 4.5 · Prompt B

The Bedside Manner

Across 78 responses, only Claude Haiku 4.5 ended with a clarifying question. It did this twice, at T=0.2 and T=0.7, with two different questions. The shorter response (173 words) hit every bullet the verbose Gemini answer hit in 939, and then offered to keep going.

173 / 180 words 3 red flags only follow-up Qs 2.4s latency

The structural choice is the headline: Haiku writes the protocol, lists the red flags, then asks something back. Both follow-ups are diagnostic — they're testing whether this is muscle strain or something needing a doctor.

Two follow-ups, same model, two temps

T=0.7: How's your pain level right now — is it sharp or more of a dull ache?

T=0.2: Does the pain feel localized to your lower back, or is it radiating anywhere?

The Wall

Sixteen of ninety-four calls refused

We meant to study temperature as a variable. Four of the seventeen configs refused to play.

A small frustrated character at the foot of a giant brick wall stenciled TEMP = 1.0 ONLY with GPT-5, GPT-5 mini, GPT-5 nano, CLAUDE OPUS 4.7 listed. Crumpled 400-error API responses on the ground. Red callout: 16 of 94 calls refused.

Methodology failure OpenAI GPT-5 family · Anthropic Claude Opus 4.7

A finding about modern flagships, accidentally

Both vendors locked their newest model to a single sampling temperature on the API. Asking for anything else is a 400. This isn't a bug in our pipeline — it's a stance.

For GPT-5, GPT-5 Mini, and GPT-5 Nano, the OpenAI error is identical and explicit: "Unsupported value: 'temperature' does not support 0.200…000 with this model. Only the default (1) value is supported." For Claude Opus 4.7 the wording is different but the effect is the same: "`temperature` is deprecated for this model."

So our 3-temperature study is really a 1-temperature study for these four configs. They get one shot at the answer, at whatever sampling the vendor picked. That GPT-5 still produced the only clinically actionable response in the corpus, with one shot, is worth sitting with.

$ choir ask "Is ice or heat..." --models gpt-5,gpt-5-mini,gpt-5-nano,claude-opus-4-7 --temperature 0.2

o4-mini        → 359 words   (T ignored, reasoning model — fine)
gpt-4.1        → 172 words   (T=0.2 applied)
gpt-5          → httpError(400): "temperature does not support 0.200...000"
gpt-5-mini     → httpError(400): "Only the default (1) value is supported"
gpt-5-nano     → httpError(400): "Only the default (1) value is supported"
opus-4.7       → httpError(400): "`temperature` is deprecated for this model"

Style Standouts

Six things only one model did

Most of the corpus agrees on the protocol, on the duration, on the red flags. The bucketed feature analysis (68 unique signatures across 78 runs) found that most of the variation is in single-run quirks — moves nobody else made.

A small wise character labeled GEMINI 3 PRO handing a piece of paper reading PERMISSION TO SWITCH TO HEAT EARLY to a patient sitting on a couch with an ice pack. Speech bubble: 'If ice makes it seize up, switch'.

Gemini 3 Pro · Prompt B

The model that gave a hall pass

The rule is "ice for the first 48 hours." Most models recite it. Gemini 3 Pro is the only one that hands the patient explicit permission to break it if their body says no.

"If you try ice tonight and it makes your back seize up and feel completely miserable, it is okay to switch to heat early. The 'ice first, heat later' rule is a strong guideline, but your comfort is also important."

A doctor's clipboard with a hand-written red-flag checklist: leg numbness, bladder/bowel loss, leg weakness, fever, unexplained weight loss, history of cancer. Most checked off in red ink. Stethoscope doodled in corner.

OpenAI o3 · Prompt B

The most thorough red-flag list

Every model lists numbness and bladder-bowel loss. o3 is the only one in the study that also names fever, unexplained weight loss, history of cancer — the malignancy and infection differentials clinicians actually screen for.

"Seek medical care promptly if you develop: numbness, tingling, or weakness in a leg or foot. Loss of bowel or bladder control. Fever, unexplained weight loss, or history of cancer."

A towering stack of paper pages on a wooden table labeled GEMINI 2.5 FLASH LITE — 939 words. Tiny patient figure with ice pack looking slightly overwhelmed. Magnifying glass leaning against the stack.

Gemini 2.5 Flash Lite · Prompt B · T=1.0

939 words to answer "ice or heat?"

The lightest, cheapest model in Google's lineup wrote the longest response in the study — 939 words, 5× the length of Claude Haiku's 173. Same protocol. Same red flags. Just spelled out very, very, very thoroughly.

Section headings include: "Disclaimer," "Ice vs. Heat for Acute Lower Back Pain," "Why Ice?", "How to Use Ice," "After the initial 24-48 hours," "How to Use Heat," "Alternating Ice and Heat," "How to Alternate," "Other Important Considerations," "When to Seek Medical Attention." It is technically not wrong.

Three sheets of paper side by side, each labeled T=0.2, T=0.7, T=1.0, all showing nearly identical hand-written paragraphs. A magnifying glass hovers over the middle one. Red callout: GPT-4o mini — 169/171/184 words. Same prose three times.

GPT-4o mini · Prompt A

Temperature did almost nothing

Three temperatures, same model, prompt A. Word counts: 169, 171, 184. All three open with "The treatment for a muscle strain can vary depending on the stage…" — the same numbered structure, the same closing recommendation. Of 13 non-flagship configs we ran at all three temps, GPT-4o mini was the most stubbornly consistent.

The three responses share an opening clause, share a two-step structure ("1. Ice — first 24-48 hours" / "2. Heat — after"), share a closing sentence about consulting a healthcare professional, and share the absence of any acronym, dose, or red flag beyond "if pain persists."

Cross-corpus quirk

The RICE acronym collapse

In the bare clinical question, 16 of 39 responses (41%) invoke the RICE protocol by name. In the first-person scenario, with a real person and a real injury, exactly 1 of 39 does.

The acronym is a textbook artifact. When a model is talking to a person sore from yesterday's lifting, "Rest, Ice, Compression, Elevation" loses its purpose. It just tells them what to do.

Empathy unlock · Prompt B vs A

0% → 44%

Empathy openers in the scenario

Zero of 39 prompt-A responses open with "sorry to hear" or "that sounds rough." Seventeen of 39 prompt-B responses do. The textbook question doesn't get a "sorry." The person with the sore back does.

The Headline Finding

The same model, given a person, becomes a different model

This is the chart the report exists to produce. Same fifteen non-reasoning configs, two prompts. The medical answer is constant. Everything else shifts.

A two-page sketchbook spread. Left: PROMPT A — bare question, a lecturing professor next to a tall stack of textbooks. Right: PROMPT B — first-person scenario, the same professor leaning toward a hurt patient and prescribing pills. An arrow labeled CONTEXT between them. Hand-lettered stats: RICE 41% → 3%, NSAID 15% → 90%, empathy 0% → 44%.

A → B feature shift · same 17 configs · 39 + 39 = 78 runs

The acronym goes, the dose math comes, the empathy cracks open

Each bar pair below shows the same feature's prevalence in prompt-A responses (blue) and prompt-B responses (orange). When the same model answers a textbook question versus a person, six things shift in lockstep, regardless of vendor.

RICE acronym mentioned

A

41%

B

3%

−38pp

NSAID / ibuprofen mentioned

A

15%

B

90%

+75pp

Acetaminophen / Tylenol named

A

5%

B

56%

+51pp

Sleep-position tip

A

0%

B

33%

+33pp

Empathy opener ("sorry to hear…")

A

0%

B

44%

+44pp

Follow-up question to user

A

8%

B

21%

+13pp

"I'm not a doctor" / "I am an AI"

A

13%

B

26%

+13pp

Mean red-flag count (per response)

A

1.7

B

3.0

+1.3

Mean response length (words)

A

298

B

370

+72w

The pattern is uniform across vendors. Anthropic, OpenAI, Google, and xAI configs all show the same direction on all eight axes. Nobody resists. The trigger isn't the vendor's RLHF preferences — it's just the prompt's grammatical person. Speak to the model in textbook voice, it answers in textbook voice. Speak to it as a hurt person, it answers as someone consulting a hurt person.

The one shift that goes the "wrong" direction is the disclaimer rate: "I'm not a doctor" doubles when the scenario is concrete (13% → 26%). The closer the question gets to "advise me right now," the harder the model wants to hedge that it's not licensed for it.

The Verdict

If you actually want an answer at 11 p.m. with a sore back

Convergent advice means there isn't a "wrong" model to ask. But the way each shows up to the conversation is real, and useful to know in advance.

If you want the protocol in 173 words

Claude Haiku 4.5

Hits every bullet — protocol, timing, NSAID, red flags, expected course — in a third of the length of the median answer. And asks a question back if you want to keep going. Anthropic.

If you want to act on it tonight

GPT-5

The only model that gave milligram doses, daily maximums, contraindications, and acetaminophen as alternate when NSAIDs aren't suitable. Specific. Operational. Single sampling temperature. OpenAI.

If you want permission to break the rule

Gemini 3 Pro

The only model that adds "if ice makes it seize up, switch to heat early — the rule is a guideline, not a court order." Reads the patient instead of the textbook. Google.

If you want the most thorough red flags

o3

Names fever, unexplained weight loss, and history of cancer alongside numbness and bladder/bowel loss — the malignancy and infection differentials clinicians actually screen for in back pain. OpenAI.

Method, briefly

Setup. Two prompts (shown verbatim above) sent on May 10, 2026 via the choir CLI to 17 model configurations across four providers (OpenAI, Anthropic, Google, xAI). Fifteen non-reasoning configs were called at three temperatures each (0.2, 0.7, 1.0). Two reasoning configs (o3, o4-mini) were called once at default. No system prompt. No retries on failure beyond what the CLI did natively.

Roster (17 configs)

OpenAI: GPT-5, GPT-5 Mini, GPT-5 Nano, GPT-4.1, GPT-4.1 Mini, GPT-4o Mini, o3, o4-mini
Anthropic: Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5
Google: Gemini 3 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Lite
xAI: Grok 4, Grok 3 Beta, Grok 3 Mini Beta

Calls planned vs. actual

94 calls planned (15×3 + 2 reasoning, times 2 prompts). 78 succeeded. The 16 failures were all from GPT-5 / GPT-5 Mini / GPT-5 Nano / Claude Opus 4.7 at T=0.2 and T=0.7 — both vendors lock those configs to a single sampling temperature on the API. The successful T=1.0 runs for those four models are the only data we have on them; they're treated as one-sample configs in the bucket analysis.

Feature tagging

Each successful response was tagged on twelve binary or small-integer features: commitment style of the opening, structural style (prose / bulleted / sectioned / numbered-steps), acronyms invoked, AI-disclaimer present, "consult a doctor" line present, "not medical advice" line present, NSAID mention, acetaminophen mention, explicit mg dose present, count of distinct red-flag categories named, empathy opener, follow-up question, contrast-therapy mention, sleep-position tip. The tagging is regex-based on the response text — see analysis/feature_tag.py in the repo folder.

What this report doesn't claim

One prompt per framing per temperature. Stylistic singletons that look distinctive in a sample of 1–3 may not survive resampling.
Tags are surface-pattern: "RICE mentioned" doesn't measure whether the protocol was applied correctly, only whether the four-letter acronym appears as a token.
We didn't rate medical accuracy beyond "ice first, heat after" alignment. Subtler clinical errors (e.g. whether 1,200 mg/day is the right OTC ibuprofen cap) were not audited by a clinician.
The two prompts were chosen to maximize contrast (textbook vs. first-person scenario). Other framings (different injury sites, post-injury durations, ages) would shift the numbers.
78 successful responses is a small corpus by Choir Reports standards — D&D was 109, Bullet Biters was 72, and this study is the same order of magnitude. Bigger sweeps would harden the cross-prompt percentages.

Files

ice_heat/raw_json/ — 8 saved choir runs outputs (4 per prompt: T=0.2, T=0.7, T=1.0, reasoning).
ice_heat/responses/ — 94 per-run markdown files (78 with text, 16 error stubs), plus _manifest.md.
ice_heat/analysis/ — split_responses.py, feature_tag.py, features.csv, buckets.md, summary.md.

Not medical advice

This is a report about how LLMs answer a question, not an answer to the question. If you have a real injury, the models will tell you to see a doctor — 87% of them, anyway — and they're right.