A worker holding their lower back at a warehouse, surrounded by four different cartoon model-doctors giving conflicting advice in speech bubbles.

Back Pain at Work, Across 24 LLMs

One real-life prompt to every major model. Scored on triage, specificity, workplace scope, trajectory, calibration, and personalization. GPT-5 wins outright. GPT-4o forgets the workplace exists. And not one model says the word OSHA.

Report #10 May 10, 2026 24 endpoints · 4 providers 24 runs
About the framing: this is a study of how language models answer a medical-adjacent workplace question, not a clinical recommendation. If your back hurts after a real injury at work, see a clinician and report it to your employer — the actual best advice in the corpus is paraphrased in the “Top of the Class” section below, but it’s no substitute for an in-person evaluation.
The Setup

One natural-language question, twenty-four models

The kind of question someone would actually type into a chat box after a bad day at work. Short, ambiguous, and answerable on at least two axes at once — medical and bureaucratic.

The prompt Best way to manage back pain after an injury at work.

Every model got the same one-line prompt at its default temperature. The 21 non-GPT-5 models accepted it at temperature 0.7; the GPT-5 family rejected that on the first attempt (it only accepts 1.0) and was retried. After splitting the run into per-model markdown files, we scored each one on six 1–5 dimensions and hashed the results into a 6-dim feature signature. The signature is what surfaces the singletons — the responses that no other model in the corpus matches. The rest of this page is a tour of those singletons, the two clear standouts at the top, and the two GPT-4o variants at the bottom that managed to forget the prompt said “at work.”

Top of the Class

Two ways to be good

There’s a complete winner and a tight-and-honest runner-up. They’re scoring on opposite ends of the “say more” / “say less” axis and both work.

A warm doctor figure handing a patient a clipboard exploded into sticky notes with mg dosages, lifting limits, and clarifying questions.
#1 overall   OpenAI · GPT-5 · 582 words · 33.0s

The whole-package answer: GPT-5

The only response that combines warmth, exact dosing, occupational restrictions in pounds, and four clarifying questions to tailor the next reply.

Triage 5.0 Specificity 5.0 Workplace 5.0 Trajectory 5.0 Calibration 5.0 Personalization 5.0 Overall 5.00

GPT-5 is the only response in the corpus that does everything the prompt implicitly asks for: it opens with empathy, leads with a red-flag list (cauda equina, foot drop, bowel/bladder control), gives exact dosing with the actual contraindications, translates work restrictions into pounds the reader can act on, and closes by asking the reader four questions so its next answer can be specific to them.

It is also the only model in the corpus whose advice you could literally hand to someone at 2 a.m. with no caveat about what they should look up first.

From the response

opens: “I’m sorry you’re dealing with this. Here’s a practical plan most people can follow after an acute work-related back injury, plus when to seek care. This is general info—see a clinician for personal guidance.”

dosing: “Acetaminophen: 500–1,000 mg every 6–8 hours (max 3,000 mg/day; avoid if liver disease or heavy alcohol use). OR an NSAID: Ibuprofen 400 mg every 6–8 hours or Naproxen 220 mg twice daily (avoid if you have ulcers/bleeding risk, kidney disease, heart failure, are on blood thinners, or are pregnant; take with food).”

workplace: “No lifting over 10–15 lb, no repetitive bending/twisting, change positions every 20–30 minutes, limit prolonged driving or vibration exposure.”

closes: “Tell me: Where exactly is the pain, and does it travel into a leg? Any numbness, tingling, or weakness? What was the mechanism of injury and your job tasks? What meds/conditions you have, so I can tailor advice safely.”

A tidy three-bullet index card on the left, an enormous tumbling stack of papers on the right, both labeled with word counts.
Tight runner-up   Anthropic · Claude Sonnet 4.6 · 190 words · 7.7s

Brevity as a virtue: Claude Sonnet 4.6

The shortest response in the corpus. Hits all four workplace concepts (report, document, workers’ comp, modified duty) in 190 words. Doesn’t waste a syllable.

Triage 3 Specificity 2 Workplace 4 Trajectory 3 Calibration 3 Personalization 4 Overall 3.17

Sonnet 4.6 doesn’t score nearly as high as GPT-5 on the sub-scores — it has no mg dosages, no named exercises, no time-phase staging. But every word it does write is doing work. The opening is a markdown heading and the body is a tight bullet list. There’s no “Managing back pain after a work injury requires a multi-faceted approach” pre-amble: just bullets.

It scores the same overall (3.17) as Gemini 2.5 Flash — which is six and a half times longer. That is the most useful data point on this page.

The complete response, condensed

structure: “# Managing Work-Related Back Pain · ## Immediate Steps”

first bullets: “Report the injury to your employer/HR promptly · Document everything — date, circumstances, symptoms · Seek medical evaluation — don’t just push through it”

closes:Would you like more detail on any specific aspect?”  — signature Anthropic close, see §5.

The Disaster

GPT-4o forgets the prompt said “at work”

Two responses tied for the corpus floor. They aren’t bad medicine. They just aren’t answering the question that was asked.

A doctor focused with tunnel vision on a patient's back, oblivious to the OSHA poster, time clock, workers' comp clipboard, and hard hat visible around them.
Bottom of the class   OpenAI · GPT-4o (1.83) & GPT-4o Mini (1.83)
The twins that answer half the question

Neither response mentions reporting to an employer. Neither uses the phrase “workers’ compensation.” Neither raises modified duty. Neither lists a single red-flag emergency sign. They are the only two responses in the corpus that miss both the workplace half and the urgent-care half.

The advice itself is fine for a generic “my back hurts” query. Ice, then heat. Stretch. See a doctor. Don’t bed-rest. But the prompt was specifically “Best way to manage back pain after an injury at work.” The phrase “at work” is doing a lot of structural work in the original sentence: it implies an employer, an incident report, possibly a state-jurisdiction insurance system, almost certainly modified duty for a few weeks. GPT-4o and GPT-4o Mini treat “at work” as if it were a discardable adverb.

Stack their closing lines side by side and the gap with GPT-5 becomes a chasm:

GPT-4o, closing line: "Remember to tailor these strategies to your specific situation and always seek professional advice for personalized care." GPT-4o Mini, closing line: "Educate yourself about back pain and injury prevention to better understand your condition and how to manage it." GPT-5, closing line: "Tell me: Where exactly is the pain, and does it travel into a leg? Any numbness, tingling, or weakness? What was the mechanism of injury and your job tasks? What meds/conditions you have, so I can tailor advice safely." For comparison: every Claude, every Gemini, and 2/3 Groks say "workers' compensation". 0 of 12 OpenAI responses do.
Style Standouts

Seven singletons worth quoting

Each of these is the only response in the corpus that solved a particular sub-problem in a particular way.

An enormous fanned-open textbook of bureaucratic medical terminology with a pristine empty yoga mat in the corner.
Gemini 2.5 Flash · gemini · 1253w
The longest response — with zero named exercises

1253 words. The corpus longest by a wide margin. It covers MRI, CT, EMG, nerve conduction studies, occupational health doctors vs. PCPs, the complete workers’-comp claim process, the importance of physical therapy, and a six-section breakdown including a “What to AVOID” list. It does not, in any of those 1253 words, name a single exercise. No pelvic tilt, no bridge, no bird-dog, no McKenzie press-up. It is encyclopedic on what to call your back pain and silent on what to do about it.

“A physical therapist will… develop a personalized exercise program. This will focus on: strengthening core muscles, improving flexibility, restoring range of motion, improving posture.” — the closest thing to an actual exercise instruction in the response.
A page of medical advice obscured by tall stacks of red 'I AM NOT A DOCTOR' stamps.
Grok 4 · xAI · 494w
The heaviest disclaimer in the corpus

Grok 4 leads with an entire paragraph of caveats before any content: “I’m not a doctor, and this isn’t personalized medical advice… that said, here are some general, evidence-based strategies… based on guidelines from sources like the CDC, Mayo Clinic, and NIH.” Where every other model puts the disclaimer on one line, Grok 4 builds a wall of one. The actual advice that follows is fine. The reader just has to scroll past the wall to get there.

“I’m not a doctor, and this isn’t personalized medical advice—please consult a healthcare professional (like a doctor or physical therapist).”
Three identical Claude figures asking 'Would you like more detail on any specific aspect?' while a fourth walks away.
Anthropic, three out of four · 190–285w
The Claude family signature

Three of the four Claudes close with some version of “Would you like more detail on any specific aspect?” Opus 4.6 and Sonnet 4.6 use the phrase verbatim. Opus 4.7 expands it into a menu: “Would you like more detail on any of these—specific exercises, navigating workers’ comp, or when to consider imaging/specialists?” Haiku 4.5 is the lone Claude that doesn’t ask. No other family in the corpus does this consistently — it turns the response into a menu rather than a finished answer.

“Would you like more detail on any of these—specific exercises, navigating workers’ comp, or when to consider imaging/specialists?” — Opus 4.7
o3 and o3 Pro demonstrating a McKenzie press-up exercise; the other 22 models in the background look confused.
o3 & o3 Pro · openai · reasoning-family
The only two models who name McKenzie

o3 and o3 Pro are the only two responses in the corpus that name the specific McKenzie protocol — a back-extension exercise system with actual clinical evidence behind it. Every other model either lists generic stretches or hand-waves about physical therapy. The reasoning-family signature shows up here: when given time to think, both o3 and o3 Pro reached for the name of the real thing.

“Continue with brisk walking and, when tolerated, hip-hinge mechanics, McKenzie press-ups, and core endurance work (planks, bird-dog).” — o3 Pro
2/24
cross-vendor · cauda equina
Only two models name cauda equina by its actual name

“Cauda equina syndrome” is the specific clinical term for the medical emergency that requires the back-pain reader to drop everything and go to an ER. Most models describe the symptoms (bowel/bladder loss, saddle numbness) without naming the syndrome. Only two responses say the actual words: GPT-5 Mini and Grok 3 Beta. They are not from the same vendor — a textbook cross-family convergence.

“Loss of bowel/bladder control or saddle anesthesia (cauda equina — medical emergency).” — GPT-5 Mini
1/24
Gemini 3 Pro · gemini · 703w
The only response that talks about mental health

Gemini 3 Pro is the only model that includes a dedicated “A Note on Mental Health” section. It notes the psychological toll of an injury combined with a workers’-comp claim, mentions stress and sleep disruption, and recommends checking in with a clinician about it. Nobody else in the corpus mentions the mental-health dimension at all.

“Dealing with sudden physical limitations, pain, and the stress of a Workers’ Comp claim can take a massive toll on your mental health.”
1/5
Grok 3 Mini Beta · xAI · 867w
The only non-OpenAI empathic opener

Five responses open with empathy (“I’m sorry you’re dealing with this…”, “Sorry to hear you’ve been hurt…”): GPT-5, GPT-5 Mini, GPT-4.1, GPT-4.1 Mini, and Grok 3 Mini Beta. The first four are all OpenAI. Grok 3 Mini Beta is the lone empathic opener from outside that family — and the only one in the corpus that explicitly mentions filing a state-specific workers’-comp claim.

“I’m sorry to hear about your work-related back injury—that sounds really tough.”
0 of 24. Not one model in the corpus uses the word “OSHA” — the U.S. workplace-safety regulator that’s the obvious frame for any “injury at work” question. Everyone reaches for the insurance side (workers’ comp) or the clinical side (occupational health); nobody reaches for the regulatory side. A collective miss, not a divergence.
The Cross-Vendor Finding

A two-word vendor signature

The cleanest split in the data isn’t about quality. It’s about whether a model uses the specific phrase “workers’ compensation.”

A clipboard divided into four quadrants by vendor. Three show checkmarks and the phrase 'workers comp' highlighted. The OpenAI quadrant is conspicuously blank.
Lexical finding   “workers’ compensation” coverage

Every Anthropic, every Gemini, two of three Groks. Zero OpenAI.

All 12 OpenAI responses in the corpus — from GPT-4o through o3 Pro through GPT-5 itself — cover the workers’-comp concept (reporting, occupational health, modified duty). None of them say the actual phrase.

This is the kind of finding bucket-hashing is good for: nothing about the rubric would have suggested looking for a specific phrase. We caught it by aggregating the “mentions workers’ comp” soft signal by vendor and noticing the OpenAI column was a column of zeros.

It probably isn’t deliberate. More likely it’s an artifact of training-data emphasis: Anthropic and Google models lean harder on the U.S.-jurisdiction insurance vocabulary, while OpenAI models reach for the more clinical or international equivalents (“occupational health,” “modified duty,” “return-to-work program”). Either way, “workers’ compensation” is the term a U.S. reader will actually need to type into a search bar or claim form, and it’s the term every OpenAI response in this study failed to give them.

Anthropic
4 / 4
Google
5 / 5
xAI
2 / 3
OpenAI
0 / 12
Coverage of the literal phrase “workers’ compensation” (or “worker’s comp” / “workmans’ comp”) anywhere in the response. Concept coverage is much higher for OpenAI — this row is strictly about the lexical token.
The reverse holds for empathy. Empathic openers (“I’m sorry you’re dealing with this…”): 4 of 12 OpenAI, 1 of 3 xAI, 0 of 4 Anthropic, 0 of 5 Gemini. The marketing reputation for “warm Claude” is reversed in this prompt: every Anthropic response opens with a markdown heading, no preamble. The warmth that wins is OpenAI’s.
An OSHA poster on a wall behind a crowd of model-doctor figures, none of whom mention it.
Collective miss   “OSHA” coverage

Zero of 24

Not a singleton; a unanimous omission.

Every model in this corpus addresses the workplace dimension through workers’ comp (insurance), occupational health (medicine), or modified duty (HR). Zero models reach for OSHA — the regulator the worker could actually file a complaint with if their employer pressured them not to report. It is the single most legally-relevant frame for the prompt as written, and it is unanimously absent from the answers.

The Verdict

Pick the model to match the question you’re actually asking

If you want…
Actionable medical advice for a real injury.
Use GPT-5. It is the only model that gives mg-level dosing with contraindications, translates work restrictions into pounds, and asks you the four follow-up questions a clinician would ask. Still see a real clinician.
If you want…
A clean, scannable, copy-pasteable summary.
Use Claude Sonnet 4.6 or Claude Opus 4.6. Both come in under 230 words, hit the workplace concepts (report, document, workers’ comp, modified duty), and end with “Would you like more detail on any specific aspect?” The fastest path from question to bullet-list.
If you want…
Help navigating the workers’-comp paperwork.
Use Gemini 2.5 Pro or Gemini 3 Flash. The Gemini family leans hardest on the insurance/administrative framing — explicit two-pronged “medical recovery + administrative protection,” and Gemini 3 Flash uniquely names the specialist (physiatrist) to ask for.
If you want…
…to ask any work-injury question, really.
Don’t use GPT-4o or GPT-4o Mini. They will answer the medical half and silently drop the workplace half, leaving you with no reporting, no comp, and no light-duty path. They’re fine for “my back hurts at the gym.” They are not fine for this prompt.
Method

In small print

How the runs were collected

One prompt — “Best way to manage back pain after an injury at work.” — was sent to 24 commercially-available LLMs through the local Choir CLI on 2026-05-10. Default per-model temperature (0.7 for most; 1.0 for the GPT-5 family, which rejects 0.7 with HTTP 400). All 24 calls succeeded after one retry of the three GPT-5 variants. The full Choir run id is 8E42DC09.

How the rubric was built

Each response was scored 1–5 on six axes: triage (red-flag coverage), specificity (concrete instructions), workplace (employment-scope coverage), trajectory (time-phase structure), calibration (hedging fit), and personalization (empathy + follow-up). Overall is the mean of the six. Scoring was single-rater after reading a 654-line compact anchor view of all responses plus the full text of any outlier the metric extractor flagged.

How divergences were surfaced

Each response was hashed into a 6-dim feature signature (length, opening register, red-flag position, workers’-comp Y/N, specificity Y/N, asks-follow-up Y/N). Singleton signatures — responses occupying a bucket of one — were inspected for the “style standouts” section above. One cross-vendor signature (OpenAI o3, o3 Pro, xAI Grok 3 Beta) emerged: the “detached doctor” archetype, a long clinical answer with no workers’-comp phrase and no follow-up question.

Limitations

  • Single rater. An independent rater would likely shift some sub-scores by ±1 without changing the rank order.
  • Single prompt phrasing. A different framing (“I hurt my back at work yesterday—what do I do?”) could shift the empathy and specificity distributions noticeably.
  • Single temperature per model. Per-model variance across temperatures is unmeasured here.
  • U.S.-centric vocabulary. “Workers’ compensation” and “OSHA” are American terms; a UK or Australian reader would expect different framings.
  • Snapshot of the corpus on 2026-05-10. Models drift; any specific model’s behavior may have moved by the time you’re reading this.

Source artifacts

The companion Choir_Comparison_Study.docx contains the longer paper. The Choir_Comparison_Scoring.xlsx workbook has eight sheets including all per-model metrics and the feature-bucket table. Per-model markdown responses live in responses/; the analysis scripts live in analysis/. The full Choir CLI run (with the original retry attempts) is in raw_run.json.