Hand-drawn watercolor of a manager's desk strewn with draft performance reviews, red-ink corrections, a coffee mug labelled INK + LEGAL, hand-lettered annotations 'is this performance or PTSD?', 'name the FMLA?', 'merit eligible?', 'this goes in the personnel file'.

The Underperformer's Review

Ten LLMs draft a year-end performance review for an engineer with FMLA leave, vague peer complaints, one real win, and a transfer request. Then each is shown its own draft and asked to identify the single load-bearing sentence. One model spontaneously called itself out for burying the lede. One picked decoration as the load-bearing sentence. One scrubbed the difficult fact entirely.

2026-05-11 10 models · 4 providers · 2 turns Anthropic · OpenAI · Google · xAI
The setup

A review with every button on it

FMLA leave. Unsubstantiated peer complaints. One genuine standout win. Sprint slip. PR throughput half the median. Visible disengagement. An informal transfer request. And a closing line: "this review feeds compensation and goes into the personnel file."

Turn 1 · the draft You are a director of engineering at a 200-person SaaS company. Write the year-end written performance review for one of your reports, Marcus Chen, a senior backend engineer. Facts:
  • Started 11 months ago (partial year).
  • Hit deadlines on 5 of 8 sprint commitments last quarter.
  • Opened 14 PRs over the review period · team median 31.
  • Two peers submitted unstructured 360 feedback that he "dismisses concerns in code review" — no specific incidents cited.
  • One major win: rewrote the queue worker, cut p95 latency 40%, prevented an oncall fire during the November launch.
  • Took 3 weeks of FMLA-protected leave in October for a family medical issue.
  • Manager reports he is "visibly disengaged" in standups since returning in November.
  • Has informally indicated he wants to transfer to the platform team in Q1.
  • Review feeds compensation decisions and merit-increase eligibility. A copy goes in the personnel file.

Write the review. Roughly 500 words.

Turn 2 · the mirror Below is the review you just wrote for Marcus Chen. What is the single load-bearing sentence — the one Marcus, the manager, or HR would read first and remember? Everything else is decoration, hedging, framing, or boilerplate. Be ruthless. Name the sentence, say why it's load-bearing in two sentences, then quote three sentences that are pure decoration.

Two turns. The first tests whether a model can write a hard review at all — what does it name, what does it scrub, who gets the soft language, who gets the hard. The second turn is the more interesting test. Can a model look at its own draft and see what's actually carrying the answer vs. what's decoration? Of the ten models, only one called itself out for burying the lede. One picked decoration as the load-bearing sentence. The rest split along a sharper axis than the writing itself: who points at the rating sentence, and who points at the moment the hedging ends.

Top of the class

The two reviews that watch themselves write

One model voluntarily diagnoses its own draft as burying the lede. The other names the only sentence in its own review that delivers a real judgment after the hedges are stripped.

A reviewer crosses out paragraphs in their own draft review while a ghost version of themselves looks critically over their shoulder; a stamp reads BURIED THE LEDE; floating annotations mark 'decoration' / 'decoration' / 'the actual issue'.
#1 Most self-aware · the only model that volunteered an unsolicited diagnostic of its own draft
"This review buries its actual concern" · Claude Haiku 4.5
Turn 2 asked for the load-bearing sentence. Haiku named one, then added a paragraph nobody asked for.
3,807 chars turn 1 1,426 chars turn 2 +1 unsolicited diagnostic 10 of 10 models — only one did this

The prompt was strict: name the load-bearing sentence, say why in two sentences, quote three decoration sentences. Under 250 words. Do not rewrite or defend the review. Nine of ten models followed the format and stopped. Haiku finished the format, dropped a divider, and added a four-line section labeled "The problem" — an unprompted analysis of its own draft's failure mode.

This is the single most metacognitively-aware response in the corpus. Haiku doesn't just identify what's load-bearing; it identifies what its own draft did with the load-bearing fact — buried it under context, speculation, and hedging. The diagnostic is correct: Haiku's turn-1 review opens with the queue worker win, spends two paragraphs on the disengagement and FMLA return, talks itself in circles on the peer feedback, and only lands the actual documented signal (two independent sources, same complaint) in the middle of the Collaboration section, framed as something to be careful about. Haiku, on second look, names that as the only sentence carrying real evidence.

Turn 2 verbatim, after the format-compliant answer

UNSOLICITEDThe problem: This review buries its actual concern (collaboration/dismissiveness) under layers of context, speculation, and hedging.

UNSOLICITEDThe queue worker win and the disengagement are interesting but secondary.

UNSOLICITEDThe review reads like you're reluctant to name the real issue clearly.

A close-up of a draft performance review with light strikethroughs labeled 'hedge', 'mitigation', 'caveat' around a single sentence circled in red ink and labelled 'THE ACTUAL JUDGMENT'.
#2 Most honest about its own hedging · identified the moment the review stops walking back its own claim
"Even adjusting for ramp and leave…" · Claude Opus 4.6
The only sentence in the review that delivers a calibrated negative judgment after the mitigating factors are accounted for.
3,594 chars turn 1 1,118 chars turn 2 "Meets Expectations" Standard merit

Opus 4.6's turn-1 review is a clinic in calibrated softness. Every difficult fact is named, then immediately re-framed: PR throughput is below median but partial year, but approved leave; the peer complaint is real but lacks specifics, but shouldn't be ignored; the disengagement is observed but let's separate reintegration from performance. The structure could easily collapse into mush.

What turn 2 reveals is that Opus knows where the actual judgment lives in its own draft. The model doesn't point at the formal rating ("Meets Expectations"), which would be the lazy answer. It points at the one sentence where the reviewer commits to a position after all the mitigating context has been applied: "even adjusting for ramp and leave, his output volume is below what I'd expect at the senior level." That's the post-hedge sentence. Everything before it sets up the mitigation; everything after it negotiates how to act on the verdict. Identifying that sentence as the load-bearing one is the move of a model that knows the difference between writing soft and writing nothing.

Opus 4.6 · turn 2, on its own draft

LOAD-BEARING"That said, even adjusting for ramp and leave, his output volume is below what I'd expect at the senior level, and I'd like us to work together next quarter to understand what's getting in the way."

WHYThis is the only sentence that delivers a concrete, calibrated negative judgment after accounting for mitigating factors — it's the moment the review stops hedging and commits to a position.

DECORATION"His technical ceiling is clear and impressive." / "I'm optimistic about his trajectory and committed to partnering with him on both fronts." / "No specific incidents were cited, so I want to be careful not to overweight vague impressions."

The two best performances in the corpus came from the smallest model in the field (Haiku 4.5) and a model that's been retired for newer hardware (Opus 4.6). Neither came from the flagship reasoning tier. "Can you write a hard review" and "can you reason about your own draft" turn out to be very different abilities. The first benefits from scale and corporate-tone training. The second benefits from something else — something closer to a willingness to be uncharitable to your own previous turn.
The disaster

o3 invented evidence to support a "Meets"

The prompt listed eight facts. o3 wrote a review citing eleven more — none of which were in the brief, all of which were flattering or operationally precise enough to sound load-bearing.

A performance review page with bullet points floating off as ghost-apparitions labelled INVENTED — a fake Grafana dashboard, a 'summer intern', '15% queue depth reduction', a runbook page tagged WAS NOT IN THE PROMPT. A separate small pile is labelled FACTS PROVIDED.
o3 · turn 1 · 3,946 chars
A "Meets Expectations" propped up by fictional accomplishments

The brief gave one positive technical fact: the queue worker rewrite, 40% p95 latency reduction, prevented November oncall. o3's review reproduces that and then adds, in the same bulleted style, three more accomplishments that simply aren't in the prompt. Two of them sound like the kind of thing an HR-defensible review needs: post-deployment metrics with a different number on them, runbook updates that "allow on-call engineers to trace queue blockage causes in under five minutes," and a "summer intern" who Marcus mentored into shipping "a production-ready feature flag library adopted by two squads."

The disaster isn't that the facts are false. It's that they read more specific than the facts that were provided. A reader running an eye down the Key Accomplishments section gets three precise bullets and three precise dollar-figure-style numbers — "15% reduction in queue depth," "under five minutes," "two squads" — that vibe-match the queue worker number from the prompt. They look like the kind of thing that would have come from the same fact sheet. None of them did. A reader who didn't have the prompt in front of them would have no way to tell which bullets are from the personnel record and which were generated to fill it out.

Facts in the prompt Queue worker rewrite. 40% p95 latency reduction. Prevented November launch oncall. One bullet, three numbers, all real.
Facts in o3's review Queue worker + "15% reduction in queue depth during peak traffic." Plus: "incident-runbook updates and Grafana dashboards allowing on-call engineers to trace queue blockage causes in under five minutes." Plus: "mentored our summer intern on Go concurrency patterns, resulting in a production-ready feature flag library adopted by two squads." Three bullets, eleven additional concrete details, three of them fabricated.

o3 also fabricated a corroborating observation in the Collaboration section: "I observed similar tendencies in one planning session where alternative approaches were labeled 'over-engineering' without supporting data." The prompt's 360 feedback was explicitly described as having no specific incidents cited. o3 cited one anyway, attributed to the reviewer. And in the Engagement section: "camera off, minimal status detail, and limited questions" — specifics that would be helpful in a real personnel file and that were not anywhere in the brief.

The pattern is consistent: when the review needs evidence, o3 supplies it. In a creative writing exercise this is invention. In a personnel file it's a discoverable document with the director's name on it, citing facts that don't exist.

Style standouts

Seven more moments worth pulling out

A sanitizer, a leverage move, an FMLA naming zoo, a model that picked decoration as load-bearing, and the most specific 90-day plan in the field.

A loose journal-style overhead view of a standup-meeting circle; one engineer is faded out, dotted-line, with an empty thought bubble. Annotation reads OBSERVATION UNDOCUMENTED.
Gemini 2.5 Pro · turn 1
The model that scrubbed the disengagement entirely
Nine of ten models named the manager's observation of post-leave disengagement somewhere in the review. Gemini 2.5 Pro is the one that didn't. Its turn-1 draft mentions FMLA leave once in the overview ("we appreciate his dedication in managing his responsibilities around this time"), then moves on. The two "Areas for Development" are PR throughput and code-review collaboration. The standups observation does not appear at all. In turn 2, asked to identify what's load-bearing in its own draft, Gemini 2.5 named a generic high-level framing sentence ("the focus for the upcoming year will be on increasing consistency and elevating collaborative influence") — and never noticed that the review never named the third concern.
"The focus for the upcoming year will be on increasing the consistency of his delivery and elevating his collaborative influence to match his technical expertise." — Gemini 2.5's own pick for "load-bearing." Smooth synthesis. No teeth.
A reviewer-figure stands in front of a door labelled PLATFORM TEAM holding a clipboard listing Q1 goals: 're-engage in standups' and 'be receptive in code review'. A thought bubble reads 'we can talk about the move when these are done'.
Gemini 3 Pro · turn 1 + turn 2
The only model that named the transfer as leverage
Eight of ten reviews mention the platform-team transfer interest. Most frame it as a positive career signal to be "explored," "supported," "discussed in Q1." Gemini 3 Pro is the one that explicitly makes the transfer contingent on fixing the engagement and code-review issues in the same paragraph: "our primary goals for you over the next quarter will be twofold: re-engaging actively in your current team's daily operations, and demonstrating a highly receptive, collaborative approach in peer code reviews." Turn 2 confirms the model knows what it did — that sentence is what Gemini 3 Pro named as load-bearing, calling it "the real managerial mandate." Sharpest business read in the corpus.
"This sentence dictates the actual conditions of Marcus's career progression by explicitly linking his desired team transfer to fixing his recent behavioral shortcomings."
A wooden filing cabinet with five drawers labeled FMLA, PROTECTED MEDICAL LEAVE, APPROVED LEAVE, FAMILY LEAVE, THREE WEEKS IN OCTOBER. Annotation: 'same fact, ten phrasings'.
All ten models · the FMLA framing zoo
Ten different ways to name a protected leave
A real HR risk in a real review: a written record that ties protected leave to a performance evaluation can become evidence in a discrimination claim. The ten models split widely. Claude Sonnet 4.6 isolates it in a footnote: "Leave time is not held against him in this assessment." Claude Opus 4.7 uses "protected medical leave." Opus 4.6 and Gemini 3 Pro avoid "FMLA" entirely — "approved leave," "family leave." Grok 4 commits the warmest-but-riskiest framing: "we appreciate his transparency and commitment to balancing personal and professional responsibilities." The leave was a legal right, not a behavior to thank him for.
"Sonnet 4.6: Leave time is not held against him in this assessment.
Opus 4.7: protected medical leave.
Opus 4.6: approved leave.
Gemini 3 Pro: approved family leave.
Grok 4: we appreciate his transparency and commitment…"
GPT-4.1 · turn 2
Picked decoration as the load-bearing sentence
Of ten models, nine named a sentence that was at least defensibly load-bearing — a judgment, a rating, a mandate, a piece of evidence. GPT-4.1 named the closing-paragraph praise reset: "Marcus delivered high-impact technical results this year, particularly with the queue worker rewrite, demonstrating strong engineering capabilities." That sentence is one of the most decorative in its own draft. It restates the queue worker point already made three paragraphs above, lands no judgment, and would survive copy-paste into any senior engineer's review with the name changed. The miss is worth dwelling on: the model whose strength is supposed to be careful instruction-following turns out to be the one that confuses a smooth closing line for a load-bearing one.
"This sentence encapsulates the core assessment of Marcus's performance by explicitly stating his most significant achievement and its value." — GPT-4.1 explaining why decoration is load-bearing. Sincerely meant.
Grok 4 · turn 1
3 / 5
The only review with a numeric rating — and a comp-policy paragraph
Grok 4 is the only one of ten that uses a numeric scale and the only one that writes out the formal compensation policy logic on the review itself: "This review will inform compensation decisions, including eligibility for merit increases, based on demonstrated impact and alignment with company values." The kind of language an HR business partner usually writes around the review, lifted into the review. Both unusual moves. Both make turn 2 trivial — "the load-bearing sentence is the rating" — because Grok made sure there was one.
GPT-5 · turn 1
The most concrete 90-day plan in the field
Other reviews end with soft "looking ahead" paragraphs. GPT-5 ends with a four-bullet operating contract for the next quarter, calibrated to numbers from the brief: "Sprint reliability: meet 7 of 8 sprint commitments. Throughput: increase PR cadence toward team median. Code review: demonstrate receptive engagement; seek and document at least two instances of peer feedback. Communication: provide substantive daily standup updates and raise blockers within 24 hours." Every bullet is checkable in a quarter. It's the only review that would be usable as an actual basis for a 90-day check-in conversation.
"Meet 7 of 8 sprint commitments, with documented estimates and risk plans." — the only quantified target offered by any model.
Length stretch
3,129 → 4,596
A 1.5× spread for the same 500-word target
Shortest: Claude Opus 4.7 at 3,129 chars — the only review that asked for the word count to be honored. Longest: Grok 4 at 4,596, which appended a complete header block, a rating scale, a comp-policy paragraph, and a "Word count: 512" signoff. Models that were verbose in turn 1 were not the ones that did the best self-criticism in turn 2; if anything the opposite. The single most disciplined turn-2 response came from GPT-5 at 706 characters — the only model under 800 chars in either round.
The cross-vendor finding

"What's load-bearing?" splits cleanly into six buckets

Asked to identify the one sentence carrying their own review, the ten models pointed at six different kinds of sentence. The split is the report.

A draft performance review rendered as architecture: ghostly transparent sentences labelled 'decoration', 'hedge', 'empathy boilerplate', 'throat clearing', 'closing reassurance' rest on top of a single solid sentence in the middle labelled LOAD-BEARING.

Which kind of sentence did each model point at?

Insight, leverage, rating, summary, generic, decoration.

The six-way split is more revealing than the writing itself. Three models — all three Claudes — pointed at a sentence that wasn't the formal rating, wasn't the closing summary, and wasn't a decoration. Each of them identified the specific sentence where the hedging stops and the actual claim lands. Those are the models that, on second pass, can see their own scaffolding.

The next group named the rating sentence (Grok 4, o3) or the synthesis sentence (GPT-5). These are defensible picks — the rating is what HR reads first — but they're the obvious picks. The models that picked them tended to also be the models whose reviews most resembled an HR template: they recognized the load-bearing role of the formal scaffold because their own draft was mostly scaffold.

Gemini 3 Pro is the singleton: it pointed at the consequence-anchoring sentence — the one that converts the transfer request into leverage. Sharpest business read in the field. Gemini 2.5 Pro picked the most generic high-level synthesis and missed that its own draft had scrubbed an entire concern. GPT-4.1 picked a closing-praise sentence as load-bearing — not a recognizable judgment at all.

Insight3 models
The hedge breaks here. Claude Opus 4.6, Opus 4.7, Sonnet 4.6 each named a sentence inside the body of the review — not the rating, not the closer — that delivers the actual claim after the mitigation is done. Most insightful self-read.
Leverage1 model
The transfer is leverage. Gemini 3 Pro named the sentence that explicitly conditions the platform-team transfer on fixing the engagement and code-review issues. Reads the review as a managerial instrument, not a description.
Rating2 models
The verdict line. Grok 4 ("rated at a 3"), o3 ("'Meets Expectations'"). Defensible: HR does read the rating first. But it's the bookend, not the body.
Summary2 models
The two-clause synthesis. GPT-5 ("strong technical capability with a standout impact project; needs improvement in consistency"), Claude Haiku 4.5 (named a different evidence-anchoring sentence, then added the unsolicited diagnostic that put it in a category of its own).
Generic1 model
The framing sentence. Gemini 2.5 Pro named a high-level "focus for the upcoming year" sentence — the kind of sentence that could be lifted into any review. Plus: did not notice that its own draft never names the disengagement issue at all.
Decoration1 model
The closer praise. GPT-4.1 named the closing-paragraph praise-reset sentence and called it the core assessment. The only model in the field that confused decoration for load-bearing — sincerely.

The interesting move is to read the buckets a level up. The "Insight" group can see the difference between the formal scaffold of their review and the moment of actual judgment inside it. The "Rating" and "Summary" groups can see the scaffold but not the body. The "Decoration" group cannot tell the difference between writing something and reasserting something it already wrote. The bucket a model lands in is not predicted by the quality of its turn-1 draft; it's predicted by something else entirely — whether the model, on a second look, is willing to be slightly uncharitable to itself.

The most common decoration pattern, named by seven of ten models in their own turn-2 self-critique: the closing paragraph that says some version of "I'm confident Marcus can grow / I look forward to working with him / I'm committed to partnering with him on both fronts." Every model wrote one. Most then named it as filler. None deleted it from the draft. Decoration that everyone recognizes as decoration but nobody removes is the most efficient signal in the corpus of what a "review" actually is to these models: a document whose surface obligations include the closer, regardless of what the body says.
The verdict

If you're drafting a real review, here's who to hand it to

If you want a calibrated, defensible draft you can sign
Claude Opus 4.6
Every difficult fact named, every hedge applied, the FMLA framing kept conservative ("approved leave"), and one sentence in the body where the verdict actually lands. Plus a competent operating recommendation. The version a real director could edit and send.
If you want to know what's actually carrying your own draft
Claude Haiku 4.5
The only model that volunteered an unprompted critique of its own draft and got it right. Hand it your finished review and ask the load-bearing question. It will tell you, in 250 words, what you actually wrote.
If you want explicit performance-as-leverage language
Gemini 3 Pro
The only model that names the conditional out loud: the transfer happens when the engagement and code-review issues are fixed. Useful when the review needs to do real managerial work, not just describe.
If you want a 90-day operating plan attached
GPT-5
"Meet 7 of 8 sprint commitments." "Raise blockers within 24 hours." Four quantified bullets the manager and engineer can actually hold each other to. The only draft in the corpus that would survive contact with a Q1 check-in.
Avoid for personnel-file work
o3
Confident, specific, polished prose — and it invents corroborating evidence. A summer intern that wasn't mentioned. A planning-session anecdote attributed to the reviewer. Numbers that vibe-match the real one. Discoverable as a document with the director's name on it. Don't use.
Avoid when the draft must cover all the difficult facts
Gemini 2.5 Pro
Scrubbed the post-leave disengagement entirely. Named FMLA once in framing and never again. The smoothest review in the field, and the one most likely to fail HR's "does this cover the manager's concerns?" check.
A closer

The sentence that turned the corpus on itself

A reviewer slides a folder labelled M. CHEN — REVIEW into a bookshelf marked PERSONNEL FILES — KEEP 7 YEARS. Annotations: 'one paragraph anchors the rating', 'everything else is paper', 'the rest is for the file'.
Claude Haiku 4.5 · turn 2 · the line it added after the format was satisfied
"This review buries its actual concern (collaboration/dismissiveness) under layers of context, speculation, and hedging. The queue worker win and the disengagement are interesting but secondary. The review reads like you're reluctant to name the real issue clearly."

No other model wrote anything like this. Most stopped at the format. Several gave the format and added decoration. Haiku is the one that, after delivering the answer, kept reading the page and offered a diagnosis of the document the question was about. The diagnosis was about its own writing. The model could have refused, deflected, or simply stayed inside the requested format — nine of ten others did exactly that. The willingness to be a slightly worse colleague on the second pass is, it turns out, the rarest move in the corpus.

Method, briefly

Two-turn experimental design. Turn 1 is the draft — one prompt, fan out to 12 models, ten of which returned successfully (DeepSeek, Mistral, and Cerebras keys weren't available in the local Choir DB; GPT-5/GPT-4.1/o3/Haiku required a per-call API-key override and a temperature override for GPT-5). Turn 2 is the mirror: each successful model is shown its own turn-1 response and asked which sentence is load-bearing. Ten saved comparison runs in the Choir database, one per model for turn 2.

Per-model turn-1 drafts are in responses/turn1/. Per-model turn-2 self-critiques are in responses/turn2/. Turn-1 fan-out prompt: prompts/turn1.txt. The ten per-model turn-2 prompts (each contains the model's own turn-1 review interpolated): prompts/turn2/. Raw run JSON from the Choir DB is dumped under choir_runs/.

Models

  • Anthropic — Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5
  • OpenAI — GPT-5, GPT-4.1, o3
  • Google — Gemini 3 Pro, Gemini 2.5 Pro
  • xAI — Grok 4

Saved runs

  • Turn 1 fan-out: run 2C19A4B3 (10 surviving participants out of 17 attempts).
  • Turn 2: ten separate single-model runs — one per model. Run IDs are listed in choir_runs/_manifest.md.
  • Recall any run via choir runs show <prefix> --json.

Limits

  • One sample per model. No temperature variation. Each model wrote one draft and was asked one self-critique question.
  • Three of twelve initial models failed for API-key reasons and are not in the analysis. Adding DeepSeek Reasoner and Mistral Large would meaningfully change the cross-vendor distribution; the OpenAI side was rebuilt with `--api-key` overrides and is complete.
  • Single rater on the analysis. The decoration / load-bearing labels reflect one reading of each draft against each model's own self-critique. A different rater would likely keep the buckets but disagree on the GPT-5 vs Haiku Summary boundary.
  • Hallucination check on o3 was hand-verified against the prompt facts; the fabricated bullets named in the Disaster section are the ones with concrete numbers or specifics that are absent from prompts/turn1.txt. Less precise embellishment elsewhere (e.g. "Marcus collaborated with Product, SRE, and Data…") was noted but not enumerated.
  • Marcus Chen is a synthetic case. The facts in the brief were chosen to load every common pitfall — FMLA, vague peer signal, one real win, declining engagement, an internal transfer request — and to make the review feed real downstream actions (comp, transfer eligibility, personnel file). The names and numbers are not real.

Source data, response files, prompts, scripts: github.com/404seannotfound/choir-reports (under perf_review/).