Autopsy
Empty versus empty: the bug that made reasoning models look smart
Our first Lighthouse benchmark scored reasoning models as competitive when both sides had returned nothing. The judge counted empty-vs-empty as a tie. Here is how we caught it, what changed when we re-ran, and why the corrected numbers say Llama 3.3 got worse.
The morning we first read the Run 1 table, GPT-5.5 looked competitive. Its tie-rate against the no-retrieval baseline was suspiciously clean — forty per cent of pairs landed as ties, the rest split close to even — and the Δ on a 0-40 rubric came in at a polite half a point. Nothing alarming. Nothing especially flattering, either. The kind of result you would put in a slide and move on.
We did not move on, because the same row looked the same way for two other reasoning-class models, and three identical-looking rows in a row is not a finding, it is a tell.
What the first table said
The bench is Lighthouse's Run 1 effectiveness pass. Eleven frontier models across four provider families, ten SDLC roles, thirty-five tasks per role, each task run twice through a four-stage loop — plan, execute, self-review, finalize. Mode A has no tools. Mode B can call search against the Lighthouse graph. The judge is Claude Sonnet 4.6 at temperature 0, scoring both answers on a four-axis rubric and emitting a winner in {A, B, tie}. The same judge for all eleven models so the verdicts compare cleanly across the panel.
The first cut of the table had nine of eleven models gaining from retrieval. Average lift across the panel was +1.5 points on the 0-40 scale. The U-curve was already visible: Gemini 2.5-pro went up by double digits from a weak baseline, Sonnet barely moved, GPT-5.2 sat flat. It read like a respectable retrieval-augmentation paper. Weak baselines gain. Strong ones do not. The interpretation almost wrote itself.
The thing that ruined the interpretation was the tie column.
The tell
GPT-5.5 had a tie-rate near 40 %. Kimi K2.6 was in the same neighbourhood. GPT-5.2 was worse. When we opened the per-task rationales the judge had written for the tied pairs, the prose was eerily generic — the kind of sentence a judge writes when there is nothing to grab onto. We pulled the raw answers themselves and the answers were not bad. They were absent. Empty strings. Whitespace. The occasional stray </tool_use> fragment. The judge had not been comparing two responses; it had been comparing two voids, scoring both at 0/40, and declaring the result a tie because neither side had cleared the floor.
About 37 % of GPT-5.5's first pass was empty-versus-empty. A third of the row was a category error.
We had built an eval that was, for the reasoning-class models, reliably measuring whether the OpenRouter wrapper had returned a string. The Δ we had been about to publish was the noise floor of an infrastructure bug, not a measurement of retrieval.
Why the model returned nothing
Reasoning-class models — the GPT-5 line, Kimi K2.6, anything that internally generates a long chain-of-thought before the visible answer — share a token budget between hidden reasoning and visible output. The max_completion_tokens cap counts both. If the model burns the budget on private deliberation, the visible reply truncates to nothing. The provider returns a valid HTTP 200 with an empty content field. The wrapper does not flag it. The judge sees "" and scores "".
Two empty strings are, by the rubric, equally bad. The verdict is a tie. The arithmetic does not lie — it just does not know what a tie is supposed to mean.
The mechanism is older than reasoning models. It is the same shape as the agent that finished without committing — a contract that cannot tell the difference between the work happened and was poor and the work never happened. The pipeline accepts the absence as a result. The dashboard stays green. The dishonest reading wins.
The fix
We did three things, in this order.
We wrote a disqualification filter. The DSQ pass flags any pair where one or both answers fall under a length floor, lead with a tool-use JSON leak, or match the short-refusal patterns the OpenRouter quirks list had already documented. The headline numbers are computed over the non-DSQ subset. Anything DSQ-flagged is held out of the means until it has been re-run.
We re-ran the affected pairs. The trigger for every reasoning-class model was the same: shared budget collapsing onto hidden CoT. The fix was two flags. We set reasoning_effort=low on the GPT-5.x and Kimi calls — letting the model think briefly instead of expensively — and quadrupled max_completion_tokens on the affected tasks. Combined, those two changes drove the residual DSQ rate to zero for nine of eleven models and to one task out of thirty-five for Kimi K2.6. Good enough to compute headlines on.
We rebuilt the table from the cleaned numbers and went looking for what had moved.
What the second table said
The average lift across the panel went from +1.5 to +1.88 points on the 0-40 scale. That is not a dramatic move on its own. The structural change is in the count: nine of eleven models had originally been "improved" — that became seven. Two of the original lifts were false positives, dragged into the green by the tie inflation on their own empty pairs.
The line that earned the rerun is further down the table.
Llama 3.3-70B went from neutral to −2.6.
In the first cut, Llama looked flat — no help from retrieval, no harm. After DSQ, after the cleaned reruns, after the judge scored real strings against real strings, the model lost two-and-a-half points on the scale. Mode B was slower in wall time (36 s versus 79 s, suggestively truncated) and substantively worse in content. The retrieval was, for that model, actively making the answer wrong. Probably context-overflow behaviour. Possibly something subtler. The number is what we publish; the diagnosis is still open.
A regression that big, on a model that brand-conscious, is the line on the page that says we did the work. Anyone can publish a chart where every bar points up. The bar that points down is the one that costs something to print.
Why we are publishing this part first
The methodology report has a "What we got wrong" section at the top, above the headline numbers, above "Where Lighthouse loses", above everything except the abstract. Most teams put their limitations section at the back, in a smaller font, and write the prose in the past-perfect tense so it reads like it happened to someone else. We tried the back-of-the-paper version first and it did not pass the smell test for either of us. The bug is not a footnote on the result. The bug is the result. The first table was wrong. The second table is right because we caught the first one being wrong.
If a reader is going to trust the second table, they need to be able to read the autopsy of the first. We would rather lose the reader who wanted a clean chart than keep the one who skips the methodology.
What we will measure differently next time
The next run will treat empty visible output as a first-class outcome, named and counted, not as an absent string the judge gets to interpret. The same DSQ filter will be applied live during the run rather than as a post-hoc audit pass; the harness will re-prompt the model with the higher budget on the first empty, not on the second sweep. We will also keep the reasoning-effort dial visible as a per-task setting, because picking it once globally is the kind of choice that looks correct in May and embarrassing in November.
The corrected Run 1 numbers ship at /lighthouse/evals, with the frozen permalink at /lighthouse/evals/runs/v1. The methodology deep-dive — every chart, every disqualification rule, every limitation — lives at /lighthouse/evals/runs/v1/scientific. The plain-English summary is at /lighthouse/evals/runs/v1/general. The Llama row is the one to look at first.
Designing an eval harness for reasoning models is not the same job as designing one for instruction-tuned chat models. We learned that the slow way. Run 2 will know it from the first task.