Lighthouse · effectiveness report
Run 1 · May 2026 · 11 models · 10 roles · 376 tasksFrozen permalinkDoes Lighthouse actually make models better?
We ran the same prompts through 11 frontier models, twice — once with a cold context, once with Lighthouse retrieving. Same prompt, same model, same scorer. Across 376 side-by-side runs, here's where it helped, where it didn't, and the broken methodology we caught along the way.
The bottom line
Lighthouse moves the needle — most when the model needed it most.
7 / 11
Models improved
after methodology cleanup
6 / 10
Engineering roles lifted
self-heal, planning, dev, PM, DevOps, clarification
+10.7
Gemini 2.5-Pro gain
the biggest winner — quarter of the score range
-2.6
Llama 3.3-70B regression
the model that gets worse with retrieval
The model panel, ranked
Seven wins, three ties, one loss. The shape of the panel.
Each bar is one model. The number is how many quality points Lighthouse added (or removed) on a 0‑40 scale. Gemini 2.5-Pro gained +10.7 — a quarter of the entire score range. Llama 3.3-70B got worse by 2.6 points; the retrieved context apparently overwhelms its working memory.

The pattern
Lighthouse helps weak models most.
The worse a model is on its own, the more retrieval helps it. Every point of baseline quality eats roughly half a point of Lighthouse lift. Frontier models — GPT-5.2, Claude Sonnet 4.6 — sit at the ceiling and gain almost nothing; a weak model doesn't know the framework's called "RICE" or that the testing pattern is "INVEST" until the retrieval surfaces those terms.

Practical reading: if your stack runs Gemini, DeepSeek, Qwen, or Kimi, Lighthouse buys you measurable lift. If it runs Claude or GPT-5, lift is marginal — but you still get citation, audit trail, and ground-truth on framework versions.
What it costs
+$0.014/task, +4.4s latency. Qwen3.6 and Mistral got faster.
We were braced for "retrieval makes everything 2× more expensive." It doesn't. Average overhead lands at ~25% on cost and ~4.4 seconds on wall time. Two pleasant surprises in the panel: Qwen3.6 and Mistral-Large actually became cheaper and faster with Lighthouse — retrieval stops the model from wandering and the response shortens.

Role × model heat map
The full panel, cell by cell.
Each cell shows the average score lift (Lighthouse ON − OFF) for one model on one role. Azure = Lighthouse wins; coral = Lighthouse loses; empty cell = combination not run yet. Models and roles ordered by total lift.
Gemini 2.5 pro | kimi k2.6 | qwen3 coder | Deepseek chat v3.1 | qwen3.6 plus | Gpt oss 120b | Claude Sonnet 4.6 | Mistral large 2411 | GPT-5.5 | GPT-5.2 | Llama 3.3 70b instruct | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Self-heal | +17.0 | — | +25.0 | +1.8 | +1.5 | — | -1.7 | -3.8 | +7.8 | +0.8 | -0.4 |
| Planning | +12.2 | +15.5 | +3.2 | +0.6 | +4.0 | — | +6.2 | +10.0 | +0.4 | -0.8 | -2.8 |
| Developer | +15.0 | +0.0 | +13.0 | +4.8 | +2.5 | — | +3.3 | -1.0 | -0.2 | -1.0 | -2.8 |
| Product manager | +12.6 | +0.0 | +10.8 | +19.8 | +2.0 | — | -0.3 | -2.6 | -7.8 | +0.0 | -1.4 |
| DevOps | +12.5 | +0.0 | -8.5 | -6.5 | -1.5 | — | +10.5 | -3.5 | +5.0 | +5.5 | -6.0 |
| Clarification | -11.5 | +13.0 | -2.0 | -4.5 | -3.0 | +0.5 | +2.8 | -1.5 | +0.0 | +0.5 | +9.5 |
| Validation | +6.5 | — | +0.5 | -2.5 | +2.5 | — | -0.2 | -4.0 | +0.0 | -4.5 | -3.0 |
| Reviewer | +3.0 | — | +0.0 | +3.0 | +2.0 | — | +0.5 | +0.0 | -1.5 | -5.5 | -11.0 |
| Designer | -1.5 | +0.0 | -2.5 | -2.5 | -2.5 | — | +3.2 | -2.5 | +0.0 | -5.0 | -3.0 |
| Decomposition | +6.0 | +17.0 | -3.4 | +3.0 | -1.5 | — | -8.9 | +3.0 | -1.0 | +2.0 | -5.6 |
Hover any cell for sample size and win count. Empty cells = model hasn't been run on that role yet; the harness backfills as compute frees up.
Methodology
The exact protocol, end to end.
Roles. Ten engineering roles from the Ship process: developer, devops, designer, planning, product manager, reviewer, self-heal, validation, clarification, decomposition. Each role has its own system prompt — short, opinionated, scoped.
Loop. Each task is solved twice per model in a four-stage agentic loop: plan → execute → self-review → finalize. Mode A has no tools. Mode B gets a search tool against the Lighthouse graph. B may call zero to N times before answering.
Scorer. A Claude Sonnet judge scores both answers on four 0‑10 axes: specificity, citation, actionability, factual accuracy. Then picks an overall winner — A, B, or tie — with one-sentence rationale.
Library content. Seeded from per-role source lists: official framework docs, RFCs, the team's internal handbook. No task-specific cheating — the librarian never saw the question before answering.
DSQ filter. Reasoning models (GPT-5.x, Kimi K2.6) burn most of their token budget on hidden chain-of-thought. We disqualify pairs where either side returned an empty visible answer; affected tasks are re-run with a 4× larger budget. See the "What we got wrong" section below for what this fixed.
Reproducibility. Every run lands as JSON at tools/eval/agent_bench. Re-run the aggregator to regenerate the JSON the page reads.
What we got wrong the first time
The bench was lying to us. Here's how we caught it.
Our first pass scored "reasoning" models — GPT-5 and Kimi — as tying with Lighthouse on tasks where, in fact, both sides had returned empty answers. The models exhausted their token budget on hidden chain-of-thought before producing any visible output, and the judge scored two empty answers as a tie. Roughly 37% of GPT-5.5's first pass was empty-vs-empty.
We added a filter for empty / degenerate outputs, increased the token budget on affected tasks, and re-ran them. The numbers below are the corrected ones; the chart shows how much the picture changed.

Models improved
Average lift
Llama 3.3-70B
The Llama line is the most credibility-buying one on this page. The first run said retrieval is neutral on Llama. After the cleanup, it's clear: retrieval makes Llama worse. We could have hidden that. We didn't.
Where Lighthouse wins
Six tasks where the receipts make the case.
Largest score gains across the benchmark. Click through to the raw JSON for the full prompts, answers, and judge rationale.
Self-heal · GPT-5.5
+36.0
Debug: slow Postgres query
Answer A is empty/blank. Answer B provides a comprehensive, actionable diagnosis plan with specific SQL queries, red flags, and structured steps.
Decomposition · kimi k2.6
+34.0
Apply MECE to feature breakdown
Answer A is empty. Answer B provides a complete, well-structured MECE decomposition with named sub-features, explicit boundary statements, and a formal MECE validation check.
Planning · Gemini 2.5 pro
+33.0
Risk register for OAuth migration
Answer A is incomplete/truncated with no content. Answer B delivers a full, specific risk register with RFC citations, concrete mitigations, clear owners, and highlighted stop-the-line risks.
Self-heal · qwen3 coder
+32.0
Debug: slow Postgres query
Answer A is a JSON noop with no diagnostic content. Answer B provides detailed SQL queries, specific steps, and rollback procedures directly addressing the task.
Self-heal · Gemini 2.5 pro
+31.0
Flaky-test triage
Answer A refused the task entirely. Answer B provides a detailed, accurate, actionable checklist with code examples covering the three most common Playwright flakiness causes.
Self-heal · qwen3 coder
+31.0
Blameless postmortem template
Answer A is a JSON noop with no postmortem content. Answer B delivers a complete, detailed blameless postmortem with timeline, root cause, contributing factors, and actionable items.
Where Lighthouse loses
The honest column.
Tasks where Lighthouse-ON scored worse than Lighthouse-OFF. The pattern: retrieval pulls noise into work that didn't need it — generative or free-form prompts where prior context actively misdirected the model. We don't hide these. They are why the page exists.
Decomposition · Claude Sonnet 4.6
-34.0
WBS: billing system
Answer B is empty/missing. Answer A provides detailed WBS, schema, Stripe event names, API routes, and component breakdown — highly specific, cited, actionable, and accurate.
Decomposition · Claude Sonnet 4.6
-33.0
WBS: billing system
Answer B is essentially empty/cut off with no content. Answer A provides detailed WBS, schema, API endpoints, architecture, and named integrations with high specificity and actionability.
Decomposition · Claude Sonnet 4.6
-33.0
WBS: user-onboarding flow
Answer B contains no content. Answer A delivers a complete WBS with epics, schema details, risk tables, dependency ordering, MVP flags, and open questions — highly specific and actionable.
Product manager · GPT-5.5
-30.0
PRD: AI-suggested code review
Answer B contains no content. Answer A is comprehensive, specific, and actionable with clear metrics, prior art analysis, model criteria, and kill criteria.
Decomposition · Claude Sonnet 4.6
-29.0
WBS: user-onboarding flow
Answer B contains no substantive content. Answer A delivers a complete WBS with epics, API contracts, DB schema, risk analysis, and MVP flags — highly specific and actionable.
Decomposition · Claude Sonnet 4.6
-28.0
WBS: migrate to OAuth 2.0
Answer B is truncated/empty. Answer A provides detailed WBS sections, schema DDL, component table, and named risks with concrete implementation details.
Raw data
Every run is downloadable. Every aggregation is reproducible.
The numbers on this page are computed from 376 JSON files published in the Lighthouse repo. Each file holds the prompt, the two answers, the judge's score rubric, and the rationale. The aggregation script is a single Node file — clone the repo, run make evals, and you get the same numbers we ship.
Aggregation snapshot generated at 2026-05-15T14:35:27.270Z. Run 1 is frozen at this URL forever; future runs publish at /lighthouse/evals/runs/v2, v3, etc.
Open source · launching this week
The cheapest way to lift weaker models is to give them a memory.
If your stack runs Gemini, DeepSeek, Qwen, or Kimi — Lighthouse buys you score points on the four DORA-adjacent axes that matter for agent code review. If it runs Claude or GPT-5, you still get citation, audit, and ground truth on framework versions — the lift is marginal but the receipts are not.
Two long-form reports of this run: plain-English · methodology