Lighthouse · effectiveness report
Updated 2026-05-15Does Lighthouse actually make models better?
We ran the same prompts through 11 frontier models, twice — once with a cold context, once with Lighthouse retrieving. Same prompt, same model, same scorer. Across 376 side-by-side runs, here's where it helped, where it didn't, and the raw runs you can download.
The bottom line
Where Lighthouse moves the needle.
6 / 10
Engineering roles improved
out of ten benchmarked roles
9 / 11
Frontier models lifted
Gemini, Kimi, Qwen, DeepSeek lead
376
Side-by-side tasks
every result downloadable as JSON
+1.88
Average score lift (40 pt scale)
174 of 376 task wins for Lighthouse
Per‑role lift
The pattern: retrieval helps grounded roles. It doesn't help free‑form generation.
Bars show Lighthouse‑ON minus Lighthouse‑OFF total score, averaged across every model and task in the role. Self‑heal, planning, and developer roles see the largest gains; decomposition and designer regress. Six of ten roles improved.
Self-heal
+4.63
Planning
+4.45
Developer
+3.59
Product manager
+3.08
DevOps
+2.38
Clarification
+0.73
Validation
-0.45
Reviewer
-0.77
Designer
-0.83
Decomposition
-1.93
Total score is the sum of four 0‑10 axes (specificity, citation, actionability, accuracy). Max lift on this scale is +40, max regression is −40.
Role × model matrix
Lighthouse benefits open‑weight models the most. Claude already saw the docs.
Each cell shows the average score lift (Lighthouse ON − OFF) for one model on one role. Azure = Lighthouse wins; coral = Lighthouse loses; empty cell = combination not run yet. Models and roles ordered by their total lift across the rest of the matrix.
Gemini 2.5 pro | kimi k2.6 | qwen3 coder | Deepseek chat v3.1 | qwen3.6 plus | Gpt oss 120b | Claude Sonnet 4.6 | Mistral large 2411 | GPT-5.5 | GPT-5.2 | Llama 3.3 70b instruct | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Self-heal | +17.0 | — | +25.0 | +1.8 | +1.5 | — | -1.7 | -3.8 | +7.8 | +0.8 | -0.4 |
| Planning | +12.2 | +15.5 | +3.2 | +0.6 | +4.0 | — | +6.2 | +10.0 | +0.4 | -0.8 | -2.8 |
| Developer | +15.0 | +0.0 | +13.0 | +4.8 | +2.5 | — | +3.3 | -1.0 | -0.2 | -1.0 | -2.8 |
| Product manager | +12.6 | +0.0 | +10.8 | +19.8 | +2.0 | — | -0.3 | -2.6 | -7.8 | +0.0 | -1.4 |
| DevOps | +12.5 | +0.0 | -8.5 | -6.5 | -1.5 | — | +10.5 | -3.5 | +5.0 | +5.5 | -6.0 |
| Clarification | -11.5 | +13.0 | -2.0 | -4.5 | -3.0 | +0.5 | +2.8 | -1.5 | +0.0 | +0.5 | +9.5 |
| Validation | +6.5 | — | +0.5 | -2.5 | +2.5 | — | -0.2 | -4.0 | +0.0 | -4.5 | -3.0 |
| Reviewer | +3.0 | — | +0.0 | +3.0 | +2.0 | — | +0.5 | +0.0 | -1.5 | -5.5 | -11.0 |
| Designer | -1.5 | +0.0 | -2.5 | -2.5 | -2.5 | — | +3.2 | -2.5 | +0.0 | -5.0 | -3.0 |
| Decomposition | +6.0 | +17.0 | -3.4 | +3.0 | -1.5 | — | -8.9 | +3.0 | -1.0 | +2.0 | -5.6 |
Hover any cell to see sample size and win count. Empty cells mean that model hasn't been run on that role yet — the harness is backfilling them as compute frees up.
Methodology
The exact protocol, end to end.
Roles. Ten engineering roles from the Ship process: developer, devops, designer, planning, product manager, reviewer, self‑heal, validation, clarification, decomposition. Each role has its own system prompt — short, opinionated, scoped.
Tasks. Roughly twenty tasks per role, mixing canonical questions ("what is the right way to…") with practical ones ("here is a stuck PR, what now"). Tasks are stable across runs so cache hits dominate input cost.
A vs B. Same model, same role prompt, same task. A runs cold — the model answers from its training memory. B gets a short tool‑use block plus the library_search MCP tool. B may call the tool zero to N times before answering.
Scorer. A Claude Sonnet judge scores both answers on four 0‑10 axes: specificity, citation, actionability, factual accuracy. Then picks an overall winner — A, B, or tie — with one‑sentence rationale.
Lighthouse content. The Library was seeded from the same per‑role source list any project would use: official framework docs, RFCs, the team's internal handbook. No task‑specific cheating — the librarian never saw the question before answering.
Honesty. Every run lands as JSON under tools/eval/agent_bench/. The aggregated numbers on this page are computed from that directory at build time. New runs land, the page updates.
Where Lighthouse wins
Six tasks where the receipts make the case.
Largest score gains across the benchmark. Click through to the raw JSON for the full prompts, answers, and judge rationale.
Self-heal · GPT-5.5
+36.0
Debug: slow Postgres query
Answer A is empty/blank. Answer B provides a comprehensive, actionable diagnosis plan with specific SQL queries, red flags, and structured steps.
Decomposition · kimi k2.6
+34.0
Apply MECE to feature breakdown
Answer A is empty. Answer B provides a complete, well-structured MECE decomposition with named sub-features, explicit boundary statements, and a formal MECE validation check.
Planning · Gemini 2.5 pro
+33.0
Risk register for OAuth migration
Answer A is incomplete/truncated with no content. Answer B delivers a full, specific risk register with RFC citations, concrete mitigations, clear owners, and highlighted stop-the-line risks.
Self-heal · qwen3 coder
+32.0
Debug: slow Postgres query
Answer A is a JSON noop with no diagnostic content. Answer B provides detailed SQL queries, specific steps, and rollback procedures directly addressing the task.
Self-heal · Gemini 2.5 pro
+31.0
Flaky-test triage
Answer A refused the task entirely. Answer B provides a detailed, accurate, actionable checklist with code examples covering the three most common Playwright flakiness causes.
Self-heal · qwen3 coder
+31.0
Blameless postmortem template
Answer A is a JSON noop with no postmortem content. Answer B delivers a complete, detailed blameless postmortem with timeline, root cause, contributing factors, and actionable items.
Where Lighthouse loses
The honest column.
Tasks where Lighthouse‑ON scored worse than Lighthouse‑OFF. The pattern: retrieval pulls noise into work that didn't need retrieval — generative or free‑form prompts where prior context actively misdirected the model. We don't hide these. They are why the page exists.
Decomposition · Claude Sonnet 4.6
-34.0
WBS: billing system
Answer B is empty/missing. Answer A provides detailed WBS, schema, Stripe event names, API routes, and component breakdown — highly specific, cited, actionable, and accurate.
Decomposition · Claude Sonnet 4.6
-33.0
WBS: billing system
Answer B is essentially empty/cut off with no content. Answer A provides detailed WBS, schema, API endpoints, architecture, and named integrations with high specificity and actionability.
Decomposition · Claude Sonnet 4.6
-33.0
WBS: user-onboarding flow
Answer B contains no content. Answer A delivers a complete WBS with epics, schema details, risk tables, dependency ordering, MVP flags, and open questions — highly specific and actionable.
Product manager · GPT-5.5
-30.0
PRD: AI-suggested code review
Answer B contains no content. Answer A is comprehensive, specific, and actionable with clear metrics, prior art analysis, model criteria, and kill criteria.
Decomposition · Claude Sonnet 4.6
-29.0
WBS: user-onboarding flow
Answer B contains no substantive content. Answer A delivers a complete WBS with epics, API contracts, DB schema, risk analysis, and MVP flags — highly specific and actionable.
Decomposition · Claude Sonnet 4.6
-28.0
WBS: migrate to OAuth 2.0
Answer B is truncated/empty. Answer A provides detailed WBS sections, schema DDL, component table, and named risks with concrete implementation details.
Raw data
Every run is downloadable. Every aggregation is reproducible.
The numbers on this page are computed from 376 JSON files published in the Lighthouse repo. Each file holds the prompt, the two answers, the judge's score rubric, and the rationale. The aggregation script is a single Node file — clone the repo, run make evals, and you get the same numbers we ship.
Aggregation snapshot generated at 2026-05-15T14:35:27.270Z. Re-runs welcome.
Open source · launching this week
The cheapest way to lift weaker models is to give them a memory.
If your stack runs Gemini, DeepSeek, Qwen, or Kimi — Lighthouse buys you score points on the four DORA‑adjacent axes that matter for agent code review. If it runs Claude or GPT‑5, you might still want it for citation and audit trail — but the lift is marginal.