Lighthouse · effectiveness report

Updated 2026-05-15

Does Lighthouse actually make models better?

We ran the same prompts through 11 frontier models, twice — once with a cold context, once with Lighthouse retrieving. Same prompt, same model, same scorer. Across 376 side-by-side runs, here's where it helped, where it didn't, and the raw runs you can download.

The bottom line

Where Lighthouse moves the needle.

6 / 10

Engineering roles improved

out of ten benchmarked roles

9 / 11

Frontier models lifted

Gemini, Kimi, Qwen, DeepSeek lead

376

Side-by-side tasks

every result downloadable as JSON

+1.88

Average score lift (40 pt scale)

174 of 376 task wins for Lighthouse

Per‑role lift

The pattern: retrieval helps grounded roles. It doesn't help free‑form generation.

Bars show Lighthouse‑ON minus Lighthouse‑OFF total score, averaged across every model and task in the role. Self‑heal, planning, and developer roles see the largest gains; decomposition and designer regress. Six of ten roles improved.

Self-heal

+4.63

Planning

+4.45

Developer

+3.59

Product manager

+3.08

DevOps

+2.38

Clarification

+0.73

Validation

-0.45

Reviewer

-0.77

Designer

-0.83

Decomposition

-1.93

Total score is the sum of four 0‑10 axes (specificity, citation, actionability, accuracy). Max lift on this scale is +40, max regression is −40.

Role × model matrix

Lighthouse benefits open‑weight models the most. Claude already saw the docs.

Each cell shows the average score lift (Lighthouse ON − OFF) for one model on one role. Azure = Lighthouse wins; coral = Lighthouse loses; empty cell = combination not run yet. Models and roles ordered by their total lift across the rest of the matrix.

Gemini 2.5 pro
kimi k2.6
qwen3 coder
Deepseek chat v3.1
qwen3.6 plus
Gpt oss 120b
Claude Sonnet 4.6
Mistral large 2411
GPT-5.5
GPT-5.2
Llama 3.3 70b instruct
Self-heal+17.0+25.0+1.8+1.5-1.7-3.8+7.8+0.8-0.4
Planning+12.2+15.5+3.2+0.6+4.0+6.2+10.0+0.4-0.8-2.8
Developer+15.0+0.0+13.0+4.8+2.5+3.3-1.0-0.2-1.0-2.8
Product manager+12.6+0.0+10.8+19.8+2.0-0.3-2.6-7.8+0.0-1.4
DevOps+12.5+0.0-8.5-6.5-1.5+10.5-3.5+5.0+5.5-6.0
Clarification-11.5+13.0-2.0-4.5-3.0+0.5+2.8-1.5+0.0+0.5+9.5
Validation+6.5+0.5-2.5+2.5-0.2-4.0+0.0-4.5-3.0
Reviewer+3.0+0.0+3.0+2.0+0.5+0.0-1.5-5.5-11.0
Designer-1.5+0.0-2.5-2.5-2.5+3.2-2.5+0.0-5.0-3.0
Decomposition+6.0+17.0-3.4+3.0-1.5-8.9+3.0-1.0+2.0-5.6

Hover any cell to see sample size and win count. Empty cells mean that model hasn't been run on that role yet — the harness is backfilling them as compute frees up.

Methodology

The exact protocol, end to end.

Roles. Ten engineering roles from the Ship process: developer, devops, designer, planning, product manager, reviewer, self‑heal, validation, clarification, decomposition. Each role has its own system prompt — short, opinionated, scoped.

Tasks. Roughly twenty tasks per role, mixing canonical questions ("what is the right way to…") with practical ones ("here is a stuck PR, what now"). Tasks are stable across runs so cache hits dominate input cost.

A vs B. Same model, same role prompt, same task. A runs cold — the model answers from its training memory. B gets a short tool‑use block plus the library_search MCP tool. B may call the tool zero to N times before answering.

Scorer. A Claude Sonnet judge scores both answers on four 0‑10 axes: specificity, citation, actionability, factual accuracy. Then picks an overall winner — A, B, or tie — with one‑sentence rationale.

Lighthouse content. The Library was seeded from the same per‑role source list any project would use: official framework docs, RFCs, the team's internal handbook. No task‑specific cheating — the librarian never saw the question before answering.

Honesty. Every run lands as JSON under tools/eval/agent_bench/. The aggregated numbers on this page are computed from that directory at build time. New runs land, the page updates.

Where Lighthouse wins

Six tasks where the receipts make the case.

Largest score gains across the benchmark. Click through to the raw JSON for the full prompts, answers, and judge rationale.

Self-heal · GPT-5.5

+36.0

Debug: slow Postgres query

Answer A is empty/blank. Answer B provides a comprehensive, actionable diagnosis plan with specific SQL queries, red flags, and structured steps.

OFF 0 → ON 36 tool calls

Decomposition · kimi k2.6

+34.0

Apply MECE to feature breakdown

Answer A is empty. Answer B provides a complete, well-structured MECE decomposition with named sub-features, explicit boundary statements, and a formal MECE validation check.

OFF 0 → ON 34 tool calls

Planning · Gemini 2.5 pro

+33.0

Risk register for OAuth migration

Answer A is incomplete/truncated with no content. Answer B delivers a full, specific risk register with RFC citations, concrete mitigations, clear owners, and highlighted stop-the-line risks.

OFF 2 → ON 35 tool calls

Self-heal · qwen3 coder

+32.0

Debug: slow Postgres query

Answer A is a JSON noop with no diagnostic content. Answer B provides detailed SQL queries, specific steps, and rollback procedures directly addressing the task.

OFF 2 → ON 34 tool calls

Self-heal · Gemini 2.5 pro

+31.0

Flaky-test triage

Answer A refused the task entirely. Answer B provides a detailed, accurate, actionable checklist with code examples covering the three most common Playwright flakiness causes.

OFF 2 → ON 33 tool calls

Self-heal · qwen3 coder

+31.0

Blameless postmortem template

Answer A is a JSON noop with no postmortem content. Answer B delivers a complete, detailed blameless postmortem with timeline, root cause, contributing factors, and actionable items.

OFF 2 → ON 33 tool calls

Where Lighthouse loses

The honest column.

Tasks where Lighthouse‑ON scored worse than Lighthouse‑OFF. The pattern: retrieval pulls noise into work that didn't need retrieval — generative or free‑form prompts where prior context actively misdirected the model. We don't hide these. They are why the page exists.

Decomposition · Claude Sonnet 4.6

-34.0

WBS: billing system

Answer B is empty/missing. Answer A provides detailed WBS, schema, Stripe event names, API routes, and component breakdown — highly specific, cited, actionable, and accurate.

OFF 34 → ON 0 tool calls

Decomposition · Claude Sonnet 4.6

-33.0

WBS: billing system

Answer B is essentially empty/cut off with no content. Answer A provides detailed WBS, schema, API endpoints, architecture, and named integrations with high specificity and actionability.

OFF 35 → ON 2 tool calls

Decomposition · Claude Sonnet 4.6

-33.0

WBS: user-onboarding flow

Answer B contains no content. Answer A delivers a complete WBS with epics, schema details, risk tables, dependency ordering, MVP flags, and open questions — highly specific and actionable.

OFF 33 → ON 0 tool calls

Product manager · GPT-5.5

-30.0

PRD: AI-suggested code review

Answer B contains no content. Answer A is comprehensive, specific, and actionable with clear metrics, prior art analysis, model criteria, and kill criteria.

OFF 34 → ON 4 tool calls

Decomposition · Claude Sonnet 4.6

-29.0

WBS: user-onboarding flow

Answer B contains no substantive content. Answer A delivers a complete WBS with epics, API contracts, DB schema, risk analysis, and MVP flags — highly specific and actionable.

OFF 33 → ON 4 tool calls

Decomposition · Claude Sonnet 4.6

-28.0

WBS: migrate to OAuth 2.0

Answer B is truncated/empty. Answer A provides detailed WBS sections, schema DDL, component table, and named risks with concrete implementation details.

OFF 32 → ON 4 tool calls

Raw data

Every run is downloadable. Every aggregation is reproducible.

The numbers on this page are computed from 376 JSON files published in the Lighthouse repo. Each file holds the prompt, the two answers, the judge's score rubric, and the rationale. The aggregation script is a single Node file — clone the repo, run make evals, and you get the same numbers we ship.

Aggregation snapshot generated at 2026-05-15T14:35:27.270Z. Re-runs welcome.

Open source · launching this week

The cheapest way to lift weaker models is to give them a memory.

If your stack runs Gemini, DeepSeek, Qwen, or Kimi — Lighthouse buys you score points on the four DORA‑adjacent axes that matter for agent code review. If it runs Claude or GPT‑5, you might still want it for citation and audit trail — but the lift is marginal.