Lighthouse · effectiveness report

Run 1 · May 2026 · 11 models · 10 roles · 376 tasksFrozen permalink

Does Lighthouse actually make models better?

We ran the same prompts through 11 frontier models, twice — once with a cold context, once with Lighthouse retrieving. Same prompt, same model, same scorer. Across 376 side-by-side runs, here's where it helped, where it didn't, and the broken methodology we caught along the way.

The bottom line

Lighthouse moves the needle — most when the model needed it most.

7 / 11

Models improved

after methodology cleanup

6 / 10

Engineering roles lifted

self-heal, planning, dev, PM, DevOps, clarification

+10.7

Gemini 2.5-Pro gain

the biggest winner — quarter of the score range

-2.6

Llama 3.3-70B regression

the model that gets worse with retrieval

The model panel, ranked

Seven wins, three ties, one loss. The shape of the panel.

Each bar is one model. The number is how many quality points Lighthouse added (or removed) on a 0‑40 scale. Gemini 2.5-Pro gained +10.7 — a quarter of the entire score range. Llama 3.3-70B got worse by 2.6 points; the retrieved context apparently overwhelms its working memory.

Score delta per model — Lighthouse ON minus OFF, sorted descending

The pattern

Lighthouse helps weak models most.

The worse a model is on its own, the more retrieval helps it. Every point of baseline quality eats roughly half a point of Lighthouse lift. Frontier models — GPT-5.2, Claude Sonnet 4.6 — sit at the ceiling and gain almost nothing; a weak model doesn't know the framework's called "RICE" or that the testing pattern is "INVEST" until the retrieval surfaces those terms.

Scatter plot: baseline score (A) on X, Lighthouse lift (B−A) on Y. Negative trend line: weak models gain more.

Practical reading: if your stack runs Gemini, DeepSeek, Qwen, or Kimi, Lighthouse buys you measurable lift. If it runs Claude or GPT-5, lift is marginal — but you still get citation, audit trail, and ground-truth on framework versions.

What it costs

+$0.014/task, +4.4s latency. Qwen3.6 and Mistral got faster.

We were braced for "retrieval makes everything 2× more expensive." It doesn't. Average overhead lands at ~25% on cost and ~4.4 seconds on wall time. Two pleasant surprises in the panel: Qwen3.6 and Mistral-Large actually became cheaper and faster with Lighthouse — retrieval stops the model from wandering and the response shortens.

Cost overhead and latency overhead per model — A vs B paired bars.

Role × model heat map

The full panel, cell by cell.

Each cell shows the average score lift (Lighthouse ON − OFF) for one model on one role. Azure = Lighthouse wins; coral = Lighthouse loses; empty cell = combination not run yet. Models and roles ordered by total lift.

Gemini 2.5 pro
kimi k2.6
qwen3 coder
Deepseek chat v3.1
qwen3.6 plus
Gpt oss 120b
Claude Sonnet 4.6
Mistral large 2411
GPT-5.5
GPT-5.2
Llama 3.3 70b instruct
Self-heal+17.0+25.0+1.8+1.5-1.7-3.8+7.8+0.8-0.4
Planning+12.2+15.5+3.2+0.6+4.0+6.2+10.0+0.4-0.8-2.8
Developer+15.0+0.0+13.0+4.8+2.5+3.3-1.0-0.2-1.0-2.8
Product manager+12.6+0.0+10.8+19.8+2.0-0.3-2.6-7.8+0.0-1.4
DevOps+12.5+0.0-8.5-6.5-1.5+10.5-3.5+5.0+5.5-6.0
Clarification-11.5+13.0-2.0-4.5-3.0+0.5+2.8-1.5+0.0+0.5+9.5
Validation+6.5+0.5-2.5+2.5-0.2-4.0+0.0-4.5-3.0
Reviewer+3.0+0.0+3.0+2.0+0.5+0.0-1.5-5.5-11.0
Designer-1.5+0.0-2.5-2.5-2.5+3.2-2.5+0.0-5.0-3.0
Decomposition+6.0+17.0-3.4+3.0-1.5-8.9+3.0-1.0+2.0-5.6

Hover any cell for sample size and win count. Empty cells = model hasn't been run on that role yet; the harness backfills as compute frees up.

Methodology

The exact protocol, end to end.

Roles. Ten engineering roles from the Ship process: developer, devops, designer, planning, product manager, reviewer, self-heal, validation, clarification, decomposition. Each role has its own system prompt — short, opinionated, scoped.

Loop. Each task is solved twice per model in a four-stage agentic loop: plan → execute → self-review → finalize. Mode A has no tools. Mode B gets a search tool against the Lighthouse graph. B may call zero to N times before answering.

Scorer. A Claude Sonnet judge scores both answers on four 0‑10 axes: specificity, citation, actionability, factual accuracy. Then picks an overall winner — A, B, or tie — with one-sentence rationale.

Library content. Seeded from per-role source lists: official framework docs, RFCs, the team's internal handbook. No task-specific cheating — the librarian never saw the question before answering.

DSQ filter. Reasoning models (GPT-5.x, Kimi K2.6) burn most of their token budget on hidden chain-of-thought. We disqualify pairs where either side returned an empty visible answer; affected tasks are re-run with a 4× larger budget. See the "What we got wrong" section below for what this fixed.

Reproducibility. Every run lands as JSON at tools/eval/agent_bench. Re-run the aggregator to regenerate the JSON the page reads.

What we got wrong the first time

The bench was lying to us. Here's how we caught it.

Our first pass scored "reasoning" models — GPT-5 and Kimi — as tying with Lighthouse on tasks where, in fact, both sides had returned empty answers. The models exhausted their token budget on hidden chain-of-thought before producing any visible output, and the judge scored two empty answers as a tie. Roughly 37% of GPT-5.5's first pass was empty-vs-empty.

We added a filter for empty / degenerate outputs, increased the token budget on affected tasks, and re-ran them. The numbers below are the corrected ones; the chart shows how much the picture changed.

DSQ filter impact — before vs after removing empty-vs-empty ties from the scoring.

Models improved

9 of 117 of 11

Average lift

+1.5 pts+1.88 pts

Llama 3.3-70B

neutral−2.6 pts

The Llama line is the most credibility-buying one on this page. The first run said retrieval is neutral on Llama. After the cleanup, it's clear: retrieval makes Llama worse. We could have hidden that. We didn't.

Where Lighthouse wins

Six tasks where the receipts make the case.

Largest score gains across the benchmark. Click through to the raw JSON for the full prompts, answers, and judge rationale.

Self-heal · GPT-5.5

+36.0

Debug: slow Postgres query

Answer A is empty/blank. Answer B provides a comprehensive, actionable diagnosis plan with specific SQL queries, red flags, and structured steps.

OFF 0 → ON 36 tool calls

Decomposition · kimi k2.6

+34.0

Apply MECE to feature breakdown

Answer A is empty. Answer B provides a complete, well-structured MECE decomposition with named sub-features, explicit boundary statements, and a formal MECE validation check.

OFF 0 → ON 34 tool calls

Planning · Gemini 2.5 pro

+33.0

Risk register for OAuth migration

Answer A is incomplete/truncated with no content. Answer B delivers a full, specific risk register with RFC citations, concrete mitigations, clear owners, and highlighted stop-the-line risks.

OFF 2 → ON 35 tool calls

Self-heal · qwen3 coder

+32.0

Debug: slow Postgres query

Answer A is a JSON noop with no diagnostic content. Answer B provides detailed SQL queries, specific steps, and rollback procedures directly addressing the task.

OFF 2 → ON 34 tool calls

Self-heal · Gemini 2.5 pro

+31.0

Flaky-test triage

Answer A refused the task entirely. Answer B provides a detailed, accurate, actionable checklist with code examples covering the three most common Playwright flakiness causes.

OFF 2 → ON 33 tool calls

Self-heal · qwen3 coder

+31.0

Blameless postmortem template

Answer A is a JSON noop with no postmortem content. Answer B delivers a complete, detailed blameless postmortem with timeline, root cause, contributing factors, and actionable items.

OFF 2 → ON 33 tool calls

Where Lighthouse loses

The honest column.

Tasks where Lighthouse-ON scored worse than Lighthouse-OFF. The pattern: retrieval pulls noise into work that didn't need it — generative or free-form prompts where prior context actively misdirected the model. We don't hide these. They are why the page exists.

Decomposition · Claude Sonnet 4.6

-34.0

WBS: billing system

Answer B is empty/missing. Answer A provides detailed WBS, schema, Stripe event names, API routes, and component breakdown — highly specific, cited, actionable, and accurate.

OFF 34 → ON 0 tool calls

Decomposition · Claude Sonnet 4.6

-33.0

WBS: billing system

Answer B is essentially empty/cut off with no content. Answer A provides detailed WBS, schema, API endpoints, architecture, and named integrations with high specificity and actionability.

OFF 35 → ON 2 tool calls

Decomposition · Claude Sonnet 4.6

-33.0

WBS: user-onboarding flow

Answer B contains no content. Answer A delivers a complete WBS with epics, schema details, risk tables, dependency ordering, MVP flags, and open questions — highly specific and actionable.

OFF 33 → ON 0 tool calls

Product manager · GPT-5.5

-30.0

PRD: AI-suggested code review

Answer B contains no content. Answer A is comprehensive, specific, and actionable with clear metrics, prior art analysis, model criteria, and kill criteria.

OFF 34 → ON 4 tool calls

Decomposition · Claude Sonnet 4.6

-29.0

WBS: user-onboarding flow

Answer B contains no substantive content. Answer A delivers a complete WBS with epics, API contracts, DB schema, risk analysis, and MVP flags — highly specific and actionable.

OFF 33 → ON 4 tool calls

Decomposition · Claude Sonnet 4.6

-28.0

WBS: migrate to OAuth 2.0

Answer B is truncated/empty. Answer A provides detailed WBS sections, schema DDL, component table, and named risks with concrete implementation details.

OFF 32 → ON 4 tool calls

Raw data

Every run is downloadable. Every aggregation is reproducible.

The numbers on this page are computed from 376 JSON files published in the Lighthouse repo. Each file holds the prompt, the two answers, the judge's score rubric, and the rationale. The aggregation script is a single Node file — clone the repo, run make evals, and you get the same numbers we ship.

Aggregation snapshot generated at 2026-05-15T14:35:27.270Z. Run 1 is frozen at this URL forever; future runs publish at /lighthouse/evals/runs/v2, v3, etc.

Open source · launching this week

The cheapest way to lift weaker models is to give them a memory.

If your stack runs Gemini, DeepSeek, Qwen, or Kimi — Lighthouse buys you score points on the four DORA-adjacent axes that matter for agent code review. If it runs Claude or GPT-5, you still get citation, audit, and ground truth on framework versions — the lift is marginal but the receipts are not.

Two long-form reports of this run: plain-English · methodology