Product · Whafr

Open source · Apache-2.0

Run your own grounding layer.

Whafr is the retrieval stack behind Lighthouse — ingest pipeline, chunker, summariser, BM25 + pgvector hybrid retrieval, cross-encoder reranker, MCP server, admin panel. Apache-2.0. The engine ships empty; you bring the corpus, you run the crons, you decide what's canonical.

~50 MB image · runs on a $20/mo VPS · Postgres + pgvector required

What's in the box

Four components, one container, one Postgres.

No external services to glue. No vendor lock on the embedding model. Bring your own keys for the LLM summary worker (OpenRouter or local).

Ingest pipeline

Sitemap + trafilatura, hash-delta

Crawler that knows how to read sitemaps, RSS feeds, GitHub repos. Hashes content for delta-ingest — re-runs are cheap. Role-tagged chunkers split paragraphs the way an agent will retrieve them.

Indexer

BM25 + pgvector, hybrid

BM25 over weighted tsvector (summary + keywords A, tags B, content C) and pgvector cosine on chunk embeddings. Async summary worker via OpenRouter or a local LFM2. Same Postgres handles both.

Reranker

Cross-encoder on top-N

Reciprocal-rank-fusion merges BM25 and vector candidates, then a cross-encoder reranks the top of the list. Pluggable model — drop in your favourite reranker.

MCP server

search · fetch_source

Two MCP tools out of the box, token-gated and rate-limited per subject. Built-in admin panel for source review and ingest debugging. Plug any MCP-aware agent in over HTTP or stdio.

Whafr · the engine

Ships empty

You bring the sources, you run the crons.

The same retrieval stack that powers Lighthouse, packaged for self-hosting. Private repos, internal runbooks, vendor docs your compliance team won't let leave the boundary. Your hardware, your schedule, your taste in what's canonical.

Lighthouse · the hosted instance

Live · curated

Same engine, our corpus, in your tool picker.

71K chunks of canonical SDLC reference — IETF RFCs, OWASP, NIST SP-800, framework docs, methodology pages. 14K sources, 21 role recipes, refreshed by our ingest crons. Plug your MCP client in and stop re-hallucinating axios v0.21.

See Lighthouse

Deploy

Three steps to a running instance.

From git clone to a searching MCP endpoint, on a single $20/mo box.

1

Clone the repo and pull the image.

git clone https://github.com/ElMundiUA/whafr
docker pull ghcr.io/elmundiua/lighthouse:latest
2

Provision Postgres with pgvector and point the engine at it.

export DATABASE_URL=postgres://...
docker run -e DATABASE_URL -p 8000:8000 ghcr.io/elmundiua/lighthouse
3

Drop YAML source manifests into ./recipes/ and run the ingest.

docker exec lighthouse python -m lighthouse.ingest --role security

Full deployment guide →

Bring your own corpus

A recipe is a YAML manifest. The engine reads it.

The engine doesn't pick what's canonical for your team — you do. A recipe lists sources (URLs, sitemaps, RSS feeds, GitHub repos), a role tag, and a refresh cadence. The ingest cron crawls each source, hashes the result, chunks new content, embeds and summarises, writes to Postgres. Recipes are versioned with the engine; the hosted Lighthouse uses the same format internally.

# recipes/security.yaml
role: security
refresh: weekly
sources:
  - https://owasp.org/sitemap.xml
  - https://csrc.nist.gov/publications.xml
  - https://www.rfc-editor.org/rfc/rfc6749.html
  - https://www.rfc-editor.org/rfc/rfc7636.html
  - https://www.rfc-editor.org/rfc/rfc9110.html
chunker: paragraph
embed: text-embedding-3-large
summary: openrouter:meta-llama/llama-3.1-70b-instruct

recipes/security.yaml — one of 21 role manifests shipping in the repo.

Apache-2.0

Use commercially, modify, redistribute, sublicense.

No patent-grant gotchas. No CLA required for trivial fixes. Read the LICENSE.

LICENSE on GitHub

Community + by-arrangement

Issues and PRs on GitHub. Anything more — email.

Stand-up help, source curation, private-corpus consulting — we scope it as a one-off engagement and quote a fixed price.

hello@harborgang.com

A note on naming

Whafr is the engine brand, and github.com/ElMundiUA/whafr is the repo. The Python package and Docker image still ship as lighthouse / lighthouse-kg — the hosted SaaS pulls them by that name, so the package rename waits for a coordinated release. Snippets on this page use the shipping names.

Open source · Apache-2.0

Self-host the same retrieval stack that grounds our coding agents.

The repo has the deployment guide, the source code, and 21 role recipes you can steal whole. Lighthouse is the hosted variant if you'd rather skip the ingest cron.