Product · Whafr
Open source · Apache-2.0Run your own grounding layer.
Whafr is the retrieval stack behind Lighthouse — ingest pipeline, chunker, summariser, BM25 + pgvector hybrid retrieval, cross-encoder reranker, MCP server, admin panel. Apache-2.0. The engine ships empty; you bring the corpus, you run the crons, you decide what's canonical.
~50 MB image · runs on a $20/mo VPS · Postgres + pgvector required
What's in the box
Four components, one container, one Postgres.
No external services to glue. No vendor lock on the embedding model. Bring your own keys for the LLM summary worker (OpenRouter or local).
Ingest pipeline
Sitemap + trafilatura, hash-delta
Crawler that knows how to read sitemaps, RSS feeds, GitHub repos. Hashes content for delta-ingest — re-runs are cheap. Role-tagged chunkers split paragraphs the way an agent will retrieve them.
Indexer
BM25 + pgvector, hybrid
BM25 over weighted tsvector (summary + keywords A, tags B, content C) and pgvector cosine on chunk embeddings. Async summary worker via OpenRouter or a local LFM2. Same Postgres handles both.
Reranker
Cross-encoder on top-N
Reciprocal-rank-fusion merges BM25 and vector candidates, then a cross-encoder reranks the top of the list. Pluggable model — drop in your favourite reranker.
MCP server
search · fetch_source
Two MCP tools out of the box, token-gated and rate-limited per subject. Built-in admin panel for source review and ingest debugging. Plug any MCP-aware agent in over HTTP or stdio.
Whafr · the engine
Ships emptyYou bring the sources, you run the crons.
The same retrieval stack that powers Lighthouse, packaged for self-hosting. Private repos, internal runbooks, vendor docs your compliance team won't let leave the boundary. Your hardware, your schedule, your taste in what's canonical.
Lighthouse · the hosted instance
Live · curatedSame engine, our corpus, in your tool picker.
71K chunks of canonical SDLC reference — IETF RFCs, OWASP, NIST SP-800, framework docs, methodology pages. 14K sources, 21 role recipes, refreshed by our ingest crons. Plug your MCP client in and stop re-hallucinating axios v0.21.
See LighthouseDeploy
Three steps to a running instance.
From git clone to a searching MCP endpoint, on a single $20/mo box.
Clone the repo and pull the image.
git clone https://github.com/ElMundiUA/whafr docker pull ghcr.io/elmundiua/lighthouse:latest
Provision Postgres with pgvector and point the engine at it.
export DATABASE_URL=postgres://... docker run -e DATABASE_URL -p 8000:8000 ghcr.io/elmundiua/lighthouse
Drop YAML source manifests into ./recipes/ and run the ingest.
docker exec lighthouse python -m lighthouse.ingest --role security
Bring your own corpus
A recipe is a YAML manifest. The engine reads it.
The engine doesn't pick what's canonical for your team — you do. A recipe lists sources (URLs, sitemaps, RSS feeds, GitHub repos), a role tag, and a refresh cadence. The ingest cron crawls each source, hashes the result, chunks new content, embeds and summarises, writes to Postgres. Recipes are versioned with the engine; the hosted Lighthouse uses the same format internally.
# recipes/security.yaml role: security refresh: weekly sources: - https://owasp.org/sitemap.xml - https://csrc.nist.gov/publications.xml - https://www.rfc-editor.org/rfc/rfc6749.html - https://www.rfc-editor.org/rfc/rfc7636.html - https://www.rfc-editor.org/rfc/rfc9110.html chunker: paragraph embed: text-embedding-3-large summary: openrouter:meta-llama/llama-3.1-70b-instruct
recipes/security.yaml — one of 21 role manifests shipping in the repo.
Apache-2.0
Use commercially, modify, redistribute, sublicense.
No patent-grant gotchas. No CLA required for trivial fixes. Read the LICENSE.
LICENSE on GitHubCommunity + by-arrangement
Issues and PRs on GitHub. Anything more — email.
Stand-up help, source curation, private-corpus consulting — we scope it as a one-off engagement and quote a fixed price.
hello@harborgang.comA note on naming
Whafr is the engine brand, and github.com/ElMundiUA/whafr is the repo. The Python package and Docker image still ship as lighthouse / lighthouse-kg — the hosted SaaS pulls them by that name, so the package rename waits for a coordinated release. Snippets on this page use the shipping names.
Keep going
GitHub
Read the source
Apache-2.0. Issues, PRs, releases — the engine lives in the open.
Open
Hosted
Lighthouse — the curated instance
Don't want to run ingest? Plug into our hosted graph with one MCP config block.
Open
Deploy guide
Full deployment notes
Postgres tuning, embedding model choices, reranker swap, cron layout.
Open
Pair with
Ship — the loop
Run the ticket-to-merged-PR cadence that produces the questions your graph answers.
Open
Open source · Apache-2.0
Self-host the same retrieval stack that grounds our coding agents.
The repo has the deployment guide, the source code, and 21 role recipes you can steal whole. Lighthouse is the hosted variant if you'd rather skip the ingest cron.