The book — Ship — Harbor Gang

Tip — Need setup, not chapters? Product setup and operational checklist live in Getting started and Docs. This page is the long narrative — motivation, trade-offs, and vocabulary — for readers who want the logic before or after wiring.

This page is Harbor Gang as a book-shaped essay: you can read it on a long flight or in about fifty short sittings — a Prologue, a Preface, forty numbered chapters, and a scattered set of lettered sub-chapters that cover the gearbox and repeatable mastery (10.A–B), the right skills (18.A), morning metrics (20.A), prompt evals (22.A), the improvement loop (25.A–C), regulated-vertical overlays (28.A), cost and envelope (31.A), agent PR review (33.A), and onboarding (35.A), followed by a closing Manifesto before the Vocabulary. Each chapter is one idea — roughly three to five minutes of reading at a normal pace — written as a short essay, not a bullet list disguised as wisdom. Underneath the tone, the rules are the same ones we ship in production, and every new passage that was added in the 2026 edition is anchored to real work in the reference org; the public story starts at Use cases.

Jump (major parts): Prologue · Preface · The idea · The system · The right skills · Running the loop · The improvement loop · Trust & boundaries · Rolling it out · When things break · The Ship Manifesto · Vocabulary

For filenames, product setup, knowledge, and operating checklists, use Getting started and Docs. For the public reference deployment, start at Use cases → ElMundi. This book keeps the why; the docs keep the how.

Prologue

The night the agent shipped nothing

The commit that opens this book is not spectacular. It is dated 2026-04-15 and its subject line is almost apologetic: fix(ci): SDLC scheduled slot must not skip on odd UTC hour. Eight insertions, thirteen deletions, one workflow file. Nothing in it would survive a conference stage. And yet it is the most honest commit in the repository, because it records the morning the system looked green and had done nothing.

The mistake, once you see it, is almost too neat. The scheduled workflow used to route the day between roles by asking the runner, is it an even UTC hour? If it was, the "developer" role picked a ticket and ran. If not, a different role took the slot. The logic was clean on a whiteboard and, for months, close enough to true that nobody had reason to question it. Then GitHub Actions did what GitHub Actions sometimes does: it delivered a cron intended for 05:00 UTC at 05:08 UTC instead, because scheduled runs queue behind other tenants and arrive when the platform can afford them. Five past the hour is still an even hour. Eight minutes past is not. The guard looked at the wall clock, saw an odd hour, cleared the role, and exited with a triumphant green check mark. The developer role that was meant to run never even looked at Linear. No ticket was picked. No branch was cut. No PR was opened. The dashboard did not light up, because the job had not failed — it had politely refused to begin.

The operator on call that morning did not discover this by staring at the failure, because there was no failure. They discovered it the way people discover most serious problems in agentic systems: by noticing an absence. The backlog was unchanged. The "today" view in the tracker was empty. The main branch had its nightly commit from the release-check workflow and nothing else. No incident had happened. No pages had fired. The system had simply decided, on the evidence of an eight-minute queue delay, that it was somebody else's turn, and gone back to sleep. A human had to open the workflow file, trace github.event.schedule to its guard, and realise the automation they had trusted for weeks had been answering a question the real world could not be relied upon to ask cleanly.

We open the book here because almost every lesson that follows is a variation of this one. A capable model did not hallucinate a library. A prompt did not go rogue. A secret did not leak. None of the dramatic failures that haunt the pitch decks happened. What happened is that a tiny, reasonable assumption — scheduled hours arrive on the hour — met the physics of a system someone else owns, and the assumption lost. The repair was not to add another bot. It was to change the shape of the contract: route by minute tolerance, not by hour parity; resolve the role from UTC on the runner, not from github.event.schedule; assume delivery will be late, and make late still valid. The commit that followed was five lines longer than the one it replaced and a great deal harder to earn.

This is what Ship is about. It is not about replacing engineers with charismatic automation. It is about running a shop where the quiet failures — the scheduled job that skipped, the prompt that drifted, the label that meant one thing yesterday and another today — are made legible before they become folklore. The clock in this story is how Ship worked then; the scar it left is part of why Ship eventually dropped the delivery clock entirely and moved to event-driven dispatch, a turn a later chapter explains in full. The rest of this book is a long, slow argument for that legibility, built out of real scars from a real monorepo, written for the kind of operator who does not want to learn these lessons twice.

Preface

Who this is for, and how to read it

This book is for the person who has been handed a team, a budget, and a promise about "AI in the SDLC," and who suspects — correctly — that the promise is larger than the plan. It is for the engineering manager who has to explain to finance why more autonomous actors did not produce more merged changes. It is for the platform engineer who has been asked to wire a coding agent into the pipeline without breaking the release train. It is for the security officer who already knows the question nobody else has asked, and who would like it answered on paper. It is not, at its core, for the executive who needs a slide; there are other documents for that, and a slide cannot carry this weight.

The product now starts from a product owner and a workspace, but that is not a promise of autopilot. In this book, the product owner is another name for the human who owns intent: priority, risk, evidence, and the final decision. The operator language remains because someone still has to notice when a quiet system has done nothing, but the responsibility is the same sentence: humans decide; machines act inside fences.

The manual is built as a sequence of short essays, each about one idea. You can read it straight through on a long flight, or pick at it one chapter a day for a month. The order is not arbitrary — the idea comes before the system, which comes before the skills and knowledge, which comes before running the loop — but nothing stops you from jumping to the chapter that matches tonight's incident. A duplicate-PR problem is probably waiting for you in chapters 5, 18.A, and 25.C. A scheduler that lies is in the prologue, chapter 19, and the field note at the end of chapter 40. The cross-links are there because the failures rhyme.

We draw our scars from a specific monorepo: ElMundiUA/elmundi, the reference org where Ship first had to survive contact with real traffic. The repository itself is private and the SHAs are not yours to click; what we keep in the book is the author date, the subject line, and enough diff shape to argue the lesson — the parts that survive an audit even when the URL behind them does not. Where a chapter tells a story as a scene — the night the agent shipped nothing, the afternoon fifteen identical fix commits landed in a row — it is a compressed retelling of events that happened, not a parable. The public version of that reference story lives in Use cases → ElMundi. If a passage ever loses its anchor, we would rather delete the passage than keep the anchor silent.

One promise about the tone. This is not a handbook of best practices. Best practices age badly, and the word "best" invites contempt from anyone who has been awake at the wrong hour. What we offer instead is a memoir of operators: things we built, things we broke, things we learned to write down so the next team would not have to relearn them in the same order. If a chapter sounds slow, that is usually because the failure it describes was slow, and writing the failure any faster would be a lie. Read it anyway. The next on-call shift will thank you.

The idea

Chapter 1 — Quiet mode

We have all sat in the room where the demo wins. Someone shows a model that answers questions, opens pull requests, and posts cheerful summaries. The lights are bright. The narrative is heroic. The room applauds motion. Then Monday arrives, and the same system quietly becomes a tax: duplicate PRs against the same file, labels that mean one thing in the morning and another by afternoon, preview environments probed until someone admits they cannot tell whether a failure was the app, the harness, or a hallucinated endpoint. The demo was loud. Production is a whisper you still have to hear.

We built Ship to be quiet on purpose. Quiet is not modesty; it is legibility. A quiet system does fewer surprising things. It leaves traces you can follow without a séance. When something breaks, you want the failure to be boring—an invariant violated, a contract unmet, a step that refused to advance because the evidence was missing—not a creative improvisation that felt brilliant for thirty seconds and expensive for three days. Predictability is the kindness we owe the humans who will own the outcome after the presenter has left the stage.

The cruel joke of agentic tooling is how often we blame the wrong villain. We say the model was stupid, or the prompt was thin, or the temperature was high, as if the failure were a personality flaw. In practice, most wounds are specification wounds. An under-specified system fails the way a bridge fails when nobody agreed which load it must carry. The model will fill the vacuum with plausible motion. It will open another PR because nobody said “one change set per concern.” It will invent a label because the taxonomy was never pinned. It will poke a preview URL because “check the deployment” sounded like enough instruction until it wasn’t. The failure is not mysticism. It is missing guardrails stated in a form a machine can obey and a human can audit.

We have the war stories, and they all rhyme. Duplicate PRs land like junk mail: each message plausible alone, together a denial-of-service on reviewers’ attention. Label drift turns your board into folklore—everyone nods at the words while meaning diverges in private. Preview probes become a ritual of hope: refresh, wait, guess whether green means safe or merely “the script succeeded.” These are not edge cases. They are what happens when excitement outruns definition. The flashy demo celebrates the exception; operations live in the mean.

We are careful about what we measure, because the wrong dashboard manufactures the wrong behavior. Metrics that celebrate motion—PRs opened, comments posted, tasks “touched”—reward theater. They teach the system to be busy. Accountability metrics look different. They ask whether the change matched the intent, whether the evidence was attached, whether the trail holds up when someone angry reads it at midnight. We would rather see a small number of disciplined actions than a fireworks show of ambiguous progress.

So we choose quiet. We prefer narrow interfaces and explicit states over charismatic improvisation. We want the agent’s work to read like a well-run shop: lights on, doors labeled, a logbook by the register. When Ship runs, it should feel less like a talent show and more like infrastructure—predictable enough to trust, traceable enough to fix, boring enough to scale. The demo will always be loud. The work that ships is the part you can hear when the room is empty.

Chapter 2 — The loud demo trap

There is a moment in every organization that adopts “AI for engineering” when someone shows a demo and the room goes quiet in the wrong way. Not the quiet of focus. The quiet of relief mixed with fear. Relief because something finally moved without a committee meeting. Fear because nobody can quite say who moved it, on whose behalf, or whether it will happen again the same way tomorrow.

Demos are emotional events. They compress weeks of doubt into ninety seconds of apparent progress. The trap is not dishonesty; it is theater. A demo proves that a capable model can produce a plausible diff, a tidy summary, or a confident plan. It does not prove that your org can absorb that output at scale without breaking trust, ownership, or the ability to ship safely. Yet the demo wins the room anyway, because demos are loud and governance is soft-spoken. People clap. Budget follows applause. And then reality arrives in the form of small, cumulative failures that never look dramatic enough to stop the train.

The second act is bolt-on access. Someone gets a key, a seat, a plugin, a bot user, a shared service account. Access spreads faster than explanation. Teams celebrate “velocity” while the security and platform folks are still drawing diagrams. The organization has not decided how agents relate to humans in the merge graph; it has merely opened a door and called the breeze “innovation.” Bolt-on access feels generous in the moment and expensive forever, because every shortcut becomes a precedent. The next team expects the same exception. The next vendor assumes the same posture. Pretty soon “we have an AI workflow” means “we have a pile of integrations that nobody fully owns.”

Governance failure does not always look like a headline breach. More often it looks like boredom and confusion. Duplicate pull requests land with nearly identical intent because two automations, or two people with automations, chased the same ticket. The wrong change merges because the signal-to-noise ratio in review collapsed: reviewers skim, bots comment, status checks turn green, and the human story behind the change is missing. Afterward, when something misbehaves in production, nobody can reconstruct a clean narrative. Which automation opened the branch? Which rule approved it? Which human actually decided? The mystery is not mystical; it is operational debt. You traded legibility for speed, and speed without legibility is just panic with better branding.

This is where the fantasy of unbounded velocity collides with the physics of software organizations. You do not get infinite throughput by adding more autonomous actors. You get a wider front of concurrent mistakes unless you bound the system: who may act, what they may touch, how their actions are recorded, and how humans remain accountable for outcomes. The useful promise is not “more merges per hour.” It is auditable velocity—movement you can explain, reproduce, and defend. Auditable velocity is slower on a leaderboard and faster in a postmortem. It is the difference between a team that can say “we decided” and a team that can only say “something happened.”

“We decided” is a sentence worth protecting. It implies alignment, a recorded rationale, and a named owner. Chaos, by contrast, is a pile of outcomes searching for a story. Demos love chaos because chaos looks busy on a screen. Serious shipping requires the opposite instinct: smaller claims, clearer boundaries, and evidence that survives the demo room’s exit lights.

The loud demo trap ends the day you stop confusing spectacle with structure. Let the demo be a hint, not a mandate. Let access follow policy, not curiosity. Let every automated step leave a trail a human can read without a séance. Velocity is not the enemy; unowned velocity is. The goal is not to silence excitement—it is to make sure that when the excitement fades, you still know who decided, and why, and what actually moved.

Chapter 3 — Humans own intent

Software moves through a pipeline: ideas become backlog items, backlog items become pull requests, pull requests become merges, merges become production. That sequence is not neutral. It encodes who is allowed to mean something. If intent lives anywhere, it lives in the transitions—what we choose to prioritize, what we agree to ship, and what we accept as “good enough” to expose to users. Automation can accelerate each step, but it cannot substitute for the decision at each threshold. The machine can run the lane; it cannot own the finish line.

Treat automation as a dedicated lane, not as the highway. A lane has an on-ramp, a speed limit, and an exit. Backlog grooming, code review, and release approval stay in human hands because they are where trade-offs live: scope versus time, risk versus reward, debt versus velocity. Automation belongs where the rules are explicit, repeatable, and boring—formatting, tests, deployment to a staging environment, rollbacks triggered by clear signals. When you blur the boundary, you get the worst of both worlds: people disown outcomes (“the pipeline did it”) while still paying the cost of incidents and rework. A clean split keeps credit and blame where they belong.

The language you use in retrospectives is a quick diagnostic. Listen for “we decided to ship with that caveat” versus “it went out.” The first sentence has a subject. Someone named a risk, accepted it, or deferred it. The second sentence is weather. Things happened; nobody was home. Intent without a named owner decays into narrative convenience. If your postmortems sound like meteorology, you have already outsourced accountability to process and tooling. The fix is not more process diagrams. It is restoring the habit of saying who chose what, when, and on what evidence.

Backlog is where intent is born in a form the team can act on. A backlog item without a crisp problem statement and a definition of success is just a placeholder for anxiety. Merge is where intent is negotiated against reality: does this change do what we think, play nicely with the rest of the system, and respect the constraints we care about? Production is where intent meets consequence. Users do not experience your intentions; they experience what actually runs. Owning intent means owning those three moments explicitly—prioritization, integration, exposure—not pretending that continuous delivery erases the need for judgment.

Here is a practical test you can use this week. Pick any workflow that touches automation—CI, deployment, a bot that merges green builds, a policy that promotes artifacts. Ask two questions. First: who is allowed to say “ready for automation”? If the answer is “whoever opened the PR” or “the system when checks pass,” you have not named an owner; you have named a trigger. A trigger is not accountability. Someone with standing on the team—usually the engineer responsible for the change, sometimes a pair, sometimes a lead on a sensitive area—should be able to say, in plain language, that this change is appropriate to hand off to the automated lane. That utterance should be cheap when the change is small and deliberate when the change is large, but it should never be implicit.

Second: what field proves it? Not vibes, not green lights alone. A ticket or PR description should carry a single, auditable signal: a checkbox, a label, a short sentence in a required template—something a future you can grep. “Ready for automation: yes—owner: Alex—risk: low—rollback: feature flag X.” The field does not replace conversation; it makes the conversation recoverable. When something breaks at two in the morning, you want to read what the team believed at handoff, not reconstruct intent from Slack archaeology.

Humans own intent so that when production misbehaves, you can improve judgment, not just scripts. Automation gets a lane so speed does not erase responsibility. Keep the words honest—decided, not happened—and keep a named voice and a visible proof at the boundary where work leaves human hands and enters the machine. That is how you ship fast without shipping blind.

Chapter 4 — Machines need fences

Before you buy another seat or wire another cron job, say the quiet part out loud. Scope is three plain questions, not a mood.

X — Which backlog? One product, one program, one slice of the board. If the answer is “whatever search returns,” you have already lost. Machines do not have taste; they have queries. Give them a project (or your tool’s equivalent) so “in scope” is a row in a config someone can audit, not a debate in stand-up.

Y — Where in the human workflow? States or columns are not decoration. They are traffic lights. Automation should only move work when the state matches a rule you could paste into an onboarding doc without blushing. “In progress” is not the same as “ready for a machine.” If you cannot draw that line on purpose, do not expect an agent to respect it by accident.

Z — What intent is attached? Labels (tags, custom fields — same job) carry meaning: ready for automation, blocked on a human, evidence attached, wrong lane. Pick a small vocabulary and treat a typo like a compiler error. Drift is not “culture”; it is a broken contract.

That triplet — X / Y / Z — is your scope sentence: only issues in this backlog, in this state, with this label shape may be touched by automation. If you cannot say it in one breath, you do not have a system. You have a chatbot with repository access and a calendar invite.

Fences are not insults to the model. They are interfaces. You would not let every microservice read every table. You would not ship a partner integration as POST /maybe-do-a-thing. An agent without fences is the same class of mistake with better marketing: huge read surface, fuzzy inputs, side effects nobody named. A fence is the wall between “what the organisation decided” and “what the model inferred from the last three Slack threads.” Good fences are boring on purpose. Boring is what lets you sleep.

Testable beats vibes. When fences are explicit — project IDs, state names, label strings you can grep — you can test them. You can fail a build when the board drifts. You can review a policy change like code. When fences are vibes — “we all know what green means,” “that column is basically ready” — you can only argue in Slack until someone mutes the thread. Vibes scale like gossip; tests scale like engineering. The difference shows up at two in the morning. Either your log says “skipped: state mismatch” and you fix a label, or your log says “it felt ready” and you fix trust. Pick the boring error message every time.

Label contracts are the smallest unit of trust. A label contract is the agreement that specific strings mean specific gates. Names like ready:developer are not aesthetics; they are legible to scripts. When the contract is clear, pick logic stays dull and trustworthy. When it drifts — someone renames a label, someone “cleans up” the board — green runs start lying. People stop believing the tracker. Automation becomes something you apologise for instead of something you rely on. Treat labels like enum values in a public API. Deprecate with a plan. Document the mapping. If the tracker will not enforce spelling, your automation must fail closed when the schema does not match.

The tracker is your API schema. Between your organisation and automation, the issue tracker is not merely “where tickets live.” It is the request surface: projects, states, labels, links to pull requests and CI — the fields that say what may move, what already moved, and what evidence exists. That is not “metadata” in the sneering sense. It is the schema machines consume — and in Ship it is more than a schema, because a state transition in it is the dispatch signal itself. When the schema is sloppy, every consumer becomes a guesser. Guessers ship incidents. When the schema is tight, the dispatcher stays dumb in the good way: read the state and labels, resolve the stage, fire one routine, no heroics. The judgment stays with the people who own the product; the machinery enforces the fences that keep work legible.

Machines do not need freedom. They need clear edges. Draw X, Y, and Z. Write the label contract like you mean it. Let the board be the API you would not be ashamed to publish. Everything else is vibes — and vibes do not pass code review.

Note — Field note Two fences in the reference org that earned their keep. A 2026-04-07 commit titled SDLC: Todo-only picks scoped to ElMundi pre-release scoped automation to a single Linear project and forbade Backlog entry without human promotion. A 2026-03-24 commit titled fail pick in CI if LINEAR_API_KEY missing taught the old CI runner to fail closed when the credential it needed was not there. Both predate the event-driven turn — the mechanism they guarded (a CI pick) is gone, and the engine now holds the tracker credential server-side — but the fences survived the migration: the dispatcher still refuses any project it has not been scoped to, and still fails closed rather than guessing. One fence is about which tickets may enter; the other is about under which conditions automation may begin at all. Both were cheaper than any incident they prevented.

Chapter 5 — Throughput must be bounded

Everyone wants more throughput until throughput starts eating the team. Then we call it “velocity,” add another lane, and wonder why nothing ships cleanly. Unbounded throughput is not a strategy. It is a way to guarantee collisions, rework, and the quiet shame of a Todo column that never shrinks.

One role per window. Not one person forever—one delivery role with clear ownership for this slice of time. When two “owners” can touch the same outcome in the same window, you do not get twice the speed. You get negotiation tax, duplicated effort, and the polite fiction that someone else is driving. Pick who is carrying the ball now. Everyone else supports, reviews, or waits. That is not hierarchy cosplay. It is how you avoid two people “helpfully” solving the same problem in incompatible ways.

Visible queues. If work is invisible, it is infinite. Sticky notes in someone’s head do not count. A backlog that only exists in a chat thread is a trap: it feels light until you try to explain why five things are “almost done.” Make the queue legible—what is next, what is blocked, what is waiting on a human decision. A visible queue is embarrassing in the best way. It forces honesty about capacity.

Then come the sharp edges: branch races and duplicate pull requests. They do not happen because engineers are careless. They happen when ownership is fuzzy, naming drifts between tools, or two automated jobs both think they picked up the same ticket. You merge one branch and discover another branch with the same intent but a different prefix. The system is doing exactly what you told it: parallelize without a single source of truth for who and what this change is.

Note — Field note We hit duplicate PRs and branch fights in a large monorepo when two jobs believed they owned the same ticket or naming drifted between workflows. The durable fix was never "smarter model." It was one delivery role per window plus a branch and title contract everyone actually follows.

Let us be emotionally honest about long Todo columns. They are not a sign of ambition. They are a museum of deferred disappointment. Each card is a little promise you made to Future You. Future You is tired. A long column whispers that you are “busy” without proving you are effective. Trimming the column—not hiding it, but actually deciding what will not happen this cycle—is an act of respect for the people doing the work and for the people waiting on outcomes.

Bound throughput on purpose. Fewer concurrent streams, explicit WIP limits, one driver per window, contracts that do not depend on a model “understanding” context. The goal is not to look busy. The goal is to finish things without tripping over your own pipeline. Smarter models cannot replace that discipline; they amplify whatever you already believe about ownership.

Throughput is a knob. Turn it up only when your queues, roles, and contracts can take the load. Until then, bounded throughput is not a limit on ambition. It is how ambition survives contact with a monorepo, a calendar, and a team that would like to go home knowing what “done” actually meant today.

Merged outcomes flatten while rework accelerates as WIP grows

Chapter 6 — What you are actually buying

When teams evaluate “agent platforms” or “AI delivery,” the sales deck often shows a mascot, a chat window, and a promise that work will “just happen.” Under that gloss, the real product is easy to misunderstand. You are not procuring a mood board for the future of engineering. You are procuring a small set of durable mechanisms that turn intent into motion, motion into evidence, and evidence into something you can defend in a postmortem. The question is not whether the logo matches your brand guidelines; it is whether the system still makes sense when you erase every slide and draw five boxes on a whiteboard.

The useful framing is procurement of governance with execution, not procurement of autonomy as entertainment. A demo that impresses in a thirty-minute call is often optimized for novelty: a single happy path, a human quietly steering, a repository that was chosen because it is easy. Production reality is the opposite. Work is concurrent, priorities shift, credentials leak scope, and the organization will ask—correctly—who approved what, why that change landed, and what to revert if the hypothesis was wrong. If your purchase cannot answer those questions from first principles, you have bought a spotlight, not a stagehand.

What you are buying, then, is the ability to treat automated work like any other operational system: bounded, observable, and reversible. That implies explicit state, explicit rules about eligibility, explicit contracts for what “done” means in code terms, and explicit linkage between machine actions and human accountability structures (tickets, CI, review). Anything less collapses into “the bot did something,” which is technically true and organizationally useless.

This is also why “buying AI” is a category error in most engineering organizations. Models and prompts are ingredients. The product is the pipeline: how work is admitted, how it is scheduled, how it is executed under constraints, and how it leaves fingerprints. If you optimize only for model quality, you get sharper text and the same broken process. If you optimize for the pipeline, you can swap models and prompts without renegotiating trust with your org—because trust was never placed in a vibe; it was placed in records, checks, and traceability.

Another trap is confusing integration count with value. A vendor that boasts fifty connectors has not necessarily given you a system of record; it may have given you fifty ways to spray activity into Slack. Connectors matter when they feed a coherent ledger: what was attempted, what succeeded, what failed, and what must be true before the next attempt.

You should also be skeptical of purchases that outsource judgment without outsourcing responsibility. The organization still owns risk: security, compliance, customer impact, and the social contract inside the team. A serious offering encodes constraints as guards and gates that are visible, versioned, and testable.

Finally, consider time horizons. The first month is seductive because novelty covers gaps. The eighteenth month is when you discover whether you bought maintainability: can a new engineer understand why work moved, can you change rules without a rewrite? The durable asset is not charisma; it is a design that stays legible when the original champions rotate out.

In that light, “what you are actually buying” is not a single miracle component. It is a composition of roles that stay stable even as tools churn. The following table names those roles plainly—so you can test any pitch against them, not against the font choice in the PDF.

Piece	Job
Tracker	System of record for state and guards — what is allowed to move, and where work sits today. A state transition is the act-signal.
Dispatcher	Watches the tracker (a diff-based poller, ~300s) and, for each transition, resolves the FSM stage and fires the one routine that transition unlocked — bounded by leases, caps, and cascade limits, not a clock.
Agent	Executes a versioned skill against a branch under a contract: branch name, PR title and body, ticket comments, allowed tools. It commits its own work; the runner pushes, opens the PR, and reports the outcome.
Audit	Every automated touch ties to a ticket, a run, and a traceable comment so "what happened?" has an answer without Slack archaeology.
MCP edge	The front door. The operator drives Ship from their own agent over an OAuth 2.1 broker; the engine is exposed as an MCP server with domain tools (`ticket_`, `project_`, `run_subagent`, `run_workflow`, `dispatch_ticket`, `inbox_*`, `audit_search`). The operator's agent is the brain; Ship keeps the control plane.

Whiteboard test, not logos. If you can redraw those five boxes from memory, explain the arrows between them, and point to where your organization’s non-negotiables live inside that diagram, you are evaluating a real system. If you need the vendor’s slide to remember what you bought, you probably bought branding. The purchase that ages well is the one that still makes sense when the room is empty except for a marker, a board, and a teammate who was not in the demo.

Chapter 7 — What we refuse to optimize for

Some optimisations look like victories on a leaderboard and feel like apologies in a retrospective. We treat that gap as a compass. When a chart goes up and trust goes down, you have not improved the system; you have moved the pain to a quieter room. The job of a framework is partly to make those trades visible before you finance them. Refusing the wrong optimisations is how you keep enough attention left for the work that actually ships.

We refuse surprise work—and the accelerant we see most often is vacuuming the Backlog because it is always full, always visible, and tempting to treat as “ready enough.” Backlog is where intent is still half-formed: priorities argue, scope breathes, and “someday” masquerades as “next.” Automating picks from that column is not ambition; it is outsourcing triage to something that cannot suffer the consequences when the wrong card moves. The ticket that looked innocent on Tuesday can be a liability on Wednesday; only humans pacing the board carry that context. Ship keeps automation’s hands out of the wishlist on purpose. Eligibility belongs in a human-readable entry state—Todo, or your organisation’s equivalent—after someone has said, in effect, this may proceed. When you skip that transition to save a morning, you do not save time; you borrow it from the week someone must untangle intent from motion, and from the incident where nobody can explain why that work started at all.

We refuse hero agents. Heroics are overlapping runs that believe they own the same ticket, the same branch, or the same narrow window of reviewer attention. They feel productive because terminals scroll and notifications pile up. In practice they duplicate pull requests, fight locks, and train humans to distrust anything with an automated author. Scale does not come from cramming more courage into the same minute; it comes from schedules that do not step on each other, contracts for names and branches that grep cleanly, and pick logic dull enough to unit test. The romantic story is the bot that never sleeps. The operational story is the team that sleeps because only one delivery role wakes at a time, and because “busy” stopped being a proxy for “aligned.”

We refuse prompts that live only in a SaaS text box—the friendly editor that makes tweaking irresistibly easy and auditing impossible. If it is not in git, it is not reviewed like code; if it is not reviewed like code, it will drift the week you are on holiday and someone will “just fix the wording” without a trail. Prompts are not vibes; they are policy written in a language models actually read. That is why versioned prompts belong beside the repository they touch, with diffs your colleagues can argue about in daylight. When you want to improve how agents behave, the civilised path is the same as any other serious change: propose it, use the authoring guide with evidence, merge it, and let the schedule pick up a known version—not whatever was last pasted into a cloud console.

We refuse vanity throughput—the metric that glows when you ran the agent four hundred times and says nothing about how many outcomes were mergeable, auditable, and intended. Motion is cheap; aligned outcomes are expensive. Dashboards that reward touches teach the organisation to be busy on purpose, because busy reads as commitment to people who do not have to merge the result. Accountability asks a different question: did this change match the ticket, carry evidence, and leave a story a tired human can follow at midnight? We would rather ship fewer, legible steps than win a throughput trophy that disintegrates under the first serious incident review.

Refusal is not puritanism; it is attention budgeting. Every hour spent debugging a surprise pick is an hour not spent improving the product. Every retro spent divining bot intent is a retro not spent tightening fences. Every duplicate PR is a tax on reviewers who thought they were hired to judge design, not to referee twins. The organisation has a finite tolerance for mystery before it quietly routes around automation altogether. Saying no to shortcuts is the same instinct as refusing to pass credit card numbers in query strings—not because nobody thought of it, but because the convenience ages badly and the interest compounds in the worst meetings. We optimise for repeatability, traceability, and the kind of boredom that keeps teams married to their own systems after the demo room empties. Everything else can be someone else’s keynote.

Chapter 8 — The wall of rules before the first run

The first run is not when you discover whether your agent setup works. It is when you find out whether you were honest about constraints, queues, and iteration. Most failures are not model failures. They are policy failures dressed up as “the AI did something weird.”

Final policy before first run fails. If the last thing you do before pressing go is bolt on a giant rules file, you have not hardened the system—you have hidden the real shape of the work behind a curtain of text nobody will maintain. The wall of rules should exist before the first run, not as a panic patch after it. That means the rules are short enough to argue about, specific enough to enforce, and owned by someone who will actually change them when reality disagrees. A rule that cannot be violated in practice is not a rule; it is wallpaper.

Thin prompts. The prompt is not the place to store your entire engineering handbook. Thin prompts do one job: aim the model at the right task with the right success criteria and the right stop conditions. Everything else belongs in durable structure—checklists, tools, interfaces, review steps—not in a paragraph the model will skim or mis-weight. When prompts balloon, teams compensate by adding more rules on top, which makes the system harder to reason about and easier to game. Prefer a small prompt plus a visible next step over a long prompt that tries to simulate the whole org chart.

Visible queues. If work is invisible, you cannot steer it. A queue you can see—what is waiting, what is in flight, what is blocked—is the difference between managing a pipeline and shouting into a void. Visibility is not dashboard theater. It is the minimum honest accounting: this item exists, this is its state, this is who or what owns the next action. When queues are hidden inside chat threads or implicit “someone will follow up,” you get duplicate effort, dropped handoffs, and the false belief that automation replaced coordination. Make the queue boring and legible. Boring queues scale.

Tight fences. Fences are the parts of the system that are not negotiable: where files may be written, which commands may run, what may be merged without review, what counts as done. Loose fences feel friendly until the first incident. Tight fences feel annoying until the hundredth day, when they are the reason you still have a repo. The goal is not maximum restriction; it is clear restriction—so the agent and the humans share the same boundaries. Ambiguous permission is worse than denial, because denial produces a visible error; ambiguity produces confident mistakes.

Iterate. The wall of rules is not a monument. It is a prototype that meets production traffic. Your first version will be wrong in ways you cannot predict from a whiteboard. That is normal. What matters is that you treat prompts, fences, and queue design as things you revise on evidence, not things you declare finished because the document is long. When something goes wrong, the useful question is not “how do we add another paragraph?” but “which fence failed, which queue hid the failure, and which prompt encouraged the wrong shortcut?” Fix the system, then trim the prose.

If you want a practical rhythm for improving how agents behave without turning every lesson into a permanent slab of text, use Docs → Authoring as your default loop: small change, real task, observe, adjust. The wall of rules before the first run is there so the first run teaches you something you can act on—not so you can pretend the first run was already safe.

Chapter 9 — Proof and where to go next

The chapters you have just read are guardrails stated as essays. They are blunt about failure modes on purpose: a guardrail that only works in a slide deck is a costume. Proof is the question that survives when the room stops nodding. Can someone trace what ran, when, under which rule, and tie that back to a ticket, a branch name, and a CI run—without opening six tabs and holding a séance? If the answer is “usually,” you have a hobby. If the answer is “yes, and here is the file,” you have something a team can hand to a new hire on week one.

Abstractions are cheap; wiring is where opinions meet reality. That is why Use cases → ElMundi exists. It is not a testimonial dressed as a case study. It is one deployment described the way engineers remember incidents: workflow shape, tracker projects and labels, secret boundaries, preview habits, hosted checks, a delivery lane bounded by leases and caps rather than a clock, and an audit loop that is allowed to say nothing useful on a quiet morning. You should disagree with our choices—rename domains, swap vendors, retune the envelope—but you should not have to pretend the framework is fog. You can map your repository against that story and see which part answers which question.

Treat Examples as a receipt, not a mandate. Receipts are what you show when someone asks whether “agentic delivery” is a pilot slide or a production habit. They are what you show yourself six months later when the person who wired it up has rotated out. A receipt names the FSM grammar, the dispatch rules, the branch and title contract, and the second board that keeps architecture and security from drowning the sprint. Without receipts, Ship becomes a mood you once had in a quarter planning deck.

The sections that follow on this same page lift the altitude from temperament to architecture and operations—still adapter-shaped, still polite about vendor logos—because serious work needs both the whiteboard sentence and the labeled doors. Read in this order when you are implementing or reviewing end to end:

The system — boxes, arrows, and where business rules live: tracker as system of record whose transitions are the act-signal, the dispatcher that routes each transition to one FSM stage, agent under a versioned skill, pull request as evidence, audits in parallel lanes that do not steal the delivery story.
Running the loop — cadence, queues, branch races, what “green” is allowed to mean, why boredom is load-bearing, and how self-heal relates to intake without collapsing the two loops into one noisy stampede.
Trust & boundaries — where bits flow, which subprocessors show up when you actually run this, threats in plain language, and the boring questions risk reviews should ask before you widen access.

If your job is to procure or govern rather than wire cron yourself, you still need the same spine—only the emphasis shifts. Read The system and Trust & boundaries, then step to Docs and Use cases for stakeholder-facing framing: what is being bought, what stays human, how exit and portability read when orchestration is “just CI plus scripts.” The purchase that ages well is the one you can redraw from memory after the vendor PDF is closed.

Reading order is a promise, not a prison. Some readers land in Trust & boundaries first because legal booked Wednesday. Others jump to When things break because the channel smells like smoke. The chapters are sized for a single sitting and written to stand alone; sequence pays off when you have time, and cross-links are escape hatches when you do not.

When the book feels too literary, go to Getting started, Docs, or Use cases → ElMundi, steal a shape, argue with our schedule, replace our org-specific names, and return when your board is boring again—boring in the good way, where states mean what you think they mean and nobody has to improvise trust at midnight. If you lead a team, one chapter a week in a short standing slot often produces more value than the text alone: the disagreements are where policy actually lands.

The system

Chapter 10 — One paragraph that holds the whole thing

The whole apparatus is one loop with a spine and a few non-negotiable seams: work lives in an issue tracker as the system of record, and a ticket's state transition is the act-signal—when a human (or the operator's own agent) moves a card into the next state, that move is the instruction, so humans are not the clock and nobody is hammering a queue. A diff-based poller watches the board for those transitions and, for each one, the engine resolves which FSM stage the ticket is now in—from its state and its signal labels—and fires the one routine for that ticket. There is no separate selector deciding "who is next" out of a slot; the ticket that moved is the ticket that runs, and concurrency is bounded not by a clock but by leases (project_lock), per-stage caps, and cascade limits, so two agents never fork one ticket into duplicate branches and dueling pull requests. When a stage fires, the engine dispatches a GitHub Actions run (ship-agent-run.yml) that executes the role's agent; the agent commits its own work and the runner pushes, opens the pull request, and reports the stage's outcome back to the engine. Versioned skills—server-side role definitions you can diff and roll back—are the other half of that contract, so behaviour changes when prompts change and you can prove which revision ran. The PR contract closes the loop back into the repo—branch naming, predictable title and body, review as the human gate—so code and commentary stay traceable to the card that justified them. Audit is deliberately a second loop: separate tracker projects, separate cadence, so assurance work does not steal throughput from the delivery lane or pretend to be “just another ticket” with the same urgency profile. Finally, adapters implement provider quirks while the shape of the workflow stays the same—role labels, state names, queue semantics—so you can swap trackers or code hosts without renegotiating what “ready,” “blocked,” and “done” mean to your team.

Tracker. The tracker is not optional wallpaper. It is the ledger of intent, the audit trail in comments, and the place operators look when CI is red. The status field is the engine's only transition signal; when tracker fields drift, the system fails in specific steps—dispatch, run, update—and those failures are gifts because they name the broken assumption.

The transition is the signal. Ship does not wake on a clock and go looking for work to do; work raises its hand, and the engine answers. A state change is what dispatches a stage, so the act of moving a card is the act of asking the system to advance it. The diff-poller is the politeness layer—it batches its reads (~300s) so nobody hammers the tracker API—but it is reacting to transitions, not manufacturing them.

One routine per ticket, bounded by leases. Given a ticket that just transitioned, the engine fires exactly the routine for its resolved stage—or fires nothing, and that nothing is explainable. Bounding is done by project_lock, cascade caps, and dependency blocks rather than by a one-per-slot clock: a frozen or blocked ticket is refused, a dependency that has not cleared holds its dependents, and a lease keeps two runs off the same project. That is how you prevent forked reality without a calendar.

Agent runtime and versioned skills. Execution is a dispatched GitHub Actions run: name failure modes, keep secrets out of logs, let the agent commit its own work and have the runner own the push, the PR, and the outcome report. Skills are configuration the engine serves to that run; they change in pull requests like code.

Audit: second loop, separate projects. Audits look like tickets but are not the same workload as feature delivery. Separate scopes keep audits from competing in the same FIFO as product work and make reporting honest about what shipped versus what was verified.

Adapters versus shape. Adapters translate tool quirks; shape is the invariant contract. When something breaks after a vendor UI change, you usually fix a field mapping; when nobody agrees what “In review” means, you fix shape and documentation. That distinction saves you from encoding process in hacks that will not survive the next migration.

Together: tracker truth, transitions as the act-signal, one routine per ticket bounded by leases and caps, a bounded agent runtime, versioned instructions, PR-backed outcomes, a parallel audit lane, and portable adapters around a stable shape. It is not magic; it is logistics with receipts—and that is why it scales.

Chapter 10.A — The manual gearbox

In the era of capable models, the engine stopped being the hard part. Anyone can buy horsepower now; it ships in an API. What separates a shop that delivers from a shop that spins is not the size of the engine but the gearbox—and Ship is a manual.

An automatic hides the moment of change. You press the pedal, the box picks a gear, and most of the time it picks well enough. That is the firehose dressed as convenience: work "just happens," the system chooses its own gear, and you learn what it chose by reading the road behind you. It is easy to drive and impossible to drive well, because the one decision that matters—when to change state—was taken from you and handed to a machine that cannot see the corner ahead.

A manual gives that decision back. The transition is the gear change: the moment a human, or the agent that human trusts, moves a card and tells the system now, this gear, this much power to the road. The model supplies the torque. The transition decides where it lands. Shift early and you bog; shift late and you over-rev; shift clean, at the right moment, and the whole drivetrain feels like one thing. The skill was never in the engine. It was always in the timing of the change.

This is why a more capable model raises the stakes instead of lowering them. A weak engine forgives a sloppy shift—there was never enough power for the mistake to matter. A strong one punishes it, and faster: a card moved too early dispatches a stage that was not ready; a card moved into the wrong state fires the wrong routine at full throttle. Capability does not make driving easier. It makes precision worth more.

We are told this is the part to automate away—that the dream is a car anyone can drive without learning to shift. That dream is how you end up in the ditch with a green dashboard. The point of Ship is not to hide the gearbox; it is to label every gear, so the only skill the road asks of you is the one that actually matters: knowing when to change. The new hire still reads the board on day one—the gears are named—but reading the gears and knowing when to pull them are different masteries, and only the second one is hard. That is the trade the best drivers have always made: give me a manual and a clear gate, and I will out-drive the automatic every time the corner is real.

Chapter 10.B — Repeatable mastery

Here is the bet, stated plainly enough to argue with: the product is not the engine, and it is not the gearbox. It is repeatable mastery—the thing every shop has always wanted and never had, because mastery lived in heads.

A senior engineer is a century of small corrections wearing one name. They know which review comment to skip and which to die on, which "quick fix" is a trap, when a green check is lying. That knowledge is the most valuable thing in the building and the least repeatable. It fires differently on a good day than a bad one. It does not come to the 3 a.m. incident; it is asleep. And when the person leaves, it leaves with them—a hundred years of judgment walking out the door in a single resignation letter.

So we did the unreasonable thing. We took the team's accumulated experience—a hundred years of this shop's decisions, not a hundred years of engineering in general—and compressed the part that repeats into a few thousand lines of prompt. Not a recipe card; a set of versioned skills, each a role with priors, a contract, and a line it refuses to cross. The developer that has seen this mistake before. The reviewer that knows which corner gets cut. The model already brings the world's generic taste; we never needed to encode that. We encoded the part the model cannot know—our definition of done, our fences, the hills we die on, the mistakes we already paid for.

What that buys is not a smarter shift. It is the same floor, every time. We do not promise the same diff twice—a model will never give you that. We promise the same bar cleared, the same fences honored, the same definition of done, on the best morning and the worst night. The operator still owns when to shift; the skill owns how, and it executes with the whole team's encoded judgment for the cost of an API call. Mastery used to be a person you hoped was awake. Now it is a skill version you can prove ran.

We captured what repeats. The irreducible part—knowing when the rule is wrong—does not compress, and we did not try; it stays at the gearbox, with the human. And the first thing we encode is not competence but its boundary: a tired senior knows they are tired and pages someone, so the skill must know what it was not given and stop, comment, and ask. Repeatable mastery without repeatable humility is just repeatable overconfidence with good production values.

None of this is a moat made of prompt text. Fork the few thousand lines and they are yours. The moat is the loop that keeps them true: the scar this week that rewrites the skill next week. A competitor can copy the prompt we have; they cannot copy the one we will have after Friday's incident. That is why skills change in pull requests, why the corpus they read forgets weekly in plain sight, why a published number is never overwritten. Repeatable mastery is not a monument to how good we were. It is a living claim about how good we are this week.

And then the proof, the one that surprised us. If the mastery is really in the system and not in the engine, you should be able to swap the engine for a cheap one and watch nothing change. So we did. The agent that drives Ship from the operator's side is plain Claude over an MCP connection—no custom cockpit. The agent that does the building, the hands on the branch, is not the frontier model at all; it is a cheap one. The work held, and nobody noticed. That is the whole thesis told as a magic trick with the method shown: if you can downgrade the engine and no one can tell, the value was never in the engine. It had already moved into the parts you own—the fences, the contracts, the transitions, the century written down. We are not saying the model does not matter; on a hard, novel corner the expensive engine still pulls ahead, which is exactly what the human's timing is for. We are saying its marginal value collapsed—and on the daily work a cheap engine with a great gearbox beats an expensive one with none, and beats it on cost by an order of magnitude.

Chapter 11 — Six heartbeats of a ticket

A ticket does not travel on vibes. It travels in beats—six of them—each one a handoff you could explain to a new hire without drawing a poster. Miss a beat and the music turns into noise: duplicate branches, green runs that lied, stand-ups about what “the bot meant.”

One — intent becomes eligible. A human, or the operator's own agent over the MCP edge, moves the card into the automation entry state. In our reference wiring that is Todo in a delivery project, never vacuumed from Backlog because the title looked easy. Backlog is where wishes argue; Todo is where the organisation has said, in effect, this card may proceed under our contract. Labels, team, project membership—whatever the dispatcher reads to resolve the stage—must already be true. If you cannot point to the field that proves eligibility, you do not have automation; you have optimism with API keys.

Two — the transition dispatches. The status change is the signal. A diff-based poller sees the transition; the engine resolves the ticket's FSM stage from its state and labels and fires the one routine for that ticket. Firing nothing is often the healthiest answer—a frozen, blocked, or dependency-held ticket is refused, with a reason. The routing is deterministic: the same board state resolves to the same stage, so triage is a diff, not a séance.

Three — the run binds. The engine dispatches a GitHub Actions run (ship-agent-run.yml) that executes the stage's versioned role skill, served from the engine, against the repo. This is where policy becomes a run. If the role definition is not governed like code, you cannot prove which revision ran—and proving it is the whole point.

Four — agent leaves fingerprints. The agent works on a branch that encodes the ticket key, commits its own work, and the runner pushes those commits, opens or updates a pull request under your naming contract, and reports the outcome back to the engine, which writes structured notes to the tracker. The PR is not theatre; it is the record reviewers and auditors can grep next month.

Five — humans own the merge. Review, request changes, merge—or stop. Merge stays a human decision, or a policy you deliberately encoded (the engine validates code_review → auto_merge against approval and CI), which is still yours in a form you can audit. The agent’s job is to shrink the diff, not to declare victory over risk.

Six — audit knocks on a different door. Audit roles on their own cadence may file separate findings in separate projects. They do not steal air from the delivery lane. Same discipline, different moral mood: evidence before enthusiasm.

If you cannot narrate your process in those six beats, pause every integration and fix the story first. Todo-only entry means the dispatcher only ever acts on a card that already means “we agreed this is eligible.” When teams internalise the beats, incident review stops asking what the model “felt” and starts asking which guard failed, which prompt version ran, which transition fired which routine. That is respect, not coldness—components have inputs and outputs; oracles have moods. We ship with components.

Chapter 12 — The board is a story

A work board is not a filing cabinet. It is a narrative: ideas arrive, get shaped, move through tension and collaboration, and either land as something shipped or stay deliberately parked. When everyone reads the same columns the same way, the story stays coherent across humans and automation. The picture below is the system context—who talks to whom, and where trust boundaries sit—so you can keep the board’s motion aligned with the real architecture instead of treating tickets as free-floating labels.

Backlog is the prologue. Work here is acknowledged but not committed for the current cycle. SDLC automation does not act on Backlog; wishes are not yet eligible under contract, and a transition into Backlog dispatches nothing. Grooming turns noise into stories someone could actually start.

Todo is where the story picks up pace and where automation enters in Ship: a transition into Todo (by a human, or the operator's agent over the MCP edge) is what the dispatcher reads as “this card may proceed.” New items appear in the same lane as everything else that is ready to start—not in a shadow queue off the board. The transition has meaning; skipping straight to In progress by hand would hide what just appeared and why, and the dispatcher routes off labels regardless of which column you dropped the card in.

In progress is the active scene—agent branch, cloud run, pull request in flight. WIP limits here matter more than almost anywhere else; too many cards means everyone is busy and nothing finishes.

In review is the editorial pass—human and CI judgement, previews, required checks. It should not become a second backlog; lingering usually means unclear review criteria or work that was not actually ready.

Done is the honest closing line—merged, accepted, or explicitly closed with a reason. Done is how you measure throughput and tell stakeholders what changed in the world.

The SVG is built from documentation/diagrams/architecture.d2 (run d2 documentation/diagrams/architecture.d2 documentation/diagrams/architecture.svg after edits when d2 is on your PATH). Aligning columns with that diagram keeps the story on the page from drifting from the story in production.

Chapter 13 — Four players, four kinds of discipline

Shipping with agents is not one machine doing everything. It is a small ensemble, and each member has a different job. In Ship, the dispatcher is the layer that reads the tracker and, for the ticket that just transitioned, resolves its FSM stage and fires one routine (or nothing) using explicit fields—never a model’s whim. If you blur the boundaries between these four players, you get flaky runs, bloated prompts, and teams that hover over every green check as if it were a fragile ornament. The fix is clearer discipline: the tracker and its transitions, the dispatcher that routes them, the agent runtime, and the people who own the system.

Tracker + transitions: wiring, not judgment. The tracker's job is to hold state and to make a transition the signal that work should advance, leaving a run URL and a comment in the audit trail. Transitions are declarative wiring. They must not encode business rules in two hundred lines of YAML conditionals—the FSM in the engine is for logic; the tracker is for state. When the pipeline graph becomes a second application layer, failures stop being legible and start being folklore.

Dispatcher: route deliberately, never “just in case.” The dispatcher fires the one routine for the transitioned ticket using only state, labels, project, team, and the FSM—fields you would show in code review—and refuses tickets that are frozen, blocked, or already leased. It must not call the agent because “maybe something is interesting.” A transition that resolves to no routine, and dispatches nothing, is a feature. Loosening the labels until everything qualifies is how surprise work ships.

Agent runtime: execution under guardrails, not improvisation. The agent run (a dispatched GitHub Actions workflow) must respect branch and PR contracts, stay inside tool allow-lists, commit its own work, and let the runner push, open the PR, and report the outcome. It must not improvise scope to compensate for vague tickets. The honourable failure mode is often a comment and a stop—not a speculative refactor that becomes someone else’s midnight.

People: governance without babysitting every run. People own intent, merge policy, production promotion, and skill governance—who may change a role definition. They must not turn every run into a supervised performance. If each execution needs hovering, your fences are too loose or your prompts too vague, and no dashboard substitutes for tightening the interface.

These four disciplines reinforce one another. A clean tracker keeps failures honest. Deliberate dispatch keeps prompts and diffs legible. A constrained agent keeps changes reviewable. Human ownership keeps the loop aligned with risk and intent without turning humans into cron jobs. The failure mode to fear is fusion—a tracker that decides product questions, a dispatcher that smuggles roadmaps, agents that rewrite the world because nobody said they could not. The success mode is separation of concerns with crisp interfaces: let the transition be the signal, route the work once on purpose, run the agent in a box, let people own the rules, then trust the machinery until evidence says otherwise.

Chapter 13.A — Agents in concert

Chapter 13 named four players. There is a fifth.

The four we have already met are operational: the tracker whose transitions are the signal, the dispatcher that routes each transition to one routine, the runtime that performs it under guardrails, and the people who own intent at the edges. They are enough to move a ticket from a board to a merge. They are not enough to make the artefact that comes out of the run readable to anyone who did not write it. For that, we have learned to add a fifth role, and the fifth role is a critic.

We did not come to this by theory. We came to it because a producer that is asked to police its own vocabulary writes around the words it has been forbidden, and the prose goes sterile. A producer that is given the whole context of the work writes with weight and specificity, and slips into language that belongs to the work rather than to the reader. Either way, the artefact fails. The way out is not a longer prompt. It is a second role with a different contract.

The critic's contract is small and unsentimental. Its inputs are two: an artefact, and a definition of the audience the artefact is supposed to land with. Its output is one: a list of places where the artefact will fail that audience. The critic does not rewrite. Rewriting belongs to the producer, who carries the voice and the claims and the responsibility for the work. The critic flags; the producer accepts or argues. The seam between them is the artefact, and the artefact is what crosses, unaltered by the critic, back to the producer with a list of objections attached. A role that both writes and judges its own writing is a role that ends up flattering itself.

The seam matters because the two readings are different in kind. A producer reads the artefact from the inside, against everything they know about how it was made. A critic reads it from the outside, against everything the audience does not know. When the same role does both, the inside reading wins, because the inside reading is the one that feels like understanding. The artefact that results is one the producer believes, and that nobody else does. It is the most common failure mode of competent automation: a document that is internally consistent and externally opaque.

The strongest case for the critic is when the audience is non-engineering. A product manager reading a developer's runbook does not need the developer's reassurance that the runbook is clear. The developer is not the audience and cannot be a fair judge of how the audience will read. What the runbook needs is a role whose contract names the product manager as the reader, and whose pass through the document returns the six words a product manager does not parse, the three paragraphs that assume context the product manager has not been given, and the one claim that would land as condescension on a reader who is not in the engineering loop. The developer can then decide what to do with each note. The critic does not decide. The critic represents the reader inside the workflow, so the reader does not have to.

The scar that taught us this is dated. On 2026-05-04 we committed po-filter-v1 — a critic role whose only job was to read agent-produced documents the way a product owner would read them, and to flag every sentence that used vocabulary the product owner did not own, every paragraph that assumed engineering context the product owner did not carry, and every claim that would land wrong on a non-engineering reader. The producers kept their voice. They wrote with the full context of the system, named the parts they had built, and did not hedge. The critic came through after each page and returned a list of places where the producer had been writing for itself. The producer's output improved that week. Not by eighty percent. By the right percent, which is the percent it takes for a page to read as if it were written for the person who would actually open it.

The shape generalises beyond documents. Every artefact a producer makes for an audience the producer does not belong to needs a critic whose contract names the audience and reads the artefact under it. Release notes are not commit messages. A status page is not an incident channel. A founder's deck is not a standup. In each of those pairs there is a production surface that carries more context than the consumption surface needs, and someone has to call which context to drop. When that someone is the producer, the call is made under the wrong gravity. When that someone is a separate role with a separate contract, the call is made for the reader. The producer keeps its voice. The reader gets the artefact they can use.

We are careful, when we say this, not to fold the critic into the workflow as a step. A step is a thing the producer does on the way to finishing. The critic is not a step. It is a role with its own contract, its own audience definition, and its own output. It runs after the producer is done, against the producer's finished work, and returns a separate document — a list of objections — that the producer then reads as a reader of objections rather than as a writer of prose. The separation is what protects the producer's voice from collapsing into hedging, and what protects the artefact from flattering its maker. A step that the producer performs against itself is a step the producer will skip when the producer is tired. A role with a contract is a role the system runs whether the producer is tired or not.

There is a quieter version of this lesson worth naming. The critic is not the audience. The critic represents the audience inside a workflow that does not have the audience in it yet. The real reader will come later, with their own gravity, their own half-finished sprint, their own two open tabs. What the critic buys is the chance to land the artefact in a shape the real reader will not bounce off. That is a smaller promise than "the document is good," and it is the promise we can keep. Good is a judgement we leave to the reader. Reachable is a contract we can write down.

This is where the chapter hands off to the next. We have said the system needs four players for the work and a fifth for the artefact. The fifth role does not change the producer's work; it changes what the workflow does with it. The critic's list of objections does not land on the producer's board. It is not a delivery item, and the delivery board is no place for it. It belongs on a different board — one that holds evidence and findings rather than features, with its own rules of entry and its own cadence. That is what Chapter 14 is about. Three boards beat one because the critic's signal is not the producer's signal, and asking a single queue to carry both is how the loud work drowns out the quiet work that keeps it honest.

Chapter 14 — Three boards beat one

Most teams start with a single board and a single backlog. That feels simpler until work types collide: a release train needs crisp sequencing, an audit surfaces dozens of findings that are not “next sprint” work, and security or dependency scanners open tickets faster than anyone can triage them. When everything lands in one lane, prioritization becomes a personality contest, labels multiply without meaning, and “urgent” drowns out “important.” The practical fix is separation of concerns at the board level. Three boards—or three projects in your tracker—give each class of work its own rules of entry, its own definition of done, and its own cadence without pretending they are the same job.

Delivery boards run the operational SDLC. Automation must not vacuum arbitrary items from a giant backlog: dispatch must be constrained by Todo (and the signal labels), labels that encode readiness and risk, and project membership so only work the team accepted for execution enters the pipe. Product setup and operating guidance lives in Getting started and Operating.

Tech-debt and findings boards are evidence led. Each ticket should point at something concrete—a log excerpt, a failing check, a report section—not a vague “refactor auth someday.”

Security and dependency boards hold scanner output, with deduplication as a first-class feature so the stream does not become spam. Tools like Snyk are valuable because they never sleep; they are also noisy if every variant becomes its own forever-ticket.

None of this forbids a unified view for leadership. What you avoid is a unified queue where incompatible work types compete in the same instant-priority arena.

Project type	Role
Delivery / pre-release	Operational SDLC. Automation does not act on Backlog; Todo plus signal labels plus project membership gate every dispatch.
Tech debt / findings	Evidence-based outputs from audit roles — each ticket should point at a log, a report, or a failing check.
Security / dependencies	Scanner output (for example Snyk), deduplicated so the board does not become spam.

Chapter 15 — Why boring is load-bearing

The glamorous parts of building software—models, agents, clever heuristics—get the attention. The parts that actually keep teams sane are dull on purpose. Deterministic dispatch is one of those load-bearing boring choices: given the same board state and the same rules, a transition always resolves to the same stage and fires the same routine. No mood, no “try again and hope.” If your process needs a séance to explain why something happened, it is not infrastructure yet.

What deterministic dispatch is. A rule you can write down: the FSM stage a ticket's state and labels resolve to, explicit refusals (frozen, blocked, dependency-held), and a single place where the decision and its reason are recorded. The goal is replayability—same context, same routing, same audit trail. Dispatch is not “AI selection.” If the model chooses what advances, you removed the fence.

Debugging, fairness, safety. When outcomes drift, nondeterminism hides the bug. Determinism turns failures into diffs: inputs changed, rules changed, or implementation diverged from spec. Fairness needs a floor you can stand on—“the AI picked” is not a policy. Safety works the same way: if you cannot state the conditions under which something is allowed, you cannot test or roll back cleanly.

Anti-patterns: mystery sorting with no documented ordering; unstable ties; hidden randomness in a path that should be procedural; prompt-as-law for rules that belong in code; double sources of truth between UI, backend, and agent.

Sleep, séance, diff. Some teams debug by rerolling until the output feels acceptable. That can unblock a demo; it does not build trust. The alternative is the diff line: inputs, rule version, outcome. When we tightened Todo-only entry and made routing deterministic, the win was not cleverness—it was sleep. “Same board, same routing” turns triage from séance into diff.

Why boring is load-bearing. Exciting systems age badly when nobody can reconstruct their judgments. Boring systems age well because they are legible under stress: incidents, audits, turnover. Deterministic dispatch does not remove judgment; it serializes it into something teams can argue about productively. When in doubt, choose the option you can explain in one sentence without saying “it depends on the model’s mood.”

Note — Field note In the reference org the routing is not a phrase; it is a grep target — it just moved. The old per-role pick scripts (tools/linear-agent/scripts/pick-*.mjs) are gone; the event-driven rearchitecture retired them. The current grep targets are services/dispatcher.py — where maybe_dispatch resolves the FSM stage for a transitioned ticket and fires its one routine — and services/tracker_fsm.py, which spells out the stages, their Linear-state mapping, and the signal labels. Both are code you can read in a review and reason about from board fields alone: given this state and these labels, this stage, this routine, or an explicit refusal with a reason. If you can trace a transition to a stage by reading those two files, your routing is boring enough; if you cannot, the next incident will teach you what "random" meant.

Chapter 16 — The branch wears the ticket’s name

Treat branch names and pull request titles as part of your public API — not decoration, not taste, not something “the bot will figure out.” Ship assumes a naming contract: the tracker issue key appears in the branch, appears again in the PR title (usually at the front), and repeats wherever your automation writes back to the ticket. Humans should be able to glance at a tab strip and know which story they are reviewing; scripts should be able to parse the same string without heuristics or model calls. That predictability is what lets CI attach checks to the right unit of work, what lets comment templates and preview links land on the correct thread, and what keeps “which PR belongs to ENG-2048?” from becoming a meeting.

The contract does not need to be clever. It needs to be stable. Pick a prefix for agent branches (for example a short product or bot name), separate segments with slashes or hyphens in a way your Git host tolerates, and forbid silent renames when a ticket is retitled in the tracker. If someone opens a manual PR “while the agent is thinking,” the human branch should either adopt the same key or the team should agree explicitly that two keys mean two stories — never two branches with different names pointing at the same issue without a human decision recorded somewhere. Ambiguity here is how you get double implementation, double deploy risk, and double blame.

Grep is your archaeology department. Years from now, nobody will remember which Tuesday fixed the regression; they will remember the issue key someone pasted in Slack. If keys live in merge commit messages, branch names, and PR titles, ordinary tools stay useful: search the repository history for the key, search your CI logs, search mail. You are not asking future you to infer intent from “fix-stuff-2” or “wip-final-really.” You are leaving breadcrumbs that survive job changes, vendor churn, and the natural amnesia of fast shipping. The same habit pays off during incidents: when production misbehaves, the fastest path from symptom to change-set often runs through one identifier everyone agreed to repeat.

Duplicate work is the failure mode naming is meant to starve. Two pull requests for the same ticket usually mean one of three things: a human did not see the agent’s branch, a workflow ran twice without an idempotency guard, or naming drift made the first PR invisible in search. The fix is mechanical before it is cultural. Workflows should check whether a branch already exists for the issue key, or whether an open PR already references it, before they mint a second tree. Reviewers should treat “unexpected second PR for ENG-1234” as a process bug, not as extra throughput. And when you must supersede an old branch, close or rename with a comment that ties the narrative together so grep still returns a single coherent thread.

Concrete wiring — hosted E2E, promote discipline, and how those expectations show up in a real monorepo — is summarized in Use cases → ElMundi. The framework’s point is smaller and harsher: if the branch does not wear the ticket’s name, you will eventually ship the wrong story, review the wrong diff, or grep the wrong history — and every one of those mistakes costs more than enforcing a string pattern ever did.

Contracts rot the way config rots. Teams rename ticket prefixes, migrate trackers, or fork a workflow “just for an experiment” and forget to merge the naming rules back. The antidote is boring governance: treat violations of branch and title shape like type errors — visible in review, fixable in minutes. Drift caught in a pull request is a one-line correction. Drift caught after merge is an incident narrative nobody wants to write. Naming is load-bearing the same way deterministic dispatch is load-bearing: not because the computer cares about aesthetics, but because people and tools share one vocabulary — and shared vocabulary is how you keep automation trustworthy at three in the morning.

Chapter 17 — A parallel universe for audits

Shipping work and proving that work was done well are related, but they are not the same job. Teams that fold “audit readiness” into every sprint goal often end up with neither clean delivery nor clean assurance. The healthier pattern is a parallel universe: same repository underneath, different tempo, different questions, different artifacts.

Delivery mood is forward-looking. The implicit question is “What can we ship next that moves the outcome?” Audit mood is backward-looking. The implicit question is “Given what we claimed, what can we demonstrate?” Audits reward traceability—who approved what, how controls behaved, how risk was considered. Because these moods optimize for different things, they fight when forced into one calendar.

The operational move is to separate projects and schedules without separating ownership. You still have one engineering organization, but you maintain parallel tracks: a delivery program tied to customer value, and an assurance program tied to evidence collection, sampling, and review cycles. They should intersect at predictable checkpoints—not continuously. The mistake is making every engineer a part-time auditor by default, which turns assurance into noise and delivery into improvised narrative.

What bridges the universes is evidence, not optimism—tickets linked to changes, test results attached to releases, logs with retention policies, access trails that match the model you documented. Treat evidence like inventory: assemble it as you go, lightly and consistently, instead of scrambling when compliance emails arrive.

One concrete pattern for a short, recurring audit pass beside normal delivery—without collapsing the two mindsets—is covered by the automation and operating docs. If delivery and audit moods stay distinct, assurance gets its own rhythm, and audits stop feeling like an alternate reality imposed from outside—they become a second map of the same territory, drawn on purpose before someone else draws it for you.

Chapter 18 — Swap the vendor, not the story

Ship treats orchestration as boring infrastructure: thin adapters at the seams, well-defined surfaces at the boundary, and git as the system of record for anything you need to diff, review, or roll back. That is not nostalgia for terminals; it is insurance. When a vendor changes pricing, deprecates an endpoint, or ships a UI that automation can no longer read, you want the failure to land in a layer you own — credentials in one place, a client module you can patch, logs with request identifiers — instead of a crisis meeting about “what Ship means now.” The adapters are the spine between “a transition fired” and “evidence exists”: they are versioned like any other code and honest under failure in ways a dashboard seldom is.

The contract between human process and machine assistance stays stable when logos rotate. A state transition is the act-signal; the engine resolves the FSM stage and fires one routine for that ticket, with the state-and-label guards you could defend in standup. That routine turns into work the runtime can execute: stable title text, branch-safe slugs, metadata that downstream steps do not have to guess. The happy path ends in a pull request that wears the ticket’s name, so history, bots, and humans align on what moved. None of that is intrinsically Linear, GitHub, or a particular cloud agent; it is adapter work — mapping vendor primitives onto those verbs without letting the vendor’s nouns become your ontology.

Trackers, code hosts, and agent runtimes are therefore swappable modules, not the plot. Replace the tracker adapter when your system of record moves; you still need stable keys, workflow states exposed through an API, and labels that act as fences and dispatch signals. Replace the code-host adapter when your org standardises on another host; you still need branch and PR primitives, secrets the engine can hold, and URLs you can paste into an incident note. Replace the agent adapter when models or quotas shift; you still need a runtime that accepts bounded jobs, commits its own work, and returns links a human can follow. In each case the rewrite is mechanical if you kept the seams thin: new client, new env vars, new compliance review — not a re-architecture of “how we ship.”

Some things are not negotiable when you swap vendors, because they are how the story stays legible across years. Versioned skills are governed like code, for the same reason database migrations live in git: behavior you cannot diff is behavior you cannot audit. The FSM invariant — transition → stage routine → agent run → commit → push + PR — remains the spine; skipping a step is how “helpful” automation becomes a parallel backlog nobody merges. Bounded concurrency stays load-bearing, enforced now by leases, caps, and cascade limits rather than a one-role-per-slot clock: overlapping automations correlate failures, duplicate work, and produce narratives that sound like weather instead of timestamps. The audit lane stays parallel — separate project, separate cadence, evidence-shaped outputs — so a scanner finding does not impersonate sprint commitment. Those choices are the story; the vendor is casting.

Swapping is also a hedge against vendor storytelling. Markets move; models change price and capability; trackers merge, fork, or tighten API policies. If your internal narrative is “we bought the all-in-one,” you re-narrate the entire SDLC when the logo changes. If your narrative is “we own the loop; vendors plug in,” you negotiate from strength. You are not immune to churn — nobody is — but you stop confusing interface with identity. The team should be able to say, without heroics, which layer failed: the tracker transition, the dispatcher's stage resolution, the agent run, the push-and-PR, or human review. Thin adapters make that sentence possible.

Reference workflows in the repo may still name today’s vendors in filenames and examples; that is convenience, not doctrine. Treat those names the way you treat a sample .env: copy the shape, replace the values, keep the invariants. The invariants are the part you defend in architecture review — not whether the HTTP POST went to host A or host B.

Operational detail — reference providers, environment variables, and where honesty ends and org-specific coupling begins — lives in Configuration and Authoring. Read them when you need names for today’s stack; read this chapter when someone asks whether changing a vendor means rewriting Ship. It should not — not if prompts, scripts, and board policy stayed yours, and vendors stayed plugs.

The right skills

Chapter 18.A — The right skills

Start with the word, because the word is doing more work than the industry has noticed. When we say a person has a skill, we do not mean they own a recipe card. We mean they carry judgement. A devops engineer is a skill in this sense. Not the steps a devops engineer would take, but the person — the thing that stands in front of an alert at two in the morning and says, with no help from anyone, this is probably the load balancer, not the database, and here is the order I would check things in. That is taste, built from a decade of decisions, most of which were wrong before they were right. It cannot be flattened onto a page. If we try, we get a six-page document that misses the part that mattered, because the part that mattered was always which of the seven plausible things to do first.

A developer is a skill. A product owner is a skill. A medical writer who knows pharma is a skill. People with priors, with a sense of when to slow down, with the muscle to say no, not yet, we have not seen the evidence we need. We treat skills in this book the same way: a role, a specialist, a contractor we have hired into the pipeline who shows up with their own ear for the work.

What that devops engineer uses to deploy something inside a specific company's stack is a different thing entirely. The runbook for the cluster the regulated team runs on. The fact that staging holds its password rotation a week longer than production. The wrapper script one of the seniors wrote in a quiet winter and that everyone now leans on. The reason migrations on Tuesdays are forbidden. Those are facts about a company. That is a knowledge base, not a skill, and the difference is load-bearing.

Collapse the two — write a document called "PDF developer skill" containing both the judgement and the company-specific instructions, and ship it as one artifact — and a category error has been made. The card will look authoritative. It will be confidently wrong, because the parts of it that are true are the parts that did not need to be written down, and the parts that needed to be written down are the parts that change before the ink dries.

A skill is a person. A knowledge base is the world that person operates inside.

The collapse-error is what makes catalogs misbehave. We have watched the industry spend a year publishing directories of named, versioned, neatly-formatted little documents — here is how to generate a PDF, here is how to set up a webhook, here is how to write a migration that does not lock the table. The directories look serious. They have categories and tags and version numbers and a search bar across the top. Past the second page, every entry serves a handful of users in the world, and the maintainers know it. The page exists because the catalog exists, not because the work needed the catalog to be done.

There are slower problems underneath the spectacle. The catalog at thirty entries can be read by one curator who keeps the corpus honest. At a hundred, that curator is gone, replaced by a process for adding entries and a quarterly audit that nobody runs. At five hundred, the directory becomes a museum — bright rooms, careful lighting, exhibits past their relevance — and the agent loading from it has to search inside a search, deciding first which entry to trust before doing the work the entry was meant to do for it. The instructions inside rot at the speed of the libraries they describe, which is to say a few hours; the catalog ages at the speed of the team that maintains it, which is to say it does not. The two clocks run apart, and after six months the difference is the entire artifact.

So we hold the line. A skill, in our shop, stays empty of company-specific facts. It carries judgement, posture, taste — the priors of the role. The facts it needs in flight, the ones that change while the work is happening, sit somewhere else: a search engine the agents maintain over the documents the company already writes. The skill asks. The search answers. When the answer is fresh it is because the source was fresh, not because someone updated a recipe card last quarter. We come back to the search engine in the chapters on Lighthouse, where its shape and its costs deserve their own pages; here it is enough to say the knowledge does not live inside the skill, and a skill that pretends otherwise will rot in public.

On 2026-04-07 we made the small decision that this part of the book exists to defend. The auditor role had been carrying too much. It was meant to be a single specialist that read a finished change and said whether it could ship, but the role had quietly accreted three jobs: checking the work against the contract, repairing the obvious damage when it failed, and triaging regressions when a fix had blown a different test. The catalog instinct would have been to keep the role and grow its instructions — another section, another paragraph, another bullet of company-specific lore baked into the prompt. We split it instead. Validation became one skill, narrow and slow, asking only whether the evidence matched the intent. Self-heal became another, with permission to attempt a repair when the failure mode was small and known. Regression-triage became a third, which did nothing but route. Three specialists, each with a job a human could describe in one sentence, each empty of the runbook lore that used to bloat the older role. The decision was a vote against the catalog instinct in our own house. It cost us a morning. It bought us months of legibility.

There is a temptation, when describing the shape, to make it sound like a manifesto. It is not. It is a working preference, defended by scars. Catalogs are not evil; they are the wrong shape for a problem whose facts change faster than humans can re-author them. Skills are not magic; they are the same kind of artifact a staffing plan has always been, written for an agent instead of a hire. The interesting question is not which catalog should we adopt, but what does a skill owe to the system around it, and what does the system owe to the skill in return. Once that question is the question, the catalog argument quiets down, because the catalog never had an answer for it.

We have four chapters in this Part, including this one. The next argues that a skill is a contract — that the role only becomes operational when its inputs, its outputs, its refusals, and its tools are written down in a form a machine can obey and a human can audit. The one after walks through authoring a new skill from the moment a gap is noticed to the moment the new specialist is loaded into the pipeline. The last is about versions and rollback, because a skill, like any other artifact in a serious shop, has to be revertable on a bad day. Read them in order if the order helps. The thesis is already on the table: keep the person separate from the world the person operates inside, and most of the rest follows.

Chapter 18.B — A skill is a contract

The previous chapter argued that a skill is a person with judgement, not a recipe card. That argument is only half the work. A specialist you cannot describe is not yet a skill in our system; they are an assumption with a friendly face. The other half of the work — the part most teams skip, because it feels bureaucratic when it is in fact load-bearing — is the contract.

A contract has four parts, and we resist the temptation to enumerate them in a list because lists invite the reader to memorise the shape and forget the weight. The contract names the inputs the skill expects to receive in a form it can act on; it names the outputs the skill is responsible for producing in a form the next skill can consume; it names the ownership of the work, which is to say the territory of decisions the skill is permitted to make alone; and it names the judgement boundary, which is the line the skill refuses to cross even when the model could plausibly continue past it. A skill, in our usage, is a role you can describe so precisely that another team could hand the same work to a different person and the work would still come out shaped the same. If two people do "the same role" and the artefacts differ in kind, you do not have a skill yet. You have a job title.

Of the four parts, the boundary is the load-bearing one. Inputs and outputs are the easy half; you can usually deduce them by inspecting what flows between teams on a normal week. Ownership is the medium-hard half; people argue about it but the argument has an end. The boundary is the hard part because it asks the skill to refuse work it could technically do. The boundary is what stops a developer from quietly rewriting product requirements while implementing them, redrawing the scope a few pixels at a time under the heading of "small clarifications," each one defensible and the sum of them a different product. The boundary is what stops a release manager from re-prioritising the backlog at the moment of release, downgrading a ticket from the safety of a position where nobody else is awake to push back. Without the boundary, both of those drifts are not malice; they are the most natural response in the world to a request that does not say "stop here." A skill that does not refuse anything is not a skill. It is an appetite.

We learned this the slow way. On 2026-03-31 we shipped a duplicate-PR bug that should embarrass us less than it does, because the lesson it taught was worth the morning. Two specialists each believed they owned the same boundary, and on the same Tuesday morning both opened pull requests against the same ticket — the diffs not quite identical, the intent unmistakably the same. The reviewer queue caught it within the hour. The instinct in the room, which we mention because it is the wrong instinct and worth flagging, was to add a coordination layer on top: a shared document, a Slack channel, a stand-up question, a check that polled both pipelines and complained when their work overlapped. We have shipped that kind of cure before in other organisations, and it has never held for more than a quarter. The cure that worked, in the end, was duller. We rewrote one of the two contracts so that the boundary moved cleanly to one side. One specialist now owns the act of opening the pull request; the other owns the act of preparing the change set and handing it across. The bug disappeared. It never came back.

The general rule that follows is short enough to keep in working memory. When two skills produce friction — when their work collides, when their owners argue, when the same ticket comes up at two stand-ups in the same week — suspect the contracts before you suspect the people. The people are almost never the problem. They are almost always doing exactly what their understanding of the contract permits, and the understanding is the artefact you can change. Add a meeting and the friction returns next quarter, because the meeting is a coordination tax that nobody owes; the friction simply waits until the calendar slot ends. Rewrite a contract and the friction dies, because the territory itself has moved. We have done this enough times now to trust the rule. We have never, in our own monorepo, fixed a recurring skill collision by adding more talking.

A word, before we close, on what contracts are not. They are not a six-page document. They are not a wiki page that grows by accretion until nobody knows which paragraph was approved and which was added by a passing engineer the week of a launch. They are short — a paragraph of plain prose on what the skill is for, a brief description of what comes in and what goes out, and one sentence on what the skill refuses to do. The brevity is the discipline. A contract long enough to need a table of contents is a contract nobody reads at the moment they would need it, which is to say no contract at all. The contract has to fit in the head of the operator who is making a decision at the boundary while a release is in flight; if it does not, it has failed before it has been broken.

This is also why we do not write contracts as governance. Governance is the wrong genre for the document. Governance suggests external compliance, a thing imposed by an authority on a worker who would otherwise drift. The contract is the opposite. The contract is the worker's own description of what they will and will not do, written by the person closest to the work, kept short enough that they can defend it on a Tuesday at the boundary. When a skill's contract is written by someone other than the person who fills the role, it becomes a fiction within a week. When it is written by the role, it becomes the shape of the work.

We have now said what a skill is, and we have said what a contract is. The next question, which Chapter 18.C will take up, is the question of authorship: when a piece of work has nowhere to live, do we extend an existing skill to absorb it, or do we author a new skill to own it. That decision turns out to be the most consequential one a team makes about its own shape, and the wrong answer in either direction is expensive in a way that takes a year to notice.

Chapter 18.C — Authoring a new skill

Every team that has read the two chapters before this one arrives at the same question, usually within a week. We have a skill that is doing most of what we need. The new work is adjacent. Do we extend the skill we have, or do we author a new one? The question is not procedural. It is the most consequential design decision in this part of the system, because the answer determines what gets watched closely and what gets watched out of the corner of an eye.

We have learned to answer it with a single question of our own. Does this new work require a different kind of judgement than the existing skill brings, or does it require the same kind of judgement applied to more cases? If the kind of judgement is different — if the person doing this work would, in a real team, read different documents, weigh different risks, and answer to a different question at the end of the day — author a new skill. If the kind of judgement is the same, and the new work is just more of it, extend the skill that already exists. Volume is not a reason to split. Difference in the shape of the decision is.

The temptation to do the opposite is real, and it usually wears the costume of efficiency. A senior developer can, technically, perform an audit. A release manager can, technically, write a runbook. A validator can, technically, do a regression analysis if you describe regression analysis as a kind of validation. On a small team, combining roles feels like the grown-up choice. Fewer contracts to maintain. Fewer handoffs. Fewer places to forget to look. We have made this choice ourselves, repeatedly, and we resist it now for one reason that took us a long time to put into words. When a skill's judgement-boundary stretches to cover two genuinely different kinds of work, the work it does worst is the work nobody is watching closely. The thing the skill does well — the thing that earned it its place — gets the attention. The thing that was tacked on inherits the trust without earning it. By the time the absence shows up in an outcome, the original definition has already drifted past the point where a clean split is cheap.

The scar that taught us this is dated 2026-04-07. On that morning we had a single auditor skill that did three things inside one contract — code-quality validation on incoming changes, self-heal triage when a check failed, and a regression analysis on the agent's own output before a run was signed off. On small days, when the three concerns arrived one at a time and gave each other room, it worked. The contract was tidy. The inputs overlapped enough that maintenance was almost free. We had even told ourselves a small story about it: that the three concerns were really one concern wearing three hats, that the underlying judgement was unified, that the skill was a generalist in the good sense. On a large day in early April, the story collapsed. The auditor was being asked to switch between three different kinds of judgement inside one workday, and the regression analysis — the one that asked the strangest question, the one that had been added last — was getting the cheapest pass each time. The other two beats had momentum and templates and a clear shape. The third was inheriting whatever attention was left over after the first two had used theirs. Nothing failed loudly. The regression analysis just stopped meaning what it had meant a month earlier. We split the contract that afternoon, into three skills with three separate one-paragraph charters and three separate sets of inputs. The week that followed was unpleasant — three places to update instead of one, three drift-checks instead of one, three sets of evidence to keep honest. The month that followed was quietly correct. The regression analysis, when it had its own contract and nothing else to compete with, did the work it had originally been imagined to do.

We took a short test out of that week, and we have used it since. If you can write a one-paragraph contract for the new skill that describes what it takes, what it produces, what it owns, and what it refuses — and that paragraph does not have to reach into the existing skill's outputs to make sense — the split is right. The new skill has its own surface. If you cannot write that paragraph without referencing the old skill's work as a precondition, or without quietly absorbing one of the old skill's responsibilities to make the new one feel whole, then you are not authoring a new skill. You are extending an old one, and the honest move is to admit that and update the existing contract in plain sight.

It is worth being precise about what authoring means, operationally, because the word invites a kind of theatre that is not the work. Authoring a skill is not creating a folder, and it is not opening a document, and it is not running a generator that lays down scaffolding. The mechanics of where the artefacts live are the cheapest part of the exercise and the part you can change later without grief. The work is writing the contract — the few paragraphs that say what the skill takes as input, what it produces as output, what it owns end-to-end, and, just as importantly, what it refuses to do when asked. The templates, the helper code, the reference notes the role consults, all of that is downstream of the contract. They follow it. They do not lead it. We have watched teams, including our own on bad days, try to author a skill by starting with the templates. The result is always a role that can produce a tidy artefact and cannot tell you why it produced one rather than another. The contract is the place where the judgement is named. Everything else is plumbing.

Once a skill exists in the world — once it has a contract, a charter, and a track record of work it has done well and work it has refused — it becomes subject to a quieter and harder problem, which is what happens when the contract needs to change. Skills do not stand still. The work in front of them shifts. The other skills around them shift. The team's understanding of what they are for shifts, sometimes from one quarter to the next. We have learned the hard way that you cannot simply edit a skill's contract in place and expect the system to forgive you. Skills, like the human roles they descend from, have versions. The next chapter is about how we keep those versions honest.

Chapter 18.D — Skill versions and rollback

When a skill needs to change, we change it. There is no ceremony around the act, no version field at the top of the file, no parallel history we keep alongside the real one. The skill lives in a file. The file lives in a repository. When the role's judgement needs to shift — because we have learned something, or because the contract that used to be enough has stopped being enough — we open the file, we edit the sentences that need editing, and we commit. The change is a commit. The reasoning is the commit message. The previous version is one command away if anyone wants to see it. That is the entire policy, and most of this chapter is an argument for why the policy is so plain.

The temptation runs the other way, and the temptation is older than agents. A skill produces a bad outcome once, in a way that everyone watching agrees was avoidable, and the impulse is to add a paragraph at the bottom of the file that begins do not do X anymore. The paragraph is honest. It is shorter than rewriting the rule it qualifies. It feels like the safe move, because nothing above it has been touched and so nothing above it has been broken. The paragraph stays. A month later there is a second paragraph, qualifying something else. A quarter later there are five. A year later the skill is twelve such paragraphs welded onto a body of text whose original sentences nobody has reread in months, and a careful reader cannot tell which sentence the skill is currently honouring and which one has been overridden by an addendum further down. Append-only documentation rots into a museum from the inside out. The exhibits are still labelled. The labels are still legible. The objects on the pedestals are no longer the objects the building was built for.

So the rule we follow is the unromantic one. Edit the skill in place. If a sentence has become wrong, the sentence is wrong, and it is the sentence that needs to leave, not a footnote about the sentence. The reasoning — the story of why this used to be the right contract and why it is no longer the right contract — belongs in the commit message, where it has the dignity of being attached to the change without crowding the artifact the change produced. We have rarely had to read those commit messages after the fact. We have never been unable to find them when we needed them. The previous version of the skill is recoverable to the character. It has simply not been necessary to recover it, because the new version was written by someone with the old one fresh in front of them, and the new version is the one the system is honouring now.

Rollback is the same shape, run in the other direction. If a skill change makes things worse — if the new contract is too strict, and starts rejecting work that the old one would have accepted, or too loose, and starts admitting work that the old one would have caught, or wrong in some quiet third way that only shows up when traffic is on it — we revert the commit. There is no escalation path. There is no committee. The same person who changed the skill yesterday is the same person who reverts it today, and the same shop discipline applies: small, reasoned, written down in the message of the revert.

We had to use the policy on 2026-04-21. The validation-role contract had been carrying a small, accumulated set of complaints, and we sat down over the weekend and rewrote it. The new version was tidier. It separated the parts that asked is the evidence present from the parts that asked does the evidence match the intent, and it read, on a Sunday afternoon, like an obvious improvement. By Tuesday morning the new contract was rejecting work the old one had passed correctly for months. The failure was not dramatic. The pipeline was not on fire. A small steady percentage of changes that should have been admissible were being sent back with notes that, on inspection, were technically correct under the new wording and operationally wrong against any reasonable reading of what we wanted the role to do. We read the failures over a coffee, agreed the rewrite was a bad rewrite, and reverted the commit. The revert was three lines. The whole round-trip, from noticing the symptom to seeing the old behaviour restored in the next run, cost less than a meeting. The cost we did not pay — and this is the part of the story that matters — was the cost of debating whether to issue an addendum, or a deprecation note, or a patch version of the skill that would coexist with the original. Under a different policy we would have lived in that debate for a week, and the skill file at the end of the week would have been the worse for it whichever way the debate went.

What this means for tooling is the easy part. A skill repository does not need a versioning system on top of the one it already has. It does not need a yank command, because a revert is a yank with a better name and a longer history. It does not need a deprecated channel, because there is no channel — there is the file as it stands now, and the file as it stood before, and the latter is recoverable without ceremony. What the repository needs is writers who treat a skill change the way a senior engineer treats a commit to a deployed service. Small. Reversible. Reasoned-about in the message, so that the person who has to revert it three months from now can read the original argument before deciding whether the argument was wrong or the world has moved on.

That is the discipline. It is not the discipline of a documentation system. It is the discipline of source control, honestly applied to artifacts that the industry has, for reasons of fashion, decided to treat as something other than source. They are source. They behave like source. They reward the habits that source has always rewarded, and they punish, slowly and then all at once, the habit of writing a paragraph at the bottom and hoping the reader will know which paragraph to believe.

Running the loop

Chapter 19 — Why “always on” is a trap

“Always on” is the fantasy that seriousness equals vigilance. If the machine never sleeps, the thinking goes, nothing will slip past. That story sells well in a deck. It ages badly in a repository, because software delivery is not a security camera pointed at an empty parking lot. It is a sequence of commitments—branches, reviews, merges, environments—and when many automated actors wake up at once, they do not become more careful. They become more synchronized, and synchronized mistakes are the expensive kind.

The trap is not laziness. It is overlap. Two delivery roles acting on the same ticket at once do not double your throughput. They double your chances of the same ticket sprouting two branches, two pull requests, and two confident narratives about who is “working it.” Branch fights are rarely a moral failure on the part of the model. They are a concurrency failure: nobody agreed which run owned the ticket. Ship’s rule is blunt on purpose: automated delivery work is bounded so that no two runs can act on the same work at once. Runs must not overlap in ways that duplicate effort. You are not being precious; you are refusing to fund a multiplayer game without a referee.

We used to buy that referee with a clock — a grid of named minutes, one role per slot — and it taught us exactly what the referee is for. Compare a firehose to a grid. A firehose says “whenever there is capacity, spray.” A grid says “this minute has a name, and that name has a job.” The firehose feels faster because motion is visible. The grid feels slower until you measure what actually merged, what actually reviewed cleanly, and what you can explain without Slack archaeology. The grid gave us predictable contention, independent failures, and a board humans could plan around. Ship dispatches off transitions now, not a grid of slots — closer in shape to the firehose the grid was built to refuse — and the only reason that is safe is that the referee did not leave; it changed lever. Leases, caps, and cascade limits do what the slot used to do: a project_lock keeps two runs off the same project, per-stage caps and cascade limits keep a transition from fanning out, dependency blocks hold work that is not ready. Bounded concurrency, not a bounded clock.

Correlated failures follow the same geometry under either lever. Independent failures are teachable: one job misread a label, one token expired, one preview flaked. Correlated failures sound like weather—everything went weird after lunch—and weather is what you say when causality is missing. Unbounded dispatch manufactures correlation by stacking starts, retries, and side effects into the same narrow windows. The basement does not care that each hose felt reasonable alone — which is why the caps exist.

Event-driven dispatch keeps something the grid promised: a traceable timeline. “Something broke around midday” is a feeling. “Trace the run that fired off ELS-204's transition at 12:40” is a sentence your team can act on. Because every run traces back to the transition that asked for it, logs, run URLs, and ticket comments line up into a single story. On-call stays human-sized because the question stops being when did the universe change and starts being which transition fired this run, and did the engine refuse anything it should have run. The framework does not canonize a timetable; it canonizes the discipline: non-overlapping delivery work so automation does not step on its own feet.

There is a political benefit, too, which matters more than most architecture reviews admit. If everything is eligible to run at once, leadership imagines infinite parallelism. Headcount and review capacity become invisible because the board looks busy. If the engine exposes a finite concurrency envelope — so many runs at once, this many per stage, leases per project — you can point at the caps and say, truthfully, “this is the throughput we designed.” That sentence saves money when someone asks for “just one more bot” without adding reviewers, without narrowing tickets, and without accepting that merge is still a human gate. Bounded concurrency is not pessimism about automation. It is honesty about attention—yours, your API vendor’s, and your repository’s.

None of this argues for moving slowly. It argues for moving legibly. A bounded concurrency envelope is how you keep dispatch deterministic, branches polite, and incidents boring enough to fix. Overlap does not feel like a design decision when you add it—it feels like a small convenience, a second trigger “just to catch stragglers.” Then the stragglers are your main branch, and the convenience is a tax. Always-on is a trap because it promises vigilance while delivering pile-ons. A bounded envelope you can reason about beats a firehose with ambition. Let the demo run every second if you must; let production run inside leases and caps you can grep and defend.

Note — Field note Two commits from the reference org's grid era teach this chapter as archaeology. A 2026-04-14 commit titled make SDLC schedule robust to stale github.event.schedule collapsed four crons into one because GitHub was delivering stale cron strings after edits and skipping every role. A 2026-04-15 commit titled SDLC scheduled slot must not skip on odd UTC hour replaced an even-hour guard after Actions delivered a 05:00 cron at 05:08. Those were the days when the clock was the referee, and the lesson then was: the grid works; the other system's clock does not. The lesson aged into a deeper one. The clock was never the point — the bound was. When the clock turned out to be the brittle part, Ship kept the bound and dropped the clock, moving the referee into leases, caps, and cascade limits that no upstream cron delivery can skip.

Role grid: at most one delivery role per UTC slot

Chapter 20 — Morning on the board

The day does not start with heroics. It starts with a board that already knows what “today” means.

Before anyone picks up work, Todo should be prepared: not a vague Backlog dump, but a small set of intentions that a human can actually finish. Preparation is respect—for the person doing the work and for everyone downstream who will interpret signals from what ships. A prepared board is not optimism dressed as planning. It is a contract with the future self who will be tired at four in the afternoon.

When the machine runs, let it run first. Dispatch is cheap because cheap certainty compounds: a transition either resolves to a routine or it does not, and either way you know early. If something is wrong, you want to know before you have narrated a story in a meeting or rebased your optimism. Early green is boring. Boring is the point.

Here is a quiet truth teams resist because it feels like cheating: green and silent when there is nothing to advance is success. Silence is not absence of work. It is absence of noise you did not ask for. A transition that resolves to no routine, with everything green, is not “nothing happened.” It is “nothing broke while we were not looking.” Celebrate the lane that stays quiet when the rules say it should.

When something launches—when intent becomes motion—leave traces. A branch name is a postcard from the past. A pull request is a letter to the future. Ticket comments are the marginalia that explain why a reasonable person chose this over the obvious alternative. You are writing for the version of you who will debug at midnight and for the teammate who inherits your choices without your context. Traces reduce panic.

Humans review when ready—not when anxious, not when performatively diligent, not when the calendar says so. Readiness is having enough signal to judge: the diff is scoped, the risk is named, the rollback path exists, the intent is legible. If the system is healthy, the human’s job is judgment, not babysitting.

If you want a kitchen metaphor, keep it optional. Some teams love mise en place; others work like a busy line cook at rush hour. Both can ship. The mistake is insisting the metaphor match your aesthetic instead of your constraints.

What you are really doing each morning is lowering variance. Variance is “sometimes it works, sometimes it does not, and nobody can predict which.” A prepared Todo, dispatch that resolves cleanly off a transition, green silence when nothing qualifies, readable run traces, and reviews at the right moment are one strategy expressed in different places—making outcomes more predictable without making people more rigid.

If the board is noisy, fix the board before you fix the people. Noise trains cynicism. Clarity trains momentum. Prepare Todo, let a clean transition do the dispatching, treat quiet green as a win, leave traces like you mean it, review when the signal is sufficient, and shave variance until shipping feels less like gambling and more like craft.

Chapter 20.A — What to measure in the morning

Every team that adopts an agentic loop eventually wants a dashboard, and most of them build the wrong one first. The wrong dashboard counts motion — pull requests opened, comments posted, agent runs invoked, tickets touched — because those numbers are big and grow daily and make the pilot look successful on a quarterly slide. They also teach the system to be busy. A number rewarded is a number gamed, and nothing games as efficiently as a motivated agent with a permissive fence. If the first metric on your wall is PRs merged per day, you have just funded a competition to lower review rigour until the chart goes up.

The right dashboard counts shape. A small, deliberately boring panel of signals that tell the operator whether the loop is doing its job, not whether the loop is doing work. There are five of them worth naming out loud. The first is dispatch outcome — of the transitions the poller saw, what fraction resolved to a routine that actually fired, versus those refused on a gate (no_routine, blocked_by_dependency, a held lease). A steady mix is healthy; transitions that suddenly all resolve to no_routine are a tracker-drift alarm — a state or label was renamed — not a productivity alarm. The second is artifact drift — the count of shipctl doctor runs whose reconciled stack diverges from .ship/config.yml, summed by repo, summed by week. Drift climbing means somebody is editing the stack out-of-band, and that is a ticket against a human, not against the loop. The third is feedback emission — how many shipctl feedback drafts were written and submitted last week, broken down by artifact. A prompt that keeps generating feedback is a prompt asking to be redesigned; a prompt that never generates any is either perfect or invisible, and invisible is usually the honest explanation.

The fourth is E2E flake signal — the rolling ratio of test reruns to first-pass successes on the same commit, by workflow. The reference org has seven separate test(e2e): stabilize … commits in its history, each one a day of someone's time paid to a suite that had been lying. Counting flake once it emerges is cheaper than learning about it from an on-call who stopped trusting the pipeline three weeks ago. The fifth is cost envelope — bounded spend, not actual spend. Report "we are sized for ~N agent runs per day across M roles" and flag any day that exceeds that envelope by more than a given factor, independent of whether the bill arrived yet. The chapter on phase-zero economics (Ch. 31.A, below) explains why envelope-first is the right posture for a bounded loop.

None of these five are vanity. They are the operator's morning answer to is today like yesterday, and if not, where did the shape change? Build them in that order, build them cheap, and keep the panel small enough that the team actually looks at it. A wall with five honest numbers beats a wall with forty cheerful ones; the cheerful wall is how you miss the afternoon when fifteen identical commits land under the same title and nobody notices until the next morning.

Chapter 21 — What “green” really means

In CI and automation, “green” is one of the most overloaded words in software. People say it when they mean success, safety, approval, or completion. That ambiguity is harmless until it becomes the thing you optimize for. Then teams start treating a passing pipeline as proof of outcomes it was never designed to certify—and agents, which are excellent at pattern-matching signals, amplify the mistake by confidently narrating progress from the wrong layer.

Green means the job finished without an infrastructure or orchestration failure. It means the runner stayed up, dependencies installed, scripts exited zero, gates you explicitly encoded were satisfied, and the automation produced artifacts or state transitions you asked it to produce. It does not mean the change shipped in the product sense, that users are safe, that behavior matches intent, or that the ticket’s acceptance criteria are true in the world your customers inhabit. Shipping is a business and product event; CI green is an engineering hygiene event. Conflating them is how organizations end up with pristine dashboards and unhappy users.

Green can still mean a transition resolved to no routine, or the agent ran and concluded “needs human input,” or a scanner step skipped because a token is missing in a development environment—sometimes acceptable if your policy says so. Green is not automatically merge to main, deploy to production, or “the ticket is done.”

The fix is to keep layers explicit: orchestration truth versus product truth. Orchestration truth is what machines can cheaply assert: builds compile, tests ran, linters passed, preview environments became reachable, required reviewers clicked approve, branch protection held. Product truth is messier—whether the feature works for real workloads, whether edge cases behave, whether the change resolves the customer problem described in the ticket. Much of that cannot be inferred from “pipeline succeeded.”

Agents especially tempt collapsing layers because they are rewarded for producing coherent narratives quickly. A model can read “preview URL is live” and slide into “the change is verified,” or read “workflow succeeded” and imply “we shipped.” The human operator must insist on language that names the layer: orchestration succeeded versus we validated the product behavior we care about.

Two phrases illustrate the trap: preview up versus preview verified. Preview up is orchestration truth: the environment exists, DNS resolves, the service responds. Preview verified is product truth scoped to a ticket: someone (human or a carefully designed check) exercised the paths that matter and compared results to expectations grounded in the ticket’s timeline and acceptance criteria.

Learn to read which job executed and which ticket, if any, was touched. The workflow title and the ticket timeline should tell the same story—not as bureaucracy, but as a guardrail against semantic drift. When workflow language drifts into generic milestones—“CI green,” “merged”—you lose the thread that ties automation signals back to the customer story.

None of this argues against automation or against using green as a quick signal. It argues for precision. Treat orchestration success as necessary but never sufficient for product claims. If you do that, green stays what it is: a faithful indicator that your pipeline did its job—not a synonym for shipped, not a replacement for judgment.

Chapter 22 — Queues are a feature, not a confession

A queue is not a pile of shame. It is a place where work waits in public, under rules you can explain. When teams treat queues like confessions—something to hide until it is “clean”—they lose the main benefit queues provide: a shared surface where intent becomes legible. Visible work in progress is the beginning of honest planning.

The first dashboard worth trusting is not a vanity chart of velocity. It is the tracker: named items with owners, states, and links to reality. A tracker is boring on purpose. It is where a hypothesis becomes a ticket, a ticket becomes a branch, a branch becomes a review, and a review becomes something a user can touch. When the tracker is thin or ornamental, every other dashboard is fiction. When the tracker is current, executives can ask shallow questions—“what shipped, what is next, what is stuck”—and engineers can answer with depth without improvising a new ontology in the meeting. Optional snapshot scripts are a second lens; the board is still the contract between abstraction and implementation.

Healthy queue culture sounds like calm language about limits. There is a defined intake. There is a cap on how much can be In progress before new work must wait. There is a place for blocked work that does not masquerade as active work. Ready work is explicit: acceptance signals, dependencies resolved, a person who can pull it next. Unhealthy queue culture feels like moral weight. The backlog becomes a junk drawer. WIP hides in private lists and chat threads. People apologize for the size of the queue instead of negotiating its shape. Urgency substitutes for sequencing. Everyone is busy, yet nothing finishes cleanly because the system rewards starting over finishing.

The tension between executive view and engineering view is depth compression, not cynicism. Queues work when the tracker carries both layers: milestones stay portable; tickets stay inspectable. The bridge is disciplined movement from fuzzy intent to crisp readiness, with evidence attached.

Here is a distinction that changes where pressure lands. The machine—automation, CI, deploy—should be loaded predictably, with concurrency limits and clear failure signals, because thrashing the machine produces garbage at high speed. Humans fail by context switching and by pulling work that is not actually ready because it feels safer than admitting uncertainty. If you squeeze the machine to absorb organizational ambiguity, you get flaky systems. If you squeeze people to “just pick something,” you get motion without progress.

Put the pressure where choices happen: on selecting the next piece of ready work, not on inflating the queue to prove seriousness. A long Todo is not proof of ambition; it is proof of deferred decisions. A visible, bounded queue is proof that you are willing to say no now so you can say yes with quality later. Ship sides with visible depth because visibility keeps arguments honest—you cannot renegotiate what you refuse to measure.

When Todo is long, the conversation shifts from “why is automation slow?” to “why are we committing more ready work than we can review?” That shift is healthy. It moves pressure off the machine—which can only execute policy—and onto humans—who choose how much work enters the ready state.

Queues are a feature when they make trade-offs explicit. Treat them as confession, and people will curate the visible list until it lies. Treat them as infrastructure, and the same list becomes a place where trust compounds.

Chapter 22.A — Evals for skills

If skills are the unit of change, then skills need tests. Not the majestic integration suites that gate release of a service — small, fast, humble evaluations that answer one specific question about one specific prompt: given an input I know the shape of, does this skill still produce an output I can live with? Without those, version bumps on prompts are vibes. With them, a prompt change on Friday night can be defended on Monday morning without an archaeologist.

The shape of a useful eval is narrower than most teams expect at first. You do not need to score a prompt against a grand benchmark. You need a handful of fixtures — real ticket texts, real diff hunks, real error bodies — and a set of assertions about the output, most of them structural. Does the intake prompt return a branch slug that matches the ticket identifier? Does the developer prompt refuse to open a second PR when an open one already exists? Does the clarifier stop after two questions and never recommend a label outside the contract? These are not "is the answer smart" tests. They are "did the skill honour the invariants it was written against" tests, and they are cheap to write the moment you treat the skill as code.

The canonical scar in the reference org for what happens when you skip this is dated 2026-03-15 — subject line fix(linear-agent): verify preview serves real app, not Bunny placeholder. For more than a day the release-check workflow had been reporting green on pull-request preview URLs because the URL responded 200 OK, the TLS handshake completed, and the workflow ticked a success box. What the URL was actually serving was Bunny's placeholder page — the literal string We're deploying your app! — while the real build was still coming up or had already failed. The fix was not another probe or another retry. It was a six-hundred-line cli.ts change whose operative moment is, in effect, an eval: fetch the preview URL body, read the HTML, and reject the run with waiting_for_deploy if the placeholder copy is present. The seventeen pr-preview commits that surround it are the cost of learning the same lesson one probe type at a time; the eval is what made the nineteenth attempt unnecessary.

The habit worth extracting from that afternoon is eval against the output you actually care about, not against the easier proxy. A TCP probe was a proxy for "the app is up." A 200 OK was a proxy for "the preview is real." Reading the HTML was the eval. In skill terms, a success exit code from the agent is a proxy for "the change was good." An integration test on the PR it produced is the eval. An eval is more work to write and considerably more useful. Invest the work once per skill; the next fifty versions of that skill will borrow from it without asking.

One last rule: evals belong next to the skill, not in a distant test directory. When an operator edits developer.md, the nearby developer.fixtures/ and developer.assertions.yaml should appear in the same pull request, be reviewed in the same tab, and live and die with the skill's version. A prompt whose evals are somewhere else is a prompt whose evals will be deleted the first time somebody reorganises the test tree. Keep them close, keep them small, keep them run automatically on every skill version bump — and you will stop shipping "confidence" in place of evidence.

Chapter 23 — Audits are still not delivery

Even when audits use the same engine, the same agent runtime, and the same checkout as your delivery loop, they are not delivery. Delivery turns agreed scope into mergeable, reviewable increments. Audits look sideways at the repository and at signals such as scanner output, then decide whether anything deserves human attention. The machinery can match; the contract must not. Collapse the two and you get a board that measures motion instead of commitments — interesting stand-ups, fragile releases.

Audits do not consume the delivery queue. The delivery queue is where the organisation records what it has already promised to finish, in what order, with which reviewers and guards. If an architecture pass, a QA sweep, or a security review anchors on that same pick list, discovery starts competing with promise. That is not prioritisation; it is queue hijacking. Depth in Todo should still answer a single uncomfortable question: do we have more ship-shaped work ready than we have capacity to ship? When audit findings drink from the same straw, Todo depth lies. People interpret a long column as “we are overloaded on features” when half the cards are automated opinions that never passed a product test. Keep audits off the delivery pick path so throughput conversations stay honest — and so automation does not get blamed for a backlog it did not authorise.

Separate projects make the split visible where teams actually argue: the tracker. Parking audit output beside pre-release cards in one undifferentiated project turns every planning session into a knife fight between “the release we promised” and “the bot’s taste in modules.” Dedicated projects — tech debt, security, whatever names your culture uses — preserve the same habits of comments, links, and owners while changing backlog gravity. Leadership can ask how much assurance debt exists, who triages it, and whether this week’s risk budget buys shipping progress or burn-down of findings. Engineers can batch audit work without pretending each finding is a P0 on the train. Psychologically, the second project is permission to treat assurance as parallel, not as a competing feature team that speaks through tickets.

Evidence-only creation rules are how you keep that parallel track from becoming spam. The filter is blunt on purpose: no ticket without a pointer — a log excerpt, a failing check, a scanner JSON reference, a reproduced path, something another engineer could verify without interpreting the bot’s literary ambitions. “Consider improving architecture” is not evidence; it is a blog post wearing a label. “This advisory ID applies to this dependency range, here is the lockfile path” is evidence. Evidence is what turns a finding into work you can assign, estimate, and close the way you close defects — with a definition of done that does not depend on whether the reviewer agreed with the machine’s vibe that morning.

Without evidence, audit bots become opinion engines: fluent, confident, and nearly useless inside a sprint. They emit judgments that sound weighty — clarity “could be improved,” coverage “might be stronger,” posture “should be reviewed” — without producing a delta anyone can execute against. Interesting in the abstract, those tickets steal attention from work that unblocks users. Worse, they train humans to tune out the channel. When everything sounds important, nothing is; when the automation publishes elegant worry every dawn, the rare real signal drowns in prose.

Evidence also disciplines the automation itself. Grounding claims in artefacts the repo or CI already produced makes runs comparable across days and branches: did this pointer exist yesterday, does it still exist after the merge, did the scanner output change for traceable reasons? Opinion-only issues cannot die cleanly because nobody can prove them satisfied; they linger as permanent background anxiety. Evidence-backed issues live or die on facts, which is how assurance stays compatible with engineering morale.

One concrete pattern for a recurring audit pass beside normal delivery — separate board, dedicated projects, no consumption of the pre-release queue — is described in the automation and operating docs. Treat it as an example of invariants, not as a fetish for specific filenames. Names and schedules may differ; the separation of lanes should not. Audits are still not delivery when the words are easy; they stay not delivery when the organisation is stressed and tempted to “just fold it all into one board.” Keep the boundary, and assurance becomes a rhythm you can trust. Erase it, and you gain motion until someone asks, quietly, what actually shipped.

Note — Field note Two reference-org commits show the shape honestly. A 2026-04-07 commit titled Add daily Linear audit roles — tech, QA, Snyk security wired a separate audit workflow with its own role prompts, its own Linear projects, and no path into the delivery pick. A 2026-04-10 commit titled bump next to 16.2.3 — Snyk SNYK-JS-NEXT-15954202 is the evidence-backed ticket that audits are meant to produce: advisory identifier in the subject line, version pinned in the diff, reviewer can walk the pointer to the external authority. Evidence is not a tone; it is a pointer an angry auditor can follow.

Chapter 24 — First boredom, then self-heal

Teams often reach for self-heal the moment things feel chaotic: flaky checks, stuck PRs, merge conflicts, and the quiet dread that someone will have to babysit the pipeline again tonight. That impulse is understandable. Recovery automation promises relief. But relief without structure is acceleration without steering. The right sequence is almost always the same: first boredom, then self-heal—stabilize the main lane until the day-to-day path is dull and predictable, and only then add machinery that repairs deviations from that path.

Stabilize the main lane first. Until that lane is trustworthy, “healing” is guesswork. You are not fixing a system; you are automating ambiguity. Duplicate pull requests, fuzzy picks, reviewers unsure what “done” means—these are failures of definition, not speed. Speed on top of fuzzy definition produces more incidents per hour.

This is where the mop versus kitchen metaphor helps. The kitchen is the layout: where ingredients live, how traffic moves, what gets dirty in the first place. The mop is what you use when something spills anyway. If the kitchen is poorly designed, you can buy a better mop and still end every day exhausted. Self-heal is the mop. Workflow design is the kitchen. If you automate the mop before you fix the kitchen, you are hiring a faster cleaner for a room that will never stop making messes.

Self-heal should be additive, not a replacement for clarity. It should assume a stable contract: this is how work is supposed to flow, these are the invariants we protect. Recovery then becomes a narrow tool: detect drift, restore invariants, escalate when the world no longer matches the model.

Useful patterns for shaping repeatable behavior live in Patterns and Docs → Authoring. When you are ready to encode what “normal” means in your org, browse the catalog for finished shapes to borrow before you wire recovery on top.

How do you know the main lane is “boring” enough? People stop improvising because they do not have to. Exceptions have categories, owners, and known fixes. Duplicate work and ambiguous picks are unusual events, not background radiation. Until then, investment in self-heal tends to amplify noise.

When you do add self-heal, keep it narrow. Prefer idempotent repairs over clever inference. Prefer “return to last known good state” over “guess what the human meant.” And treat every automated recovery as a signal: if the same heal fires constantly, the kitchen still needs work.

Note — Field note Self-heal shines when the main lane is already trustworthy. If duplicate PRs or tickets advancing on the wrong stage are still normal, a recovery bot just runs faster around a broken compass.

Boredom in the main lane is not stagnation; it is the sound of a system that can be trusted. Self-heal belongs on top of that trust—not as a substitute for making the path obvious. First boredom, then self-heal: stabilize what “normal” means, then automate the return to normal when the world wobbles.

Chapter 25 — “Sounds slow” — the honest answer

Someone will say it out loud sooner or later: this feels slow. They will mean the bounded concurrency, the leases and caps, the insistence on review, the refusal to let three automations race for the same ticket. They will compare your bounded loop to a demo where the agent “just does everything” in one breathless take. That comparison is almost never fair — and fairness matters, because “slow” lands on people’s backs before it lands on architecture.

Ship optimizes for repeatable throughput, not theatrical throughput. Theatrical throughput is optimized for the moment the screen is shared: motion, confidence, a story that ends before the hard questions arrive. Repeatable throughput is optimized for the Tuesday four weeks later, when an auditor, a regulator, or a tired teammate asks what happened, in order, under which policy. The first kind wins rooms. The second kind wins sleep.

If you need more speed, the honest levers are almost never “add another bot.” They are narrower tickets so dispatch and review stay legible; clearer guards so automation stops early instead of shipping ambiguity; and more human review capacity so merge is not the bottleneck you pretend does not exist. Overlapping delivery roles do not multiply throughput — they multiply branch fights and duplicate pull requests. You do not buy velocity; you buy a multiplayer game without a referee, then pay interest on every merge.

Redefine speed once, in writing, where leadership can see it. If speed means “motion without outcomes,” Ship will lose that contest forever — and should. If speed means mergeable, auditable progress — increments that match intent, leave traces, and survive a hostile read of the timeline — the chaotic demo becomes the slow option. It borrows time from every future incident by skipping review, security, and the existence of other humans. The bounded loop looks patient because it refuses to mortgage tomorrow for applause today.

That reframing is also a kindness to junior engineers. Telling someone to “move faster” when tickets are vague and reviewers are missing is not management; it is weather. The bounded loop makes constraints visible enough that staffing and scope conversations can happen without moralising about productivity. That is systems thinking — and it is how you keep seriousness from turning into shame.

The improvement loop

The loop described so far is linear: a transition, a dispatched stage, an agent run, a PR, a merge, an audit. It is the loop that delivers work. It is not, by itself, the loop that makes the system better at its job. That second loop — the one that turns the scars of Tuesday into the contract of Wednesday — is what this part is about. Without it, the patterns you authored in part 2.5 decay into wall art; the telemetry you collected becomes a vendor dashboard; the feedback an on-call wrote at 3 a.m. dies as a Slack message nobody threaded. With it, each incident is a candidate for a skill or knowledge change, each release of shipctl carries the lessons of the last one, and the system stops asking the same question twice a week.

Chapter 25.A — Where a fix becomes feedback

There is a specific moment on every rotation when a small fix is quietly the wrong answer. You see it most clearly in a sequence of commits that looks, at first, like diligent work. On 2026-03-16, in the reference org, fifteen commits landed under near-identical subjects — fix(ELM-64): keep zero-target standup runs successful on Slack membership errors. Variants followed in the same window: handle Slack bot-not-in-this-channel wording, handle Slack bot-is-not-in-this-channel wording, harden zero-target Slack audit and delivery recovery. A thoughtful reviewer, looking at the log the next morning, would ask the obvious question: what exactly were we fixing, fifteen times? The answer, once you read the diffs, is mundane and damning. Slack was changing the English inside its error responses; each fix added a new string to a growing match list. The pattern was not broken at its edges. It was broken at its contract — it had agreed to care about the exact wording of someone else's error text, and the someone else did not read the contract.

The right move, by commit number three, was not commit number four. It was to stop patching the symptom and move upstream — to mark the skill that chose to match on English text as in need of design, not more lines. That is the action Ship names feedback. Feedback is not a survey. It is a structured, versioned note attached to a skill, declaring that a class of incident keeps attaching to it, that the current shape is the cause, and that the skill should change before another on-call gets the page. In the shipctl client it takes the shape of shipctl feedback new, a local markdown draft that the operator writes while the incident is still warm, and shipctl feedback submit, which turns that draft into a ticket on the skill's own governance queue. shipctl is no longer the primary operator front door — the MCP edge is, and open questions and approvals flow through its inbox_* tools and the console /approve surface as readily as through the CLI — but the feedback habit is the same in any of them. The book cares less about the command and more about the habit: when the same title keeps appearing in your log, the log is telling you the skill is wrong.

Feedback works when three things are true. First, the skill or knowledge entry exists as a first-class object, which is why part 2.5 had to come first. You cannot file feedback against a Slack channel or a hallway hand-wave. Second, the operator trusts that the feedback will be read — that somebody owns the skill, reviews incoming notes on a cadence, and either responds with a version bump, a deprecation, or an honest "works as intended, here is the policy." Third, feedback is cheap to write and expensive to ignore: it should take less time than writing a sixteenth fix, and the ticket it opens should sit on a board where skipping it is more visible than dealing with it. Ship wires the first two on the client side, and the third is the organisational discipline the rest of this book is quietly campaigning for.

The quiet danger of this chapter is feedback inflation — the worry that once operators can file notes on skills, every minor annoyance will become a ticket and reviewers will drown. In practice the opposite has held, for the same reason on-call logs do not drown: writing a feedback note that survives peer reading is work, and the operator who has a real grievance writes one; the operator who merely disagrees with a skill's taste does not. If the pile ever does grow too large, that is itself a signal — and the signal, read honestly, says the skill's owner needs help, the skill needs to be split, or the skill needs to be retired. None of those are bad outcomes. They are the outcomes the loop was designed to produce.

Fifteen identical patches in one calendar day

Fig. 25.A — Compound interest, drawn from a real commit log. By patch #3 the right move was to escalate the artifact, not to ship patch #4.

Chapter 25.B — Telemetry that serves operators, not vendors

On 2026-03-16, the reference org shipped a commit with the subject feat(ci): add automated failed-check recovery workflow — a single file at .github/workflows/check-failure-recovery.yml, three hundred and twelve lines long. Within six hours, fifteen further commits landed against it, almost all of them titled some variant of install runtime deps for recovery and self-heal CLIs. The self-heal workflow had been born, and the self-heal workflow had immediately needed self-heal, because its dependency install step was itself the thing most likely to fail. This is funny in a depressing way. It is also the purest possible argument for the thing we call telemetry.

Telemetry, in Ship, is not a marketing event bus. It is a small, carefully bounded stream of operator-facing signals, emitted by shipctl and shipctl-adjacent workflows, and shaped so that the people running the system can see its health before the commit log tells them what went wrong. The events are few on purpose: skill.run, skill.fetch, knowledge.query, doctor.result, feedback.submit. Each carries a minimum set of fields — skill id, version, success or failure, a coarse error category. None of them carry customer data. None of them name humans. All of them are opt-in, configured at shipctl init and revisitable at any time. The signal they emit is quiet enough to be boring in aggregate and sharp enough to be useful in a specific one.

The specific use is what the self-heal afternoon would have caught in minutes instead of hours. A small doctor.result histogram showing five consecutive failures against the recovery workflow in the first hour would have told the operator, live, that the new workflow was not yet stable — not as a postmortem discovery next morning, but as a dashboard that changed colour while they could still do something about it. A small artifact.use count showing that the patterns depended on by the recovery workflow had spiked to failure would have named the artifact that should be rolled back before it propagated. None of this is ambitious telemetry. It is the bare minimum needed to tell an operator whether their morning work stayed in the drawer they opened.

The shape that distinguishes this kind of telemetry from the ambient vendor kind is that it is for the operator first, the platform second, the vendor never. The payload goes to the customer's own backend by default; submission to the shared Ship telemetry endpoint is a second decision the operator makes, separately, and the event shape is identical either way. That symmetry matters. An operator who cannot see what is being sent cannot trust the mechanism. An operator who can see exactly what is being sent — and who can disable it in one command, export it in another, and delete it in a third — can use it without apologising to their security officer.

The lasting habit from that March afternoon is to treat the introduction of any new workflow as a telemetry event in itself. When you add a new recovery lane, a new pattern, a new daily audit role, the first thing you should wire is the counter that will tell you whether it stayed healthy. The artifact and its telemetry should land in the same PR, not in different sprints. The alternative is the commit log reading of what was really going on, which is, generously, retrospective. Telemetry is the cost of learning while the fire is still useful.

The canonical operator-facing consumer of those signals — for the daily horizon, alongside per-issue learning (flow-learning-capture) and the every-six-hours retry sweep (op-retry-sweep) — is the daily retro role (flow-daily-retro). Once a day it reads the tracker delta across all watched projects and the last twenty-four hours of run journals; the load-bearing signal it watches for is not a red CI badge but tracker_delta == 0 for a lane or for the whole system. A lane that produced no movement on any ticket — no transition, no comment, no PR link — is the silent-failure case the prologue lived through, and it is invisible to a workflow that politely exits zero. Pair the daily retro with feedback (25.A) and triage (25.C) and the improvement loop closes around three cadences: the moment of an incident, the day of an incident, and the cohort that a skill change exposes.

Chapter 25.C — Agent regression triage

The last shape of the improvement loop is less intuitive than the other two, and it is the one that makes on-call humane. When a skill changes — a prompt gets a new paragraph, a rule set gains a stricter fence, a workflow swaps its pick order — the same regression can appear in wildly unrelated tickets at the same time. A human reviewer, looking at three unrelated PRs that suddenly pick the wrong label on Wednesday morning, will look for the bug in three different services. The right read, nearly always, is that none of the services changed. The skill changed. The agent inherited a new instruction on Tuesday night, and Wednesday's work is simply the first cohort to exhibit the consequence.

Return for a moment to the duplicate-PR commit dated 2026-03-31 (subject line fix(linear-agent): prevent duplicate PRs on developer Cloud Agent runs). Before that commit was written, the "developer" prompt had been updated in a prior change with a reasonable-sounding instruction that, for reasons no human realised, encouraged the agent to open a PR before checking whether one existed. That single skill change produced duplicate pull requests across multiple unrelated tickets inside hours. On the surface, this read as a model-quality incident — three agents, three tickets, all getting the "obvious" wrong thing wrong at once. Inside the system, it was a single regression with a single source — a prompt change, versioned, reviewable, rollbackable — showing up across a fleet. The right triage was not to patch three services. It was to bisect on the skill, find the change that broke the contract, roll the skill back to its last known good version, open a single governance ticket against the skill, and stop touching the fleet.

This triage has a specific shape when you practise it in shipctl terms. It starts with shipctl doctor --changes to list which skills moved versions recently in the repo. It continues with shipctl fetch <skill>@<previous-version> --pin to freeze a cohort of agents on the older revision while you investigate. It ends with shipctl feedback new --against <skill>@<broken-version> --because "class regression across N tickets" to turn the evidence into a governance ticket. The shipctl commands are the implementation detail; the principle is that skills are cohorts, and cohort regressions demand cohort responses. The service patches you would otherwise have written are downstream noise. They close the three tickets you noticed. They do not close the eleven you did not.

Agent regression triage is also where the improvement loop closes honestly. A single cohort regression, pinned to a skill, bisected, rolled back, and written up as feedback, produces a concrete revision for the next version of that skill — often with a regression test added in the skill's own test bed, exactly the way a code fix ships with a unit test. That revision becomes the next version stamped into artifacts/patterns/<id>/ARTIFACT.md, which the next shipctl sync will pick up, which the next on-call will quietly benefit from without ever having to read this book's history. That is what a loop is for. The role of the authors is not to eliminate incidents — there will be more — but to make sure the system has fewer unknown unknowns each quarter than it had the previous one. The improvement loop is how that promise is kept.

Chapter 25.D — Observation routines

Chapter 25.A described the moment a fix becomes feedback — the operator who, having patched the same Slack error string for the fifteenth time, finally writes the note that says the skill is wrong, not the string. That note is a small act of recording. Something that had been a sequence of tired commits becomes, in a single ticket, a versioned grievance against a skill. The corpus grows by one entry. The system is, in a narrow sense, better than it was.

This chapter is about what happens to that entry next. Recording is not maintenance. A note written on a Tuesday in March is a true statement about a Tuesday in March; whether it is still true on a Thursday in May depends on a question nobody is asking by default. The corpus the agents read each morning to decide what to do — runbooks, post-mortems, architecture pages, the senior developer's three-paragraph Slack message from last quarter that turned out to be the answer — is, like any corpus, slowly turning into furniture. Sentences that were once load-bearing become decorative. Decorative sentences become misleading. Misleading sentences become the reason an agent at 04:00 follows the wrong instruction calmly and confidently into a dead service.

So the corpus the agents consume has to be a corpus the agents also maintain. That maintenance is not a feeling. It is a set of routines that run on a cadence, owned by named roles, doing dignified and unglamorous work.

The first routine is a daily retro. At the end of each working day, the role that was on the relevant rotation re-reads what actually happened — the run journals, the ticket transitions, the Slack threads that survived their hour — and walks through the runbooks that were touched. Anywhere a runbook told the agent the wrong thing, the runbook gets updated that evening, not at the next quarterly audit. The cost of waiting is paid by tomorrow's on-call, who will read the same wrong page and follow it to the same wrong outcome. The retro is short on purpose; the discipline is that it happens every day.

The second routine is a post-mortem capture. When an incident closes — and Ship is opinionated about what closed means here, which is that the cure has held for at least a working day — a role reads the chat transcript of the incident, the diff of the cure, and the surrounding run journal, and writes the events and the cure into the corpus. Not as a triumphal narrative. As the sort of dry, dated entry the next operator will want when the same shape recurs in six months. The point is not to memorialise the incident. The point is to make the next encounter cheaper.

The third routine is the one that draws the most uncomfortable looks, and it is the one this chapter is most insistent about. Once a week a role walks the corpus end to end, and deletes. Not tags. Not archives. Not moves to a "deprecated" folder where the agents will still trip on it. Deletes. Entries that have not been touched in the period one would have expected them to be touched. Entries that contradict the current architecture, named explicitly, with the contradiction in the diff. Entries that describe retired services. Entries that no human has read in months. The walker is the only routine that shrinks the corpus, and the system depends on it being run.

This is the heart of the argument, and it is worth saying clearly:

The corpus is not the place for the archive. The archive is git. The corpus is the place for what is true now.

The instinct to preserve is a human instinct, and for slow humans it is the correct one. The past was expensive to produce. The cost of rewriting a requirements document used to be a week of meetings, three reviewers, and a stretch of irritated quiet. Keeping yesterday's version in case was rational because regenerating it was not free. That premise no longer holds for any system that includes agents. An agent can rewrite the requirements doc in an evening for the cost of an API call. There is no scenario in which yesterday's version of the doc is more useful to anyone than today's correct one. If we ever need yesterday's, it is in git, exactly where versions of things have always been kept. Git is the archive. The corpus is the live surface. Mixing the two — keeping stale entries visible because deletion feels rude — is how a knowledge base becomes a landfill that smells slightly better than the last one.

The dated scar, for this chapter, is from a week ago. On 2026-05-09 we committed corpus-walker-v0 — the first weekly routine that walks the knowledge base and deletes stale entries instead of tagging them. The first run, on the Saturday, deleted twenty-three documents. Nobody asked where they went. The second week, run on the following Saturday, deleted seven. The third deleted three. The signal we were watching as we let the routine run was not did anyone complain — operators in our shop are not shy when a tool eats something they wanted. The signal was did the agents answer faster from a smaller corpus. They did. Retrieval latency on the daily retro role fell measurably between the first walk and the third. The agents were not handling fewer documents; they were handling fewer wrong ones, and the lookups they did make returned cleaner answers. That is the only honest argument for delete-as-default: it is not a tidiness preference, it is a correctness mechanism.

The three routines — the daily retro that updates what was wrong yesterday, the post-mortem capture that records what was learned, and the weekly walker that removes what is no longer true — together form the maintenance loop for the corpus the agents read. They are the chapters of an opinion about how knowledge survives contact with a system that consumes it at scale. They also clarify the contract between this chapter and Chapter 25.A: 25.A is where a fix becomes feedback, recorded into the corpus; 25.D is how that recorded corpus stays correct, by being rewritten and walked and, most of the time, deleted from. One chapter is about writing things down. The other is about not letting the written-down things lie.

The next Part of the book turns from how the system improves itself to where its fences are drawn — what an agent is allowed to touch, who signs for what, and how an operator keeps authority without becoming a bottleneck. The observation routines described here assume those fences exist; without them, an agent walking the corpus with permission to delete is a different argument entirely. We will get to that argument now.

Trust & boundaries

Chapter 26 — The question nobody asks early enough

There is a question procurement forgets until the ink is dry — or until the first serious incident, which is the same thing with worse timing. Where does our code go during a run? Who can see it? What gets logged, for how long, and under whose keys?

If you cannot answer that in one plain sentence per vendor, you are not ready to wire money at enterprise scale. You are ready to run a pilot with synthetic data until the answers exist on paper you could hand to a regulator, a customer security team, or your own legal counsel without improvising in the hallway. Improvisation is how “we thought it stayed in our VPC” becomes “actually there was a subprocessor” becomes “the engineer is explaining retention policy on a bridge call at 11 p.m.”

The question is also a kindness to the people who will operate the system. They are the ones who will be asked, under pressure, whether customer data ever crossed a boundary the company promised. If leadership skipped the question during procurement, policy gets invented in the worst room at the worst moment — half memory, half hope. Asking early turns “trust us” into “here is the diagram and the retention table,” which is the only kind of trust that survives the first audit and the second reorg.

Write the answers where they outlive any single hire: beside the architecture diagram, in the operator runbook, in the security review packet. Memory is not a data residency control. If only one person knows where the bits flow, you do not have governance; you have a bus factor wearing a hoodie. The goal is not paranoia. The goal is legibility — the same instinct that makes you want ticket comments to match workflow runs.

Chapter 27 — Where the bits flow

Trust starts with a boring map: which runtime touches the repository, which holds secrets, which calls which API, and where those calls are allowed to write. The map is simpler than it used to be, because the engine now holds the credentials that matter.

The engine holds the tracker OAuth (a DB Integration row) and posts state itself. The status field is the engine's only transition signal, and the engine is the thing that writes it. That centralisation is deliberate: it is what designs out the worst failure the old CI-cron model could produce — a credential present in one place and absent in another. There is no longer a tracker key the agent run has to carry in order to update tickets.

The agent run (a dispatched GitHub Actions workflow) is where execution power concentrates: it checks out the repo and runs the coding agent. But it does not carry the tracker credential. The agent commits its own work; the runner pushes, opens the PR, and reports the outcome through a .ship/agent-finish.json sidecar and the server's /finish endpoint, and the engine turns that report into the tracker transition. So the old "green run, board never moved" ghost — pick ran, launch looked fine, ticket silent — is largely gone: the side that writes the tracker is the side that owns the run's outcome.

Operators connect through the OAuth 2.1 broker — claude mcp add ship — with no token paste; a personal access token survives only as a CI fallback for environments that cannot do the broker handshake. Every mutation the operator's agent makes is attributed to the operator's identity, server-side, which is also where the stakes policy lives (approvals demand a verbatim echo; hard-destructive actions stay web-only).

Optional scanners feed JSON into audit roles. Treat those reports as untrusted input until your policy says otherwise — the same posture you take toward issue descriptions written by humans. A file on disk is not truth; it is a claim waiting to be validated.

This chapter deliberately stays free of password shapes and host-specific wiring. For where to put agent-runtime credentials and how the engine holds tracker OAuth, read Configuration and Agent matrix.

What remains the silent killer is scope, not duplication: a token — the operator's PAT, the engine's OAuth, the agent runtime's keys — that can touch more than its job needs. The fix is not blame; it is naming each identity's blast radius explicitly and a smoke-style test that proves the engine can post state and the run can open a PR under the identities you documented. Treat that test like a deploy gate: boring, mandatory, non-negotiable.

Chapter 28 — The boring procurement checklist

Before you standardize on a vendor, ask questions that sound tedious and save careers. You are not trying to win a meeting; you are trying to buy testable commitments.

Data residency and subprocessors — where code and context live during a run, who else touches it, and which legal entities are in the chain.
Retention — how long logs, prompts, and artifacts persist, who can delete them, and what “delete” actually means technically.
Export — whether you can reconcile vendor-side activity with CI timestamps and ticket IDs when something disputed happens six weeks later.
Isolation — what breaks when two jobs run concurrently, whether workloads commingle, and what rate limits mean for your concurrency envelope.
Offboarding — how access is revoked when a project ends, a trial stops, or a key rotates — without leaving orphaned integrations still calling APIs.

If a vendor cannot answer, assume the worst and narrow scope until they can. Pilot with synthetic repos. Block production secrets until the paper exists. Boring questions are cheaper than retroactive press releases and cheaper than explaining to a customer why “we thought” was the policy.

Procurement culture often rewards confidence over clarity. “We take security seriously” is not a control. Named subprocessors, retention defaults, export paths, and isolation boundaries are — because you can attach them to a risk register and test them. Ship does not require cynicism toward vendors. It requires paper. Paper is how enthusiasm becomes engineering instead of liability.

Chapter 28.A — Regulated-vertical overlays

Regulated industries do not want a different Ship. They want the same Ship with a named set of extra latches and an auditor's trail that survives a subpoena. The mistake most teams make in year one is to pick a regulated pilot and then quietly fork the framework for it — a parallel repo, a parallel style guide, a parallel vocabulary that diverges from the unregulated codebase by month three. The regulatory team ends up maintaining a second product; the non-regulated team ends up wondering why their conventions keep changing. Within a year nobody remembers which rule came from where, and the auditor — when she arrives — finds two stories about the same control and picks the less flattering one.

Ship's shape instead is an addendum. An addendum is a regulated overlay: a named bundle of skills with tighter fences plus knowledge entries that pin non-negotiable rules, applied on top of a base preset (web-app, api-backend, mobile-app), that can tighten a rule or annotate it with an audit or retention requirement, but cannot remove one. The overlay ships with its own version, its own regulatory_frameworks front-matter (HIPAA, GDPR, 21-CFR-Part-11, EU-AI-Act), its own min_shipctl, and — this is the point — a single controllable identity on the board. When the auditor asks "what changed in the last six months that affected patient data handling?" the answer is a diff of one file and a ledger of its versions, not a cross-repo scavenger hunt. The same primitives that make skills and knowledge useful for a Rails monolith make them survivable for a pharma mobile app.

The pharma addendum included with Ship is deliberately opinionated about the boring things. It treats never commit patient identifiers as a first-class rule, applied to fixtures and synthetic data alike, because an agent that has internalised "never commit real names" will still commit a realistic-looking name it generated at three in the morning unless the rule is absolute. It mandates log redaction by default — a beforeSend hook on Sentry-class platforms and a LOG_DEIDENTIFY=1 convention in every environment, dev included — because dev is where redaction rules decay first. It pins audit logs to an immutable store (Object Lock, GCS Bucket Lock, or an append-only service) and a six-year retention floor because HIPAA §164.530(j) is a number, not a feeling. It translates 21 CFR Part 11 into a repository-level contract: a change-record label, a signed approved-by: <name> <role> <timestamp> <reason> comment, SSO with MFA on tracker, Git host, and cloud console, and a prohibition on shared accounts for anything touching production. Separation of duties is enforced at merge time — a PR author cannot be the sole approver or the deployer.

None of that is glamorous, and that is the quality it trades for. The moment you have a regulated vertical, you want the least creative part of your SDLC to be the audit boundary. The addendum is the shape of "least creative": one versioned overlay, one diff, one ledger, applied on top of the same preset the unregulated team is already using. When an auditor, a product manager, or a new engineer joins the project, they do not have to learn a bespoke dialect — they learn Ship, then they read one overlay, then they understand what their team does differently and why. Regulated software was always going to cost more than unregulated software; Ship's contribution is making the extra cost legible instead of diffuse.

Chapter 29 — Threats in plain language

You do not need a red-team aesthetic to benefit from a short, honest list of what actually goes wrong in agentic SDLC setups.

Credential leakage is the old war, still winning: tokens in logs, keys pasted into tickets, one “shared” credential across dev and prod because it was convenient on Tuesday. The fix is rotate fast, least privilege, separate identities per environment, and resistance to “the big key” that solves every meeting until it solves none after a leak.

Prompt injection is not science fiction — it is untrusted text in places you wired to power. Issue titles and descriptions are user input. So are comments, labels if humans can set them freely, and any field an agent reads as instructions. Prompts must assume an attacker or a well-meaning colleague will paste text that contradicts your policy. Mitigation lives in pick fences and tool allow-lists, not in clever wording that sounds safe in a slide deck.

Duplicate PRs and branch fights are scheduling and naming failures dressed as technical surprises. Enforce one branch naming contract, close extras without merge, and keep one delivery role per time window so the same ticket cannot become a fork bomb. Use cases → ElMundi shows the public reference story; copy the invariants, not necessarily every filename.

Audit spam is what happens when scanners meet a tracker with no discipline: dozens of low-value tickets, each fluent, few actionable. The fix is dedupe rules, evidence-only creation, and separate projects — the same story as Chapter 23, now read as a threat model rather than a workflow preference.

Threat models are not meant to paralyse you. They are meant to stop you from pretending the tracker is a safe space. Every field on a ticket is user input until proven otherwise. Pick fences exist so malicious or careless text cannot become executable policy. Tool allow-lists exist so clever prompts cannot reach places they should not. The combination is dull — and in security, dull is a compliment.

Chapter 30 — Trust, but verify lightly

Ops culture does not require heroics or all-night stakeouts on the workflow dashboard. It requires habits that fit inside a human week.

Pick one full run per week — the same way you might spot-check a cash drawer — and trace it end to end: ticket timeline, workflow run, pull request list. They should tell one story. If they diverge, you have either a bug or a policy drift worth fixing before it becomes folklore.

Alert on a sudden shift in dispatch outcomes. Often the automation did not get dumber; a label changed, a project guard moved, or a token lost a scope. Transitions that resolve to no routine are sometimes correct — Chapter 37 will say it again — but a change in how transitions resolve is a signal that something in the guard layer moved.

Review prompt diffs the way you review code diffs, because in this system they are code diffs in everything that matters: they change behavior, blast radius, and what “done” means for the agent. A casual prompt edit merged on Friday is a production change with extra steps.

Lightweight verification scales because it respects attention budgets. Nobody can watch every run; everyone can sample the chain of custody often enough that drift cannot hide for a month. When sampling becomes habit, surprises shrink — not because the world got safer, but because you stopped flying blind while insisting you had “full visibility” because the vendor sent a pretty dashboard.

Chapter 30.A — Limitations above the fold

Chapter 30 asked you to trust, but to verify lightly. That instruction quietly assumes the reader can see what is being verified. The assumption breaks the moment the page hides what failed. A page that scrolls cleanly from a confident headline to a polite conclusion is not asking to be verified; it is asking to be believed. The chapter you are reading now is about the small, almost stylistic decision that distinguishes the two — where on the page the limitations sit — and about why we have come to treat that decision as a load-bearing part of how trust forms between an operator and a piece of evidence we have published.

The conventional shape is older than any of us. A page that publishes numbers — a benchmark, a metric, an evaluation report — opens with the hero numbers, threads through the methodology, lays out the results, and ends with a small section called limitations or caveats or, more often, notes. The section is honest. It is technically reachable. It is set in the same font as the rest of the page. And it is, in practice, invisible, because by the time the reader who already half-believed the conclusion has reached it, they have stopped reading; and the reader who did not believe the conclusion stopped earlier still. The shape converts the reader who already trusted the author, and loses the reader who did not. That is the opposite of what an evidence page is meant to do. An evidence page is meant to convert the sceptical reader and to give the trusting reader something to remain accountable to.

The shape we have moved to is reversed. The limitations sit above the where it loses section, and sometimes — when the run was unusual enough — above the headline numbers themselves. The reader meets the failure modes before the wins. The page declares, in the same typography as the rest of it, what it does not know, what it had to throw away, and where the numbers we are about to claim were nearly something less flattering. Only then does it land the conclusion. The conclusion lands harder, not softer, because the reader has already seen the parts that did not flatter us. Nothing in the conclusion has to fight against the suspicion that something is being managed.

This works on human readers, and it works for a reason that is older than the web: people who already know what a claim costs weigh the claim correctly. But it has begun to matter for a second reason that did not exist when these conventions were set. The reader is sometimes not a person. An operator who is trying to decide whether to bring a product into their team will increasingly ask an agent to read the evidence first and summarise. That agent does not weight pretty-prose conclusions. It weights claims against the evidence offered for them, and it weights how openly the page surfaces its own failure modes. A page that puts limitations on top reads, to such an agent, as a page that respects the verification step the agent is about to perform. A page that buries limitations reads as a page that hoped the verification would not happen. The human and the agent both arrive, by different routes, at the same instinct: trust the page that put the cost of its claim where you would see it without scrolling.

On 2026-05-15 we redesigned the front of our first eval-run page within a day of publishing it. The original ordering was conventional — headline numbers, then where it loses underneath, then methodology further down. The numbers were defensible, but the shape of the page was not what we wanted to stand behind. We moved a section titled what we got wrong the first time above where it loses, at the same heading weight, with no accordion and no fade. The section named three things. We had not run a data-quality filter on the first pass, which had let pairs of empty responses score as ties and inflated the panel. We had under-represented one question type in the suite, which had advantaged models good at the other types. And one methodology choice — a shared budget setting we had picked once globally — had biased a class of models upward in a way that did not survive the rerun. The conclusion that survived after the rewrite was, in its load-bearing claims, the same conclusion the original page had carried. The page read differently. Visitors stayed longer on the corrected version, and the people who wrote back about it were the people we had been hoping to hear from in the first place — the ones who had been burned by glossy reports and were looking for something that did not feel like one.

The general rule we drew from that morning is short enough to put on a postcard. Limitations earn trust by being seen first. A reader who learns the limit before the claim weighs the claim correctly. A reader who learns the limit after the claim feels they have been managed, and is correct to feel it, because the structure of the page has decided in advance which order would be most generous to the author. Generosity to the author is not a property an evidence page is allowed to optimise for. Generosity to the reader is. And the most generous thing a page that publishes numbers can do is to refuse to let the reader meet the headline without also meeting the cost.

This chapter belongs to the Trust & boundaries Part because trust, as we have used the word throughout these chapters, is not something an author installs in a reader by being careful with the page. It is something an operator builds for themselves by reading evidence under weights of their own choosing. The job of the page is not to make trust easy. The job of the page is to leave the limits where an honest reader, skimming on a Tuesday afternoon, will encounter them before they encounter the result — and to let the result earn its place once the cost has been seen. The conclusion that survives that ordering is a conclusion the reader can carry into a room where it will be questioned. That is the only kind of conclusion this Part has ever recommended publishing at all.

Rolling it out

Big-bang automation fails for the same reason big-bang rewrites fail: nobody remembers which assumption broke first. Ship is designed to roll out in layers — each layer observable before you add the next.

Chapter 31 — Phase zero: before you touch cron

Until you turn on dispatch, “automation” can still behave like a hobby. Demos are forgiving. Slack is forgiving. The moment you let transitions dispatch work, you have promised that something will move while people are in other meetings, asleep, or untangling a production incident. Phase zero is the deliberate pause before that promise hardens. It is not paperwork for its own sake; it is the last chance to align on what may happen without a human in the loop, at a pace nobody can out-talk.

Big-bang is a seductive story because it sounds decisive. In practice it fails the same way big-bang rewrites fail: when everything moves at once, no one remembers which assumption broke first, and every fix feels like whack-a-mole. Ship expects you to roll out in layers, each one observable before you stack the next. Turning on dispatch does not create observability; it automates whatever observability you already lack. If you skip the layer where humans can still see the seams, the engine simply encodes confusion at the speed of your transitions.

So before the first live transition, leadership agrees on three non-negotiables, in plain language, in a room where someone will actually be called when the story stops making sense. First: we are not automating Backlog. The backlog is where priorities become legible; letting a headless actor reorder that queue is how you automate politics before you automate delivery. Second: we are bounding throughput, and that means bounded concurrency—leases, caps, and cascade limits so there is one clear owner of each outcome, not parallel “helpers” who each believe they are driving the same ticket. Third: the prompts and skills that steer headless runs are governed in review, because a text change that alters behavior is a production change even when it never compiled on your laptop. If those three commitments feel negotiable, you are not ready to go live; you are ready for another demo.

You will still hear the comfortable anti-patterns. “Let us turn everything on Friday evening”—as if the weekend were a sponge for risk, and Monday’s judgment were optional. “We will add guardrails after we see value”—which, translated honestly, means you will not. Value without guardrails trains the organization to love motion; guardrails added late always look like bureaucracy to the people who just learned to move fast in the dark. Phase zero is where you say no to those stories while the stakes are still a calendar invite, not an incident bridge.

The exit criterion is almost embarrassingly analog, and that is why it works. Put the people who own the outcome around one whiteboard and draw the control loop until the picture matches what you intend to run—not the aspirational diagram from a slide, but the actual path from intent to evidence to human accountability. Photograph it. If that photograph still reads true a week later, you have a shared object cheaper than a contract and harder to gaslight than a thread. If it falls apart the first time something goes red, you have learned where the story was thin before cron made the thinness hourly.

This phase exists to resolve a specific tension before machinery locks it in. “AI everywhere” is an easy headline; bounded concurrency is load-bearing physics. They only fight when you try to instantiate both against the same ticket at the same time. Phase zero forces the conversation: everywhere does not mean everyone autonomous against the same ticket at the same time. It means disciplined lanes, visible queues, and automation that knows its place in the story. The whiteboard photo is a cheap contract—small enough to carry, serious enough that nobody can pretend the shape of the system was never agreed. Wiring can wait. Clarity is the prerequisite that keeps the loop from becoming regret on a loop.

Chapter 31.A — The price of a bounded loop

Before you open the lane there is an arithmetic conversation most teams skip, and later regret. An agentic loop is a stream of work multiplied by a provider that charges per non-trivial run; a stream nobody bounded and a provider that bills per model invocation will produce a quarterly invoice whether or not anyone has planned for it. Phase zero is the right place to do the math — not to settle it, but to shape it — because once dispatch is live, the math will happen to you instead of with you. Leadership that signed off on "let us try agents" without an envelope tends to discover, in month three, that the envelope was an assumption and the assumption was someone else's problem.

The architecture Ship recommends is cheaper than most buyers expect, and the reason is boring: bounded. Ship dispatches off tracker transitions, not a clock, so the cost is not a fixed grid of ticks you can multiply out in advance — it is shaped by how much work transitions into the lane and by the concurrency envelope the engine enforces. A transition that resolves to no routine, or to a refusal, costs nothing — no agent is invoked. The expensive unit is the real run, where an agent writes code, and the number of those in flight at once is bounded by leases (project_lock), per-stage caps, and cascade limits. That is what makes the bill forecastable without a timetable: you are not sized for "forty-eight ticks a day," you are sized for "at most N concurrent runs, this many per stage, one lease per project." At an illustrative one-to-five dollars per meaningful run, the lane's cost tracks the work that genuinely entered it, capped by the envelope — a shape small enough to sign for and concrete enough to monitor. The argument that survived the migration off the grid is unchanged in spirit: a bounded envelope forecasts; an unbounded one surprises, and surprise is expensive in an agentic system because the marginal unit is not a CPU second but a model invocation.

Two design choices fall out of that and deserve to be named, because teams that internalise them early avoid entire classes of remediation later. The first is route models by role, in one place. Intake and labelling do not need the same tier of model that writes code; review does not need the same tier that refactors a migration. Because skills are server-side role definitions and the agent profile is set in one place (config_put), swapping "cheap model for intake, premium model for developer, something in the middle for review" is a configuration change, not a sprawl of per-entry dashboards each pinned to its own profile. The second is make the envelope, not the bill, the alarm. Bills arrive weeks late; caps trip in real time. "We are sized for N concurrent runs; the lease queue has been saturated all morning" is a signal the operator can act on at 10 a.m., without waiting for the provider's dashboard to catch up. When the envelope trips repeatedly, the right response is almost never "raise the cap"; it is to look at the board and find the loop that learned to be busy.

The uncomfortable corollary is that an agentic loop's price is not what a single run costs; it is what unbounded enthusiasm costs. Every organisation that decided agents were free because "they were in the plan" has a story about the month the bill doubled and nobody could name which team owned the spike. Ship's posture is the opposite: cheaper than it looks because bounded, predictable because pick-first, legible because the choke point for provider choice and model tier is one file in one repo. If the arithmetic in phase zero does not add up under those constraints, the honest move is to narrow the pilot — fewer roles, fewer hours, one team — not to widen the envelope. You can always add slots when the math earns them. You almost never get to unadd them after the organisation learned the bigger number was normal.

Daily agent runs against a bounded envelope

Chapter 32 — Phase one: pilot the delivery lane

Phase one is not the moment you prove agents are clever. It is the moment you prove the delivery lane is boring enough to trust. Keep the blast radius small on purpose: one tracker project, one team, and the delivery stages only. If you can postpone audit automation, do. Audits multiply edge cases and opinions; the pilot needs a narrow question—“when a card transitions, does the right work move in the right way?”—not a referendum on every kind of ticket your org might ever file.

Success in this phase is almost insultingly concrete. When automation runs, people should see a predictable Todo → In progress transition: the card moves because a status change cascaded the FSM (or a PR-merge auto-advanced it), not because someone backstage got creative. The ticket timeline should answer a forensic question without a meeting: which run touched this card, off which transition, and when. Dispatch should never feel like magic. Every automated step should be explainable from board fields someone can read—labels, columns, assignees, guards. If you need a secret spreadsheet to justify why this ticket advanced, the pilot is still pretending to be done.

Failure announces itself in behaviours that erode trust faster than any latency graph. Duplicate PRs mean your naming and scheduling story is not tight yet—the system is forked, and humans are about to spend their week closing ghosts. Tickets moved without comments turn the tracker into a ouija board: something happened, nobody signed it, and the stand-up becomes detective work. Worst of all is when humans cannot tell whether the last change came from a bot or a teammate. That confusion is not a UX nit; it is a signal that you have automated motion without custody. Comments, workflow links, and consistent identity are how you keep the social contract intact.

You are not looking for a fireworks exit. You are looking for two weeks of boring Mondays: the same ticket classes, the same guardrails, and no emergency retro summoned because automation did something nobody can reconstruct. That repetition is the operational proof. For concrete product setup and operating patterns, see Getting started, Operating, and Use cases → ElMundi and copy the habits, not only the filenames.

Treat boring Mondays as both an emotional and an operational metric. Exciting Mondays in agent land are expensive: surprise PRs, surprise state changes, surprise arguments about who owns what just happened. A healthy pilot is the week that starts quietly—because automation reads as infrastructure people can walk on, not weather that rearranges their plans overnight. When Monday stops being a mood problem, you have earned phase two.

Chapter 33 — Phase two: the second track

Phase one taught the delivery lane to behave. Phase two asks a harder question: where do the other voices go when they are not allowed to hijack the same lane?

Most teams already have more than one kind of problem. There is the work you ship this week, and there is the work that follows you home — tech debt that showed up in a scan, QA signals that are not quite a release blocker, security findings that deserve a thread but not a stand-up meltdown. If you pour all of that into the same tracker columns as release work, you do not get “more visibility.” You get a single WIP limit shared by two different games. Development and audit start arguing in the same queue. Throughput does not rise; it collapses, because every card looks equally urgent and almost none of them are equally actionable.

The second track is not a metaphor for working harder. It is a structural admission that these items are a different kind of ticket and belong in a place that respects that kind. Concretely, that means separate projects for tech, QA, and security findings, fed by evidence-only creation rules. A finding is not a mood; it is a pointer. If the automation cannot attach or reference what a human would call proof, it should not open the ticket. That sounds strict until you remember the alternative: a board where “consider improving architecture” passes for work, and humans spend stand-up translating opinions back into facts.

Here is what success actually feels like on a calendar. The release stand-up board stays legible — audit noise stays out of the daily release conversation — while the audit lane still exists and still moves. Findings show up with their homework done: they reference artifacts you can open, diff, or rerun, so triage feels like triage instead of therapy. Teams handle those cards the way they handle any other actionable queue — prioritise, assign, close — because the tickets are written to be operated, not admired. Nobody has to carve out a special moral exception for “audit work” in the weekly plan; it is normal backlog hygiene with a sharper intake filter.

That normality is the point. When audit items live beside release cards, every stand-up becomes a negotiation about whether feelings count as blockers. When they live in their own separate projects, the release conversation keeps its job — ship the increment — and the audit conversation keeps its job — shrink risk with receipts. The handoff between the two is a pull request, a scan output, a test run ID: something you can link, not something you have to re-derive from memory.

Failure is equally recognizable. Vague tickets arrive with the tone of a concerned relative. Delivery throughput tanks not because the auditors are evil but because audit and development are now fighting the same WIP cap, and neither side can tell whether “In progress” means shipping or stewing. The second track exists precisely so that fight is optional.

If you want a reference implementation of the wiring, not just the philosophy, start with Use cases → ElMundi and Automations.

The uncomfortable truth phase two surfaces is that many organisations never agreed what evidence means. They agreed on tools, on schedules, on severity labels — and then discovered, under automation, that “high” without a reproducible anchor is just typography. Until you define evidence the way you define “done,” audit bots become prolific authors of ambiguity. With a shared bar, the same automation stops being a second opinion column and starts behaving like a diligent clerk: it files what you configured, with the attachments you said count.

So treat phase two like a product launch, not a checkbox after the pilot. Name owners for each finding stream. Put SLAs on triage that match the risk, not the drama. Ship templates that encode your evidence contract so a ticket cannot be “created” without the fields that make it real. The machine only files what humans configured; it does not rescue you from vague policy. Phase two is where you stop pretending one board can hold every kind of truth — and build a second lane that is boring on purpose, because boring is what scales.

Chapter 33.A — How humans review what agents wrote

Phase two is the first time most organisations have human reviewers routinely reading pull requests an agent opened. That experience is not the same as reviewing a teammate's PR, and pretending it is — through force of habit and the comforting familiarity of the existing review template — is how the first serious agent-authored regression slips past a team that thought it was being careful. The reviewer's job is not fundamentally different. The reviewer's attention budget is, because the surface area has changed.

The first habit worth installing, out loud, is read the diff and the ticket together, in that order, and refuse to skip either. An agent-authored PR is always reacting to a prompt and a ticket; the diff alone is half of the evidence. If the commit implements a fix that does not match the ticket's intent, the right move is to close the PR and fix the ticket and prompt — not to merge the diff because "well, the code change is fine in isolation." Code changes that are fine in isolation and wrong in context are the special horror of agent output: they compile, they pass CI, they read well in review, and they take the team weeks to undo because nobody questioned whether the ticket they attach to was the ticket the code actually solves.

The second habit is to audit the boundary, not the body. In human PR review, reviewers often skim the logic and quiz the edge cases. In agent PR review, the body of the change is often correct by construction — the model is good at local logic — and the sharp corners live in the wiring: environment variables, schema expectations, branch assumptions, tenancy shapes that the agent could not have seen. The cleanest scar in the reference org for this is dated 2026-04-11: fix(ci): resolve Neon DB/role from parent branch for PR preview. The diff was small: fifty-five lines in a shell script, two lines in a README. What it quietly fixed was an invisible assumption — that the database role matches neondb — that had been true in every environment the agent had ever seen, and was not true in Neon for PR preview branches, where the role follows the database name. No diff reader could have caught that by looking at code. A reviewer with operator memory caught it by asking, "wait, which role does this connection actually have in a preview branch?" That is the question agent PR review is built around.

The third habit is refuse the big diff. An agent PR that touches many files and many concerns is almost always wrong — not because the model is bad at big changes but because the prompt that produced it failed to force a narrow scope. The operator-grade response is not to slog through a five-hundred-line PR out of politeness. It is to close the PR and file feedback against the skill: "this prompt is producing scope creep; please narrow." A reviewer who has merged three such PRs out of exhaustion is a reviewer who has trained the system to keep sending them. Reviewers teach the system what they accept. That cuts both ways.

Put those three habits together and the review culture that phase two actually needs becomes visible. Ticket-and-diff together. Boundary over body. Small or bounced. It is faster than it looks, because an agent PR that fails the first two tests is usually bounced in under a minute, and the remaining ones — the small, scoped, invariant-honouring changes — are the cases where human reviewers still do their best thinking. That is the arrangement the rest of the book has been building towards: agents doing the motion, humans owning the outcome, review as the seam where that ownership is performed.

Chapter 34 — Phase three: when release becomes real

Phase three is when release stops being a vibe and starts being a habit. Until now you could tell yourself a comforting story: the pipeline is green, the ticket moved, the demo worked on someone’s laptop. That story was useful—it got you through a pilot and a second track for audits. It is not sufficient when money and reputation ride on what customers see in the browser. Hosted is not a synonym for “the same thing, but online.” It is a different physics: DNS, TLS, cookies, feature flags, cold starts, third-party scripts, and the slow betrayal of anything you only ever proved against localhost. Admitting that difference is not pessimism; it is adulthood for the loop.

So you wire hosted end-to-end tests to release habits the way you wire smoke detectors to bedrooms—not because you enjoy the noise, but because silence that lies is worse than noise that tells the truth. The suite should run when humans already expect to look at quality: before merge, before promote, before the narrative in the channel shifts from “we think it’s fine” to “customers are seeing it.” If E2E only runs when someone remembers, you do not have a gate; you have a lottery. If it runs on every transition without matching how your team actually ships, you have traffic, not signal. Cadence is therefore a design choice, not a default. You tune it against provider rate limits the same way you tune batch jobs against database connection pools: rate limits are not rude constraints from a vendor; they are economics. Every run has a cost in quota, time, and attention. A loop that hammers APIs until jobs flake teaches the wrong lesson—that automation is fragile—when the real lesson is that you spent your budget like water and called it velocity.

While you are tightening the release story, tighten the branch story in parallel. Duplicate PR handling should be tightened until it is boring: one naming contract, one expected automation path per ticket, extras closed without drama. Excitement here is almost always a scheduling or discipline failure dressed up as Git being mysterious. Boring duplicate handling is load-bearing; it keeps review queues readable and prevents “which PR is real?” from becoming a daily stand-up game.

Success in phase three has a specific shape. You catch regressions against hosted URLs before promote—not because localhost lied, but because the environment that matters is the one your users touch. Promote stays manual or policy-gated: a human or an explicit rule humans wrote and can audit—not “the agent decided.” If an agent could ship to production by default, you would have merged operations with improvisation. On-call should know When things break by heart—not because memorization is virtuous, but because at 2 a.m. nobody should hunt through six wikis for the story of triage. The framework section exists so symptom, look, and fix share one vocabulary before the pager rings.

Failure in phase three is equally specific. Flaky E2E that everyone learns to ignore until red means nothing is worse than no E2E: you pay the cost of the suite and the false confidence of green. A noisy test is not a personality quirk; it is a bug in your quality system—fix it, quarantine it with an owner and a deadline, or admit you are running theater. The other failure mode is speed without judgment: automation that runs faster than humans can review turns your team into a rubber stamp with good intentions. Throughput without comprehension is how defects become folklore—“we thought CI checked that”—when CI only checked that the script exited zero.

If you skip here from a shaky pilot, you will not skip the consequences. You will blame flaky E2E for being flaky when the real problem is that nobody trusts the delivery lane yet—so every red run feels like politics instead of information. Order matters because trust is the platform everything else runs on. Phase three does not ask you to be clever. It asks you to be honest about where the software lives, who owns the last mile to production, and what your tests are allowed to mean. When that honesty is in place, release becomes real—not louder, but quieter in the way that means you can sleep.

Chapter 35 — Owners without a hundred-page RACI

Every org eventually buys a RACI template. Someone pastes it into Confluence. Six months later nobody can name who is Accountable for the thing that just broke. The matrix was honest once; then the reorg happened, the page went stale, and the work kept moving anyway—because humans route around bureaucracy. Ship does not need another hundred-page grid. It needs named owners, a decision path when two of them disagree, and habits light enough that you will actually run them.

Ownership is not a vibe. It is a contract with reality. When a model prompt drifts, or a secret ages past policy, or a scanner flags something that blocks a release, someone must be findable without opening a wiki archaeology expedition. In practice, someone owns prompt changes—not “the org,” not “AI,” not an abstract steering group. Most teams land on a pairing: platform (the people who understand the toolchain, evals, and guardrails) with engineering management (the people who trade off roadmap risk and can say no to a ship). That pair is not decorative. It is who gets paged when the assistant starts answering in a voice you did not authorize.

Secrets are the same story with a different failure mode. Rotation is not a checkbox in a quarterly audit deck; it is recurring work that competes with features. Someone owns secrets rotation—a single name on the calendar, not a committee “aligned” on the importance of hygiene. If nobody wakes up mildly anxious about expiry dates and blast radius, you do not have rotation; you have a policy PDF and eventual regret.

Security does not own every knob, but it must own the knobs that define risk for the business. Concretely: security owns scanner policy—what runs, when it gates, what severity means in your context—and what counts as evidence when an audit ticket asks “prove it.” Without that clarity, engineering invents twelve local definitions of “fixed,” and your audit trail becomes a pile of screenshots nobody trusts. Security’s job here is to make the bar legible, not to personally merge every pull request.

Here is where full RACI templates fail you. They imply a static world. Your world is reorgs, acquisitions, and tools that did not exist when the matrix was written. The artifact rots; the work does not stop. Ship asks for something smaller and harder to fake: named owners who survive the reorg because you reassign the role, not because you hope people read the wiki. Write the name in the runbook, in the channel topic, in the on-call rotation—anywhere a new hire can find it on day one.

You still need governance you can run. Not a council that meets when the moon is full—a decision path for the predictable collision: two owners, two legitimate views, one ship date. The path can be short (“escalate to CTO”) or layered (“product + security + platform triage”), but it must be one agreed route, time-boxed, and used in anger at least once so everyone knows it works. Without it, ownership decays into politics: loud voices win, quiet risks accumulate, and your agents keep shipping because nobody had authority to stop them cleanly.

Be suspicious of committees for clock ticks. Automation does not respect quorum. Schedules slip, token budgets creep, prompts get edited in side channels. A committee can endorse a strategy; it cannot feel the responsibility of “this must run Tuesday” or “we are burning margin on this model.” Automation needs someone who feels responsibility for schedule, tokens, and prompts—a human name who would be embarrassed if the job silently failed or the bill doubled. Embarrassment is underrated infrastructure.

None of this replaces trust or good judgment. It replaces the fantasy that a giant RACI in Confluence is the same as control. Ship is built for operators: fewer boxes, more names, explicit disagreement handling, and owners who would notice if the machine drifted while everyone was in a meeting. That is not bureaucracy trimmed to zero—it is governance trimmed until you will actually follow it.

Chapter 35.A — Onboarding a human into an agent team

The last thing a book like this owes the reader is a page about people joining an agent-assisted team. Not because onboarding is the most dramatic part of the system — it is not — but because it is where every earlier chapter either holds up under pressure or does not. A new hire walking into a delivery lane on their first Monday is the cheapest truth-detector an organisation has: within a week they will either understand how work flows, or they will pretend they do. Both outcomes are shaped by what you wrote down before they arrived.

The onboarding contract that earns trust has three visible layers, and the reference org arrived at each by rewriting its own documentation in public. The first layer is a single page that explains the loop in a paragraph, not a tour of seven tools. In the reference org an early version of this was AUTONOMOUS-SETUP.md — a 2026-04-07 commit (docs(linear-agent): framework-first site, ElMundi examples, tools & prompts) first rewrote it around the loop's verbs with vendors named only afterwards, back when those verbs were pick → launch → PR → merge → audit. The verbs have since become the FSM grammar — transition → planning bundle → implementation → validation → review → merge — but the test of a good "day one" page did not change: whether somebody who has never touched Linear, GitHub, or the agent runtime can still predict what the system will do when they move a card from Todo. If the page leaves them guessing, it is a tour, not an onboarding.

The second layer is introducing the artifacts before introducing the dashboards. The same day as the page rewrite, the reference org also shipped docs(linear-agent): tech-writer + stakeholder structure, which split documentation responsibility between authors who write for operators and authors who write for stakeholders, and pointed every new hire at the role definitions as the first thing to read. Those skills used to live beside the workflows in git; they are server-side agent_roles/*.md now, but the ordering principle survived the move. A new engineer who reads developer.md, decomposition.md, and clarification.md before they open any dashboard understands the machine's job before they see the machine's motion. Dashboard-first onboarding produces people who know how to stare at charts. Artifact-first onboarding produces people who know what the system is contractually promising.

The third layer is a first-week task that is small, real, and artifact-shaped. Not "read the wiki and come to standup." A micro-task: propose a one-line change to an existing prompt artifact via shipctl feedback new, or author a single new fixture for an existing eval, or add one row to the label-contract manifest. That task teaches the new hire, in about an hour, the three things the rest of the book takes forty chapters to say: that artifacts are where the intent lives, that feedback is how artifacts change, and that changes are reviewed like code. Compared with the traditional first-week ritual of "shadow a senior for a sprint," it is almost rude in its concreteness. It is also the only onboarding shape that leaves, at the end of the week, evidence that the person did work the team can inspect.

A small warning before the chapter closes. Onboarding into an agent-assisted team is more intimidating to new hires, not less, in the first week — because the system is partly autonomous and the new hire worries they will break something they do not yet understand. Leaders should say out loud, more than once, that the fences in chapter 4 exist precisely so a new hire cannot break anything irreversible by editing an artifact in the way a good first task would demand. Say it on day one. Repeat it on day three. The rest of the book's work — the fences, the bounded concurrency, the audits, the feedback loop — is what earns you the right to mean it.

When things break

Symptom → look → fix. Example-specific product setup and operating guidance sits in Getting started, Docs, and Use cases → ElMundi; this section is the story of triage.

Chapter 36 — Symptom, look, fix

Nobody opens the manual at chapter thirty-six because the morning felt orderly. They arrive because something behaved—a run went green and the board stayed wrong, or a ticket transitioned and nothing fired, or two pull requests showed up for the same card like twins nobody invited. In those moments you do not need a sermon on architecture; you need a way to walk from what hurts to where to look without pretending the system is simpler than it is. This chapter is written in that spirit: a short story of triage, not a substitute for your organisation’s runbook. It will not paste the exact commands, hostnames, or env var spellings your repo uses; those live in the product docs and in your own workspace. Here, the point is the shape of the hunt—symptom first, evidence second, fix last—so that when you are tired, you still move in a straight line.

Symptom-first triage is not pessimism; it is respect for how humans remember incidents. Months later nobody quotes the inner name of a workflow step; they quote the user-visible hurt—“the agent never picked up my ticket,” “CI is green but Linear disagrees.” A good triage map meets people at that sentence and only then names the subsystem. That is also why this page refuses to be a runbook: runbooks age per repo, per tracker, per cloud account. A framework manual that pretends otherwise becomes fiction the week after the first fork. So read what follows as a compass, not a checklist. When a row points at “see example SDLC doc” or “see Cursor doc,” that is intentional deferral—the examples carry the verbs; this chapter carries the why.

Symptom	Where to look	Typical fix
Moved a ticket, nothing fired	Wrong state / missing signal label / guards	`dispatch_ticket` over MCP to fire the current stage and recover a stall the poller missed; it honours every gate and returns the reason
`dispatch_ticket` returns `blocked_by_dependency`	Linear `blocks` relations on the ticket	Clear or close the blocking ticket; the dependency gate is doing its job, not malfunctioning
`dispatch_ticket` returns `no_routine`	Ticket's resolved FSM stage	The state/labels resolve to a stage with no routine (often terminal); set the intended entry state instead
Ticket frozen, never advances	`blocked` signal label (overlay freeze)	Remove the `blocked` label; the dispatcher refuses frozen tickets by design
Agent run never starts	Agent provider dashboard + runtime secret	Verify the agent profile / runtime credential; check the dispatched GitHub Actions run
Duplicate PRs for one ticket	Branch naming contract drift / stale lease	Keep one naming scheme; close extras; confirm the `project_lock` released
Scanner job skipped	Missing scanner token	Expected in dev; add token for full signal; document "skip is OK here"
Prompt change "did nothing"	Role definition not deployed / cached	Confirm the new `agent_roles` revision is live; re-run the stage with `dispatch_ticket`
Rate limits / throttling	Concurrency envelope vs provider quota	Lower caps / leases; reduce cascade fan-out; ask vendor for quotas

Deep setup: Getting started · Docs · Terms: Vocabulary.

Treat the table as portable scaffolding: copy it into your internal runbook, then annotate the middle column with your workflow names, secret keys, and dashboards until “where to look” is a five-second glance for the next on-call. The third column stays deliberately terse—enough to suggest direction, not enough to replace the example repo’s step lists—because the failure mode of framework docs is pretending one paragraph can hold every fork your company will invent. Keep this page as narrative; keep Examples as executable truth. When the next odd symptom lands, read the row that matches the pain, follow the pointer, and let the story end where the tooling begins.

Chapter 36.A — Architectural vs cosmetic

The discipline of the previous chapter — symptom, look, fix — carries you a long way, but it does not, on its own, catch the class of error this chapter names: some symptoms are not the symptom they look like. They arrive in the vocabulary of taste and they invite a response in the vocabulary of taste, and the response, however careful, leaves the underlying defect untouched. The complaint sounds cosmetic. The defect is architectural. A polish pass on an architectural defect produces a more attractive version of the wrong thing, and the better-looking version is harder to argue with than the version that came before.

The diagnostic question is short, and we have learned to ask it once, out loud, before we touch a single pixel. Would moving the boundary make the complaint disappear? If the honest answer is yes — if the page would stop reading badly when its contents were split into two pages, or when two pages were joined into one, or when a section were lifted out and given its own home — then the complaint is architectural and a polish pass will fail. If the honest answer is no — if the page is asked to do exactly one job and is simply doing it gracelessly — then the complaint is cosmetic and a polish pass will succeed. The question takes a minute. The savings, when the answer is yes, are measured in days.

The general pattern is easier to recognise once you have been bitten by it. Surfaces that conflate two audiences look bad in the same way. They look noisy. They look as if their type ramp has been chosen by committee. They look as if a designer left in the middle of the work. The reading experience is one of small, constant context switches: a sentence written for one reader, then a sentence written for another, then a heading whose vocabulary matches neither. A designer looking at that surface will reach, reasonably, for typographic tools. Tighten the hierarchy. Quiet the colour. Standardise the spacing. None of these moves are wrong. None of them touch the defect. The defect is that two concerns have been placed in one container, and the container is now being asked to seat two readers on the same bench. The visual noise is the bench complaining.

The cost of misdiagnosis is two-layered, and the second layer is the one that catches us out. The first layer is the obvious one: a polish pass on an architectural defect consumes the same hours as the correct fix, and at the end of those hours the defect remains. We have spent the budget and bought nothing. The second layer is quieter and more expensive. The wrapper is now prettier. The page reads more smoothly, the type ramp is more confident, the spacing has been settled. The user who returns to it can no longer trust their first impression, because their first impression is now of a confident, well-set page that still does not quite make sense. Their own eye has been trained on a more attractive version of the wrong thing. The signal that would have prompted them to file the complaint a second time has been damped. The defect is now harder to see, and the people who would have surfaced it have lost the small visual cue that would have made them speak up.

On 2026-05-13 we got a complaint about a surface on one of our sites that read, on first listening, as a typography problem. The type ramp looked uneven, the headings sat awkwardly against the body, and the page as a whole gave the impression of having been drafted in haste. A polish pass would have been straightforward and defensible: tighten the hierarchy, settle the spacing, retire one of the heading weights, ship by the end of the afternoon. We listened to the complaint a second time before we did any of that, and on the second listening the claim was different. The reader was not telling us that the page looked unkempt. They were telling us, in the only language they had, that the page had two voices on it. Two kinds of content, intended for two different audiences, had been placed in one container because the container had been cheap to build when both kinds were thin. The fix was to split the surface into two surfaces, each with one audience, each with one voice. The typography became fine on its own once the structure was right. We had been about to spend a careful afternoon on the wrong problem, and the only thing that stopped us was the boundary question. We have asked it first ever since.

The short rule that came out of that day is the one we now teach to anyone joining the team. Cosmetic complaints describe what is on the page. Architectural complaints describe what is on the page together with what else is on the page. Listen for the conjunction. Listen for the X and the Y both being here, or this section and that section in the same place, or I was reading about one thing and then suddenly I was reading about another. That phrasing is the tell. Users rarely file architectural complaints in architectural vocabulary; they file them in the language of taste, because taste is the only register they have for something is wrong with how this fits together. The work is to translate.

None of this displaces the discipline of the previous chapter. Symptom, look, fix remains the shape of the hunt. What this chapter adds is a single instruction to the look-step, which has always been the step that does the real work. Before you reach for the cosmetic fix, ask the boundary question. If moving the boundary would make the complaint disappear, the look-step is not yet finished, and the fix you were about to apply is the wrong one. Read the complaint twice. The first read tells you where it hurts. The second read tells you whether the hurt is on one page or between two.

Chapter 37 — Green, but nothing happened

You moved a ticket, or a run went green, and the lane still looks idle. Before you treat that as failure, flip the default: often this is correct. The dispatcher read the board, resolved the stage, and either found no routine to fire or refused the ticket on a gate. That is not a stalled brain; it is a fence doing its job. Silence is cheaper than forcing the wrong card into motion just so something “happens.” The wrong ticket in the wrong stage costs reviews, reverts, and trust — a kind of interest you pay in meetings instead of commits. An engine that occasionally does nothing on a transition is healthier than one that always grabs something and calls it progress.

When you expected motion anyway, resist the story where the machine “did not feel like working.” Agents do not have moods. The interesting failure mode is almost always guards: the ticket is in the wrong project, in a state that resolves to no routine, missing a signal label the stage demands, carrying the blocked freeze label, or held behind an unresolved dependency. Sometimes the confusion is simpler: you watched the wrong show. Intake, developer, audit — different stages, different faces. Verify you are looking at the run that fired off the transition you made. If the stage that ran was not the stage you narrated in your head, green plus silence is exactly what you should see. When in doubt, open the ticket’s timeline next to the run and ask whether they describe the same story — semantic drift is a guard too, just wearing language instead of a label.

Treat verification as boring detective work, not vibes. The fastest probe is dispatch_ticket over the MCP edge: it fires the ticket's current stage immediately and, when it cannot, returns the reason — blocked_by_dependency, no_routine, a held lease. Read that reason the way you would read a failing if statement: every gate is a door that can stay shut. If the ticket still does not advance after a clean dispatch, that is still information — your mental model and the board disagree, and only one of them is authoritative.

Teams that cannot sit with a quiet lane reach for the wrong lever. They loosen the fences — strip signal labels, drop the dependency gate, widen what a stage will act on — until the machinery always finds a victim. That “fix” trades discipline for motion. You get noise, duplicate work, and tickets advanced out of order, then you blame the automation for being sloppy when you were the one who removed the fence. The board looks “alive” in the worst way: lots of branches, overlapping claims, people asking which PR is the real one. Loosening guards is how working systems break without anyone committing a dramatic mistake; it is death by small, reasonable-sounding concessions, usually justified in chat as “we just need it to do something.”

The habit that saves you is not optimism or pessimism about AI. It is curiosity phrased as a concrete question: which guard disagreed with what I assumed about this ticket? Walk the fields like a checklist — project, state, signal labels, dependency blocks, the lease — each is a vote. One of them voted no, and dispatch_ticket will usually name it. Your feelings about whether the run “should” have done more are not data; the answer lives in fields and the dispatch reason, not in whether the morning felt quiet.

Learn to love the quiet lane when the rules say there is nothing to advance. Learn to distrust your discomfort when the rules say something should have moved and did not — and go hunt the guard that said no. Green with nothing to show is only a mystery until you read the board the same way the dispatcher does.

Chapter 38 — Red in CI, opaque log

CI does not fail in prose. It fails in a step name — a short label at the top of a rectangle that turned red. Everything below that label is noise until you know which rectangle you are in. So stop scrolling to the bottom first. The bottom is where fear lives: stack traces, retry spam, a thousand lines that all swear they are the real story. They are not. They are the footnote. The headline is the step that stopped the run.

Read CI the way you read a newspaper someone left on the train. Headline first. If the headline says “Install dependencies,” the article is not about your model’s creativity. If it names the dispatch step, the article is not about TypeScript. If it names the agent run, the article is not about whether the ticket was well written. Get the genre right before you invest emotion in the fine print.

Checkout and install are the boring front page. When they break, you are usually looking at infrastructure, cache keys, or lockfiles — the physical world of the runner: disk, network, versions pinned or not pinned. Blaming the agent here is like blaming the weather report for the rain. Fix the roof.

Dispatch is the politics desk. The engine talks to your tracker the way a reporter talks to a source. If the credential is wrong, the state is wrong, or the project/state/label story no longer matches what humans renamed last Tuesday, the dispatcher resolves to a refusal before anyone intelligent gets involved. That is a feature. A dumb refusal at dispatch is cheaper than a smart mistake on the wrong ticket — and it carries a reason you can read.

The agent run is the foreign bureau — another system, another auth story, another set of secret names that must match what the run actually injects. This is where “it works on my laptop” and “it works in the dashboard” collide with “the run never saw that variable.” The failure mode is often embarrassingly literal: one underscore, one wrong environment, one scope missing from a token. The log will happily imply cosmic network instability while the headline quietly says the agent run never started.

Then there are tests after an agent opened a pull request. That is still a human-shaped story. The PR and the ticket are your byline: who asked for the change, what was in scope, what evidence was attached. A red test might be a product bug the agent surfaced, a flaky suite you have been politely ignoring, or a guardrail doing its job. Your job is to separate those without losing the thread. Conflating “the bot broke CI” with “CI told the truth about our app” is how teams disable checks and call it velocity.

Flaky versus product is not a philosophical debate; it is a bookkeeping exercise. If the same step fails intermittently on unrelated changes, you are not debugging a feature — you are debugging trust in the signal. If it fails once, on one change, with a reproducible assertion, you might actually be done with infrastructure and finally in the product. Ship stays sane when you tag which kind of red you are looking at before you open the twentieth log tab.

Here is the empathy angle. Long logs are an empathy test — for the person reading at midnight and for the system that produced them. Panic scrolling is a natural response to volume. The framework’s gentle discipline is: step name first, stack trace second. Treat the stack as a footnote you consult after you know which chapter of the book you are in. Otherwise you spend an hour proving the network timed out when the real plot was a secret typo — same shape as a network story in the dark, completely different ending.

Teach that habit on day one. It sounds like a small courtesy. It is actually hours returned to the team — hours that would otherwise go to storytelling sessions where everyone agrees Something Must Be Wrong With The Cloud because nobody looked at the headline.

Red in CI is not a mystery novel. It is a newspaper. Read the headline, then the footnotes. Match the step to the world it touches — checkout, dispatch, agent run, test — and keep the ticket and PR in frame when the failure is about code. Opaque logs are usually opaque because we read them backwards.

Chapter 39 — It worked yesterday

“It worked yesterday” is not a mystery. It is a sentence about the world outside your repo.

When something that ran fine on Tuesday fails on Wednesday, your instinct will be to blame the machine you can see: the workflow file, the prompt, the agent, the engine. That is the wrong order. Suspect external drift first — the parts of the system nobody committed. A state got renamed. A signal label was “cleaned up.” A project ID moved during a reorganisation. A token expired, or someone tightened scopes because security asked a reasonable question on Slack. Those changes do not show up in git diff. They show up as silence, transitions that resolve to no routine, or errors that read like nonsense until you remember the board is not a file.

Ship is stable only when the tracker behaves like a contract — or when you treat changes to it the way you would treat a breaking change to a public API. You would not rename a field in production on a Friday afternoon and hope every client magically noticed. You would announce a migration window, bump a version, run a dry run against staging, and only then flip the switch. The issue tracker is the same class of surface. It is the API your automation calls. The HTTP client is a script; the JSON is a ticket; the schema is whatever your team typed into Linear, Jira, or the tool du jour. If the schema moves without a migration plan, your automation did not get dumber. The ground moved.

Drift is embarrassingly human. Someone merges two workflow states because the board felt cluttered. Someone renames a label because the old word embarrassed them in a leadership review. Someone archives a project and creates a new one with a shinier name. Someone deletes “unused” custom fields that were only unused because humans stopped looking at them — the machines were still reading them every hour. None of that is villainy. It is the normal entropy of organisations that forget their tracker is infrastructure, not wallpaper.

The uncomfortable truth is that automation has no sense of intent. It does not know you meant well. It matches strings. When the string changes, the contract breaks. From the engine’s point of view, that is indistinguishable from a vendor shipping a breaking API with no deprecation notice. The FSM expected the entry state to be exactly Todo, or a signal label to read exactly as written; the workflow state was renamed, so the transition now resolves to no routine. To you, it is the same idea. To the machine, it is a 404 dressed as a label.

Tokens and scopes drift in the same family. Credentials feel like plumbing until they are not. A rotation policy does its job; a new OAuth screen asks for one fewer checkbox; an admin clicks “principle of least privilege” and trims a scope the automation needed but nobody documented. The failure mode is not always “401 Unauthorized” in giant letters. Sometimes it is partial data, empty search results, or tickets that exist in the UI but not in the query the bot is allowed to see. Again: external drift. Again: suspect it first.

So what is the fix? Not freezing the board in amber. Boards are allowed to evolve. People are allowed to tidy. The fix is migration discipline — the boring kind good platform teams already practice for real APIs.

Say the change out loud before it lands: we are renaming this state, we are collapsing these labels, we are moving work into a new project. Update the FSM state mapping and signal labels (tracker_fsm.py) and the guards in the same change window — or run old and new names in parallel until the logs look boring. Probe with dispatch_ticket after the change if you can; at minimum, move one ticket and watch which stage resolves and whether anything fires. Watch the first few transitions after the change like you would watch a deploy — not because you do not trust the code, because you do not trust the universe to stay still.

If your organisation cannot do that yet, start smaller: treat tracker field names like enum values in a shared library. Renaming is a PR someone reviews. Deleting is a PR with a checklist. “Nobody uses that label” is not evidence; grep your automation and the FSM mapping first.

“It worked yesterday” stops being a ghost story when you admit the obvious. Yesterday the API matched. Today it does not. The framework did not rot overnight. The interface moved. Find the diff in the world, not only in the repository — then migrate it the way you would migrate anything else your systems depend on: deliberately, visibly, and never silently on a Friday.

Note — Field note A 2026-04-14 commit titled intake skips rows past intake label is this chapter's scar in the reference org. Back then planning was a multi-stage chain, and the intake step had been starving clarification downstream for a full day because upstream someone had introduced a wave of new labels — stage:ba, ready:ba, ready:developer, needs:clarification — and Todo tickets that already carried them were still eligible to the old query. The code had not changed. The tracker's vocabulary had. The fix excluded the new labels from intake. That multi-stage label chain is itself archaeology now — planning collapsed into a single decomposition bundle (planning:anchor + stage:decomposition), so there is no ready:ba → ready:developer hand-off to starve. But the scar's lesson outlived its labels: treat a change to the FSM's state-and-label vocabulary the way you would treat an API migration, or your automation will find out for you. (The principle that saved the day — a guard said no; do not loosen the guard, fix the schema — is the same one Chapter 37 is built on.)

Chapter 40 — Fix in place or escalate

Incidents come in two temperatures. One is hot and local: a wrong value somewhere—a secret, a label, a line in a prompt—made this run lie. You can name the file, merge the fix, point at the commit, rerun, and watch the world align. That is fix in place. It is the kind of work that feels good on-call because it ends: you closed a loop with evidence, not with hope.

The other temperature is warm everywhere: the same class of failure keeps visiting different tickets. Pick is “random.” PRs keep overshooting scope. Reviewers see the same class of surprise every Tuesday. No single commit will cure that, because the system is doing what you built it to do—you just did not mean it. That is escalate: not theatrics, not blame, but a deliberate promotion of the problem from triage to design. Guards too loose, prompts too vague, a schedule too aggressive for human review capacity—these are levers you adjust with owners and a calendar, not with a midnight tweak between pages.

Escalation is how you stop paying interest on the same bug every sprint. Interest looks like heroic on-call: the same patch, the same Slack thread, the same “we’ll tighten that later” that never arrives. Compound interest looks like a team that stops trusting automation because nobody admitted the pattern was structural.

The boundary between fix and escalate is a skill teams learn with scar tissue, and it helps to name examples out loud. A single bad secret is a fix. “We rotated secrets again and picks still fail randomly across projects” is an escalate—you are past the typo; you are in identity, mirroring, or query land. A vague prompt merged once is a fix. A prompt that keeps producing scope creep across unrelated tickets is an escalate—the model is not having a mood; the instruction set is under-specified for the blast radius you gave it.

Say the distinction in the handoff, not only in your head: “this was a hotfix” versus “this needs a design pass.” On-call should not spend a month patching a hole in the hull while the ship keeps sailing. Fixes in place keep water out of one cabin; escalation gets someone to redesign how the bilge connects. Both are respectable outcomes. What burns people out is pretending every flood is a bucket problem when the map shows a recurring crack.

Fix in place when you can point to the commit and the class of failure dies with it. Escalate when the class survives the commit. The framework only asks you to be honest about which story you are in—so the humans fixing tonight are not the same humans paying for architectural debt forever.

Note — Field note On 2026-03-16 the reference org's main branch received fifteen commits whose subject line was, with trivial variation, fix(ELM-64): keep zero-target standup runs successful on Slack membership errors. Slack was changing the English inside its error responses; each commit added the next spelling to a growing match list. By commit number three the right action had already stopped being "another patch" and started being "escalate the artifact that chose to match on vendor English." The log is a textbook: if you have ever wondered what "compound interest on a warm-everywhere incident" looks like in a calendar day, here it is. Read the commit log, not only the code.

Lighthouse

Chapter 41 — Why Lighthouse is in this book

We are aware of what this looks like. A book about a delivery system, disciplined for forty chapters about staying on its own subject, suddenly opens a Part named after a different product. The reader who has trusted the line so far is entitled to a flicker of suspicion at the heading. We have not lost track of the line. This chapter, and the three after it, exist in spite of the suspicion, not because we forgot it was watching.

The seam that brings us here was already in the book before we wrote this Part. In the chapters on the right skills, we argued — slowly, and at some cost to the prose — that a role cannot carry the facts it needs while doing the work. The judgement belongs to the role; the facts belong outside it. The knowledge a role needs in flight does not live inside the role. It lives in a search engine the agents maintain over the documents the company already writes. The skill asks. The search answers. We wrote that sentence and moved on, because the chapter was about the skill and not about the engine.

But a sentence like that, taken seriously, does not let the book move on. It points at something the argument depends on and then declines to name it. If the knowledge a role needs has to live in a search engine the agents maintain, the obvious question is which search engine. Not which product — that is the wrong question — but the more honest one underneath: what does that engine have to look like, and what does it cost a team to keep it honest? A book that names a constraint and refuses to walk into the substrate the constraint demands is a book that has only earned half of what it claims.

Lighthouse is the answer to that question in our own shop. It is not the protagonist of this book, and it is not, in these pages, a product we are selling. It is the named substrate the earlier chapters described in the abstract. Without it, the skills argument is a posture — a clean shape on a page, undefended by the work of actually running the system the shape implies. With it, the argument has somewhere to live. That, and nothing more than that, is the reason a Part with this title sits between Chapter 40 and the Manifesto. We could not write the second half of the book with integrity while pretending the first half pointed at nothing in particular.

It is worth being plain about what the four chapters after this one are. They are not a pitch. They do not claim that Lighthouse is the right substrate for the reader's team, or that the reader who builds something else has misunderstood the argument. They go into the substrate the way the rest of this book goes into the system — what shape it takes when an agent has to read from it, what discipline a team applies when it publishes numbers about a knowledge base of its own, what it means for a page to be useful to a human reader and an agent reader at the same time, and what we have had to delete or freeze or version when the corpus stopped behaving. They are working notes, not advertising.

We say this because we want the reader to feel free to disagree about Lighthouse without losing the thread of the book. The argument the earlier chapters made does not require Lighthouse to be the engine the reader chooses. It requires some engine the agents maintain to exist. If the reader has already built one, or has chosen to buy one, or comes from a tradition that names this substrate by a different word, the four chapters following this one are still ours to offer and theirs to use. The shape of the discipline travels. The name on the door does not.

We will close on the line that holds the two halves of the book together. The skills half defended a role: a person, with judgement, kept empty of company-specific lore so the lore did not rot inside them. The knowledge half names the world that role operates inside: the place where the lore lives, where it ages honestly, where it can be searched for by a specialist who has chosen not to carry it. Neither half makes sense without the other. A role with nothing to read is a posture; a corpus with nobody to read it is a museum. We needed both halves to be a book at all, and so we wrote both halves, and this chapter is the hinge between them.

Chapter 42 — Why search beats catalog

A skill carries judgement. The world the skill works inside carries facts. The previous Part has already said why the two cannot be collapsed without one of them rotting, and the previous chapter named the substrate the argument depends on. The question this chapter takes up is the next one down: what kind of substrate is honest about being a fact-store, and what does that honesty look like once the substrate has to run.

We begin with the property that decides most of the others. A fact-store that asks people to copy facts into it is already lying about what it is. The team has somewhere it already writes — the architecture page edited the week before a launch, the runbook a senior engineer rewrote after the second outage, the post-mortem from the Tuesday the queue backed up, the on-call rotation the manager updates when somebody changes teams, the long Slack message from the principal that, by Friday, everyone has copied into their own notes. Those documents exist because the work demanded them. They are kept current, when they are kept current, because somebody had to live with their absence. A substrate that pulls from there — that indexes the documents the company is already writing, in the places the company already writes them — ages at the speed of the source. A substrate that demands a separate authoring step ages at the speed of the team's discipline, which is a faster clock than anyone running it would like to admit. The aggregator wins not because it is cleverer but because it inherits the freshness of the work, instead of competing with the work for the writer's attention. Authoring tools rot because they ask for a second pass. Aggregators rot only when the source rots, and the source is the thing the team will keep alive if they keep anything alive at all.

The second property is ranking, and it is the property the directory-shaped substrates quietly skip. A search that returns ten plausible results in any order is not a search. It is a directory under a different name, with a query box bolted to the front. To be a search engine in the operational sense, the substrate has to know that a document edited this morning, touching the question being asked, outranks one last edited in September that touches the same question. It has to know that a canonical architecture page, the one the team has agreed describes the system as it is now, outranks a meeting note from a Zoom that three people remember and seven did not attend. It has to combine the cheap lexical signal — the words on the page that overlap the words in the query — with the more expensive semantic signal that recognises a question asked in different vocabulary than the page uses. The signals are not exotic — recency, source priority, hybrid retrieval that blends lexical and semantic similarity. None of them are inventions. What is non-optional is that the substrate apply them at all. A retrieval surface that returns the right document third, behind two weaker ones, is a surface the agent has to second-guess, and an agent that second-guesses the substrate is doing the substrate's job for it.

The third property is the one that lets the substrate be trusted by anyone who has to defend the work. Every fact the agent reads back from the engine has to carry a pointer to where it came from. Not a paraphrase; a pointer the operator can open and read. Without it, the agent cannot defend its answer to the human who asked the question — why do you believe this has no good reply when the reply is because the substrate told me so — and, just as importantly, the human cannot defend a correction back to the agent. The operator who reads a returned fact and recognises it as a stale entry from a deprecated runbook needs a place to point. This one. This document. Retire it. Citation is what makes that loop possible in either direction. It is what makes the substrate auditable instead of oracular, and the difference is the difference between a tool a serious shop can run and a tool that becomes folklore the first time it is wrong.

The fourth property is the one that the directory traditions have the hardest time accepting. The corpus has to shrink. A search engine that grows forever, that adds without subtracting, turns into a directory by attrition — bright rooms, careful lighting, exhibits past their relevance, and a query box across the top. To stay a search engine in the operational sense, the substrate has to forget. Entries that have not been touched in the period one would have expected them to be touched go away. Entries that contradict the current architecture go away. Entries whose most recent comment is this is out of date go away. The cost of carrying them is not measured in disk; it is measured in the decision the agent and the operator have to make every time the corpus returns one — should I trust this. Removing the entry removes the decision. We have already published the line that names this discipline elsewhere, and we hold to it here without softening: the corpus is the place for what is true now; the archive is git. Anything we are tempted to keep against the possibility of needing it later is already kept, in the version control system the team was using anyway.

One paragraph on what this shape is not, because the industry has crowded the space with adjacent things that fail in different ways. It is not a wiki. Wikis grow and rarely delete, and the operational meaning of rarely delete is that the corpus is a museum with a query box. It is not a vector database with no priors, used purely for similarity search; pure semantic search will happily return plausible-sounding distractors when nothing relevant exists, because the geometry does not know the question has no good answer in the corpus. It is not an answer engine that hallucinates a summary in place of the documents that would have supported it; an answer without citations is a confident opinion, not a retrieval, and the operator cannot tell the two apart at the moment they most need to. The shape we are defending is narrower than any of those: a ranked retrieval surface, scoped to one organisation, returning documents with provenance.

We close on why this shape matters now and not five years ago. The argument is not new. Search-over-fresh-sources has been describable for at least a decade, and a careful reader will recognise pieces of it from older traditions of enterprise retrieval. What changed is maintenance. The substrate has always needed somebody to keep the source documents honest — to write the runbook after the outage, to mark the architecture page when the design moved, to delete the entry when the service was retired. Humans, on their own, did not keep pace. The substrate that depended on them stayed almost-fresh in a way that made it untrustworthy when trust mattered. What is different now is that the agents reading the substrate can also write it, at the speed required to keep it honest — running the retros that turn the day into a runbook, capturing the post-mortem off a chat transcript while the incident is warm, walking the corpus weekly to remove entries no human has touched. The shape was always right. It became operational only when the same kind of thing that consumes the substrate could also maintain it. That is the practical reason the catalog phase happened first, and the reason it is, slowly, ending.

Chapter 43 — How the agent uses Lighthouse

A query, in the shape we prefer, is a plain sentence. The agent does not reach for a recipe card with a known title, nor open a binder by chapter number. It asks the thing it actually wants to know — what is the deploy command for the payments service, what was the last time we had a migration that locked a table, who reviews changes to the auth middleware — and the substrate returns ranked answers with citations. The phrasing matters less than one might fear, because the corpus is searched the way a careful reader searches a library: by meaning first, by spelling second. The agent writes the question the way it would write it to a colleague who has been on the team longer.

The rhythm of a working call is two beats, not five. The first beat is the search itself. The agent reads down the ranked one-line statements and decides whether the top summary is enough to act on. Often it is. The deploy command is a sentence; the table-locking migration was a known incident from a known week; the reviewer is a name. When the one line is enough, the agent carries the citation forward and moves on. When it is not — when the question needs the surrounding paragraph for context, or when the citation has to be quoted to an operator who will read it later — the agent fetches the source for the top hit and reads the original ingested passage. One round-trip, not five, on the average question. Drilling further is the exception, reserved for ambiguities the source paragraph could not resolve on its own.

Success looks like a piece of finished work the agent can defend out loud. Every fact it used carries a pointer back to where the fact came from — a paragraph it can name, a source it can quote. The operator who reads the agent's output can follow the trail without phoning a friend. Nothing the agent claims about the company's stack is something a human cannot verify in under a minute. The substrate does not make the agent right; it makes the agent legible. The two are not the same property, and only the second compounds.

The failures we have learned to name come in two shapes. The first is silence: the agent asks, and the substrate returns nothing useful, either because the topic is not indexed or because the question reached for vocabulary the corpus does not carry. The agent then faces a choice, and the contract requires the choice to be explicit. It may write what it would have written from prior knowledge, but it must flag that the substrate was silent on the point. Or it may stop and ask the operator. We prefer the second when the answer would change the work in a load-bearing way, and the first when the answer is decorative. The discipline is that the agent never pretends the silence did not happen.

The second failure is a wrong answer — a stale entry, a passage that contradicts the current architecture, a sentence that was true in March and is no longer true in May. This shape is recoverable in a way the catalog shape was not, because the entry can be deleted, in plain sight, by the routines described in chapter 25.D. The agent does not work around the wrong entry by quietly distrusting it. It reports it, in the same notes that carry its citations, and the next walk of the corpus removes it. A wrong sentence found and flagged is worth more to the substrate than ten right sentences used silently, because the first kind keeps the corpus honest.

One short paragraph on what the agent does not do. It does not author entries into the substrate in the middle of a delivery task. Writing to the corpus happens elsewhere, in observation routines that run on their own clocks under contracts written for that work — daily retros, post-mortem captures, the weekly walks. Mixing reading and writing in a single run is how a corpus poisons itself: the agent's draft sentence becomes tomorrow's cited fact, and the loop closes around its own opinion. We have kept the two streams separate from the first week, and have not been tempted to merge them.

The meta-property, then, is this. The agent is a participant in the substrate, not only a consumer of it. It reads from the corpus during work, and other roles, on a slower clock, maintain it. The next chapter turns to the discipline that closes the loop: publishing numbers out of the substrate itself, so the corpus answers not only what we know, but how well we are tending it.

Chapter 44 — Publishing numbers honestly

When we publish a number about our own substrate — a benchmark, an effectiveness score, a comparison between one configuration of the work and another — we have made a promise to the reader that the number will be there next year, at the address we gave them, without the methodology shifting under their feet in the months between. The promise is small in the wording and large in the practice. A reader who quotes a figure in a memo or a paper has placed a trust in our typography that we are not always conscious of having accepted. This chapter is about how we keep that promise, in the ordinary cases and in the embarrassing ones.

The discipline has two halves. The first half is what to do when the number you published turns out to have been wrong. The second is the shape of the page that lets the first half exist at all. We learned the first half in the slow, awkward way most operators learn anything that touches their own credibility. We learned the second by reading older communities — the ones who write specifications for a living — and borrowing the constraint that has kept their citations honest for decades.

Sometime after the first published evaluation run, we discovered that the data-quality filter had not been run on a portion of the pairs. The judge in those pairs had been comparing answers that were not, in any meaningful sense, answers at all — empty responses, fragments of internal tooling that had leaked through the wrapper, whitespace where prose was supposed to be. The judge, having nothing to grade, had returned a tie. A tie counted on the panel the same as a careful draw, and the panel inflated. The conclusion that survived a corrected pass was, in its load-bearing claims, still defensible. Some of the headline numbers, however, moved. One model that had read as flat under retrieval read as a regression under it after the rerun. A second pair of models that had read as having lifted from retrieval turned out to have been carried by the tie inflation on their own empty pairs. The shape of the table changed in places a careful reader would notice.

We had two choices. The first was to swap the page quietly. Replace the numbers, leave the address alone, hope nobody had cited the original. The shape of the page would not visibly admit that anything had moved. This is the path of dignity in the narrow sense — the page looks composed, the author is not embarrassed — and a betrayal of the citation contract in every other sense, because every reader who had taken a number from the original would, on the next visit, be holding something the page no longer confirmed. The second choice was to write up the correction in plain sight, at the same address, above the numbers it corrected. This is the path of dignity in the broader sense — the page admits, in its own typography, what it had been wrong about — and the only path that lets the reader keep their citation.

We chose the second. On 2026-05-15, the day after the original publication, we added a section above the headline numbers called what we got wrong the first time. It named three things: that the data-quality filter had not run on the first pass, which had let empty responses score as ties and inflated the panel; that one question type in the evaluation suite had been under-represented, which had advantaged models good at the other types; and that a budget setting we had picked once globally had biased a class of models upward in a way that did not survive the rerun. The corrections sat above the numbers, at the same heading weight, with no fade and no accordion. The page read differently. The conclusion read harder, not softer, because the reader met the cost of the claim before the claim.

The shape that made this possible is the second half of the discipline. Every evaluation run we publish lives at a versioned permalink, and the permalink is treated as immutable. The first run lives at the address it shipped at on its first day, and it stays at that address forever. A correction to the first run becomes a section inside the first run's page, not a replacement of the first run with something newer wearing the same address. When a second run is ready, it publishes at a fresh permalink of its own. A reader who cited the first run in a paper six months ago will, six months from now, find the same numbers at the same address, with whatever corrections we have written up above them in dated notes. The header of the page makes the freezing explicit: this is the first run, frozen, with corrections noted in place. The reader knows what kind of artefact they are reading before they read it.

The reason for the rule is not aesthetic. The reader who has cited a number has no leverage if the number can move. The author who can move the number has every incentive to do so quietly, because a quiet swap improves the headline today and does not, in the short run, cost anything visible. Over time it costs the reader the ability to trust any number on the page, because none of them are guaranteed to be the number they remember from the last visit. The shape of the page has decided in advance which order of operations would be most generous to the author. The only way to break out of that gravity is to publish under a rule that takes the swap option off the table entirely.

The general rule that follows is short. Published numbers do not get overwritten. Mistakes get a section, dated and named, inside the page that carried the mistake. New methodology gets a new run at a new address. The page does not pretend to be the only run that ever existed, and it does not pretend that its current state was its original state.

The operational shape under all of this is worth one paragraph. The substrate the agents read is one corpus — the knowledge graph the previous chapters described. The substrate we publish from is a sibling: structured outputs of the evaluation runs, each one keyed by a run identifier and a date, each one available to be quoted exactly without surprise. We keep that sibling honest the same way we keep the corpus the agents read honest. We are explicit about when an entry was written. We refuse to let entries grow by accretion, picking up small edits that nobody asked for. We give every claim a citation a careful reader can follow without phoning a friend. The two substrates are tended by the same instinct, in different rooms.

This chapter is the operational sibling of chapter 30.A. That chapter argued that limitations belong above the fold, where a reader skimming on a Tuesday afternoon will meet them before they meet the headline. This chapter says the page where the limitations were published is itself frozen — the limitations do not move either, and a correction to a limitation lands as a dated note inside the same page, not as a quiet edit somewhere downstream. The two disciplines work together, and they do not work apart. Limits visible. Pages immutable. Numbers earn their citations because the addresses they live at are not allowed to lie about their own history.

Chapter 45 — Pages with two readers

There is a class of page on the public internet that quietly changed audience while we were still writing it for the old one. The product pitch page. The evidence page. The integration page. The pricing page. The incident timeline. These surfaces used to have a single reader: a person who scrolled, formed an opinion, and either bounced or clicked. They now have two. A human still scrolls; sometimes hours later, sometimes minutes later, an agent reads the same page on behalf of an operator who has just said something like go figure out whether we should adopt this. The second reader arrived without our permission. The page that was written for only the first reader continues to render correctly and continues to fail, often for months, without anyone noticing.

The agent's needs are different in kind, and the difference is not superficial. The agent is not interested in brand voice. It does not feel anything about the headline. It is trying to do a task its operator has just delegated, and that task usually has a shape: find the canonical address of the thing, learn the contract of the call it is supposed to make, make the call, verify that the call worked, and report back. Marketing prose answers none of that. A page that converts a human politely refuses to converse with the reader who could actually take the next step on its own. The agent reads the page, finds the prose dignified and the conclusion confident, and asks — correctly — where the documentation site is. The human, by then, has moved on to other tabs.

We tested this on ourselves before we wrote about it. We took the public pitch page for our substrate, the one we had been polishing for the buyer, and rewrote it under a single explicit constraint. The rendered page must be sufficient for an agent to wire up the connection without needing a second documentation visit. The page opens with prose that lands a human reader — what the thing is, why it exists, who it sits next to. The middle then turns denser. It names the address of the service, the shape of the contract the agent should expect, the verification step the agent can run before declaring success, and one canonical example of what a working call looks like. The closing prose returns to the human, who needs to be left with a feeling rather than a checksum. Both readers got served on the same page, in different sections, without either one feeling that the page had been written past them.

After publishing, we ran the obvious experiment. We opened a fresh agent session, gave it the rendered address of the page and nothing else, and asked it to do what its operator would plausibly have asked it to do: wire up the connection, verify the work, report back. We had a rubric with eight questions on it — what the thing is, where the service lives, what call to make, how to verify it, what success looks like, and the rest. The agent answered all eight from the page alone. It did not ask for a separate documentation site. It did not make plausible-sounding guesses about endpoints. It quoted the contract back to us, named the verification step, and proposed the check before we suggested one. The whole exchange ran without a human in the loop after the first prompt.

One piece of feedback came back from that session, and it was real. A link on the page was written relatively rather than absolutely, and the agent had to resolve it against the host before it could follow. A human reader does that without noticing. A browser does that without noticing. The agent flagged it as an extra step it should not have had to take. We made the link absolute in the next push. It was a small fix and a slightly humbling one, because it is exactly the kind of fix you only ever find by handing the page to a reader who cannot intuit the host from context.

The general claim is more interesting than the experiment. We are not designing for agents instead of humans. We are designing for both, and we have noticed how much the page changed once we admitted the second reader was already there. The buyer paragraphs got tighter, because there was less room for them, so they had to earn their lines. The agent-facing section got more disciplined, because everything in it had to be paste-correct, every claim verifiable on the page itself, every address absolute. The two readers improved the same draft by pulling in opposite directions. Neither would have produced the page alone. The human reader had been letting us get away with the kind of evocative phrase that an agent quietly discards. The agent reader, on its own, would have produced a page nobody actually wanted to read.

The list of surfaces where this matters is longer than it looks, and most of them are not pitch pages. An evidence page is one of the first to feel the shift: an agent will increasingly read your numbers before its operator does, and a page that buries its limitations below its conclusion reads, to that agent, as a page that wanted to skip the verification step. An integration page is another: every row in it is a question an agent will ask in production, and the page either answers the question or earns a follow-up call that nobody scheduled. A pricing page is a third: agents are starting to be the readers who tell their operators what something would cost, and a pricing page that requires a sales conversation to resolve is a pricing page that has just disqualified itself from a category of decision its authors did not realise was being made. A status page is a fourth: an agent reading the incident timeline is the reader who decides, in the next few seconds, whether to retry. None of those pages will meet the new constraint by accident. Someone has to hold the constraint deliberately while the page is written, and then again while it is reviewed.

This chapter is the closing seam of the Part that came before the manifesto. The substrate that the agents read is one constraint on the writing: the corpus has to be honest about what is true now, and the entries that drift have to be retired without ceremony. The pages the agents read about the substrate are a second constraint, and they have to be honest about what the substrate is, in a form a non-human reader can act on without going to a separate documentation trip. Both constraints discipline the prose. The reward, when the prose has met both, is a page that does not need to be re-explained to either reader in a hallway after the meeting.

A reader who cannot ask a follow-up question deserves a page that does not need one. That is the only sentence the writing constraint really comes down to.

The Ship Manifesto

There is a page in every operator's working folder that says, in the fewest words possible, what the system believes. If the rest of this book is the argument, this page is the conclusion — the sentences you can quote in a design review when the room is drifting, in a procurement meeting when the pitch is loud, in a 3 a.m. incident channel when the next message is going to shape whether the team still trusts the loop on Monday.

We believe that intent belongs to humans and that no amount of model capability makes the human author dispensable. The agent does the motion; the human owns the outcome; the trail between them is the legible object we defend. When that trail is missing, we are not "shipping with AI." We are running a parallel process nobody agreed to.

We believe that legibility is kindness. It is the kindness we owe the operator on call, the auditor six months from now, the new engineer reading the repo for the first time, and the regulator who will one day ask a question we did not expect. A system that answers those people politely is a system that can be trusted to keep answering them politely under pressure. A system that hides its internals in a model response is a system that is negotiating its own replacement.

We believe that quiet is the right default. Loud automation is seductive and expensive. The work that ships is the work you can hear when the room is empty: a transition that fired one routine for one ticket, a branch that wears its name, a PR that points to its evidence, a merge that left a trail. When the system is working, the building should feel slightly boring. Boredom is the sound of trust.

We believe in repeatable mastery. A senior engineer is a century of small corrections wearing one name, and that judgement has always been the most valuable thing in the building and the least repeatable—brilliant on a good day, asleep at 3 a.m., gone in a single resignation letter. So we compressed the part that repeats into a few thousand lines of versioned skill: not the world's generic taste, which the model already brings, but this shop's decisions—our definition of done, our fences, the hills we die on. We do not promise the same diff twice; we promise the same floor—the same bar cleared on the worst night as the best morning. The operator still owns when to shift; the skill owns how. And the proof that the mastery lives in the system and not the engine is that we swapped the expensive model for a cheap one and nobody noticed: if you can downgrade the engine and no one can tell, the value was never in the engine.

We believe that skills carry judgement and knowledge is a search. A skill — a developer, an auditor, a release manager, a product owner — is a person with priors, not a recipe card written by someone who left two years ago. The facts that skill needs in flight do not live inside the skill. They live in a corpus the agents maintain, retrieved on demand, with citations a careful reader can follow. Collapsing the two — writing a single document that contains both the judgement and the company-specific instructions — produces a card that ages in twenty minutes and pretends not to. We keep them separate, and the separation is what lets each half stay honest.

We believe in fences over exhortations. A guard that fails closed on a missing secret is worth more than a paragraph of documentation asking engineers to be careful. A dispatcher that refuses to act on a ticket until a human promotes it out of Backlog is worth more than a Slack reminder. Machines obey predicates. Humans read predicates. Exhortations read nobody twice.

We believe that bounded concurrency beats both the grid and the firehose. We once bought our discipline with a clock — one role per slot, ticks on named minutes — and that clock taught us what the discipline was for: shared timestamps, independent failures, blast radius you can reason about, a board humans can plan around. We kept every one of those goods and replaced the lever. The work now dispatches off tracker transitions, not a calendar of slots, and what keeps it from becoming the firehose we warned about is not a clock but leases, cascade caps, and dependency blocks: the engine will only run so much at once, and it will refuse a ticket that is frozen, blocked, or already in someone else's lease. A system bounded by concurrency can still explain itself — every run traces to the transition that asked for it — and a system that cannot explain itself cannot be trusted to improve itself.

We believe that self-heal is a mop, and workflow design is the kitchen. Recovery automation on top of a clean contract is a good tool. Recovery automation instead of a clean contract is a faster cleaner for a room that will never stop spilling. We earn the right to automate repairs by first making "normal" boring enough that deviations are legible.

We believe in evidence over opinion. An audit ticket without a pointer is a feeling. A version bump without a cited authority is an undocumented patch. A telemetry event without a use case is a vendor's hobby. Every record of our work — ticket, commit, skill version, telemetry event — must survive the reading of someone angry enough to check.

We believe that vendors are plugs, not gods. The tracker, the CI, the model, the chat, the code host — none of these is the story. The story is the FSM: triage → ready → in_progress → in_review → merged → done, with rework and needs-info as the loops back and blocked as the freeze, and planning carried by a single decomposition bundle on the way in. Any tool that can speak that grammar — a state to transition into, a label to read, a PR to open — is a first-class citizen. The front door is the operator's own agent driving ticket_update over the MCP edge; the tracker takes it from there. Any tool that demands we re-narrate the SDLC in its vocabulary is a risk we will pay for later.

We believe that quiet systems can still be improved. The improvement loop — feedback on skills, telemetry for operators, regression triage on cohorts that share a skill version, observation routines that keep the corpus the agents read honest — is how a system that has stopped surprising us also stops being stagnant. Stability and evolution are not in tension. They are the same posture, watched from two distances.

We believe that a published number is a promise to the reader. We do not overwrite it. When an evaluation turns out to have been wrong, the correction goes inside the page that carried the original, dated and named, above the numbers it corrects. New methodology gets a new page at a new address. The old page stays at the old address, frozen, with its corrections in plain sight. A reader who cited us six months ago should find what they cited six months from now, and a reader who reads us today should not have to wonder which version of the number they are looking at.

We believe that the limitations belong above the fold. A reader who learns the limit before the claim weighs the claim correctly. A reader who learns the limit after the claim feels they have been managed, and is correct to feel it. We write evidence pages so the failure modes meet the reader before the headline, at the same heading weight, with no accordion and no fade. Trust is something an operator builds for themselves by reading evidence under weights of their own choosing. The shape of the page must not steal that work.

We believe that delete is a first-class verb. The corpus is the place for what is true now; the archive is git. Entries that have not been touched in the period one would have expected them to be touched go away. Entries that contradict the current architecture go away. The instinct to preserve was correct for slow humans; it is wrong for systems that include agents who can rewrite an entry in an evening for the cost of an API call. Tagging is not deleting. Archiving is not deleting. The corpus has to forget, weekly, in plain sight, with the deletion in the diff.

Finally, we believe this book is not finished, because nothing that ships is. Every chapter in this manual was earned by a commit somewhere in the reference org, or its successor, or the next one after that. We will keep writing. We will keep editing. We will keep deleting — entries from the corpus when they stop being true, sentences from this book when they stop earning their place. And every time we change a skill or retire a page because you wrote feedback, we will count it as the method working the way it was meant to.

Vocabulary

Words are interfaces between people and automation. When dispatch means “whatever the model felt like” to one teammate and “the routine the FSM fired for this transition” to another, nobody can reason about incidents, and the system stops feeling safe. Agreeing on terms is not pedantry; it is how you keep runbooks and blameless postmortems aligned with what the code actually does. The sections below spell out the same ideas the manual uses elsewhere, but in plain language you can reuse in onboarding and design reviews.

Delivery lane (event-driven)

The delivery lane is the main path through the SDLC: the FSM stages a ticket walks through in order — task_intake → planning → dev_implementation → validation → code_review → auto_merge → merged. It is event-driven, not scheduled: work enters the lane when a ticket's status changes in a designated tracker project — set by the operator's agent via ticket_update, or by a human moving the card — not on a cron tick that pulls from Todo, and never when someone yanks the next card straight from Backlog. There is no per-slot fan-out; the ticket that transitioned is the ticket that advances. Calling something “the lane” should narrow the conversation to that throughput-shaped pipeline. It deliberately excludes parallel worlds such as audit schedules, self-heal jobs, or ad-hoc agent chat; those may be valuable, but they are not the lane, and mixing the vocabulary invites double-booking and confused ownership.

Audit loop

The audit loop is its own cadence, not a synonym for the delivery lane. It writes into findings projects under evidence-only expectations: a finding should point at something inspectable — a log excerpt, a report, a failing check — rather than reading like a vague opinion ticket. That is different from a human filing a bug because something “feels off”; machine-assisted filing is held to the same evidential bar you would expect from a person acting as a proxy for CI.

Dispatch (`maybe_dispatch`)

Dispatch is what replaced the old deterministic pick. There is no standalone selector that scans the board and returns one issue per tick. Instead, when a ticket transitions, the engine's maybe_dispatch resolves that ticket's FSM stage from its state and signal labels and fires the one routine for that ticket — deterministically, using explicit fields, never a model's whim. It is gated, not by a slot, but by leases (project_lock), per-stage caps, cascade limits, and dependency blocks: a frozen, blocked, or already-leased ticket is refused, and the refusal carries a reason (blocked_by_dependency, no_routine). Dispatch is not AI selection. If the model were choosing which ticket runs next, you would have removed the fence between policy and improvisation; the routing stays boring enough to reason about from board fields alone.

Agent run (`ship-agent-run.yml`)

The agent run is how a dispatched stage executes: the engine triggers a GitHub Actions workflow (ship-agent-run.yml) that runs the role's coding agent against a branch. The agent commits its own work; the runner (run.mjs) then pushes the commits, opens or updates the pull request, and reports the stage's outcome back to the engine through a .ship/agent-finish.json sidecar and the server's /finish endpoint. The skills the run executes are server-side role definitions the engine serves — not loaded from git by a thin launch client. Selection (what runs) and execution (how it runs) are still separate concerns, but the seam is now transition → stage routine → dispatched run, not pick → launch script.

CLI (tooling)

CLI here means the shipctl commands and the agent runner that sit at the edges — names like init, doctor, feedback, and the run.mjs runner are implementation-specific helpers, not framework dogma. Operators use them to bootstrap a workspace, read telemetry, file feedback, or — inside an agent run — push the work and report the outcome. The runner is the agent run's hands, not a selector: it does not decide which ticket is eligible. For most operator work the front door is no longer the CLI at all but the MCP edge (claude mcp add ship), where the operator's own agent drives ticket_update, dispatch_ticket, and the rest. Think of the CLI as the workbench that remains reachable from a terminal when you want it.

Guards

Guards are the predicates — labels, project membership, team, state, dependency blocks, leases — that must be true before automation may dispatch or touch a ticket. They are the contract that says “this card is ready for machines.” Culturally, guards are not extra bureaucracy layered on top of process; they are APIs between human intent and headless agents, encoding what “ready” means in data the dispatcher can evaluate.

E2E (end-to-end)

E2E means browser-level tests against a hosted environment: a real URL, real authentication or dedicated test users, and real edge behaviour from CDNs and cookies. Smoke on localhost still matters, but it surfaces a different failure surface; release gates that care about what customers see usually want hosted signal so networking, auth, and deployment shape match production.

Leases, caps, cascade

Leases, caps, and cascade limits are what bound concurrency now that there is no UTC grid of delivery ticks. A lease (project_lock) keeps two runs off the same project at once; per-stage caps limit how many tickets sit in a stage; cascade limits stop one transition from fanning out into an unbounded chain; dependency blocks hold a ticket whose blocks relations have not cleared. The framework requirement is structural — delivery work must not overlap in a way that causes duplicate work — and the lever that delivers it is bounded concurrency, not a shared timetable. (Cron still exists in Ship, but it runs maintenance, not delivery — see below.)

What cron actually runs

For the avoidance of the old mental model: there is no evenly-spaced clock that dispatches delivery roles. The cron jobs Ship registers (services/cron_jobs.py) are all maintenance — the knowledge pipeline (harvest, route, synth, decay, source-sync, claim extract/reconcile, topic render), Linear token refresh, seed-PR auto-merge, deployment reconcile and health, inbox stale-sweep, the agent-dispatch lock sweep, PR-cache reconcile, project-completion sweep, stall-notify, and the scheduled routines (daily digest, retro, tech-debt, nightly workflow). Delivery dispatch is not among them; it is event-driven off tracker transitions via the diff-based poller.

Versioned skills

Versioned skills are Markdown role definitions living in the repository, reviewed in pull requests, and executed by headless agents when a stage dispatches. That is the opposite of the “final prompt” that lives only in a vendor text box: convenient to type once, painful to diff, impossible to roll back cleanly. Shipping prompts like code is how prompt changes stay attributable and reversible.

Skill

A skill is a role with judgement — a developer, an auditor, a release manager, a product owner. It carries priors, taste, and a contract that names its inputs, outputs, ownership, and the line it refuses to cross. A skill does not carry the company-specific facts it needs in flight; those live in the corpus the agents maintain, and the skill queries them on demand. Collapsing the two gives you a recipe card pretending to be a person, and recipe cards age in twenty minutes.

Knowledge

Knowledge, in this book's vocabulary, is the substrate a skill operates inside — the runbooks, the post-mortems, the architecture pages, the senior engineer's three-paragraph Slack message from last quarter that turned out to be the answer. Lighthouse is our own implementation of the substrate; the discipline travels regardless of the implementation. Knowledge ages at the speed of the source it aggregates, not at the speed of a separate authoring tool the team has to remember to update.

Skill version

A skill version is the state of a skill at a particular commit. Skills live in files in a repository; changing a skill means editing the file and committing the change, with the reasoning in the commit message. Rollback is a revert. There is no separate "yank" command, no "deprecated" channel, no parallel history kept alongside the real one. The previous version is one command away if anyone wants to read it.

Observation routine

An observation routine is a scheduled job whose contract is to keep the corpus honest. The three we run are the daily retro, which re-reads what happened and updates the runbooks that were wrong; the post-mortem capture, which writes incident events and cures into the corpus; and the weekly corpus walker, which deletes entries that have stopped being true. Observation routines are owned by named roles and run on their own clocks, separate from the delivery lane.

Corpus

The corpus is the body of documents the agents read during delivery work. It is searched by retrieval, with ranking that respects recency and source priority; every fact read back from it carries a citation a careful reader can follow. The corpus is the place for what is true now. The archive is git, where versions of things have always been kept.

Where the reference org names things

This manual describes shapes and invariants. For exact workspace setup, repo wiring, knowledge, and operating checklists, follow Getting started and Docs — those pages name the screws, not the abstract blueprint here. Teach this vocabulary once during onboarding and you shorten months of alignment work afterward: shared words travel lighter than shared screenshots, and they age better when the UI moves — until someone renames a project without telling the robots, at which point even the best glossary needs a screenshot and a field migration, not a debate about what “the lane” was supposed to mean.

Ship. A memoir of operators.