Troubleshooting — Ship docs

Audience

This page is for the engineer next to the product owner. The product owner reads it to know it exists. You read it to fix the thing.

Symptom first. Where to look second. Typical fix third. The shape is borrowed from the book chapter "When things break" — read the headline, then the footnotes. Long logs are an empathy test; do not panic-scroll to the bottom before you know which step turned red.

Pick fails with a missing API key

Where to look. Your CI secrets, scoped to the job that runs pick — a secret defined at the workflow level is not always inherited by jobs in stricter setups.

Typical fix. Add the token under the exact name the workflow file references. Re-run, confirm the secret was injected into the pick job. If the secret exists but the job still complains, check for trailing whitespace introduced by copy-paste.

The agent never starts

Where to look. The agent provider's dashboard (Cursor, Codex, Claude Code). Confirm the workspace is linked, the API key is current, the repo is on the allow-list. Then check the CI secret name matches what the provider expects.

Typical fix. Either the provider lost the link (re-link) or the secret name does not match (rename). Both are silent failures — pick succeeds, launch attempts, provider returns 401 or 404, run log reads like the agent "did nothing." Rotate the key, mirror it in both places.

Ticket stuck after a green run

Where to look. The tracker, not the workflow. Compare the ticket's state, project, and labels against the pick rules in .ship/config.yml. Then re-read the routine that ran — what state does it advance the ticket to? Is that transition allowed?

Typical fix. Almost always a guard. The ticket is in a state pick never reads, in a project the routine does not scope to, or missing a label the query demands. Open the timeline next to the run and read both — if they describe different events, your mental model and the board disagree. Try shipctl run --issue <id> locally; the log names the exact guards that voted no.

Duplicate PRs for one ticket

Where to look. The branch naming contract. Your workflow file should have one canonical pattern. If duplicates differ in pattern, somebody changed the convention on one side.

Typical fix. Pick one pattern, write it down in .ship/config.yml, close the duplicates without merging. Often the cause is both the cloud agent and the CI agent given permission to open PRs on the same ticket — a duplicate-credential problem, not a branch problem. Let one side own PR creation.

Agent updates ticket in CI but not in the cloud (or vice versa)

Where to look. The tracker key needs to live in two environments — the CI that runs pick and the agent provider's environment.

Typical fix. Mirror the credential. Use the same identity policy on both sides. Write a smoke test that proves both environments can touch the tracker under the same identity, and run it like a deploy gate. Only an end-to-end check catches the case where each side looks healthy on its own dashboard.

Scanner job skipped

Where to look. The CI secret for the scanner. The job's guard reads "skip if no token."

Typical fix. Add the token in production CI. In dev CI, document "skip is OK here" so the next on-call does not chase it.

Routines out of sync with the board

Where to look. The routines page in the console, the routine wiring in .ship/config.yml, and the ticket workflow in .ship/tracker-fsm.md. All three should tell the same story.

Typical fix. Update the ticket-workflow doc to match what the board actually does, regenerate the routines, and run the snapshot utility (tools/scripts/seed_dev.py or your equivalent) for a fresh dev workspace. Drifted label and state names are the most common cause.

Prompt change "did nothing"

Where to look. Branch (merged to default?), schedule (fetches latest ref or pinned to a SHA?), cache (bundle still showing the old version?).

Typical fix. Confirm the merge. Run shipctl sync --lock to refresh the bundle. Confirm the version with shipctl config show. Bust stale CI cache keys.

Rate-limits or throttling

Where to look. Concurrency on the scheduler — parallel jobs, cron cadence, provider rate-limit tier.

Typical fix. Widen the schedule grid. Reduce overlap. Ask the provider for higher quotas. The answer is almost always "you scheduled too many things at once."

"It worked yesterday"

Where to look. Not your repo. The world outside. A tracker state renamed, a label "cleaned up," a project archived, a token expired. None of that shows up in git diff.

Typical fix. Treat tracker changes like API migrations. Check the tracker-side audit log for recent renames. Update pick rules and guards in the same change window — feature-flag the new names if you can, run old and new in parallel until the logs are boring.

Red in CI with an opaque log

Where to look. The step name at the top of the failing rectangle. Not the stack trace at the bottom. The step name is the headline; the stack trace is the footnote.

Checkout / Install failed → infrastructure, cache keys, lockfile drift. Not the agent's fault.
Pick failed → tracker credential, tracker query, or guard logic. Pick fails before anyone intelligent gets involved, on purpose.
Launch failed → agent provider credentials, environment variables not making it through, scope missing from a token. The log will imply network instability; the headline says Launch.
Tests failed → real product failure or a flaky suite. Treat flake and product separately; tag which kind of red you are looking at before opening the next log tab.

Read the headline. Then the footnotes. In that order.

Hot fix or escalate

Some incidents are hot and local — one wrong value; fix the file, point at the commit, move on. Others are warm and everywhere — the same class of failure visiting different tickets, no single commit will cure it. The first is a hotfix. The second is an escalation: promote the problem from triage to design, with owners and a calendar. Say which one you are doing in the handoff. The shortcut: if the same fix would have helped on three previous Tuesdays, stop patching and open a real ticket.

When in doubt

shipctl doctor
shipctl verify
shipctl config show
shipctl knowledge fetch <bucket> --json | head

Cheap, fast, almost always tells you something. If doctor says everything is green but the system still misbehaves, the problem is upstream of the repo — in the workspace, the tracker, or the agent provider — and the audit log is your next stop.