diff --git a/.claude/commands/mdd.md b/.claude/commands/mdd.md index af6fbd6..9036cb2 100644 --- a/.claude/commands/mdd.md +++ b/.claude/commands/mdd.md @@ -38,6 +38,7 @@ Before detecting mode or doing anything else, ensure the `.mdd/` directory struc [ -d .mdd ] || mkdir -p .mdd [ -d .mdd/docs ] || mkdir -p .mdd/docs [ -d .mdd/audits ] || mkdir -p .mdd/audits +[ -d .mdd/ops ] || mkdir -p .mdd/ops ``` **If `.mdd/.startup.md` does not exist**, create it with the default template: @@ -81,6 +82,7 @@ Read CLAUDE.md for the full rulebook. Key rules: ๐Ÿ“ MDD structure initialised: .mdd/docs/ โœ“ created .mdd/audits/ โœ“ created + .mdd/ops/ โœ“ created .mdd/.startup.md โœ“ created .gitignore โœ“ updated (.mdd/audits/ added) ``` @@ -108,6 +110,9 @@ Parse `$ARGUMENTS` to determine the mode: - `plan-remove-feature` โ†’ **Plan-Remove-Feature Mode** (jump to Phase PRF) - `plan-cancel-initiative` โ†’ **Plan-Cancel-Initiative Mode** (jump to Phase PCI) - If arguments start with `commands` โ†’ **Commands Mode** (jump to Phase CM) +- If arguments start with `ops` โ†’ **Ops Document Mode** (jump to Phase OP) +- If arguments start with `runop` โ†’ **Ops Execute Mode** (jump to Phase RO) +- If arguments start with `update-op` โ†’ **Ops Update Mode** (jump to Phase UO) - If arguments are empty โ†’ ask the user what they want to do - Otherwise โ†’ **Build Mode** (the default โ€” jump to Phase 1) @@ -738,8 +743,8 @@ Triggered when arguments start with `audit`. If a section is specified (e.g., `/mdd audit database`), audit only that feature. If no section, audit the entire project. -1. **Read all `.mdd/docs/*.md` files** โ€” build the feature map -2. **If no `.mdd/` directory exists:** Create it with `docs/` and `audits/` subdirectories. Then tell the user: "No MDD documentation found. Run `/mdd` for each feature to create docs first, or I can scan the codebase and create them now. Which do you prefer?" +1. **Read all `.mdd/docs/*.md` files** โ€” build the feature map. Also read all `.mdd/ops/*.md` files โ€” check for: missing mandatory sections, literal credential values (critical violation), stale `last_synced`, services with no `health_check` defined. +2. **If no `.mdd/` directory exists:** Create it with `docs/`, `audits/`, and `ops/` subdirectories. Then tell the user: "No MDD documentation found. Run `/mdd` for each feature to create docs first, or I can scan the codebase and create them now. Which do you prefer?" - If "scan": read all source files and generate documentation files (Phase 0) - If "manual": exit and let the user create docs per feature @@ -850,6 +855,7 @@ Present: ๐Ÿ“Š MDD Status Feature docs: files in .mdd/docs/ +Ops runbooks: files in .mdd/ops/ Last audit: ( findings, fixed, open) Test coverage: unit tests, E2E tests Known issues: tracked across features @@ -869,7 +875,7 @@ Drift check: features possibly drifted โ† run /mdd scan for details features untracked โ† no last_synced field yet -Run `/mdd audit` to refresh, `/mdd scan` to see drift details, `/mdd plan-initiative` to start an initiative, or `/mdd ` to build something new. +Run `/mdd audit` to refresh, `/mdd scan` to see drift details, `/mdd plan-initiative` to start an initiative, `/mdd ops ` to create a deployment runbook, or `/mdd ` to build something new. ``` If all files are on the current `mdd_version`, omit the version breakdown and just show: `MDD version: v โ€” all files up to date` @@ -890,6 +896,7 @@ After collecting status, rebuild the auto-generated zone of `.mdd/.startup.md`: - `Branch:` from `git branch --show-current` - `Stack:` from `CLAUDE.md` or `claude-mastery-project.conf` if detectable, otherwise `(unknown)` - `Features Documented:` sorted list of `.mdd/docs/*.md` filenames with status if detectable from frontmatter + - `Ops Runbooks:` sorted list of `.mdd/ops/*.md` filenames with status โ€” omit section entirely if `.mdd/ops/` is empty - `Last Audit:` from the most recent `.mdd/audits/report-*.md` โ€” extract findings/fixed/open counts - `Rules Summary:` static block (does not change) 3. Write the rebuilt auto-generated section + `---` divider + preserved Notes section back to `.mdd/.startup.md`. Update `mdd_version` in the file's frontmatter to current. @@ -935,11 +942,11 @@ Triggered when arguments start with `note`. Three subcommands: Triggered when arguments start with `scan`. Detects features whose source files have changed since the last MDD session, and checks for initiative/wave drift. -### Phase SC1 โ€” Read all feature docs and initiative/wave files +### Phase SC1 โ€” Read all feature docs, ops runbooks, and initiative/wave files -Read every `.mdd/docs/*.md` (excluding `archive/`). For each, extract: +Read every `.mdd/docs/*.md` (excluding `archive/`) and every `.mdd/ops/*.md` (excluding `archive/`). For each, extract: - `last_synced` from frontmatter -- `source_files` list from frontmatter +- `source_files` list from frontmatter (feature docs) or the ops doc slug for ops runbooks ### Phase SC2 โ€” Check each feature for drift (parallelized) @@ -1022,6 +1029,16 @@ Initiatives: payment-flow-wave-2 (initiativeVersion: 2, initiative now: 3) โ†’ run /mdd plan-sync payment-flow ``` +**Ops runbook drift check** (appended when `.mdd/ops/` has files): + +For each ops runbook, check `last_synced` against the last git commit on the runbook file itself. Since ops runbooks track live service state (not source files), drift means the runbook file hasn't been touched since the last `runop`. + +``` +Ops Runbooks: + โœ… swarmk-dokploy โ€” last runop: 2026-04-17 + โš ๏ธ rulecatch-dokploy โ€” runbook edited 3 days ago but no runop since โ†’ run /mdd runop rulecatch-dokploy +``` + Save the full report to `.mdd/audits/scan-.md`. --- @@ -1315,6 +1332,19 @@ Issues: โ“ auth-system / wave-2 / auth-rate-limit โ€” complete in wave but no doc path set ``` +**Ops Runbooks section** (appended when `.mdd/ops/` has files): + +``` +๐Ÿ“ฆ Ops Runbooks + + swarmk-dokploy โ€” 4 services, 2 regions (eu-west canary โ†’ us-east primary) + rulecatch-dokploy โ€” 10 services, 2 regions (eu-west canary โ†’ us-east primary) + +Service health (last runop): + swarmk-dokploy: all healthy โœ“ (2026-04-17) + rulecatch-dokploy: api โœ— failing in eu-west (2026-04-16) โ†’ run /mdd runop rulecatch-dokploy +``` + Save the graph to `.mdd/audits/graph-.md`. --- @@ -1840,6 +1870,341 @@ Report: --- +## OPS DOCUMENT MODE โ€” `/mdd ops ` + +Triggered when arguments start with `ops`. If arguments are exactly `ops list` โ†’ jump to **Ops List Mode** (Phase OL) instead. + +### Phase OP1 โ€” Scope, slug, and collision check + +**Step 1 โ€” Ask scope first (before anything else):** + +``` +Where should this runbook live? + (a) Project โ€” .mdd/ops/.md (this project only) + (b) Global โ€” ~/.claude/ops/.md (reusable across all projects) + +Note: Global ops cannot access project-local .env variables or +project-specific paths. Use ~/.env globals only. +``` + +**Step 2 โ€” Derive slug:** +Strip `ops` from the start of arguments. Use the remainder as the description. +Derive a slug: lowercase, hyphens, drop filler words (e.g., "deploy swarmk to dokploy" โ†’ `swarmk-dokploy`, "update cloudflare dns" โ†’ `cloudflare-dns`). + +**Step 3 โ€” Collision check:** +- If scope is **project**: check whether `~/.claude/ops/.md` exists. + - If global op exists with that slug โ†’ **hard stop**: + *"A global runbook named `` already exists (`~/.claude/ops/.md`). Project ops cannot share a name with a global op โ€” use a different name, or run `/mdd runop ` to execute the global one."* +- If scope is **global**: check whether `~/.claude/ops/` exists, create it if not (`mkdir -p ~/.claude/ops`). + +**Step 4 โ€” Check target location:** +- Target path: `.mdd/ops/.md` (project) or `~/.claude/ops/.md` (global) +- **Does not exist** โ†’ proceed to Phase OP2 (create) +- **Exists** โ†’ tell the user: *"Runbook `` already exists. Use `/mdd update-op ` to edit it or `/mdd runop ` to execute it."* Stop. + +### Phase OP2 โ€” Ask questions + +Ask all questions in a single interaction: + +1. "What is this deployment? (describe the target โ€” e.g., swarmk API and dashboard to Dokploy US + EU)" +2. "What platform? (dokploy / docker-hub / vercel / github-actions / manual / other)" +3. "List all services being deployed โ€” for each: name, Docker image name, port (or none), health check command" +4. "List your deployment regions โ€” for each: slug, host, platform, deploy order (1 = deploy first / canary)" +5. "Deployment strategy: sequential or parallel across regions? What gates between regions? (health_check / manual / none)" +6. "What happens if a canary gate fails? (stop / skip_region / rollback) Auto-rollback on failure? (yes/no)" +7. "How is deployment triggered? (Dokploy webhook URL as env var, GitHub Actions workflow name, manual command, etc.)" +8. "What credentials and API keys does this deployment need? List as env var names only โ€” never values. Where is each stored?" +9. "Are any MCP servers required during deployment? (e.g., strictdb-mcp for post-deploy seeding)" +10. "What environments does this target? (staging / production / both)" + +### Phase OP3 โ€” Write the runbook + +Create `.mdd/ops/.md` with full frontmatter and all 7 mandatory sections: + +```markdown +--- +id: +title: +type: ops +platform: <platform> +environments: [<list>] +deployment_strategy: + order: sequential + gate: health_check + on_gate_failure: stop + rollback_on_failure: false +regions: + - slug: <slug> + host: <host> + platform: <platform> + deploy_order: 1 + role: canary + - slug: <slug> + host: <host> + platform: <platform> + deploy_order: 2 + role: primary +services: + - slug: <name> + image: <registry/name:tag> + port: <port or ~> + health_check: <exact command> + regions: + <region-slug>: + image: <registry/name:tag> + status: unknown + last_checked: ~ +status: draft +last_synced: <YYYY-MM-DD> +mdd_version: <current> +known_issues: [] +--- + +# <title> + +## Overview +<What this deployment does and why โ€” 2-3 sentences> + +## Services & Ports +<Table: service | image | port | health endpoint> + +## Environment Targets +<Which environments, what platforms, any environment-specific notes> + +## Webhooks & Triggers +<How deployment is triggered: Dokploy webhook URL as $ENV_VAR, GitHub Actions workflow, manual command, etc.> + +## Credentials & API Keys +<Table: credential name | env var | where stored> +**NEVER include actual values โ€” env var names only.** + +## MCP Servers +<Any MCP servers required during deployment, or "(none)"> + +## Deployment Procedure +<Ordered steps. Each step MUST have: name, command/action, verification check.> + +Step 1 (Name): + Action: <exact command> + Verify: <command that returns exit 0 on success> + +Step 2 (Name): + Action: <exact command> + Verify: <command that returns exit 0 on success> + +## Rollback Plan +<Specific steps to undo this deployment if it fails. Must be actionable, not "revert the commit".> +``` + +### Phase OP4 โ€” Offer next steps + +``` +โœ… Runbook created: .mdd/ops/<slug>.md + +Next steps: + /mdd runop <slug> โ€” execute the runbook now + /mdd update-op <slug> โ€” edit the runbook +``` + +--- + +## OPS EXECUTE MODE โ€” `/mdd runop <slug>` + +Triggered when arguments start with `runop`. Executes an existing ops runbook with pre-flight health checks, canary-gated region deployment, and post-flight verification. + +### Phase RO1 โ€” Load runbook + +1. Parse `<slug>` from arguments โ€” hard stop *"Slug required. Usage: /mdd runop <slug>"* if missing. +2. Locate the runbook (project-local first, then global): + - Check `.mdd/ops/<slug>.md` โ†’ found: announce *"Running project runbook: `<slug>`"* + - Check `~/.claude/ops/<slug>.md` โ†’ found: announce *"Running global runbook: `<slug>`"* + - Neither found โ†’ hard stop: *"No runbook found for `<slug>` (checked project and global). Run `/mdd ops <description>` to create one, or `/mdd ops list` to see all available runbooks."* +3. Parse all frontmatter fields: regions (sorted by `deploy_order`), services, `deployment_strategy`. + +### Phase RO2 โ€” Pre-flight health check (all regions) + +Run each service's `health_check` for each of its declared regions. Display a status table: + +``` +Pre-flight Health Check โ€” <slug> +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + <region-1> <region-2> +<service> โœ“ healthy โœ— failing +<service> โœ“ healthy โœ“ healthy +<service> ? unknown ? unknown + +(last checked: <date> | <date>) +``` + +Write updated `status` and `last_checked` to each `services[].regions.<slug>` entry in frontmatter immediately. + +For each service that is **not healthy**, ask per region: +``` +<service> is <status> in <region>. What do you want to do? + (a) Redeploy โ€” run this service's deployment steps + (b) Skip โ€” continue without touching this service in this region + (c) Abort โ€” stop the entire runop +``` + +### Phase RO3 โ€” Deploy region by region (in deploy_order) + +For each region in `deploy_order` sequence: + +**Step A โ€” Deploy services in this region** +- For each service marked for redeploy in this region: + - Use `services[].regions.<slug>.image` (falls back to `services[].image` if not set) + - Walk through the service's steps in the Deployment Procedure section + - Each step: announce name โ†’ run command โ†’ run verification check + - Verification passes โ†’ โœ“, continue + - Verification fails โ†’ STOP, show exact output, surface Rollback Plan section + - If `rollback_on_failure: true` โ†’ automatically run rollback steps, then stop + +**Step B โ€” Region gate** + +Run health checks for all services in this region. Display result: +``` +โ”€โ”€ <region> (<role>) โ€” gate check โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +<service> โœ“ healthy (<image>) +<service> โœ“ healthy +Gate: PASSED โœ“ +``` + +If gate is `health_check` and any service is not `healthy`: +- Apply `on_gate_failure`: + - `stop` โ†’ halt, show what failed, print: *"<next-region> was NOT deployed โ€” <this-region> gate failed."* + - `skip_region` โ†’ log failure, advance to next region + - `rollback` โ†’ run Rollback Plan steps for this region, then stop +- Write updated status to frontmatter before stopping + +If gate is `manual` โ†’ always pause: *"<region> deployed. Proceed to <next-region>? (yes / abort)"* + +If gate is `none` โ†’ advance immediately. + +Write updated `status` and `last_checked` for all services in this region to frontmatter. + +**Step C โ€” Advance** + +Gate passed โ†’ proceed to next region in `deploy_order`. Repeat Steps Aโ€“B. + +### Phase RO4 โ€” Post-flight health check (all regions) + +Re-run all service health checks across all regions. Display full cross-region before โ†’ after table: + +``` +Post-flight Health Check โ€” <slug> +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + <region-1> <region-2> +<service> โœ“ healthy (was โœ—) โœ“ healthy +<service> โœ“ healthy โœ“ healthy +``` + +Write final `status` and `last_checked` to all service region entries in frontmatter. +Any service still failing โ†’ append entry to `known_issues` in the doc. + +### Phase RO5 โ€” Summary + +``` +runop complete โ€” <slug> + + <region-1> (canary) <region-2> (primary) +<service> โœ“ healthy โœ“ healthy +<service> โœ“ healthy โœ“ healthy + +Canary gate: PASSED โœ“ +Regions deployed: <N>/<N> +Steps executed: <N>/<N> โœ“ +last_synced: <YYYY-MM-DD> +``` + +If canary gate failed and primary was skipped: +``` +runop stopped โ€” <slug> + +<region-1> (canary): โœ— gate FAILED โ€” <service> failing after deploy +<region-2> (primary): NOT deployed โ€” canary gate must pass first + +on_gate_failure: stop +Fix: resolve <service> in <region-1>, then re-run /mdd runop <slug> +``` + +--- + +## OPS UPDATE MODE โ€” `/mdd update-op <slug>` + +Triggered when arguments start with `update-op`. Updates an existing ops runbook. + +### Phase UO1 โ€” Load + +1. Parse `<slug>` โ€” hard stop *"Slug required. Usage: /mdd update-op <slug>"* if missing. +2. Locate runbook (project-local first, then global): + - Check `.mdd/ops/<slug>.md` โ†’ found: load it, note scope = project + - Check `~/.claude/ops/<slug>.md` โ†’ found: load it, note scope = global + - Neither found โ†’ hard stop: *"No runbook found for `<slug>`. Run `/mdd ops list` to see all available runbooks."* + +### Phase UO2 โ€” Re-ask with current values pre-filled + +Re-present the Phase OP2 questions with current values shown as defaults. User can accept (press enter) or type a new value. Only changed fields are rewritten. + +Show a diff summary before writing: +``` +Changes detected: + + regions: eu-central added + ~ services.api.regions.eu-west.image: old-name:v1 โ†’ new-name:v2 + ~ deployment_strategy.on_gate_failure: stop โ†’ rollback +``` + +Ask: *"Apply these changes? (yes / cancel)"* + +### Phase UO3 โ€” Rewrite and update + +Rewrite only changed sections. Preserve: +- `known_issues` (never remove existing entries without asking) +- Service `status` and `last_checked` values (these are live data, not config) + +Update frontmatter: `last_synced: <today>`, `status: draft` if previously `complete` and structural changes were made. + +``` +โœ… Updated: .mdd/ops/<slug>.md + last_synced: <today> + Sections rewritten: <list> +``` + +--- + +## OPS LIST MODE โ€” `/mdd ops list` + +Triggered when arguments are exactly `ops list`. Lists all ops runbooks โ€” global and project โ€” in a single unified view. + +### Phase OL โ€” Scan and display + +1. Glob `~/.claude/ops/*.md` โ€” read each, extract `id`, `title`, `platform`, `status`, and the last `last_checked` value across all services. +2. Glob `.mdd/ops/*.md` (excluding `archive/`) โ€” same fields. +3. Display unified list, grouped by scope: + +``` +๐Ÿ“ฆ Ops Runbooks + +Global (~/.claude/ops/) + cloudflare-dns DNS record updates via Cloudflare API last run: 2026-04-10 + docker-hub-login Docker Hub authentication procedure last run: 2026-03-28 + ssl-renewal Let's Encrypt cert renewal (Certbot) last run: never + +Project (.mdd/ops/) + rulecatch-dokploy 10 services โ†’ eu-west (canary) + us-east last run: 2026-04-18 โœ“ all healthy + swarmk-dokploy 7 services โ†’ eu-west (canary) + us-east last run: 2026-04-17 โš  api degraded + +Run /mdd runop <slug> to execute any runbook. +``` + +If either directory is empty or missing, omit that section without error. If both are empty: +``` +No ops runbooks found. + Project: /mdd ops <description> (saves to .mdd/ops/) + Global: /mdd ops <description> (choose "global" when prompted) +``` + +--- + ## COMMANDS MODE โ€” `/mdd commands` Triggered when arguments start with `commands`. Outputs a reference table of every available MDD mode. @@ -1872,8 +2237,12 @@ Command | Description /mdd plan-sync | Plan-Sync Mode โ€” Reconcile manual edits to initiative/wave files /mdd plan-remove-feature <wave> <feature> | Plan-Remove-Feature Mode โ€” Remove a feature from a wave /mdd plan-cancel-initiative <slug> | Plan-Cancel-Initiative Mode โ€” Cancel an initiative and archive its waves +/mdd ops <description> | Ops Document Mode โ€” Create a runbook (asks: global ~/.claude/ops/ or project .mdd/ops/) +/mdd ops list | Ops List Mode โ€” Show all runbooks (global and project) with last-run status +/mdd runop <slug> | Ops Execute Mode โ€” Run a runbook: pre-flight health check, canary-gated deploy, post-flight verify +/mdd update-op <slug> | Ops Update Mode โ€” Edit an existing runbook (checks project then global) -Run /mdd <feature description> to start building, or /mdd audit to check existing code. +Run /mdd <feature description> to start building, /mdd ops <description> to create a deployment runbook, or /mdd audit to check existing code. ``` No files are created or modified by this mode. diff --git a/README.md b/README.md index cd3c7a9..af9d871 100644 --- a/README.md +++ b/README.md @@ -436,7 +436,158 @@ The dashboard auto-detects drift by running `git log` against each doc's `last_s npm: [mdd-tui](https://www.npmjs.com/package/mdd-tui) ยท GitHub: [TheDecipherist/mdd](https://github.com/TheDecipherist/mdd) -> **Recommended: install MDD globally.** Run `/install-global` once and answer "yes" to the MDD prompt โ€” `/mdd` is then available in every project on your machine with no per-project setup. Update the starter kit once and every project picks up the new version automatically on the next session. When you run `/mdd` for the first time in a fresh project, it auto-creates the `.mdd/` structure (docs, audits, `.startup.md`) โ€” no separate `/install-mdd` step needed. +> **Recommended: install MDD globally.** Run `/install-global` once and answer "yes" to the MDD prompt โ€” `/mdd` is then available in every project on your machine with no per-project setup. Update the starter kit once and every project picks up the new version automatically on the next session. When you run `/mdd` for the first time in a fresh project, it auto-creates the `.mdd/` structure (docs, audits, ops, `.startup.md`) โ€” no separate `/install-mdd` step needed. + +--- + +## Ops Mode โ€” Deployment Runbooks โœจ NEW + +> **The flaw MDD had:** Deployment and infrastructure tasks had no documentation home. Running `/mdd dokploy-deploy` defaulted to Build Mode and skipped the documentation phases โ€” because deploying services isn't a feature to build. Ops Mode fixes this. + +MDD now treats deployments as first-class citizens. Every deployment target gets a structured runbook โ€” either project-local or global. Write it once โ€” then `runop` executes it every time, with live health checks, verified steps, and canary-gated multi-region rollout. + +### Commands + +| Command | What it does | +|---|---| +| `/mdd ops <description>` | Create a runbook โ€” **first question is always: global or project?** | +| `/mdd ops list` | List all runbooks โ€” global and project โ€” with last-run health status | +| `/mdd runop <slug>` | Execute a runbook โ€” checks project-local first, then global | +| `/mdd update-op <slug>` | Edit an existing runbook โ€” same lookup order | + +### Global vs Project Scope + +The **first thing `/mdd ops` asks** is where the runbook should live: + +| Scope | Location | Use for | +|---|---|---| +| **Project** | `.mdd/ops/<slug>.md` | This project only (e.g., deploy this specific app to Dokploy) | +| **Global** | `~/.claude/ops/<slug>.md` | Reusable across all projects (e.g., update Cloudflare DNS, renew SSL certs, Docker Hub login) | + +> **Global ops cannot access project-local `.env` variables or project paths.** They use `~/.env` globals only โ€” which is exactly right for infrastructure procedures that don't belong to any one project. + +**Global is the authoritative namespace.** If a global runbook named `cloudflare-dns` exists, no project can create a local runbook with the same name. This prevents any ambiguity about which runbook `runop` will execute โ€” you always know exactly what runs. + +### Write Once, Runs Every Time + +```bash +# First time โ€” creates the runbook +/mdd ops "deploy rulecatch services to dokploy US and EU" + +# Every deployment after โ€” reads the runbook, no questions asked +/mdd runop rulecatch-dokploy +``` + +`runop` reads `.mdd/ops/rulecatch-dokploy.md` and executes the full deployment: pre-flight checks, step-by-step procedure with verification at each step, and post-flight confirmation. No tribal knowledge. No forgotten steps. The doc IS the deployment. + +### Canary-Gated Multi-Region Deployment + +Deploy to your canary region first. Gate on full health verification. Only then touch primary. If canary fails โ€” primary is never touched, still running the last good version. + +``` +Pre-flight Health Check โ€” rulecatch-dokploy +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + eu-west (canary) us-east (primary) +api โœ“ healthy โœ“ healthy +dashboard โœ— failing โœ“ healthy +worker โœ“ healthy โœ“ healthy + +dashboard is failing in eu-west. + (a) Redeploy (b) Skip (c) Abort +``` + +``` +โ”€โ”€ eu-west (canary) โ€” gate check โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +api โœ“ healthy (rulecatch-api-eu:latest) +dashboard โœ“ healthy (rulecatch-dashboard-eu:latest) +worker โœ“ healthy +Gate: PASSED โœ“ โ€” advancing to us-east (primary) +``` + +``` +runop complete โ€” rulecatch-dokploy + + eu-west (canary) us-east (primary) +api โœ“ healthy โœ“ healthy +dashboard โœ“ healthy โœ“ healthy +worker โœ“ healthy โœ“ healthy + +Canary gate: PASSED โœ“ +Regions deployed: 2/2 +Steps executed: 14/14 โœ“ +``` + +### Per-Region Docker Image Overrides + +Different image names for different regions? Fully supported. Each service has a default image plus per-region overrides: + +```yaml +services: + - slug: api + image: theDecipherist/rulecatch-api:latest # default + regions: + eu-west: + image: theDecipherist/rulecatch-api-eu:latest # different name for EU + status: healthy + last_checked: 2026-04-18T10:00:00Z + us-east: + image: theDecipherist/rulecatch-api:latest # same as default + status: healthy + last_checked: 2026-04-18T10:05:00Z +``` + +All keys are always fully populated โ€” no implicit inheritance that breaks when you add a second region. + +### Deployment Strategy Control + +```yaml +deployment_strategy: + order: sequential # sequential | parallel + gate: health_check # health_check | manual | none + on_gate_failure: stop # stop | skip_region | rollback + rollback_on_failure: false # auto-run rollback steps on failure +``` + +`on_gate_failure: stop` โ€” canary fails, primary untouched. Investigate, fix, re-run. +`on_gate_failure: rollback` โ€” canary fails, auto-rollback EU, primary untouched. +`on_gate_failure: skip_region` โ€” skip the failed region and continue to primary (useful when EU is lower priority). + +### Listing All Runbooks + +```bash +/mdd ops list +``` + +``` +๐Ÿ“ฆ Ops Runbooks + +Global (~/.claude/ops/) + cloudflare-dns DNS record updates via Cloudflare API last run: 2026-04-10 + ssl-renewal Let's Encrypt cert renewal (Certbot) last run: never + +Project (.mdd/ops/) + rulecatch-dokploy 10 services โ†’ eu-west (canary) + us-east last run: 2026-04-18 โœ“ all healthy + swarmk-dokploy 7 services โ†’ eu-west (canary) + us-east last run: 2026-04-17 โš  api degraded + +Run /mdd runop <slug> to execute any runbook. +``` + +### Where Runbooks Live + +``` +~/.claude/ops/ โ† global runbooks (all projects) + cloudflare-dns.md + ssl-renewal.md + +.mdd/ +โ”œโ”€โ”€ docs/ โ† feature docs (type: feature | task) +โ””โ”€โ”€ ops/ โ† project runbooks (this project only) + โ”œโ”€โ”€ rulecatch-dokploy.md + โ”œโ”€โ”€ swarmk-dokploy.md + โ””โ”€โ”€ archive/ +``` + +All existing modes are ops-aware: `/mdd status` shows ops runbook count, `/mdd scan` checks runbook drift, `/mdd graph` includes a runbook health summary, `/mdd audit` flags missing sections and credential security violations. --- diff --git a/docs/index.html b/docs/index.html index 9c7c626..7c95cd8 100644 --- a/docs/index.html +++ b/docs/index.html @@ -143,6 +143,7 @@ <a href="#mdd-workflow">MDD Workflow โœจ NEW</a> <a href="#mdd-startup-context">Startup Context</a> <a href="#mdd-dashboard">MDD Dashboard</a> + <a href="#ops-mode">Ops Mode</a> <a href="#featured-packages">Featured Packages</a> <a href="#what-is-this">What Is This?</a> <a href="#learning-path">Learning Path</a> @@ -208,6 +209,7 @@ <h1>The Definitive Starting Point<br>for Claude Code Projects</h1> <a href="#mdd-workflow" class="nav-new">MDD Workflow โœจ NEW</a> <a href="#mdd-startup-context" style="padding-left: 2rem; font-size: 0.9em;">Startup Context</a> <a href="#mdd-dashboard" style="padding-left: 2rem; font-size: 0.9em;">MDD Dashboard</a> + <a href="#ops-mode" style="padding-left: 2rem; font-size: 0.9em;">Ops Mode</a> <a href="#featured-packages">Featured Packages</a> <a href="#what-is-this">What Is This?</a> <a href="#learning-path">Learning Path</a> @@ -661,6 +663,193 @@ <h3>Drift Detection</h3> <p style="margin-top: 1rem;">npm: <a href="https://www.npmjs.com/package/mdd-tui" target="_blank" rel="noopener">mdd-tui</a> · <a href="https://github.com/TheDecipherist/mdd" target="_blank" rel="noopener">GitHub</a></p> </section> + <!-- Ops Mode --> + <section id="ops-mode"> + <h2>Ops Mode <span style="background: linear-gradient(135deg, #f59e0b, #ef4444); color: white; font-size: 0.5em; padding: 0.2em 0.6em; border-radius: 999px; vertical-align: middle; margin-left: 0.5em;">NEW</span></h2> + <p style="font-size: 1.1em; color: var(--text-secondary); margin-bottom: 1.5rem;">Document-first deployments — the same discipline MDD brings to features, now applied to ops tasks. Runbooks replace improvised deploys. Every region is gated. If canary fails, primary is never touched.</p> + + <div class="callout" style="background: linear-gradient(135deg, rgba(245, 158, 11, 0.1), rgba(239, 68, 68, 0.05)); border-left: 4px solid #f59e0b; padding: 1.25rem; border-radius: 0.5rem; margin-bottom: 2rem;"> + <strong>What Ops Mode fixes:</strong> MDD's original flaw — deployment and ops tasks had no documentation home. They fell into Build Mode or were skipped entirely. Ops Mode gives deployments the same document-first discipline as features. Write the runbook once; <code>/mdd runop</code> executes it every time. + </div> + + <h3>The Four Commands</h3> + <div class="commands-grid" style="margin-bottom: 2rem;"> + <div class="command-card"> + <h3><code>/mdd ops <description></code></h3> + <p class="command-desc">Create a new deployment runbook. First asks: global (<code>~/.claude/ops/</code>) or project-scoped (<code>.mdd/ops/</code>)? Interviews you about services, regions, health checks, rollback criteria, and deployment strategy. Produces a structured YAML-frontmatter runbook that <code>runop</code> can execute.</p> + </div> + <div class="command-card"> + <h3><code>/mdd runop <slug></code></h3> + <p class="command-desc">Execute a runbook end-to-end: pre-flight health check → canary-gated region deploy → post-flight verify. Checks project-local first, then global. Reads the ops doc as the source of truth. If any gate fails, execution stops — the next region is never touched.</p> + </div> + <div class="command-card"> + <h3><code>/mdd update-op <slug></code></h3> + <p class="command-desc">Edit an existing runbook. Checks project-local first, then global. Updates services, regions, health endpoints, rollback criteria, or deployment strategy. Re-validates the runbook structure after editing.</p> + </div> + <div class="command-card"> + <h3><code>/mdd ops list</code></h3> + <p class="command-desc">Show all runbooks — global and project-scoped — in a unified view. Displays slug, scope, platform, environments, and status so you can see everything available at a glance.</p> + </div> + </div> + + <h3>Global vs Project Scope</h3> + <p><code>/mdd ops</code> asks scope as its very first question. The answer controls where the runbook is stored and how it is shared.</p> + + <div style="overflow-x: auto; margin: 1.5rem 0;"> + <table style="width: 100%; border-collapse: collapse; font-size: 0.9em;"> + <thead> + <tr style="background: var(--bg-secondary);"> + <th style="text-align: left; padding: 0.75rem 1rem; border-bottom: 2px solid var(--border);">Scope</th> + <th style="text-align: left; padding: 0.75rem 1rem; border-bottom: 2px solid var(--border);">Location</th> + <th style="text-align: left; padding: 0.75rem 1rem; border-bottom: 2px solid var(--border);">Available</th> + <th style="text-align: left; padding: 0.75rem 1rem; border-bottom: 2px solid var(--border);">Use for</th> + </tr> + </thead> + <tbody> + <tr> + <td style="padding: 0.75rem 1rem; border-bottom: 1px solid var(--border);"><strong>Global</strong></td> + <td style="padding: 0.75rem 1rem; border-bottom: 1px solid var(--border);"><code>~/.claude/ops/<slug>.md</code></td> + <td style="padding: 0.75rem 1rem; border-bottom: 1px solid var(--border);">All projects</td> + <td style="padding: 0.75rem 1rem; border-bottom: 1px solid var(--border);">Docker Hub login, DNS updates, Vercel deploys, reusable infrastructure tasks</td> + </tr> + <tr> + <td style="padding: 0.75rem 1rem;"><strong>Project</strong></td> + <td style="padding: 0.75rem 1rem;"><code>.mdd/ops/<slug>.md</code></td> + <td style="padding: 0.75rem 1rem;">This project only</td> + <td style="padding: 0.75rem 1rem;">Service-specific deploys, project env vars, region-specific image names</td> + </tr> + </tbody> + </table> + </div> + + <div class="callout" style="background: rgba(245, 158, 11, 0.08); border-left: 4px solid #f59e0b; padding: 1rem 1.25rem; border-radius: 0.5rem; margin-bottom: 1.5rem;"> + <strong>Global scope note:</strong> Global ops cannot read project-local <code>.env</code> variables or project paths. They have access to <code>~/.env</code> globals only. If your runbook needs <code>DOCKER_HUB_TOKEN</code> or other global secrets, store them in <code>~/.env</code> and reference them there. + </div> + + <div class="callout" style="background: rgba(239, 68, 68, 0.08); border-left: 4px solid #ef4444; padding: 1rem 1.25rem; border-radius: 0.5rem; margin-bottom: 2rem;"> + <strong>Collision guard:</strong> Global namespace is authoritative. If a global op named <code>docker-hub-push</code> already exists, you cannot create a project op with the same slug. This is a hard stop — no silent shadowing. Rename one of them to avoid ambiguity. + </div> + + <h3>Listing All Runbooks</h3> + <p><code>/mdd ops list</code> shows all runbooks across both scopes in a unified view:</p> + <pre><code class="language-text">Ops Runbooks +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + +Global (~/.claude/ops/) + docker-hub-login docker-hub all projects complete + dns-cloudflare manual all projects draft + +Project (.mdd/ops/) + rulecatch-dokploy dokploy staging, prod in_progress + api-rollback manual production draft + +4 runbooks total (2 global, 2 project) +Run /mdd runop <slug> to execute any runbook.</code></pre> + + <h3>The Runbook Concept</h3> + <p>A runbook is a structured ops document in <code>.mdd/ops/</code> with YAML frontmatter declaring services, regions, deployment order, health check endpoints, and gate behaviour. It is the single source of truth for a deployment. <code>/mdd runop</code> reads it and executes it — you do not hand-craft the sequence each time.</p> + <p><strong>Write once, runs every time.</strong> Once a runbook exists, every future deployment of that service runs through the same documented sequence, with the same pre-flight checks, the same canary gate, and the same post-flight verification. Nothing falls through the cracks because someone forgot a step.</p> + + <h3>Canary Deployment Pattern</h3> + <p>Ops Mode enforces a canary-first deployment model. Regions are ordered by <code>deploy_order</code> in the runbook frontmatter. The canary region always deploys first — only if all its services pass the health gate does the primary region proceed.</p> + + <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(160px, 1fr)); gap: 1rem; margin: 1.5rem 0;"> + <div style="background: var(--bg-secondary); border-radius: 0.75rem; padding: 1.25rem; text-align: center; border-top: 3px solid #f59e0b;"> + <div style="font-size: 1.5em; margin-bottom: 0.5rem;">๐Ÿ”</div> + <strong style="display: block; margin-bottom: 0.25rem;">Pre-flight</strong> + <small style="color: var(--text-muted);">Health-check all regions before any deploy begins</small> + </div> + <div style="background: var(--bg-secondary); border-radius: 0.75rem; padding: 1.25rem; text-align: center; border-top: 3px solid #8b5cf6;"> + <div style="font-size: 1.5em; margin-bottom: 0.5rem;">๐Ÿฆ</div> + <strong style="display: block; margin-bottom: 0.25rem;">Canary Deploy</strong> + <small style="color: var(--text-muted);">Deploy to canary region first (deploy_order: 1)</small> + </div> + <div style="background: var(--bg-secondary); border-radius: 0.75rem; padding: 1.25rem; text-align: center; border-top: 3px solid #3b82f6;"> + <div style="font-size: 1.5em; margin-bottom: 0.5rem;">๐Ÿšฆ</div> + <strong style="display: block; margin-bottom: 0.25rem;">Canary Gate</strong> + <small style="color: var(--text-muted);">All services must be healthy before advancing</small> + </div> + <div style="background: var(--bg-secondary); border-radius: 0.75rem; padding: 1.25rem; text-align: center; border-top: 3px solid #10b981;"> + <div style="font-size: 1.5em; margin-bottom: 0.5rem;">๐Ÿš€</div> + <strong style="display: block; margin-bottom: 0.25rem;">Primary Deploy</strong> + <small style="color: var(--text-muted);">Only then deploy to primary region (deploy_order: 2)</small> + </div> + <div style="background: var(--bg-secondary); border-radius: 0.75rem; padding: 1.25rem; text-align: center; border-top: 3px solid #06b6d4;"> + <div style="font-size: 1.5em; margin-bottom: 0.5rem;">โœ…</div> + <strong style="display: block; margin-bottom: 0.25rem;">Post-flight</strong> + <small style="color: var(--text-muted);">Verify all regions healthy, update runbook frontmatter</small> + </div> + </div> + + <p>If the canary gate fails — any service unhealthy after deploy — the primary region is never touched. It is still running the old version. You can redeploy to canary, skip that region, or abort entirely. The gate behaviour is configurable: <code>stop</code> (default), <code>skip_region</code>, or <code>rollback</code>.</p> + + <h3>Pre-flight Health Check</h3> + <p>Before any deployment begins, <code>/mdd runop</code> checks the current health of all services across all regions. If a service is already failing before you deploy, you know upfront — not after the deploy makes it worse. The pre-flight table surfaces this clearly and prompts for a decision:</p> + + <pre><code class="language-text">Pre-flight Health Check โ€” rulecatch-dokploy +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + eu-west (canary) us-east (primary) +api โœ“ healthy โœ“ healthy +dashboard โœ— failing โœ“ healthy +worker โœ“ healthy โœ“ healthy + +dashboard is failing in eu-west. Redeploy, skip, or abort?</code></pre> + + <h3>Canary Gate Output</h3> + <p>After the canary deploy completes, <code>/mdd runop</code> re-checks every service in the canary region. The gate output shows exactly which image each service is running and its health status:</p> + + <pre><code class="language-text">โ”€โ”€ eu-west (canary) โ€” gate check โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +api โœ“ healthy (rulecatch-api-eu:latest) +dashboard โœ“ healthy +worker โœ“ healthy +Gate: PASSED โœ“ โ€” advancing to us-east (primary)</code></pre> + + <h3>Multi-Region, Multi-Image Support</h3> + <p>Each service can have a different Docker image name per region. This supports regional image registries, blue-green strategies, or teams that maintain separate EU and US image builds. Per-region health status is tracked in the runbook frontmatter so the dashboard can show the last-known state of every region at a glance.</p> + + <div class="commands-grid" style="margin: 1.5rem 0;"> + <div class="command-card"> + <h3>Per-Region Images</h3> + <p class="command-desc">Each service declares its image name per region in the runbook frontmatter. <code>runop</code> reads the correct image for each region automatically — no manual substitution.</p> + </div> + <div class="command-card"> + <h3>Health Status Tracking</h3> + <p class="command-desc">After each runop execution, the runbook frontmatter is updated with the last-known health status per region and service. <code>/mdd status</code> can surface this in the startup context.</p> + </div> + <div class="command-card"> + <h3>Deployment Strategies</h3> + <p class="command-desc"><code>deployment_strategy: sequential</code> deploys regions one at a time with gates between each. <code>deployment_strategy: parallel</code> deploys non-canary regions simultaneously after the canary gate passes.</p> + </div> + <div class="command-card"> + <h3>Gate Behaviour</h3> + <p class="command-desc"><code>gate_on_fail: stop</code> halts execution (default). <code>skip_region</code> logs the failure and continues to remaining regions. <code>rollback</code> triggers the runbook’s defined rollback procedure before stopping.</p> + </div> + </div> + + <h3>Directory Structure</h3> + <p>Project runbooks live in <code>.mdd/ops/</code>. Global runbooks live in <code>~/.claude/ops/</code> alongside your other global Claude config:</p> + <pre><code class="language-bash"># Project-scoped runbooks (this project only) +.mdd/ +โ”œโ”€โ”€ docs/ # Feature documentation +โ”œโ”€โ”€ ops/ # Project deployment runbooks +โ”‚ โ”œโ”€โ”€ rulecatch-dokploy.md # Multi-region Dokploy deploy runbook +โ”‚ โ””โ”€โ”€ api-rollback.md # Emergency rollback runbook +โ”œโ”€โ”€ initiatives/ +โ”œโ”€โ”€ waves/ +โ””โ”€โ”€ audits/ + +# Global runbooks (available in every project) +~/.claude/ +โ”œโ”€โ”€ commands/ # Global slash commands +โ”œโ”€โ”€ skills/ # Global skills +โ”œโ”€โ”€ ops/ # Global deployment runbooks โ† NEW +โ”‚ โ”œโ”€โ”€ docker-hub-login.md # Reusable Docker Hub auth runbook +โ”‚ โ””โ”€โ”€ dns-cloudflare.md # DNS update runbook (any project) +โ””โ”€โ”€ CLAUDE.md</code></pre> + + <p>Project runbooks are gitignored by default alongside the rest of <code>.mdd/</code> — they contain environment-specific configuration (region URLs, health endpoints, image names) that belongs in your local workspace, not version control. Global runbooks persist in your home directory and are never project-specific.</p> + </section> + <!-- Featured Packages --> <section id="featured-packages"> <h2>Featured Packages</h2>