diff --git a/CLAUDE.md b/CLAUDE.md index 584bf8ad..35de39ff 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -73,12 +73,25 @@ The SDK is the canonical kernel — a single typed client with a `CredentialsPro ### SDK (`sdk/src/`) -- **`index.ts`** — `Run402` class + `run402()` factory. Isomorphic entry point. +- **`index.ts`** — `Run402` class + `run402()` factory + `files()` helper. Isomorphic entry point. - **`kernel.ts`** — Request function, `Client` interface. Only place that calls `globalThis.fetch`. -- **`errors.ts`** — `Run402Error` hierarchy: `PaymentRequired`, `ProjectNotFound`, `Unauthorized`, `ApiError`, `NetworkError`. Never calls `process.exit`. +- **`errors.ts`** — `Run402Error` hierarchy: `PaymentRequired`, `ProjectNotFound`, `Unauthorized`, `ApiError`, `NetworkError`, `Run402DeployError` (the v1.34+ structured envelope from the deploy state machine). Never calls `process.exit`. - **`credentials.ts`** — `CredentialsProvider` interface. Required: `getAuth`, `getProject`. Optional: `saveProject`, `updateProject`, `removeProject`, `setActiveProject`, `getActiveProject`, `readAllowance`, `saveAllowance`, `createAllowance`, `getAllowancePath`. -- **`namespaces/*.ts`** — One class per resource group (projects, blobs, functions, email, …). Namespaces hold a `Client` and expose typed methods. -- **`node/*.ts`** — Node-only entry point (`@run402/sdk/node`). Wraps `core/` keystore + allowance into `NodeCredentialsProvider`. Sets up x402-wrapped fetch via `createLazyPaidFetch()`. +- **`namespaces/*.ts`** — One class per resource group (projects, blobs, functions, email, …). Namespaces hold a `Client` and expose typed methods. The canonical deploy primitive lives at **`namespaces/deploy.ts`** (with shared types in `deploy.types.ts`) — see "Unified Deploy" below. +- **`node/*.ts`** — Node-only entry point (`@run402/sdk/node`). Wraps `core/` keystore + allowance into `NodeCredentialsProvider`. Sets up x402-wrapped fetch via `createLazyPaidFetch()`. Adds `fileSetFromDir(path)` for filesystem byte sources to the deploy primitive. + +### Unified Deploy (v1.34+) + +- **`namespaces/deploy.ts`** — `Deploy` class exposing the canonical primitive. Three layers: + - `r.deploy.apply(spec, opts?)` — one-shot, awaits to terminal (most agents use this). + - `r.deploy.start(spec, opts?)` — returns a `DeployOperation` with `events()` async iterator + `result()` promise. + - `r.deploy.plan` / `upload` / `commit` — low-level steps for CLI debugging. + - Plus `r.deploy.resume(operationId)`, `status`, `getRelease`, `diff`. +- **All bytes ride through CAS.** The plan request body never carries inline bytes — only `ContentRef` objects. When the normalized spec exceeds 5 MB JSON, the SDK uploads the manifest itself as a CAS object and references it (`manifest_ref` escape hatch — no body-size cliff). +- **Replace vs patch semantics per resource.** `site.replace` = "this is the whole site" (files absent are removed in the new release); `site.patch.put` / `patch.delete` = surgical updates. Same for `functions`, `secrets`, `subdomains`. Top-level absence = leave untouched. +- **Server-authoritative manifest digest.** The gateway returns the canonical digest in the plan response. The SDK no longer requires byte-for-byte canonicalize agreement — `canonicalize.ts` is now a UX helper only. +- **Backward-compat shims.** `apps.bundleDeploy` translates legacy options (including `inherit: true` with a deprecation warning) into a `ReleaseSpec` and delegates to `deploy.apply`. `sites.deployDir` is a thin wrapper that uses `fileSetFromDir(dir)` and synthesizes both unified `DeployEvent` shapes and the legacy `{ phase: ... }` shapes for v1.32-era event consumers. +- **MCP/CLI surface.** `deploy` and `deploy_resume` MCP tools (in `src/tools/deploy.ts` and `src/tools/deploy-resume.ts`) expose the new primitive directly. CLI subcommands `run402 deploy apply` and `run402 deploy resume` (in `cli/lib/deploy-v2.mjs`) mirror them. The legacy `bundle_deploy`/`deploy_site`/`deploy_site_dir` MCP tools and `run402 deploy --manifest` CLI continue to work and route through the same SDK shim. ### Shared Core (`core/src/`) diff --git a/cli-e2e.test.mjs b/cli-e2e.test.mjs index cd3a6eae..857848b9 100644 --- a/cli-e2e.test.mjs +++ b/cli-e2e.test.mjs @@ -266,7 +266,7 @@ function mockFetch(input, init) { })); } - // Deployments (sites) — v1.32 plan/commit transport + // Deployments (sites) — v1.32 plan/commit transport (legacy) if (path === "/deploy/v1/plan" && method === "POST") { // Mark every file in the inbound manifest as `present: true` so the // SDK skips S3 PUTs and goes straight to commit. This keeps the e2e @@ -292,6 +292,53 @@ function mockFetch(input, init) { return Promise.resolve(json({ id: "dpl_test456", status: "live", url: "https://dpl_test456.sites.run402.com" })); } + // Deploy v2 — unified plan/commit. The CLI's `sites deploy` and + // `sites deploy-dir` route through r.deploy.apply against these endpoints. + // The fake gateway reports every content ref as already-present (empty + // missing_content) so the SDK skips S3 PUTs and goes straight to commit. + if (path === "/deploy/v2/plans" && method === "POST") { + return Promise.resolve(json({ + plan_id: "plan_v2_test", + operation_id: "op_v2_test", + base_release_id: null, + manifest_digest: "deadbeef".repeat(8), + missing_content: [], + diff: { resources: { site: { unchanged: true } } }, + })); + } + if (path.match(/^\/deploy\/v2\/plans\/[^/]+\/commit$/) && method === "POST") { + return Promise.resolve(json({ + operation_id: "op_v2_test", + status: "ready", + release_id: "rel_v2_test", + urls: { + site: "https://dpl_test456.sites.run402.com", + deployment_id: "dpl_test456", + }, + })); + } + if (path.match(/^\/deploy\/v2\/operations\/[^/]+$/) && method === "GET") { + return Promise.resolve(json({ + operation_id: "op_v2_test", + project_id: TEST_PROJECT.project_id, + plan_id: "plan_v2_test", + status: "ready", + base_release_id: null, + target_release_id: "rel_v2_test", + release_id: "rel_v2_test", + urls: { + site: "https://dpl_test456.sites.run402.com", + deployment_id: "dpl_test456", + }, + payment_required: null, + error: null, + activate_attempts: 0, + last_activate_attempt_at: null, + created_at: new Date().toISOString(), + updated_at: new Date().toISOString(), + })); + } + // Subdomains if (path === "/subdomains/v1" && method === "POST") { return Promise.resolve(json({ name: "my-app", url: "https://my-app.run402.com", deployment_id: "dpl_test456" }, 201)); diff --git a/cli/lib/deploy-v2.mjs b/cli/lib/deploy-v2.mjs new file mode 100644 index 00000000..83679254 --- /dev/null +++ b/cli/lib/deploy-v2.mjs @@ -0,0 +1,359 @@ +/** + * `run402 deploy apply` and `run402 deploy resume` — CLI wrappers over the + * unified deploy primitive (`r.deploy.apply` / `r.deploy.resume`). + * + * The legacy `run402 deploy --manifest …` command is preserved in + * `cli/lib/deploy.mjs` and continues to work; this file adds the new + * subcommand surface. + * + * Manifest format mirrors the MCP `deploy` tool's input schema: + * { + * "project_id": "...", + * "base": { "release": "current" } | { "release": "empty" } | { "release_id": "..." }, + * "database": { "migrations": [...], "expose": {...}, "zero_downtime": false }, + * "secrets": { "set": {...}, "delete": [...], "replace_all": {...} }, + * "functions": { "replace": {...}, "patch": { "set": {...}, "delete": [...] } }, + * "site": { "replace": {...} } | { "patch": { "put": {...}, "delete": [...] } }, + * "subdomains": { "set": ["..."], "add": [...], "remove": [...] }, + * "idempotency_key": "..." + * } + * + * File entries: `{ "data": "...", "encoding": "utf-8" | "base64", "contentType": "..." }` + * — same shape used by `bundle_deploy`. UTF-8 is the default; binary files + * pass `"encoding": "base64"`. + */ + +import { readFileSync } from "node:fs"; +import { resolve, dirname, isAbsolute, join } from "node:path"; +import { getSdk } from "./sdk.mjs"; +import { reportSdkError } from "./sdk-errors.mjs"; +import { allowanceAuthHeaders, resolveProjectId } from "./config.mjs"; + +const APPLY_HELP = `run402 deploy apply — Unified deploy primitive (v1.34+) + +Usage: + run402 deploy apply --manifest [--project ] [--quiet] + run402 deploy apply --spec '' [--project ] [--quiet] + cat spec.json | run402 deploy apply [--project ] + +Manifest format mirrors the MCP \`deploy\` tool's ReleaseSpec: + { + "project_id": "prj_...", + "base": { "release": "current" }, + "database": { "migrations": [{ "id": "001_init", "sql": "CREATE TABLE ..." }], "expose": {...} }, + "secrets": { "set": { "OPENAI_API_KEY": { "value": "sk-..." } } }, + "functions": { "replace": { "api": { "source": { "data": "export default ..." } } } }, + "site": { "replace": { "index.html": { "data": "..." } } }, + "subdomains": { "set": ["my-app"] } + } + +Options: + --manifest Read the spec from this JSON file + --spec '' Inline JSON spec (single-quote in shell) + --project Override project_id from the manifest + --quiet Suppress per-event JSON-line stderr (final result still on stdout) + +Output: + stdout: { "status": "ok", "release_id": "rel_...", "operation_id": "op_...", "urls": {...} } + stderr: one JSON event per line (suppressed with --quiet) + +Patch examples (only the listed file changes): + { "project_id": "prj_...", "site": { "patch": { "put": { "index.html": { "data": "..." } } } } } + { "project_id": "prj_...", "site": { "patch": { "delete": ["old.html"] } } } +`; + +const RESUME_HELP = `run402 deploy resume — Resume a stuck deploy operation + +Usage: + run402 deploy resume [--quiet] + +Used when a previous \`deploy apply\` ended in \`activation_pending\` or +\`schema_settling\` (e.g. transient gateway failure between SQL commit and +the pointer-swap activation). The gateway re-runs only the failed phase +forward — SQL is never replayed. + +Output: + stdout: { "status": "ok", "release_id": "...", "operation_id": "...", "urls": {...} } + stderr: one JSON event per line (suppressed with --quiet) +`; + +export async function runDeployV2(sub, args) { + if (sub === "apply") return await applyCmd(args); + if (sub === "resume") return await resumeCmd(args); + console.error(JSON.stringify({ status: "error", message: `Unknown deploy subcommand: ${sub}` })); + process.exit(1); +} + +async function readStdin() { + const chunks = []; + for await (const chunk of process.stdin) chunks.push(chunk); + return Buffer.concat(chunks).toString("utf-8"); +} + +function makeStderrEventWriter(quiet) { + if (quiet) return undefined; + return (event) => { + console.error(JSON.stringify(event)); + }; +} + +async function applyCmd(args) { + const opts = { manifest: null, spec: null, project: null, quiet: false }; + for (let i = 0; i < args.length; i++) { + if (args[i] === "--help" || args[i] === "-h") { console.log(APPLY_HELP); process.exit(0); } + if (args[i] === "--manifest" && args[i + 1]) { opts.manifest = args[++i]; continue; } + if (args[i] === "--spec" && args[i + 1]) { opts.spec = args[++i]; continue; } + if (args[i] === "--project" && args[i + 1]) { opts.project = args[++i]; continue; } + if (args[i] === "--quiet") { opts.quiet = true; continue; } + } + + let raw; + if (opts.spec) { + raw = opts.spec; + } else if (opts.manifest) { + try { + const manifestPath = isAbsolute(opts.manifest) ? opts.manifest : resolve(process.cwd(), opts.manifest); + raw = readFileSync(manifestPath, "utf-8"); + } catch (err) { + console.error(JSON.stringify({ status: "error", message: `Failed to read manifest: ${err.message}` })); + process.exit(1); + } + } else { + raw = await readStdin(); + } + + let spec; + try { + spec = JSON.parse(raw); + } catch (err) { + console.error(JSON.stringify({ status: "error", message: `Manifest is not valid JSON: ${err.message}` })); + process.exit(1); + } + + if (opts.manifest) resolveFileDataPaths(spec, dirname(resolve(opts.manifest))); + + if (opts.project && spec.project_id && spec.project_id !== opts.project) { + console.error(JSON.stringify({ + status: "error", + message: `project_id conflict: spec.project_id=${spec.project_id} but --project=${opts.project}`, + })); + process.exit(1); + } + if (opts.project) spec.project_id = opts.project; + if (!spec.project_id) spec.project_id = resolveProjectId(null); + + // Translate { project_id, ... } envelope → ReleaseSpec ({ project, ... }) + // The SDK ReleaseSpec uses `project` rather than `project_id`; both shapes + // are accepted at the manifest layer (project_id is friendlier for agents + // sharing JSON manifests with the MCP tool). + const releaseSpec = mapManifestToReleaseSpec(spec); + const idempotencyKey = spec.idempotency_key; + + // Preserve the aggressive early exit when no allowance is configured. + allowanceAuthHeaders("/deploy/v2/plans"); + + try { + const result = await getSdk().deploy.apply(releaseSpec, { + onEvent: makeStderrEventWriter(opts.quiet), + idempotencyKey, + }); + console.log(JSON.stringify({ status: "ok", ...result }, null, 2)); + } catch (err) { + reportSdkError(err); + } +} + +async function resumeCmd(args) { + const opts = { operationId: null, quiet: false }; + for (let i = 0; i < args.length; i++) { + if (args[i] === "--help" || args[i] === "-h") { console.log(RESUME_HELP); process.exit(0); } + if (args[i] === "--quiet") { opts.quiet = true; continue; } + if (!args[i].startsWith("-") && !opts.operationId) opts.operationId = args[i]; + } + if (!opts.operationId) { + console.error(JSON.stringify({ status: "error", message: "Usage: run402 deploy resume " })); + process.exit(1); + } + + allowanceAuthHeaders("/deploy/v2/operations"); + + try { + const result = await getSdk().deploy.resume(opts.operationId, { + onEvent: makeStderrEventWriter(opts.quiet), + }); + console.log(JSON.stringify({ status: "ok", ...result }, null, 2)); + } catch (err) { + reportSdkError(err); + } +} + +// ─── Manifest → ReleaseSpec ────────────────────────────────────────────────── + +function mapManifestToReleaseSpec(spec) { + const out = { project: spec.project_id }; + if (spec.base !== undefined) out.base = spec.base; + if (spec.subdomains !== undefined) out.subdomains = spec.subdomains; + if (spec.secrets !== undefined) out.secrets = spec.secrets; + if (spec.routes !== undefined) out.routes = spec.routes; + if (spec.checks !== undefined) out.checks = spec.checks; + + if (spec.database) { + out.database = {}; + if (spec.database.expose !== undefined) out.database.expose = spec.database.expose; + if (spec.database.zero_downtime !== undefined) out.database.zero_downtime = spec.database.zero_downtime; + if (spec.database.migrations) { + out.database.migrations = spec.database.migrations.map((m) => { + const mm = { id: m.id }; + if (m.sql !== undefined) mm.sql = m.sql; + if (m.sql_ref !== undefined) mm.sql_ref = m.sql_ref; + if (m.checksum !== undefined) mm.checksum = m.checksum; + if (m.transaction !== undefined) mm.transaction = m.transaction; + return mm; + }); + } + } + + if (spec.functions) { + out.functions = {}; + if (spec.functions.replace) out.functions.replace = mapFunctionMap(spec.functions.replace); + if (spec.functions.patch) { + out.functions.patch = {}; + if (spec.functions.patch.set) out.functions.patch.set = mapFunctionMap(spec.functions.patch.set); + if (spec.functions.patch.delete) out.functions.patch.delete = spec.functions.patch.delete; + } + } + + if (spec.site) { + if (spec.site.replace) { + out.site = { replace: mapFileMap(spec.site.replace) }; + } else if (spec.site.patch) { + const patch = {}; + if (spec.site.patch.put) patch.put = mapFileMap(spec.site.patch.put); + if (spec.site.patch.delete) patch.delete = spec.site.patch.delete; + out.site = { patch }; + } + } + + return out; +} + +function mapFunctionMap(map) { + const out = {}; + for (const [name, fn] of Object.entries(map)) { + const f = {}; + if (fn.runtime) f.runtime = fn.runtime; + if (fn.source !== undefined) f.source = fileEntryToContentSource(fn.source); + if (fn.files) f.files = mapFileMap(fn.files); + if (fn.entrypoint !== undefined) f.entrypoint = fn.entrypoint; + if (fn.config !== undefined) f.config = fn.config; + if (fn.schedule !== undefined) f.schedule = fn.schedule; + out[name] = f; + } + return out; +} + +function mapFileMap(map) { + const out = {}; + for (const [path, entry] of Object.entries(map)) { + out[path] = fileEntryToContentSource(entry); + } + return out; +} + +function fileEntryToContentSource(entry) { + if (entry === null || entry === undefined) return entry; + if (typeof entry === "string") return entry; + if (entry instanceof Uint8Array) return entry; + if (typeof entry === "object") { + if (entry.encoding === "base64" && typeof entry.data === "string") { + const bytes = Buffer.from(entry.data, "base64"); + const u8 = new Uint8Array(bytes.buffer, bytes.byteOffset, bytes.byteLength); + return entry.contentType ? { data: u8, contentType: entry.contentType } : u8; + } + if (typeof entry.data === "string") { + return entry.contentType ? { data: entry.data, contentType: entry.contentType } : entry.data; + } + // Pre-resolved ContentRef shape — pass through. + if (typeof entry.sha256 === "string" && typeof entry.size === "number") { + return entry; + } + } + return entry; +} + +/** + * Resolve any `{ "path": "..." }` entries in the manifest to inline data. + * Mirrors the legacy deploy.mjs behavior so `run402 deploy apply` accepts + * the same files-with-paths shape that `run402 deploy` does today. + */ +function resolveFileDataPaths(spec, baseDir) { + // Site files + if (spec.site?.replace) resolveMap(spec.site.replace, baseDir); + if (spec.site?.patch?.put) resolveMap(spec.site.patch.put, baseDir); + // Function files + const visitFns = (fnMap) => { + if (!fnMap) return; + for (const fn of Object.values(fnMap)) { + if (fn.source && typeof fn.source === "object" && fn.source.path) { + const resolved = readFileEntry(fn.source, baseDir); + if (resolved) fn.source = resolved; + } + if (fn.files) resolveMap(fn.files, baseDir); + } + }; + visitFns(spec.functions?.replace); + visitFns(spec.functions?.patch?.set); + // Migration sql_path / sql_file + if (spec.database?.migrations) { + for (const m of spec.database.migrations) { + if (!m.sql && m.sql_path) { + try { + const p = isAbsolute(m.sql_path) ? m.sql_path : join(baseDir, m.sql_path); + m.sql = readFileSync(p, "utf-8"); + delete m.sql_path; + } catch (err) { + console.error(JSON.stringify({ + status: "error", + message: `Failed to read migration sql_path '${m.sql_path}': ${err.message}`, + })); + process.exit(1); + } + } + } + } +} + +function resolveMap(map, baseDir) { + for (const [key, entry] of Object.entries(map)) { + if (entry && typeof entry === "object" && typeof entry.path === "string" && entry.data === undefined) { + const resolved = readFileEntry(entry, baseDir); + if (resolved) map[key] = resolved; + } + } +} + +function readFileEntry(entry, baseDir) { + try { + const p = isAbsolute(entry.path) ? entry.path : join(baseDir, entry.path); + const buf = readFileSync(p); + const out = {}; + // Detect text vs binary via simple UTF-8 round-trip; mirrors the bundle + // deploy behavior. Image/font types get base64; HTML/CSS/JS stay UTF-8. + const looksTextual = !entry.contentType?.match(/^(image|font|application\/(pdf|wasm|octet-stream|zip))/); + if (looksTextual) { + out.data = buf.toString("utf-8"); + out.encoding = "utf-8"; + } else { + out.data = buf.toString("base64"); + out.encoding = "base64"; + } + if (entry.contentType) out.contentType = entry.contentType; + return out; + } catch (err) { + console.error(JSON.stringify({ + status: "error", + message: `Failed to read file '${entry.path}': ${err.message}`, + })); + process.exit(1); + } +} diff --git a/cli/lib/deploy.mjs b/cli/lib/deploy.mjs index 05e07c75..a3881d90 100644 --- a/cli/lib/deploy.mjs +++ b/cli/lib/deploy.mjs @@ -255,6 +255,20 @@ async function loadManifest(opts) { } export async function run(args) { + // Subcommand dispatch (v1.34+): + // run402 deploy apply ... → unified deploy primitive (deploy.apply) + // run402 deploy resume → resume an activation_pending operation + // run402 deploy --manifest … → legacy bundle deploy (still works) + const sub = args[0]; + switch (sub) { + case "apply": + case "resume": { + const { runDeployV2 } = await import("./deploy-v2.mjs"); + await runDeployV2(sub, args.slice(1)); + return; + } + } + const opts = { manifest: null, project: null }; for (let i = 0; i < args.length; i++) { if (args[i] === "--help" || args[i] === "-h") { console.log(HELP); process.exit(0); } diff --git a/cli/llms-cli.txt b/cli/llms-cli.txt index 4f1949d4..fe449024 100644 --- a/cli/llms-cli.txt +++ b/cli/llms-cli.txt @@ -94,7 +94,94 @@ run402 tier status ## Deploying Apps -### Bundle Deploy (recommended -- one command, full stack) +### Unified Deploy (v1.34+, recommended) + +The canonical deploy primitive. All bytes ride through CAS (no inline-body cap), supports both `replace` and `patch` semantics per resource, atomic multi-resource activation, and resumable recovery from partial failures. This is the path the SDK exposes as `r.deploy.apply(...)`. + +⚠️ You still need the `anon_key` BEFORE writing your manifest -- provision first, then embed the real key in your HTML. + +```bash +run402 projects provision --name "my-app" +# → copy anon_key from output into your HTML +``` + +Manifest format mirrors a `ReleaseSpec`: + +```json +{ + "project_id": "prj_1741340000_42", + "database": { + "migrations": [ + { + "id": "001_init", + "sql": "CREATE TABLE IF NOT EXISTS items (id serial PRIMARY KEY, title text NOT NULL); INSERT INTO items (title) VALUES ('Buy groceries');" + } + ], + "expose": { + "version": "1", + "tables": [ + { "name": "items", "expose": true, "policy": "public_read_authenticated_write" } + ] + } + }, + "secrets": { "set": { "OPENAI_API_KEY": { "value": "sk-..." } } }, + "functions": { + "replace": { + "api": { + "runtime": "node22", + "source": { "data": "export default async (req) => new Response('ok')" }, + "config": { "timeoutSeconds": 30, "memoryMb": 256 } + } + } + }, + "site": { + "replace": { + "index.html": { "data": "..." }, + "assets/logo.png": { "data": "iVBORw0KGgo...", "encoding": "base64" } + } + }, + "subdomains": { "set": ["my-app"] } +} +``` + +Apply it: + +```bash +run402 deploy apply --manifest app.json +``` + +Stdout returns `{ status: "ok", release_id, operation_id, urls, ... }`. Stderr streams structured progress events as JSON-line (one event per line). Pass `--quiet` to silence stderr. + +**Patch semantics — only the listed file changes**: + +```json +{ + "project_id": "prj_...", + "site": { "patch": { "put": { "index.html": { "data": "

v2

" } } } } +} +``` + +Or via `--spec` for a one-line CLI invocation: + +```bash +run402 deploy apply --spec '{"project_id":"prj_...","site":{"patch":{"delete":["old.html"]}}}' +``` + +**Recovery from a stuck deploy**: when an `apply` ends in `activation_pending` (rare; transient gateway failure between SQL commit and the pointer-swap activation), the gateway auto-resumes on the hourly tick. Or call resume explicitly: + +```bash +run402 deploy resume +``` + +The gateway re-runs only the failed phase forward — SQL is never replayed. + +**Migration registry**: each migration is identified by `(id, checksum)`. Re-shipping the same `id` + same SQL is a registry noop; same `id` + different SQL is a hard error (`MIGRATION_CHECKSUM_MISMATCH`). Ship idempotent migrations (`CREATE TABLE IF NOT EXISTS`, `ADD COLUMN IF NOT EXISTS` in a `DO` block) and re-deploys are free. + +--- + +### Bundle Deploy (legacy compat — still works) + +The legacy `run402 deploy --manifest` command continues to work and routes through the v2 primitive under the hood. Prefer `run402 deploy apply` for new manifests; the legacy form is kept for the deprecation window so existing scripts and CI pipelines don't break. ⚠️ You need the anon_key BEFORE writing your manifest. Run provision first, then write your HTML with the real key: diff --git a/core/src/config.ts b/core/src/config.ts index 136a4b6e..4c53a379 100644 --- a/core/src/config.ts +++ b/core/src/config.ts @@ -6,6 +6,17 @@ export function getApiBase(): string { return process.env.RUN402_API_BASE || "https://api.run402.com"; } +/** + * API base for the deploy-v2 routes. Defaults to the same value as + * `getApiBase()`. Set `RUN402_DEPLOY_API_BASE` to point only deploy traffic + * elsewhere — useful when running deploy-v2 against a staging gateway while + * the rest of the SDK still talks to production. In normal use callers + * should not need this override. + */ +export function getDeployApiBase(): string { + return process.env.RUN402_DEPLOY_API_BASE || getApiBase(); +} + export function getConfigDir(): string { return process.env.RUN402_CONFIG_DIR || join(homedir(), ".config", "run402"); } diff --git a/docs/consultations/optimal-deploy-dx-for-coding-agents.md b/docs/consultations/optimal-deploy-dx-for-coding-agents.md new file mode 100644 index 00000000..71d8cfd1 --- /dev/null +++ b/docs/consultations/optimal-deploy-dx-for-coding-agents.md @@ -0,0 +1,877 @@ +# Consultation Result + +**Model**: gpt-5.5-pro +**Submitted**: 2026-04-28T14:24:57.310556 +**Completed**: 2026-04-28T14:38:54.787989 +**Status**: completed + +--- + +Build **one deploy system**: + +> **`deploy.apply(ReleaseSpec)` is the canonical primitive.** +> It accepts structured app intent, normalizes all byte payloads into CAS content refs, performs `plan → upload missing CAS objects → commit`, and gives agents one resumable operation with JSON progress/errors. + +Do **not** raise the `/deploy/v1` 50 MB limit. Retire inline base64 as a transport. + +--- + +## 1. Target SDK shape + +### Fresh full-stack deploy + +```ts +import { run402, files } from "@run402/sdk"; + +const run = run402(); + +const result = await run.deploy.apply({ + project: projectId, + base: "empty", // fail if a release already exists; use "current" for existing apps + + database: { + migrations: [ + { + name: "001_init", + sql: ` + CREATE TABLE items ( + id serial PRIMARY KEY, + title text NOT NULL, + done boolean DEFAULT false + ); + `, + }, + ], + expose: { + // new declarative auth/RLS manifest, not the deprecated rls template + tables: { + items: { read: "public", insert: "public", update: "public" }, + }, + }, + }, + + secrets: { + set: { + OPENAI_API_KEY: { value: openaiKey }, + }, + }, + + functions: { + replace: { + "api": { + runtime: "node22", + source: ` + export default async function handler(req) { + return Response.json({ ok: true }); + } + `, + config: { timeoutSeconds: 30, memoryMb: 256 }, + }, + }, + }, + + site: { + replace: files({ + "index.html": { data: html, contentType: "text/html; charset=utf-8" }, + "logo.png": { data: logoBytes, contentType: "image/png" }, + }), + }, + + subdomains: { set: ["my-app"] }, + + checks: [ + { name: "home loads", http: { path: "/", expect: { status: 200 } } }, + { name: "api health", http: { path: "/api", expect: { status: 200 } } }, + ], +}, { + onEvent: (event) => console.log(JSON.stringify(event)), +}); +``` + +### Patch one site file + +Only the new `index.html` bytes leave the machine. + +```ts +await run.deploy.apply({ + project: projectId, + base: "current", + site: { + patch: { + put: { + "index.html": { data: html, contentType: "text/html; charset=utf-8" }, + }, + }, + }, +}); +``` + +### Patch one function + +No site rebuild, no migration replay. + +```ts +await run.deploy.apply({ + project: projectId, + base: "current", + functions: { + patch: { + set: { + "api": { + runtime: "node22", + source: newFunctionCode, + config: { timeoutSeconds: 10, memoryMb: 256 }, + }, + }, + }, + }, +}); +``` + +### Deploy a large directory in Node + +```ts +import { fileSetFromDir } from "@run402/sdk/node"; + +await run.deploy.apply({ + project: projectId, + base: "current", + site: { + replace: fileSetFromDir("dist"), + }, +}); +``` + +`fileSetFromDir()` should be lazy/streaming: hash from disk, upload from disk, never load a 2 GB site into memory. + +### Deploy from memory / V8 isolate + +```ts +import { files } from "@run402/sdk"; + +await run.deploy.apply({ + project: projectId, + base: "current", + site: { + replace: files({ + "index.html": htmlString, + "data.json": new Blob([JSON.stringify(data)], { + type: "application/json", + }), + }), + }, +}); +``` + +No filesystem assumption in the root SDK. + +--- + +## 2. Public API layers + +Expose three layers, but make agents use layer 1. + +```ts +// Layer 1: agent happy path +await run.deploy.apply(spec, opts); + +// Layer 2: debuggable/resumable operation +const op = await run.deploy.start(spec); +for await (const event of op.events()) console.log(event); +const result = await op.result(); + +// Layer 3: low-level protocol for CLI/debugging +const plan = await run.deploy.plan(spec); +await run.deploy.upload(plan, { onEvent }); +await run.deploy.commit(plan.id); +``` + +Also: + +```ts +await run.deploy.resume(operationId); +await run.deploy.status(operationId); +await run.deploy.getRelease(releaseId); +await run.deploy.diff({ from: releaseA, to: releaseB }); +``` + +`apps.bundleDeploy()` and `sites.deployDir()` should become wrappers over this. They should not use their current transports. + +--- + +## 3. Manifest and plan: use both + +The correct model is: + +1. **Manifest / ReleaseSpec**: declarative desired release or patch. +2. **Plan**: gateway diff + CAS upload negotiation + payment preflight. +3. **Commit**: durable server-side operation that stages, migrates, activates. + +So: one conceptual deploy primitive, implemented as plan/upload/commit. + +The SDK should hide this by default, but plan should be first-class for agents that need debugging. + +--- + +## 4. Wire model + +The wire manifest should contain **no file/function/source bytes**. It should contain content refs. + +Conceptually: + +```json +{ + "schema": "run402.deploy.v2", + "project_id": "prj_123", + "base": { "release": "current" }, + "resources": { + "site": { + "mode": "patch", + "put": { + "/index.html": { + "sha256": "abc...", + "size": 12345, + "content_type": "text/html; charset=utf-8" + } + } + }, + "functions": { + "set": { + "api": { + "runtime": "node22", + "entrypoint": "index.mjs", + "files": { + "index.mjs": { + "sha256": "def...", + "size": 999, + "content_type": "text/javascript" + } + }, + "config": { "timeout_seconds": 30, "memory_mb": 256 } + } + } + }, + "database": { + "migrations": [ + { + "id": "001_init", + "checksum": "sha256:...", + "sql_ref": { + "sha256": "ghi...", + "size": 456 + } + } + ] + } + } +} +``` + +The SDK can accept strings, `Uint8Array`, `Blob`, web streams, Node files, etc. But before planning, it normalizes byte payloads into: + +```ts +type ContentRef = { + sha256: string; + size: number; + contentType?: string; + integrity?: string; // sha256-... SRI form +}; +``` + +The deploy endpoint receives refs, not base64. + +--- + +## 5. Canonical byte transport + +Build one internal/public CAS content layer and use it everywhere: + +- deploy site files +- deploy function bundles/source +- deploy SQL payloads if large +- deploy manifest itself if manifest JSON exceeds the normal body limit +- blob storage uploads + +`blobs.put()` remains a storage API, but internally becomes: + +```txt +content.ensure(source) → storage.publish(key, contentRef, metadata) +``` + +Deploy becomes: + +```txt +manifest/spec → content refs → plan missing refs → upload missing refs → commit +``` + +Important details: + +### Upload modes + +Support three upload strategies behind one API: + +1. **Single PUT** for small/medium objects. +2. **Multipart PUT** for large objects. +3. **CAS pack upload** for many tiny files. + +The pack upload matters. A site with 20,000 tiny files should not require 20,000 presigned PUTs. The SDK should pack small missing objects into a content-addressed archive, upload one/few packs, and let the gateway unpack/promote each object after verifying checksums. + +### Manifest refs + +Keep `/deploy/v2/plans` small. If the normalized manifest is too large, the SDK uploads the manifest JSON itself as a CAS object first, then calls: + +```json +{ + "project_id": "prj_123", + "manifest_ref": { + "sha256": "...", + "size": 9000000, + "content_type": "application/vnd.run402.deploy-manifest+json" + } +} +``` + +### Server authoritative digest + +Do not make correctness depend on SDK/gateway canonicalization matching byte-for-byte. The gateway should compute and return the authoritative `manifest_digest`. + +The SDK may compute a local digest for caching/progress, but idempotency should be based on: + +- gateway-computed manifest digest +- project id +- base release +- optional client idempotency key + +--- + +## 6. Atomicity model + +Make deploy commits server-side, durable, and resumable. + +Current ordering is dangerous: + +```txt +migrations → RLS → secrets → functions → site → subdomain +``` + +New ordering should be: + +```txt +plan +upload missing content +validate everything +stage all non-DB resources +reserve domains +gate traffic if DB changes +run DB transaction +activate release pointers +clear gate +poll readiness +``` + +More concretely: + +1. **Validate** + - content exists + - manifest schema valid + - function names/routes valid + - subdomains available/reserved + - migration IDs/checksums sane + - payment/lease checked via x402 before large uploads if possible + +2. **Stage non-visible resources** + - build/stage function versions + - stage site deployment + - stage secret version set + - reserve subdomain + - prepare route table + - no public pointer changes yet + +3. **If database changes exist, enter strict deploy gate** + - temporarily gate project traffic at the run402 edge + - return `503 Retry-After` or queue short requests + - correctness beats zero downtime for agent deploys + - allow opt-in zero-downtime mode only for declared backward-compatible migrations + +4. **Run DB work transactionally** + - advisory lock per project DB + - migrations table with `{ id, checksum, applied_at, operation_id }` + - default: reject non-transactional statements + - apply migrations + expose/RLS in one transaction where possible + +5. **Activate** + - one control-plane transaction swaps active release pointers: + - site deployment + - function versions + - secret version set + - routes + - subdomain mapping + - clear traffic gate + +6. **Readiness** + - CDN/site copy polling + - function warmup/build logs + - optional smoke checks + +This eliminates the bad failure class: + +> SQL migration succeeded, then function deploy failed. + +Under v2, functions are staged before SQL runs. If function staging fails, SQL never ran. If SQL committed and the final activation failed, the project remains gated and `resume(operationId)` finishes activation without rerunning SQL. + +--- + +## 7. Retry / recovery behavior + +Every deploy gets: + +```ts +operationId +planId +manifestDigest +baseReleaseId +``` + +Retry rules: + +### Before DB migration + +Safe. No visible state changed. + +Repeating the same `deploy.apply(spec)` should: + +- reuse or recreate the plan +- skip already-present CAS objects +- restage only missing pieces + +### During transactional migration + +On SQL error: + +- transaction rolls back +- gate clears +- active release unchanged +- error points to migration id + statement/offset +- agent fixes SQL and redeploys + +### After DB commit but before activation + +The operation is `activation_pending`. + +- DB is migrated. +- New site/functions/secrets are already staged. +- Traffic remains gated. +- `deploy.resume(operationId)` activates and clears the gate. +- Repeating the same deploy does not replay migrations. + +### Non-transactional migration + +Default should be: reject. + +If users explicitly opt in: + +```ts +transaction: "none" +``` + +then failures enter: + +```txt +needs_repair +``` + +No blind replay. Return structured repair instructions. + +--- + +## 8. Resource semantics + +Use explicit replace/patch semantics. + +### Top-level absence + +If a resource is omitted, leave it untouched. + +```ts +await run.deploy.apply({ + project, + functions: { patch: { set: { api: fn } } }, +}); +``` + +This does not touch site, DB, secrets, or domains. + +### Replace + +Exact desired set for that resource. + +```ts +site: { replace: fileSetFromDir("dist") } +``` + +Files absent from `dist` are removed from the new site release. + +### Patch + +Modify only specified keys. + +```ts +site: { + patch: { + put: { "index.html": html }, + delete: ["old.html"], + }, +} +``` + +### Conflict handling + +Every plan captures a `baseReleaseId`. + +Default behavior: + +- full replace: fail if active release changed before commit +- patch: auto-rebase if touched resources/paths are disjoint; otherwise fail with structured conflict diff + +Agents need this. Silent clobbering is poison. + +--- + +## 9. Non-byte resources + +Rule: + +> Bytes go through CAS. Semantics stay structured. + +### Site files + +CAS content refs plus path metadata. + +### Functions + +Function source/bundles are CAS content. Function config remains structured. + +Support both: + +```ts +functions: { + patch: { + set: { + api: { + source: "export default async req => new Response('ok')", + }, + }, + }, +} +``` + +and: + +```ts +functions: { + patch: { + set: { + api: { + entrypoint: "index.mjs", + files: fileSetFromDir("functions/api"), + }, + }, + }, +} +``` + +### Migrations + +SQL payloads can be CAS refs internally, but migration identity is structured: + +```ts +{ + id: "001_init", + checksum: "sha256:...", + sqlRef: ContentRef, + transaction: "required" +} +``` + +If the SDK user passes a raw SQL string, SDK converts it to content. + +Migration replay rules: + +- same id + same checksum: noop +- same id + different checksum: hard error +- new id: pending migration + +### Expose/RLS + +Declarative manifest. Gateway diffs/applies it. Store applied digest. + +### Secrets + +Do **not** globally CAS secrets. + +Secret values should be: + +- encrypted/write-only +- staged under the deploy plan +- redacted from logs/events/errors +- versioned, then activated with the release + +Small secret values can ride the control-plane request because they are not bulk bytes. If you later support large secret files, use private encrypted staging, not global CAS/dedup. + +--- + +## 10. Progress and errors + +Agents need structured event streams, not human CLI text. + +```ts +type DeployEvent = + | { type: "plan.started" } + | { type: "plan.diff"; diff: DeployDiff } + | { type: "payment.required"; amount: string; asset: "USDC" } + | { type: "payment.paid"; tx?: string } + | { type: "content.hash.progress"; label: string; done: number; total: number } + | { type: "content.upload.skipped"; label: string; sha256: string; reason: "present" } + | { type: "content.upload.progress"; label: string; done: number; total: number } + | { type: "commit.phase"; phase: string; status: "started" | "done" } + | { type: "log"; resource: string; stream: "stdout" | "stderr"; line: string } + | { type: "ready"; releaseId: string; urls: Record }; +``` + +Errors should look like: + +```json +{ + "code": "FUNCTION_BUILD_FAILED", + "phase": "stage.functions", + "resource": "functions.api", + "message": "Build failed: missing export default", + "retryable": false, + "operation_id": "op_123", + "plan_id": "plan_123", + "logs": [ + { "stream": "stderr", "line": "index.mjs:1: no default export" } + ], + "fix": { + "action": "edit_and_redeploy", + "path": "functions.api.source" + } +} +``` + +For SQL: + +```json +{ + "code": "MIGRATION_FAILED", + "phase": "database.migrate", + "resource": "database.migrations.001_init", + "message": "column \"title\" does not exist", + "statement_offset": 184, + "retryable": false, + "rolled_back": true +} +``` + +--- + +## 11. x402 placement + +Do payment preflight during `plan`, before uploading huge bytes. + +If a lease renewal/payment is needed, agents should learn that before uploading 2 GB. + +Then make `commit` idempotent. If commit hits 402 anyway, the SDK’s x402 fetch wrapper handles it, and errors include: + +```ts +PaymentRequired { + amount, + asset, + payTo, + allowancePath?, + operationId?, + planId? +} +``` + +--- + +## 12. What to do with existing APIs + +### `apps.bundleDeploy` + +Keep as compatibility sugar only. + +```ts +apps.bundleDeploy(projectId, oldOpts) +``` + +should internally convert to: + +```ts +deploy.apply({ + project: projectId, + database: ..., + functions: ..., + site: ..., + secrets: ..., + subdomains: ..., +}); +``` + +It must not POST base64 files to `/deploy/v1`. + +### `sites.deployDir` + +Keep as sugar: + +```ts +sites.deployDir({ project, dir }) +``` + +becomes: + +```ts +deploy.apply({ + project, + site: { replace: fileSetFromDir(dir) }, +}); +``` + +### `blobs.put` + +Keep the API, but make it use the same CAS uploader underneath. + +--- + +## 13. Things other platforms get wrong + +Do not copy these patterns: + +1. **Client-side orchestration** + - `wrangler`, Supabase CLI, etc. do too much locally. + - If the process dies, state is ambiguous. + - run402 should make the gateway own deploy state. + +2. **Separate mutable APIs for env/domains/functions/sites** + - Vercel/Cloudflare split these. + - That creates release skew. + - run402 should version secrets/routes/functions/site together. + +3. **Filesystem-first DX** + - Human platforms assume a repo and a CLI. + - Agents often have strings, blobs, generated artifacts, or V8 memory. + - Filesystem support should be a Node convenience, not the primitive. + +4. **Text logs and dashboard-only debugging** + - Agents need JSON errors with resource paths, retryability, and fixes. + +5. **Opaque server builds** + - Agents need deterministic deploy artifacts and inspectable build errors. + - If you support builds, make them a staged resource with structured logs. + +6. **Blind migration replay** + - Migration identity/checksum must be first-class. + - Repeat deploys should be noops, not re-execution. + +--- + +## 14. Extra high-leverage agent DX + +Two improvements would make run402 meaningfully better than the incumbents. + +### A. Release-scoped public config + +Remove the “provision first, copy anon key into HTML” pitfall. + +Serve a virtual file from every site: + +```txt +/.run402/config.json +``` + +containing: + +```json +{ + "project_id": "prj_123", + "anon_key": "...", + "api_base": "https://api.run402.com", + "release_id": "rel_123", + "functions": { + "api": "/api" + } +} +``` + +Then agents write portable HTML: + +```js +const cfg = await fetch("/.run402/config.json").then(r => r.json()); +``` + +This also improves CAS dedup because HTML no longer changes per project just to embed keys. + +### B. Same-origin function routes + +Let releases declare routes: + +```ts +routes: { + "/api": { function: "api" }, +} +``` + +Then static sites call `/api`, not a separate function URL. This avoids CORS, avoids hardcoded URLs, and makes site+function activation truly release-scoped. + +--- + +## 15. Concrete build order + +Build this in order: + +1. **Extract canonical CAS content service** + - single/multipart/pack uploads + - project-scoped presence + - manifest-ref support + - used by deploy and blobs + +2. **Add `/deploy/v2`** + - `POST /deploy/v2/plans` + - `POST /deploy/v2/plans/:id/commit` + - `GET /deploy/v2/operations/:id` + - `GET /deploy/v2/operations/:id/events` + - `POST /deploy/v2/operations/:id/resume` + +3. **Add release model** + - immutable releases + - active release pointer + - staged function/site/secret versions + - base release conflict detection + +4. **Add transactional commit state machine** + - stage non-DB first + - traffic gate for DB changes + - transactional migrations with ids/checksums + - pointer swap activation + - resumable failure states + +5. **Ship SDK `deploy.apply`** + - root isomorphic memory/blob/web-stream sources + - `/node` directory file sets + - progress events + - structured errors + - old APIs as wrappers + +6. **Update docs/MCP/CLI** + - one deploy command/tool + - no base64 manifest examples + - patch examples + - recovery/resume examples + +The end state: agents think in terms of “apply this release/patch,” not “which upload transport should I use?” Bytes always go through CAS. Structured resources stay typed. Deploys are resumable, debuggable, and atomically activated. + +--- +**Wall time**: 13m 57s +**Tokens**: 5,852 input, 29,458 output (24,706 reasoning), 35,310 total +**Estimated cost**: $5.4780 diff --git a/openspec/changes/unify-deployments/.openspec.yaml b/openspec/changes/unify-deployments/.openspec.yaml new file mode 100644 index 00000000..0a064c1e --- /dev/null +++ b/openspec/changes/unify-deployments/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-04-28 diff --git a/openspec/changes/unify-deployments/design.md b/openspec/changes/unify-deployments/design.md new file mode 100644 index 00000000..c9b8120a --- /dev/null +++ b/openspec/changes/unify-deployments/design.md @@ -0,0 +1,499 @@ +## Context + +Three upload/deploy transports exist today: + +1. **`apps.bundleDeploy` → `POST /deploy/v1`** — atomic multi-resource (DB + RLS + secrets + functions + site + subdomain), inline base64 in JSON body, hard 50 MB ceiling at the gateway (`express.json({ limit: "50mb" })` in `packages/gateway/src/server.ts:227`). Non-transactional under the hood: orchestrates migrations → RLS → secrets → functions (parallelized) → site → subdomain with no rollback if a later step fails. +2. **`sites.deployDir` → `POST /deploy/v1/plan` + presigned PUTs to S3 + `POST /deploy/v1/commit` + `GET /deployments/v1/:id`** — site-only, Node-only, content-addressed (gateway dedupes by SHA-256), multipart, resumable on URL expiry. Implemented in `sdk/src/node/sites-node.ts:144`. +3. **`blobs.put` → `POST /storage/v1/uploads` + presigned PUTs + `POST /storage/v1/uploads/:id/complete`** — single-asset, isomorphic, content-addressed, multipart. Implemented in `sdk/src/namespaces/blobs.ts:268`. + +Two of the three are already CAS-based; the high-value atomic-multi-resource path is the laggard. Inline base64 + 50 MB cap is documented in the gateway with `// /deploy/v1 — bundle deploy (still inline-bytes; carries files inline)` — the team knows. + +Constraints: + +- **Isomorphic SDK** — the kernel must run in V8 isolates (`@run402/sdk` is wired for code-mode MCP / sandbox runtimes; no `node:fs`, no Node streams). Plan/commit + presigned S3 PUT works in isolates; directory walking does not. +- **Atomic multi-resource is the differentiator.** Vercel/Cloudflare/Supabase don't ship one transactional call for DB + functions + site. Losing this loses the platform's whole agent-DX story. +- **Agents generate content in memory.** A primitive that demands a filesystem rules out the most important runtime (sandboxed code-mode agent producing files in V8 memory). +- **Partial failure is a current bug.** Migrations succeeded, function deploy failed → orphaned half-state with no recovery primitive other than re-calling the same orchestration. +- **x402 payment** flows through the `Run402Client.fetch` wrapper; deploys may negotiate payment for tier-renewal mid-call. Whatever we design must surface 402 cleanly and ideally before bytes move. + +Stakeholders: gateway team (state machine + new tables), SDK team (new namespace + Node helpers), MCP/CLI (rewrite tools as thin wrappers), agent-DX docs (`llms-cli.txt`). + +## Goals / Non-Goals + +**Goals:** + +- One canonical SDK primitive `deploy.apply(ReleaseSpec)` that covers every shape today's three transports cover, plus partial-update workflows (patch one file, patch one function) that don't exist today. +- Every byte payload travels through CAS. There is no inline-bytes path in the v2 wire protocol. +- Server-owned, resumable deploy operations with structured progress events and structured errors. After-DB-commit-before-activation must be recoverable without replaying SQL. +- Replace vs patch semantics per resource, with explicit base-release conflict detection on full-replace deploys. +- Migration registry that makes redeploys noop-safe and makes "same id different checksum" a hard error. +- Backward compatibility: existing `apps.bundleDeploy` and `sites.deployDir` callers see no API break in this change. The shape they POST is preserved as a v1 shim. +- The `/deploy/v1` inline-bytes acceptance is removed in a follow-up minor (mirrors the v1.32 site-deploy cutover precedent). + +**Non-Goals:** + +- CAS pack uploads (single archive carrying many small objects). Worth doing — sites with 20k tiny files shouldn't issue 20k presigned PUTs — but layered after v2 lands and not on the critical path. +- Virtual `/.run402/config.json` per-site config endpoint. Solves a real agent pitfall ("provision first, embed anon_key in HTML, then deploy") but is orthogonal to the deploy primitive. +- Same-origin function routes (`routes: { "/api": { function: "api" } }`). Belongs to a release-manifest follow-up that depends on this change but isn't part of it. +- Server-side build steps (Vercel-style). Out of scope; agents pre-build and ship artifacts. +- Multi-region deploy fan-out, blue/green canaries beyond the basic activation gate. + +## Decisions + +### D1. One primitive at the SDK level, three layers exposed + +Single canonical call: + +```ts +await run.deploy.apply(spec, { onEvent }); // L1: agent happy path +const op = await run.deploy.start(spec); // L2: resumable op + event stream +const plan = await run.deploy.plan(spec); // L3: low-level +await run.deploy.upload(plan, { onEvent }); +await run.deploy.commit(plan.id); +await run.deploy.resume(op.id); +``` + +Most agents use L1. L2 exists for streaming/long-running work and progress UIs. L3 is for the CLI's debugging surface and for tests. + +**Why not just expose `apply` and hide the rest?** Coding agents iterate. They want to see the diff before paying, see the missing-bytes count, debug a partial failure. Exposing plan/upload/commit individually is cheap and cuts the "magic black-box" failure mode. + +**Alternatives considered:** keep `bundleDeploy` and `deployDir` as separate primitives and just swap their internal transports — rejected because it preserves the conceptual fork; agents still have to choose which path matches their case. Unifying the surface is the entire point. + +### D2. Wire protocol: `/deploy/v2/plans` + `/deploy/v2/plans/:id/commit` + operation lookups + +``` +POST /deploy/v2/plans + body: ReleaseSpec (with ContentRef objects, never inline bytes) + → { plan_id, operation_id, base_release_id, manifest_digest, missing_content: [{sha256, size, parts:[{url, byte_start, byte_end, part_number}]}], diff: { ... }, payment_required?: { amount, asset, payTo } } + +PUT (per missing content ref, possibly per part) + headers: { x-amz-checksum-sha256: } + body: + +POST /deploy/v2/plans/:id/commit + body: { idempotency_key? } + → { operation_id, status: "running" | "activation_pending" | "ready" | "failed", release_id?, urls?, error? } + +GET /deploy/v2/operations/:id → operation snapshot +GET /deploy/v2/operations/:id/events → event stream (SSE or paginated polling) +POST /deploy/v2/operations/:id/resume → finishes activation if status === "activation_pending" +``` + +The body limits stay sane: + +- `/deploy/v2/plans` — 5 MB JSON (manifest only, no bytes). For huge manifests, see D5 (manifest-ref). +- `/deploy/v2/plans/:id/commit` — 1 MB. +- `/deploy/v2/operations/*` — 1 MB. + +Bytes always go direct-to-S3 via presigned URLs. The gateway never carries deploy bytes. + +**Alternatives considered:** put everything on `/deploy/v2` as a single endpoint (v1 style) — rejected because plan/commit separation is the only way to negotiate dedup before bytes move and the only way to do payment preflight before bytes move. + +### D3. ReleaseSpec shape — explicit replace vs patch per resource + +```ts +interface ReleaseSpec { + project: string; + base?: { release: "current" | "empty" } | { release_id: string }; + database?: { migrations?: MigrationSpec[]; expose?: ExposeManifest }; + secrets?: { set?: Record; delete?: string[]; replace_all?: Record }; + functions?: { replace?: Record; patch?: { set?: Record; delete?: string[] } }; + site?: { replace?: FileSet } | { patch?: { put?: FileSet; delete?: string[] } }; + subdomains?: { set?: string[]; add?: string[]; remove?: string[] }; + routes?: RouteSpec; // forward-compat for same-origin routing follow-up + checks?: SmokeCheck[]; +} + +type ContentRef = { sha256: string; size: number; contentType?: string; integrity?: string }; +type FileSet = Record; +type FunctionSpec = { runtime: "node22"; entrypoint?: string; source?: ContentRef; files?: FileSet; config?: { timeoutSeconds?: number; memoryMb?: number }; schedule?: string | null }; +type MigrationSpec = { id: string; checksum: string; sql_ref: ContentRef; transaction?: "required" | "none" }; +``` + +Top-level absence = leave that resource untouched. `replace` = the spec is the new desired state for that resource (anything not listed is deleted in the new release). `patch` = surgical updates that touch only listed keys. + +**Why not Kubernetes-style strategic-merge-patch?** Too clever, bad agent ergonomics. The two-mode shape (replace vs patch.put/delete) is unambiguous and easy to reason about. + +**Why force migrations to have explicit ids?** Migration replay is the second-most-painful thing in deploys. Agents that can't tell whether a re-deploy is a noop or a re-execution write defensive `IF NOT EXISTS` everywhere. With ids + checksums, the gateway answers definitively. + +### D4. SDK byte sources — "anything goes," normalized to ContentRef before plan + +The SDK accepts: + +- `string` — UTF-8 text +- `Uint8Array` / `ArrayBuffer` +- `Blob` (web) / `File` +- web `ReadableStream` (for streaming hashes) +- `fileSetFromDir(path)` — Node-only, lazy: hashes from disk, uploads from disk, never loads a 2 GB site into memory + +Normalization runs locally before the plan request. Each source emits `{ contentType?: string, sha256: string, size: number }`, plus a deferred reader for the upload phase. The plan request body never contains bytes. + +**Why a separate `files()` helper for the in-memory case?** Sandboxes / V8 isolates have no filesystem; agents generate HTML and JSON in memory. Forcing them to write to a temp dir to deploy is hostile DX. + +**Why does `fileSetFromDir` belong in `@run402/sdk/node`?** It's the only Node-specific piece. Keeping it out of the root SDK keeps the isomorphic kernel pure. + +### D5. Manifest-ref escape hatch + +When the normalized ReleaseSpec exceeds the plan body cap (5 MB), the SDK uploads the JSON itself as a CAS object via the same `cas-content` service, then sends: + +```json +{ "project_id": "prj_...", "manifest_ref": { "sha256": "...", "size": 9000000, "content_type": "application/vnd.run402.deploy-manifest+json" } } +``` + +The gateway fetches it, validates, and proceeds as if it had been inlined. No body-size cliff anywhere in v2. + +**Why CAS the manifest itself?** Reuses the same content service. No new code path. Eliminates the worry that a site with hundreds of file paths might run out of headroom. + +### D6. Server-authoritative manifest digest + +The SDK computes a local manifest digest (RFC 8785 JCS canonicalize SHA-256, same algorithm as today's `sdk/src/node/canonicalize.ts:65`) for caching and progress UX. Idempotency at the gateway is keyed on: + +``` +(project_id, gateway_computed_manifest_digest, base_release_id, optional client_idempotency_key) +``` + +**Why not depend on byte-for-byte client/server canonicalize match?** That's the current fragility — one drift between SDK canonicalize and gateway canonicalize and the SDK's hash silently doesn't match, so retries create new plans instead of finding existing ones. Letting the gateway own the authoritative digest removes the failure mode entirely; the SDK's local digest becomes a UX nicety. + +### D7. CAS content service — expose the existing v1.32 substrate + +The gateway already ships the CAS substrate end-to-end as of v1.32 (see CLAUDE.md "Content-addressed storage (CAS) — v1.32"): + +- `internal.content_objects (sha256 BYTEA(32) PK, s3_key, size_bytes, orphaned_at, deleting_at)` — the global storage row. One S3 object per SHA, ever, across the whole platform. Storage-shared by design. +- `internal.deploy_plans (id, project_id, manifest, manifest_digest BYTEA, expires_at, committed_at)` with unique index `(project_id, manifest_digest) WHERE committed_at IS NULL` — the plan substrate, already idempotent. +- `internal.plan_claims (plan_id, project_id, sha256, completed_at)` — the cross-project commit-existence oracle closure. A SHA is "satisfied" for a project only if that project's prior plan completed an upload referencing it (or the project already references the SHA via `blobs`/`deployment_files`). This is how presence is project-scoped without re-storing bytes. +- `internal.upload_sessions` extended in v1.32 with `kind` (`'blob'` | `'cas'`), `staging_key`, and FK `plan_id` → `deploy_plans` for `kind='cas'` sessions. +- `services/cas-promote.ts` — the staging→CAS promote flow with size + SHA verify, S3 CopyObject (≤5 GiB) / UploadPartCopy (>5 GiB), idempotent identity INSERT, concurrent-same-hash safety. +- `services/copy-resume.ts` — durable Stage-2 resume worker (5-minute lock window, 10-attempt cap, hourly tick from `services/leases.ts`). +- `services/cas-metrics.ts` + CDK metric filters → `Run402/CAS` namespace + alarms. +- AFTER INSERT/DELETE/UPDATE-OF-content_sha256 triggers on `blobs` and `deployment_files` are the **sole** writers of `internal.projects.storage_bytes`. + +This change does **not** add a new `cas_objects` table or a parallel storage layer. It exposes the existing substrate over a generic content route: + +- `POST /content/v1/plans` — accepts `{ project_id, content: ContentRef[] }`, returns missing-with-presigned-PUTs. Internally creates an `upload_sessions` row per missing entry with `kind='cas'` and the supplied `plan_id`. Reuses `services/deploy-plans.ts` for the plan-row primitive. +- `POST /content/v1/plans/:id/commit` — finalizes by promoting any not-yet-promoted staging objects via `services/cas-promote.ts`, marks the plan `committed_at`. Equivalent to today's `/storage/v1/uploads/:id/complete` for blobs, generalized. + +Project-scoped *presence* is the spec contract; project-scoped *storage* is not. The privacy guarantee ("project B cannot infer that project A has uploaded SHA X") is enforced by `plan_claims` + the per-project ref join, not by re-storing bytes per project. Client-observable behavior matches the spec scenarios; the implementation is the v1.32 design we already operate in production. + +**Adapters, not duplicates.** `/storage/v1/uploads` and `/deploy/v1/plan` (the existing site CAS route) become thin adapters over `/content/v1/plans`. Public route shapes unchanged for the deprecation window. One internal substrate; three public routes during transition; one route after sunset. + +**Why not a new project-scoped table?** Two reasons. (1) v1.32's per-reference billing (storage_bytes triggers) is already correct; reintroducing per-project rows would split the trigger story. (2) S3 storage doubles for every cross-project shared SHA (logos, common bundles, the same `react.production.min.js`), hitting our cost line for no privacy gain — the privacy guarantee is already met without it. + +### D8. Release model — immutable releases, typed staging tables, atomic activation + +All schema below lives in the `internal` schema, with `REVOKE ALL ON ... FROM authenticator, anon, authenticated, service_role, project_admin` at create time (matches the dark-by-default convention for `internal.content_objects` / `internal.deploy_plans` / `internal.plan_claims`). All `*_digest` and `*_checksum` columns are `BYTEA(32)` to match the rest of the SHA-256 surface in the schema. All ids follow the existing `__` convention from v1.32 deployment ids. + +```sql +CREATE TABLE internal.releases ( + id TEXT PRIMARY KEY, -- rel__ + project_id TEXT NOT NULL REFERENCES internal.projects(id) ON DELETE CASCADE, + parent_id TEXT REFERENCES internal.releases(id) ON DELETE SET NULL, + manifest_digest BYTEA NOT NULL CHECK (length(manifest_digest) = 32), + manifest_ref BYTEA REFERENCES internal.content_objects(sha256) ON DELETE RESTRICT, -- if manifest was uploaded via /content/v1/plans + manifest_json JSONB, -- small manifests stored inline; null if manifest_ref is set + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + created_by TEXT NOT NULL, -- wallet address or service principal + status TEXT NOT NULL CHECK (status IN ('staged','active','superseded','failed')), + activated_at TIMESTAMPTZ, + superseded_at TIMESTAMPTZ, + CHECK ((manifest_ref IS NULL) <> (manifest_json IS NULL)) +); +CREATE INDEX releases_project_active_idx ON internal.releases (project_id) WHERE status = 'active'; +CREATE INDEX releases_project_created_idx ON internal.releases (project_id, created_at DESC); + +CREATE TABLE internal.deploy_operations ( + id TEXT PRIMARY KEY, -- op__ + project_id TEXT NOT NULL REFERENCES internal.projects(id) ON DELETE CASCADE, + plan_id TEXT NOT NULL REFERENCES internal.deploy_plans(id) ON DELETE RESTRICT, + base_release_id TEXT REFERENCES internal.releases(id) ON DELETE SET NULL, + target_release_id TEXT REFERENCES internal.releases(id) ON DELETE SET NULL, + status TEXT NOT NULL CHECK (status IN ( + 'planning','uploading','committing', + 'staging','gating','migrating','schema_settling', + 'activating','activation_pending','needs_repair', + 'ready','failed','rolled_back')), + payment_required JSONB, + error JSONB, + last_activate_attempt_at TIMESTAMPTZ, -- driven by auto-resume worker (D12) + activate_attempts INT NOT NULL DEFAULT 0, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); +CREATE INDEX deploy_operations_project_idx ON internal.deploy_operations (project_id, created_at DESC); +CREATE INDEX deploy_operations_resume_idx ON internal.deploy_operations (last_activate_attempt_at) + WHERE status = 'activation_pending'; +CREATE INDEX deploy_operations_gc_idx ON internal.deploy_operations (updated_at) + WHERE status IN ('failed','rolled_back'); + +CREATE TABLE internal.applied_migrations ( + project_id TEXT NOT NULL REFERENCES internal.projects(id) ON DELETE CASCADE, + migration_id TEXT NOT NULL, + checksum BYTEA NOT NULL CHECK (length(checksum) = 32), + applied_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + operation_id TEXT NOT NULL REFERENCES internal.deploy_operations(id) ON DELETE RESTRICT, + PRIMARY KEY (project_id, migration_id) +); +``` + +**Typed staging tables instead of one JSONB blob.** A single `staged_resources(kind, ref, config JSONB)` blob would be a query and observability hazard the moment we need to GC stale Lambda versions (AWS limits 75 versions/function — easy to exhaust at agent-deploy cadence) or audit a stuck operation. Three typed tables with proper FKs: + +```sql +CREATE TABLE internal.staged_function_versions ( + operation_id TEXT NOT NULL REFERENCES internal.deploy_operations(id) ON DELETE CASCADE, + function_name TEXT NOT NULL, + lambda_version TEXT NOT NULL, -- AWS Lambda version (e.g. "42") + source_sha256 BYTEA NOT NULL REFERENCES internal.content_objects(sha256) ON DELETE RESTRICT, + config JSONB NOT NULL, -- runtime, timeoutSeconds, memoryMb, schedule + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + PRIMARY KEY (operation_id, function_name) +); + +CREATE TABLE internal.staged_deployments ( + operation_id TEXT NOT NULL REFERENCES internal.deploy_operations(id) ON DELETE CASCADE, + deployment_id TEXT NOT NULL REFERENCES internal.deployments(id) ON DELETE RESTRICT, + PRIMARY KEY (operation_id) +); + +CREATE TABLE internal.staged_secret_sets ( + operation_id TEXT NOT NULL REFERENCES internal.deploy_operations(id) ON DELETE CASCADE, + secret_version_id TEXT NOT NULL, -- opaque id from secrets layer + PRIMARY KEY (operation_id) +); +``` + +**Activation pointer.** Today no `live_deployment_id` column exists; the implicit live deployment is the newest `status='ready'` row per project. The activate phase introduces explicit pointers in a single transaction: + +- `internal.releases.status = 'active'` for the new release; previous flips to `'superseded'` (use the partial unique constraint `... WHERE status = 'active'` enforced by index, with the swap done in one UPDATE that flips both rows). +- `internal.projects.live_release_id` (new column, FK to `releases.id`) — denormalized for hot-path lookup. +- Function alias swaps at AWS (UpdateAlias to point at the staged Lambda version). +- Subdomain mapping update (existing subdomain row → new deployment_id). +- Secret version-set pointer flip. + +**Rollback** = create a new release whose `manifest_json` (or `manifest_ref`) is the parent's, then run activate on it. Cheap, no actual byte move; one new `releases` row marks the rollback in history. + +**Storage-bytes accounting on release-only refs.** Two new triggers (matching the v1.32 pattern in `tg_blobs_storage_bytes` / `tg_deployment_files_storage_bytes`) on `staged_function_versions` and `releases.manifest_ref`. CAS GC's "no refs" union extends to count `staged_function_versions.source_sha256`, `releases.manifest_ref`, and any future release-retained CAS reference. Without this, function-source CAS objects retained by a `superseded` release are billed at $0 and reaped by GC after 30 days, breaking historical rollback. + +### D9. Commit state machine — transactional, with an explicit schema-cache settle + +Phase order: + +``` +1. validate (manifest schema, content present, payment preflight, subdomain available, + migration ids/checksums sane, base-release conflict check) +2. stage (build/stage Lambda function versions; stage site deployment via existing + `services/deploy-commits.ts` mechanism; stage secret version set; + reserve subdomain; insert rows into staged_function_versions / + staged_deployments / staged_secret_sets — no public pointer changes) +3. migrate-gate (only if database.migrations is non-empty: set + `internal.projects.migrate_gate_until = NOW() + INTERVAL '60s'`; + edge middleware returns 503 + Retry-After for control plane AND for + /rest/v1/* — see "Gate scoping" below) +4. migrate (acquire pg_advisory_xact_lock per project; single transaction; + SET search_path TO ; for each migration in spec order: + check applied_migrations row; same id+checksum noop; same id different + checksum hard error; new id apply + INSERT applied_migrations row; + apply expose/RLS manifest in same transaction; + NOTIFY pgrst, 'reload schema'; COMMIT) +5. schema-settle (wait for PostgREST to pick up the schema reload — see D11 for mechanism) +6. activate (single control-plane transaction: flip releases.status; update + projects.live_release_id; flip Lambda alias to staged versions; + update subdomain → deployment_id mapping; flip secret version set; + clear migrate_gate_until) +7. ready (poll site copy if `copying` via existing `services/copy-resume.ts`; + warm function cold-starts opportunistically; run smoke checks if defined; + mark operation ready) +``` + +**Why a separate migrate-gate, not the lifecycle gate?** The lifecycle gate (`packages/gateway/src/middleware/lifecycle-gate.ts`) returns 402 for control-plane writes during `frozen`/`dormant`. The migrate-gate is a different beast: short-lived (typically <30s, capped at 60s), 503 + Retry-After, and **explicitly carves out from the standing data-plane invariant** (CLAUDE.md: "data plane is never gated"). The carve-out is justified because the schema is mid-flight — PostgREST cache is stale, RLS policies are being rewritten, the data plane returning stale results during this window would be worse than returning 503. The carve-out applies only while `migrate_gate_until > NOW()`; it cannot be set by lifecycle. + +**Why the schema-settle phase?** PostgREST picks up DDL via the `NOTIFY pgrst, 'reload schema'` channel and reloads asynchronously. Production observed reloads taking 1–6 seconds under back-to-back DDL load — this is the bug fixed in commit 4c65102b ([packages/gateway/src/services/postgrest-forward.ts](packages/gateway/src/services/postgrest-forward.ts)) by bumping the retry budget to 12×500ms. If we clear the migrate-gate immediately after migrate-COMMIT, the thundering herd of `/rest/v1/*` requests hits stale schema and re-introduces the bug. The schema-settle phase issues a canary `SELECT` that exercises the new schema (an information_schema lookup of one of the tables touched by the migration) with a short retry loop bounded by the same 6s budget. Once the canary succeeds, schema cache is warm; only then do we activate and clear the gate. + +**Idempotency-Key on commit.** Reuse the existing `internal.idempotency_keys` middleware ([packages/gateway/src/middleware/idempotency.ts](packages/gateway/src/middleware/idempotency.ts)) on `POST /deploy/v2/plans/:id/commit` — the response cache makes the "agent retried because the network blipped" case naturally safe. + +Failure handling: + +- Phases 1–2: safe to retry. `deploy.apply(spec)` with the same digest hits the same plan via the existing `(project_id, manifest_digest) WHERE committed_at IS NULL` unique index on `internal.deploy_plans`; uploads dedup; staging restages only what's missing. +- Phase 3 (set gate): idempotent UPDATE; no rollback needed. +- Phase 4 SQL error: transaction rolls back; in finally block, clear `migrate_gate_until = NULL`; structured error carries `migration_id` + statement offset; active release unchanged. +- Phase 4 succeeds, phase 5 (schema-settle) times out: gate stays up; operation enters `schema_settling`; auto-resume worker (D12) retries the settle + activate phases. Schema-settle is replayable — it doesn't write. +- Phase 6 (activate) fails: operation enters `activation_pending`; migrations remain committed; staging resources remain; gate stays up; auto-resume worker completes activation without SQL replay. +- Non-transactional migrations (`transaction: "none"`): explicit opt-in. On failure, operation enters `needs_repair` with structured repair instructions. No blind replay. + +**Why advisory lock and not a status guard?** Two concurrent commits for the same project would both try to migrate. `pg_advisory_xact_lock(hashtext($project_id))` matches the existing convention in `services/bundle.ts:344` and is held for the full transaction (auto-released at COMMIT). The lock key collision space is 32 bits; FNV-1a or hashtext both work — match the existing call site for consistency. + +### D10. Backward-compat shims — three v1 routes folded onto v2 + +Three existing public routes get shimmed; all three keep their request/response shapes for the deprecation window. + +**`apps.bundleDeploy` (SDK) → `POST /deploy/v1` (gateway).** + +```ts +async bundleDeploy(projectId, opts) { + const spec = translateBundleOptsToReleaseSpec(projectId, opts); + const result = await this.client.deploy.apply(spec); + return shapeAsBundleDeployResult(result); +} +``` + +Translation: + +- `migrations: string` → `database.migrations: [{ id: "bundle_legacy_", checksum: sha256(sql), sql_ref: , transaction: "required" }]`. **Deterministic id from the SQL content**, not a timestamp — so re-shipping identical SQL collapses to a noop via the registry. Behavior change vs. v1 (`runMigrations` re-executes on every call): documented release note. Idempotent SQL (`CREATE TABLE IF NOT EXISTS`) is unaffected; non-idempotent SQL gets safer (no accidental re-execution). +- `rls: { template, tables }` → `database.expose: ` via the existing translator at [packages/gateway/src/services/bundle.ts:682](packages/gateway/src/services/bundle.ts:682) (`translateRlsToManifest`). +- `secrets: [{key, value}]` → `secrets.set: { [key]: { value } }`. +- `functions: [{name, code, ...}]` → `functions.replace: { [name]: { runtime: "node22", source: , config, schedule } }`. +- `files: SiteFile[]` → SDK reads bytes (decoding base64 if needed), uploads to CAS via `/content/v1/plans`, builds `site.replace: FileSet`. **Empty/missing `files` with no `inherit` flag stays empty/missing in the spec → `site` is omitted → site is left untouched** (matches today's bundleDeploy semantics; v1 only wipes the site when explicitly given empty `files: []` with `inherit: false`). +- `subdomain: string` → `subdomains.set: [string]` (single-element; multi-subdomain is out of scope, see D13). +- `inherit: true` → ignored + deprecation warning. v2 patch semantics from per-resource omission already cover the use case. + +**`sites.deployDir` (SDK) → `POST /deploy/v1/plan` + `POST /deploy/v1/commit` (gateway).** Today these are the existing site-CAS plan/commit routes ([packages/gateway/src/routes/deploy.ts](packages/gateway/src/routes/deploy.ts), [packages/gateway/src/services/deploy-plans.ts](packages/gateway/src/services/deploy-plans.ts), [packages/gateway/src/services/deploy-commits.ts](packages/gateway/src/services/deploy-commits.ts)) — they already use the same `internal.deploy_plans` / `internal.upload_sessions` / `internal.deployment_files` substrate. Folding them into v2: + +- `POST /deploy/v1/plan` → translates the legacy `{ project, files: [{ path, sha256, size, content_type }] }` body into a v2 ReleaseSpec with `site: { replace: }`, calls `services/deploy-v2.ts:planDeploy`, reshapes the response to the legacy `{ plan_id, missing: [...] }` shape. +- `POST /deploy/v1/commit` → finds the operation linked to the plan_id, runs `services/deploy-v2.ts:commitDeploy`, reshapes to legacy `{ deployment_id, url }` on success. + +Kill-switch env vars `DEPLOY_V1_BUNDLE_ROUTE_THROUGH_V2` and `DEPLOY_V1_SITE_ROUTE_THROUGH_V2` (both default `true`) so each path can fall back to legacy code if the shim has bugs in production. Phase B sets the deprecation headers on all three legacy routes. Phase C returns 410 Gone on `/deploy/v1` with non-empty `files` AND on the legacy `/deploy/v1/plan` + `/deploy/v1/commit` site CAS routes — at that point `/deploy/v2/plans` is the only public surface. + +**Why fold all three at the same time?** Two parallel commit paths against the same `internal.deploy_plans` substrate is a divergence trap. The migration to v2 is cheaper to do once than twice; the kill-switch vars give us per-path rollback during the canary week. + +**One-minor cycle.** Outside callers (forks of the project, integrators not using our SDK, the existing demos under `demos/`) keep working without code changes. SDK consumers get `Deprecation` headers; behavior is byte-identical modulo headers and the migration-noop semantic. + +### D11. Schema-cache settle — how the gateway knows PostgREST is ready + +PostgREST listens on the `pgrst` LISTEN/NOTIFY channel and reloads schema asynchronously. The gateway never gets an ack; the established workaround is the bounded retry loop in `services/postgrest-forward.ts` (12×500ms = 6s budget, post-v1.33). The schema-settle phase converts that workaround from a per-request reaction into a deploy-state-machine step: + +``` +async function schemaSettlePhase(operation, expectedTables: string[]) { + // 1. NOTIFY was issued at COMMIT in the migrate phase. + // 2. Issue a canary SELECT through the project's authenticator role + // that exercises a column added/touched by the migration. + for (let i = 0; i < 12; i++) { + const ok = await canarySelect(operation.project_id, expectedTables); + if (ok) return; + await sleep(500); + } + // 3. Timeout = leave operation in `schema_settling`; auto-resume tick retries. + throw new SchemaSettleTimeoutError(operation.id); +} +``` + +The `expectedTables` list comes from diffing the schema snapshot before/after migrate (the existing `snapshotSchema` helper in `services/bundle.ts` already produces this — reuse). The canary SELECT runs through the same forward path PostgREST uses, so a successful canary proves the cache is warm for end-user traffic. + +Cost of the settle phase under nominal conditions: 0–500ms (cache typically warm after the first poll cycle). Cost on cold/loaded: up to 6s. Cost when PostgREST is wedged: 6s timeout → `schema_settling` → auto-resume retries on the hourly tick (which is fine; the gate stays up but the operation isn't lost). + +### D12. Auto-resume worker — `services/activation-resume.ts` + +The proposal as originally drafted required SDK or human intervention to recover an `activation_pending` operation. That is wrong for run402's operational posture: the gateway already runs an hourly tick from `services/leases.ts` that drives `advanceLifecycle`, the daily cost fetcher, the CAS GC (three phases), and `runCopyResume` for `status='copying'` site deployments. Activation is structurally identical — finite work, replayable, externally-observable terminal state. + +`services/activation-resume.ts` mirrors `services/copy-resume.ts`: + +```ts +export async function runActivationResume(deps) { + const rows = await deps.query(` + SELECT id, project_id, plan_id, target_release_id + FROM internal.deploy_operations + WHERE status IN ('activation_pending', 'schema_settling') + AND (last_activate_attempt_at IS NULL + OR last_activate_attempt_at < NOW() - INTERVAL '5 minutes') + AND activate_attempts < 10 + FOR UPDATE SKIP LOCKED + LIMIT 32 + `); + for (const op of rows) { + await runWithLease(op.project_id, async () => { + await deps.query( + `UPDATE internal.deploy_operations + SET last_activate_attempt_at = NOW(), activate_attempts = activate_attempts + 1 + WHERE id = $1`, [op.id]); + try { + if (op.status === 'schema_settling') await schemaSettlePhase(op, ...); + await activatePhase(op); + await readyPhase(op); + } catch (err) { + if (op.activate_attempts + 1 >= 10) { + await markFailed(op, err); // operator-actionable + } + } + }); + } +} +``` + +Wired into `services/leases.ts` alongside `runCopyResume`. Feature flag `DEPLOY_AUTO_RESUME_ENABLED` (default `true`) for emergency disable. After 10 attempts, operation transitions to `failed` with the error envelope — the gate clears (because operating with the gate up forever is worse than a failed deploy), staged resources GC after 24h. + +**Why max 10 attempts?** Same number as `services/copy-resume.ts`. If activation has truly failed 10× over ~50 minutes, the issue is structural and an alarm should fire; auto-retry stops being useful past that point. + +### D13. Gate scoping — which v2 endpoints are lifecycle-gated, which aren't + +CLAUDE.md's three-category rule applies to the v2 surface. Explicit enumeration so the implementation doesn't drift: + +| Route | lifecycle gate | Rationale | +|---|---|---| +| `POST /deploy/v2/plans` | yes | Control-plane write | +| `POST /deploy/v2/plans/:id/commit` | yes | Control-plane write | +| `POST /content/v1/plans` | yes | Control-plane write (initiates upload) | +| `POST /content/v1/plans/:id/commit` | yes | Control-plane write | +| `GET /deploy/v2/operations/:id` | no | Read-only | +| `GET /deploy/v2/operations/:id/events` | no | Read-only | +| `POST /deploy/v2/operations/:id/resume` | **no** | In-flight completion of an already-authorized commit | + +The `resume` carve-out is critical: an `activation_pending` operation may sit in that state across a tier-lease expiry. Gating resume on `frozen`/`dormant` would trap the project in held-gate forever. Resume only completes work the project has already paid for — there is no x402 settlement on resume. + +**Which endpoints respect the migrate-gate (`projects.migrate_gate_until`)?** During the gate window, `/rest/v1/*` returns 503 + Retry-After (data-plane carve-out, see D9), `/storage/v1/blob-internal` (CDN origin) returns 503 too, and all control-plane writes return 503 for that project. Read-only operations endpoints (`GET /deploy/v2/operations/*`) stay open so the SDK can poll for status during the migrate window. Resume stays open. + +## Risks / Trade-offs + +**Risk** — One CAS substrate exposed via three public routes (`/content/v1/plans`, `/storage/v1/uploads`, `/deploy/v1/plan`) during the deprecation window. → **Mitigation:** all three are thin adapters over `services/cas-promote.ts` + `services/deploy-plans.ts`. The internal substrate (`internal.content_objects`, `internal.deploy_plans`, `internal.plan_claims`) is single-owner; public routes are wire-format adapters. Existing `Run402/CAS` CloudWatch metrics dashboard already covers the substrate; alarms (`Run402CasCommitCopyFailedHigh`, `Run402CasStuckCopyingHigh`, `Run402CasDedupHitRateLow`) apply uniformly. + +**Risk** — Schema-cache settle phase adds latency to every migrate-bearing commit. Nominal 0–500ms, p99 6s, worst-case 6s timeout into `schema_settling` state. → **Mitigation:** mandatory for the default path — this is the cost of fixing the bug we hit in commit 4c65102b. Auto-resume catches the timeout case so the operation isn't lost. The settle phase is itself non-blocking for the agent: the SDK polls `schema_settling` like any other non-terminal status, and the auto-resume worker drives the operation forward without manual intervention. + +**Risk** — Migrate-gate carves out `/rest/v1/*` from the standing "data plane is never gated" invariant. → **Mitigation:** the carve-out is short (≤60s), narrowly scoped (only when `projects.migrate_gate_until > NOW()`), explicitly documented in CLAUDE.md, and surfaced as a deployment-correctness primitive rather than a billing primitive. End users see Retry-After during a deploy; this is acceptable because the alternative (returning stale rows / RLS-policy-mismatched data during DDL) is worse. Zero-downtime opt-in (`migrations: { zero_downtime: true }`) is available for callers who declare their migrations strictly backward-compatible. + +**Risk** — Base-release conflict detection on full-replace deploys is a new agent-facing failure mode (two agents racing to redeploy). → **Mitigation:** default behavior on `apply()` is to auto-rebase if the patch's touched paths/keys are disjoint from concurrent changes; surface conflicts as structured `{conflict: { paths, keys }}` errors with a clear retry path. Document. + +**Risk** — Five new tables (`releases`, `deploy_operations`, `applied_migrations`, `staged_function_versions`, `staged_deployments`, `staged_secret_sets`) plus two new columns on `internal.projects` (`live_release_id`, `migrate_gate_until`) plus one column flag (`migrations_adopted_at`, optional — see Migration Plan). All require explicit `REVOKE ALL` from `authenticator/anon/authenticated/service_role/project_admin` and corresponding db-staging probes. → **Mitigation:** Codified as a task; the db-staging-gate CI catches missing REVOKEs before merge (see CLAUDE.md "Migration staging gate"). Five new probes added to [test/db-staging/probes.ts](test/db-staging/probes.ts). + +**Risk** — Storage-bytes accounting drift: function-source CAS objects retained only via `staged_function_versions` or via a `superseded` `releases.manifest_ref` are invisible to the v1.32 storage_bytes triggers. → **Mitigation:** new AFTER triggers on `staged_function_versions.source_sha256` and `releases.manifest_ref` mirror the v1.32 trigger pattern. CAS GC's "no refs" union extends to these tables. Storage-bytes invariant audit added to db-staging probes. + +**Risk** — Migration registry hard errors on long-running projects when an agent's `deploy.apply` ships SQL that conflicts with a pre-v2 schema state. → **Mitigation:** **no bulk seed**. Registry starts empty; first v2 deploy per project executes whatever the spec ships. Agents migrating to v2 are documented to ensure their first migration block is idempotent (`CREATE TABLE IF NOT EXISTS`) — this is already the agent norm. For the rare case where an operator wants to record a migration as "already applied" without executing it, an admin endpoint `POST /deploy/v2/admin/migrations/adopt` accepts `{ project_id, migration_id, checksum }` and inserts the row. No cross-project bulk backfill — that path always lies, see "applied_migrations seed risk" tabled. + +**Risk** — Server-authoritative digest changes the round-trip shape (SDK now learns the digest from the plan response). Tests that rely on the SDK-computed digest matching the gateway's are no longer meaningful. → **Mitigation:** the existing JCS canonicalize at [packages/gateway/src/services/deploy-plans.ts:286](packages/gateway/src/services/deploy-plans.ts:286) (`computeManifestDigest`) is reused as the gateway's digest function. The SDK keeps `sdk/src/node/canonicalize.ts` as a UX helper but no longer required to byte-match. Replace the cross-repo fixture test with a contract test that asserts the gateway accepts whatever the SDK posts (the SDK can verify the gateway's reply digest locally for sanity, but mismatch is no longer fatal). + +**Trade-off** — Patch semantics + base-release conflict detection are more complex than today's "post the whole thing every time." The cost lands on the gateway (release model + diff logic). The win lands on agents (fast iteration, partial-failure recovery, structured progress). + +**Trade-off** — `deploy.apply` becomes the only blessed primitive. Agents that today reach for `bundleDeploy` because it's "the atomic one" or `deployDir` because it's "the fast one" lose that mental shortcut. → **Mitigation:** docs explicitly cover the transition; the shims keep both old names working; the SDK exports `deploy` prominently in the README's first example. + +**Trade-off** — Bundle-shim translates `migrations: string` to a deterministic `bundle_legacy_` id. Re-shipping identical SQL via the shim becomes a registry noop instead of v1's re-execution. Idempotent SQL (the agent norm) sees no change; non-idempotent SQL gets safer. Documented release note. + +## Migration Plan + +1. **Phase A — Build v2 alongside v1** + - Ship gateway migration adding `internal.releases`, `internal.deploy_operations`, `internal.applied_migrations`, `internal.staged_function_versions`, `internal.staged_deployments`, `internal.staged_secret_sets`. Add `internal.projects.live_release_id`, `internal.projects.migrate_gate_until`. All with `REVOKE ALL` to anon/authenticated/service_role/project_admin (matches existing dark-by-default convention). Add five db-staging probes asserting the REVOKEs hold under all auth shapes. CODEOWNER review by `@MajorTal` per `.github/CODEOWNERS`. + - Ship `services/content.ts` (the thin facade over existing `services/cas-promote.ts` + `services/deploy-plans.ts`) and `routes/content.ts`. + - Ship `services/deploy-v2.ts` (plan + commit + state machine) and `routes/deploy-v2.ts`. Wire `services/activation-resume.ts` into the existing `services/leases.ts` hourly tick. + - Ship the SDK `deploy` namespace + new types + Node helpers. + - Ship the three v1 shims: `POST /deploy/v1` (bundle), `POST /deploy/v1/plan` + `POST /deploy/v1/commit` (site CAS). Each behind a kill-switch env var (`DEPLOY_V1_BUNDLE_ROUTE_THROUGH_V2`, `DEPLOY_V1_SITE_ROUTE_THROUGH_V2`, default `true`). `apps.bundleDeploy` and `sites.deployDir` SDK methods become thin wrappers over `deploy.apply`. + - MCP `bundle_deploy`, `deploy_site`, `deploy_site_dir`, `deploy_function` keep their input schemas; new MCP tools `deploy` and `deploy_resume`. + - **Risk window:** any divergence between v1-shim and v2 logic produces silent behavior changes. Mitigation: shadow-traffic comparison test (`bundle-v1-shim.test.ts`) asserts byte-identical responses modulo deprecation headers; one-week canary on staging via `BASE_URL=https://api.staging.run402.com npm run test:e2e` + `npm run test:bld402-compat`; full sync test in `sync.test.ts` extended; full `db-staging-gate` CI run on the migration before merge. + +2. **Phase B — Deprecate (one minor)** + - Add `Deprecation: true; Sunset: ; Link: ; rel="successor-version"` headers on all three v1 routes. + - SDK emits a one-time console warning when callers pass legacy `files: SiteFile[]` with `inherit: true`, or when the inline-bytes `/deploy/v1` path is hit (the bundle shim CAS's bytes internally; only direct HTTP callers see the warning). + - `cli/llms-cli.txt` updated to lead with `deploy.apply` examples; legacy section moved to a "compat" appendix. + +3. **Phase C — Remove (next minor)** + - `/deploy/v1` returns 410 Gone for requests with non-empty `files`. The route stays for callers using only `migrations`/`secrets`/`functions` without site files. + - `/deploy/v1/plan` + `/deploy/v1/commit` return 410 Gone with a pointer to `/deploy/v2/plans`. + - `bundleDeploy` SDK shim continues working — it CAS's bytes internally. + - `inherit: true` becomes a hard error in the shim. + +4. **Rollback strategy** + - Phase A schema migration is **forward-only** (mirrors v1.32 cutover constraint). The five new tables and two `projects` columns cannot be dropped after they ship without an RDS snapshot restore. + - Phase A code rollback: kill-switch env vars revert v1 routes to legacy code paths; `DEPLOY_AUTO_RESUME_ENABLED=false` disables the auto-resume worker. Schema persists harmlessly. + - Phase B is purely additive (headers + warnings). + - Phase C is the only real cutover. Mitigation: snapshot of pre-cutover gateway, traffic mirror to staging for one week before flipping, customer email + dashboard banner two weeks ahead. + +## Open Questions — resolved + +- **Q1. Payment timing.** Surface `payment_required` in the **plan** response body (HTTP 200) so the SDK can show the agent before bytes move. Settle the x402 charge at **commit** via the existing `Run402Client.fetch` 402 handshake. The plan-time surface is informational; commit is authoritative. Rationale: pre-revenue, settlement-on-plan would add a billing surface we don't have today, and the plan is cheap enough that an agent who pays at commit hasn't lost much. The lifecycle gate at `/deploy/v2/plans` already returns 402 if the project is `frozen`/`dormant`, which covers the "agent doesn't realize they're past-due" case at the expected layer. +- **Q2. Staging GC.** Three typed tables (`staged_function_versions`, `staged_deployments`, `staged_secret_sets`) all CASCADE-delete from `internal.deploy_operations`. Janitor task in the existing `services/leases.ts` hourly tick (alongside CAS GC and `runCopyResume`) deletes operations in `failed`/`rolled_back` older than 24h, which CASCADEs the staging rows. Never GC `activation_pending` or `schema_settling` — those are the auto-resume worker's domain. Stale Lambda versions (>75 per function, AWS limit) get a separate per-function reaper tick that drops the oldest non-active version. +- **Q3. v1→v2 translation locus.** Per-request, in the route handler. The shim does not persist a v2 plan for v1 callers — they don't have the operation_id surface anyway, so persisting wouldn't help them. The shim still uses `internal.idempotency_keys` (the existing middleware) for retry safety on `Idempotency-Key` headers. +- **Q4. Payment-required surfacing.** HTTP 200 with `payment_required` in the body. No discrete pricing endpoint — `POST /deploy/v2/plans` with an empty/minimal spec already serves that purpose (returns the diff against base + payment_required if any), at the cost of one plan-row insert. +- **Q5. Subdomain semantics.** **Single subdomain only**. `subdomains.set` accepts a one-element array; the gateway rejects multi-element arrays with a structured error pointing at the future multi-subdomain change. Multi-subdomain per project is a significant feature (KVS routing, billing, custom-domain interaction) and is explicitly **out of scope** for unify-deployments. The bundle shim's `subdomain: string` translates to `subdomains: { set: [string] }`; existing one-subdomain-per-project semantics are preserved exactly. diff --git a/openspec/changes/unify-deployments/proposal.md b/openspec/changes/unify-deployments/proposal.md new file mode 100644 index 00000000..89965386 --- /dev/null +++ b/openspec/changes/unify-deployments/proposal.md @@ -0,0 +1,45 @@ +## Why + +run402 today ships three different upload/deploy transports that coding agents must choose between: `apps.bundleDeploy` (inline base64 in JSON, atomic multi-resource, hard 50 MB ceiling at the gateway), `sites.deployDir` (plan/commit + S3 CAS, site-only, Node-only), and `blobs.put` (CAS again, isomorphic, single-asset). The atomic multi-resource shape — DB + RLS + secrets + functions + site + subdomain in one transaction — is the platform's whole differentiator versus Vercel/Cloudflare/Supabase, but its bytes transport (base64-in-JSON) caps real-world deploys at ~37 MB of file content and provides no dedup, no resumability, no streaming progress, and no patch-one-file path. Meanwhile the well-engineered CAS transport that already exists in `deployDir`/`blobs.put` is unavailable for the multi-resource case, so agents who need to ship a real app are forced into the worse path. Today's gateway also runs the bundle non-transactionally (migrations apply, then function deploy fails → orphaned half-state), with no recovery primitive other than re-calling the same call. + +We need one canonical deploy primitive that carries every byte payload through CAS, accepts structured intent (DB / functions / site / secrets / subdomains) declaratively, supports both replace and patch semantics per resource, and commits atomically with a server-owned, resumable state machine. + +## What Changes + +- **NEW** `deploy.apply(ReleaseSpec)` — single canonical SDK call. Accepts structured app intent + byte sources (string, `Uint8Array`, `Blob`, web stream, or directory via `@run402/sdk/node` helper), normalizes all bytes into `ContentRef` (sha256 + size + content_type) before planning, and runs plan → upload missing → commit. Isomorphic in the root SDK (no filesystem assumption); Node entry adds `fileSetFromDir()` lazy/streaming source. +- **NEW** `POST /deploy/v2/plans`, `POST /deploy/v2/plans/:id/commit`, `GET /deploy/v2/operations/:id`, `GET /deploy/v2/operations/:id/events`, `POST /deploy/v2/operations/:id/resume` — versioned deploy wire protocol. Manifest contains content refs only (no inline bytes); for huge manifests, an SDK-side CAS pre-upload of the manifest itself plus a `manifest_ref` field in the plan request. +- **NEW** Release model — every successful commit produces an immutable `release_id`. The active release pointer flips atomically at activation. Staged-but-not-active site deployments / function versions / secret version sets / route tables hang off plans. Plan requests carry a `base_release_id` for conflict detection. +- **NEW** Replace vs patch semantics per resource. `site.replace` = "this is the whole site" (files absent from the spec are removed in the new release); `site.patch.put` / `patch.delete` = surgical updates that do not touch unmentioned paths. Same shape for `functions`, `secrets`, `subdomains`. Top-level resource absence means "leave untouched." +- **NEW** Server-side commit state machine. Phase order: validate → stage all non-DB resources → reserve subdomain → migrate-gate (only if DB changes are present; 503 + Retry-After per project, distinct from the lifecycle gate, with explicit data-plane carve-out for `/rest/v1/*` for the duration) → run DB migrations transactionally with `{id, checksum, applied_at}` registry → schema-cache settle (canary SELECT against the new schema until PostgREST has reloaded) → atomic pointer swap activation → clear gate → poll readiness. After-DB-commit-but-before-activation enters `activation_pending`; `deploy.resume(operationId)` finishes activation without replaying SQL, **and** an auto-resume worker in `services/leases.ts` automatically retries `activation_pending` operations on the hourly tick (mirrors the existing `services/copy-resume.ts` pattern). +- **NEW** Structured event stream + error envelopes. Events: `plan.started`, `plan.diff`, `payment.required`, `payment.paid`, `content.upload.skipped`, `content.upload.progress`, `commit.phase`, `log`, `ready`. Errors carry `{code, phase, resource, message, retryable, operation_id, plan_id, fix?}`. +- **NEW** Server-authoritative manifest digest — the SDK computes a local digest for caching/UX, but idempotency is keyed on the gateway-computed digest plus optional client idempotency key. Removes the byte-for-byte canonicalize fragility between SDK and gateway. +- **NEW** x402 payment preflight during `plan` — agents learn about lease renewal cost *before* uploading bytes. +- **NEW** Migration registry (`internal.applied_migrations`) keyed by `(project_id, migration_id)` with `checksum BYTEA(32)`. Same id + same checksum = noop; same id + different checksum = hard error; new id = applied + recorded. Eliminates blind replay. **No bulk seed from v1 history** — the gateway never persisted v1 migration text, so any seed would be a sentinel that hard-errors all real callers. Registry starts empty per project; first v2 deploy executes whatever the spec ships (agents are documented to ensure first-deploy migrations are idempotent — the established norm). Operator escape hatch via `POST /deploy/v2/admin/migrations/adopt` for the rare case of recording a pre-applied migration without execution. +- **MODIFIED** `apps.bundleDeploy(projectId, opts)` becomes a thin compatibility shim that translates to `deploy.apply()` and never POSTs base64 to `/deploy/v1`. Migration translation uses a deterministic id `bundle_legacy_` so re-shipping identical SQL becomes a registry noop (behavior change vs. v1's blind re-execution; idempotent SQL — the agent norm — sees no observable difference). +- **MODIFIED** `sites.deployDir({ project, dir })` becomes a thin shim over `deploy.apply({ site: { replace: fileSetFromDir(dir) } })`. +- **MODIFIED** `blobs.put` keeps its public API but routes through the same canonical CAS substrate used by `deploy.apply` (the v1.32 `internal.content_objects` + `services/cas-promote.ts` already shipping in production — exposed as `/content/v1/plans` for v2). +- **MODIFIED** `POST /deploy/v1/plan` and `POST /deploy/v1/commit` (the existing v1.32 site CAS routes) become v1 shims over v2. They keep their request/response shapes; internally they translate to `/deploy/v2/plans` + commit. Behind kill-switch env var `DEPLOY_V1_SITE_ROUTE_THROUGH_V2` (default `true`) for incident rollback. Folding all three v1 paths through v2 at the same time eliminates the "two parallel commit code paths against the same `internal.deploy_plans` substrate" divergence trap. +- **BREAKING** `POST /deploy/v1` — accepting inline `files: SiteFile[]` is deprecated. The endpoint stays for one minor cycle accepting the same wire shape but routes through v2 internally; the next minor removes inline-bytes acceptance and returns 410 Gone with a pointer to v2 (mirrors the v1.32 site cutover). +- **BREAKING** The legacy `inherit` flag on `bundleDeploy` is obsolete — patch semantics replace it. The shim accepts and ignores `inherit` for one minor; thereafter it errors. + +## Capabilities + +### New Capabilities + +- `unified-deploy`: The `deploy.apply()` SDK primitive, the `/deploy/v2` wire protocol, the release/operation model, replace-vs-patch resource semantics, server-side commit state machine (with explicit schema-cache settle phase + auto-resume worker), structured events and errors, payment preflight, and migration registry. This is the canonical deploy capability; everything else becomes a wrapper. +- `cas-content`: The content-addressed storage capability, exposed as `POST /content/v1/plans` (negotiate which SHAs are missing for the project) and presigned PUTs/multipart for missing bytes. **The substrate already ships in production** as v1.32's `internal.content_objects` + `internal.deploy_plans` + `internal.plan_claims` + `services/cas-promote.ts` + `services/copy-resume.ts`; this change introduces a generic content route over the existing substrate and refactors the existing per-namespace upload routes (`/storage/v1/uploads`, `/deploy/v1/plan`) into adapters over it. Project-scoped *presence* (the privacy guarantee) is enforced by `plan_claims` + per-project ref joins; storage stays globally shared (one S3 object per SHA) for cost. + +### Modified Capabilities + +- `deploy-dir`: `sites.deployDir` is reduced to a Node-only convenience that calls `deploy.apply({ site: { replace: fileSetFromDir(dir) } })`. The MCP tool `deploy_site_dir` and CLI `run402 sites deploy-dir` keep their input shapes but route through v2. The existing onEvent callback is preserved (events upgrade to the richer v2 envelope under the same name). +- `incremental-deploy`: The `inherit` flag is removed; patch semantics on `deploy.apply` cover its use case. Bundle-deploy callers passing `inherit: true` get a one-minor deprecation warning, then a hard error. + +## Impact + +- **SDK** — new `deploy` namespace (`@run402/sdk` + `@run402/sdk/node`), modifications to `apps.bundleDeploy` and `sites.deployDir`, modifications to `blobs.put` internals only (public API unchanged). +- **Gateway** — new `packages/gateway/src/routes/deploy-v2.ts`, `packages/gateway/src/routes/content.ts`, `packages/gateway/src/services/deploy-v2.ts`, `packages/gateway/src/services/content.ts` (thin facade over existing `services/cas-promote.ts` + `services/deploy-plans.ts`), `packages/gateway/src/services/activation-resume.ts` (wired into the existing `services/leases.ts` hourly tick alongside `runCopyResume`). New tables: `internal.releases`, `internal.deploy_operations`, `internal.applied_migrations`, `internal.staged_function_versions`, `internal.staged_deployments`, `internal.staged_secret_sets`. New columns on `internal.projects`: `live_release_id`, `migrate_gate_until`. New AFTER triggers on `staged_function_versions.source_sha256` and `releases.manifest_ref` for storage-bytes accounting (mirroring the v1.32 trigger pattern). Existing `routes/bundle.ts`, `routes/deploy.ts` (legacy plan/commit) become v1 shims around v2 behind kill-switch env vars. CAS GC's "no refs" union extended to count release-only references. Five new `db-staging-gate` probes asserting the new internal tables are not exposed via PostgREST. +- **MCP server** — `bundle_deploy`, `deploy_site`, `deploy_site_dir`, `deploy_function` rewrite as thin wrappers over `deploy.apply`. The MCP surface gains `deploy` (one-shot) and may expose `deploy_resume` (resumable op). +- **CLI** — `run402 deploy --manifest ` keeps its surface but routes to v2 internally; `run402 sites deploy-dir` similarly. New `run402 deploy resume ` for partial-failure recovery. +- **Docs** — `cli/llms-cli.txt` and the agent-facing surface drop base64 manifest examples in favor of patch examples + recovery examples; the existing "provision first, copy anon_key into HTML" pitfall is unaffected by this change (a follow-up change can address it via the virtual `/.run402/config.json` idea from the consultation). +- **External users** — bundle-deploy callers using inline base64 see no behavior change for one minor cycle; `inherit: true` users see a deprecation warning then a hard error. Custom integrations against `/deploy/v1` that were depending on undocumented inline-bytes behavior break on the v2 cutover (mirrors the v1.32 site cutover precedent). +- **Out of scope** (explicit follow-ups, not this change): CAS pack uploads for sites with thousands of tiny files; virtual `/.run402/config.json` per-site config endpoint; same-origin function routes (`routes: { "/api": { function: "api" } }`); **multi-subdomain per project** (the spec rejects multi-element `subdomains.set` arrays with a structured error pointing at this follow-up). diff --git a/openspec/changes/unify-deployments/specs/cas-content/spec.md b/openspec/changes/unify-deployments/specs/cas-content/spec.md new file mode 100644 index 00000000..62f4d5ae --- /dev/null +++ b/openspec/changes/unify-deployments/specs/cas-content/spec.md @@ -0,0 +1,100 @@ +## ADDED Requirements + +### Requirement: Gateway exposes a content-addressed content service + +The gateway SHALL expose a content-addressed storage service used internally by `unified-deploy`, the `blobs` storage namespace, and the manifest-ref escape hatch. The service SHALL provide: + +- `POST /content/v1/plans` — accepts `{ project_id, content: [{ sha256: hex, size: int, content_type?: string }] }` and returns `{ plan_id, missing: [{ sha256, mode: "single" | "multipart", parts: [{ part_number, url, byte_start, byte_end }], expires_at }] }`. Entries already present-for-this-project (per the project-scoped presence rule below) SHALL be omitted from `missing`. +- Presigned PUTs to S3 staging for missing content. Each PUT SHALL carry the per-part SHA-256 in the `x-amz-checksum-sha256` header; the gateway-issued URL SHALL pin the expected checksum so a corrupted upload fails at S3. +- `POST /content/v1/plans/:id/commit` — finalizes a plan by promoting staged objects from the project's staging area to the shared CAS, and recording per-project reference proofs (`internal.plan_claims`). Multipart uploads complete here. + +The body limit on `POST /content/v1/plans` SHALL be 5 MB. + +The implementation SHALL reuse the existing v1.32 substrate (`internal.content_objects`, `internal.deploy_plans`, `internal.plan_claims`, `internal.upload_sessions` with `kind='cas'`, `services/cas-promote.ts`, `services/copy-resume.ts`). This route is a generic content-route surface over already-shipping infrastructure; it SHALL NOT introduce a parallel storage layer or per-project storage rows. + +#### Scenario: Plan reports presence-only for already-uploaded SHAs + +- **WHEN** the SDK calls `POST /content/v1/plans` listing 10 SHAs of which 7 are already in the project's CAS +- **THEN** the response `missing` contains exactly 3 entries +- **AND** the project's CAS state is unchanged at this point (no DB writes for entries already present) + +#### Scenario: Multipart mode chosen by the gateway for large objects + +- **WHEN** the SDK lists a missing entry whose `size` exceeds the gateway's single-PUT threshold +- **THEN** the plan response sets `mode: "multipart"` for that entry with multiple parts and per-part presigned URLs +- **AND** each part covers a contiguous byte range without gaps or overlap + +#### Scenario: Bytes upload via presigned PUT, never through the gateway + +- **WHEN** the SDK uploads a missing object's bytes +- **THEN** the PUT request goes directly to the presigned S3 URL +- **AND** the gateway is not in the request path for the bytes themselves + +### Requirement: CAS content presence is project-scoped + +CAS content **presence** SHALL be scoped per project. A SHA present in project A's references SHALL NOT count as present for project B; project B's plan response SHALL list that SHA in `missing` until project B has either (a) uploaded the bytes itself within a plan, or (b) accumulated a reference to the SHA via its own `blobs` / `deployment_files` / staged-function rows. The presence answer SHALL NOT leak whether any other project has uploaded the same SHA. + +This is a privacy and isolation guarantee — projects MUST NOT learn about each other's content presence via cross-project dedup. + +The implementation SHALL enforce this via the existing v1.32 substrate: `internal.plan_claims` (per-project proof of upload completion within a plan) joined with the project's own ref tables. Storage SHALL remain globally shared in S3 (one object per SHA across the platform) — this is a cost optimization that does not weaken the presence guarantee, because presence is decided by joins, not by storage layout. + +#### Scenario: Same SHA in two projects requires two uploads + +- **WHEN** project A has uploaded a file with SHA X +- **AND** project B issues a plan listing SHA X as content it wants to ship +- **THEN** project B's plan response includes that SHA in `missing` +- **AND** project B uploads the bytes itself +- **AND** the second upload is observable as a fresh presigned PUT — project B cannot infer from latency, response shape, or any other side channel that project A previously uploaded the same bytes + +### Requirement: CAS content service is reused by `blobs.put` internally + +The `blobs.put` SDK method SHALL use the same internal CAS content service for byte transport. The public `POST /storage/v1/uploads` and `POST /storage/v1/uploads/:id/complete` routes SHALL continue to exist (no breaking change for blob callers), but their handlers SHALL delegate to the CAS content service for the byte staging step. + +The agent-observable behavior of `blobs.put` SHALL be unchanged — the response shape (`AssetRef` with `cdnUrl`, `scriptTag()`, `linkTag()`, `imgTag()`, etc.) is preserved. + +#### Scenario: blobs.put is byte-identical from the caller's perspective + +- **WHEN** an agent calls `r.blobs.put(projectId, "logo.png", { bytes: imageBytes })` against the new gateway +- **THEN** the returned `AssetRef` is shape-identical to the pre-change result (same `cdnUrl`, `sri`, `contentDigest`, etc.) +- **AND** the SDK uses the gateway's `/storage/v1/uploads` route as before + +#### Scenario: Internal CAS dedup applies to blobs + +- **WHEN** an agent uploads an `AssetRef` for `logo.png` and then later calls `blobs.put` with the same bytes under a different key +- **THEN** the second upload's plan reports the SHA as already present +- **AND** no S3 PUT is issued for the bytes + +### Requirement: Manifest-ref pre-upload uses the CAS content service + +When a deploy spec exceeds the inline plan body cap (5 MB), the SDK SHALL upload the manifest JSON itself as a CAS object via this content service, with `content_type: "application/vnd.run402.deploy-manifest+json"`. The deploy plan request SHALL then reference the manifest via its ContentRef. + +The gateway SHALL fetch the manifest from CAS as the first step of plan processing when `manifest_ref` is present. + +#### Scenario: Large manifest is uploaded as CAS first + +- **WHEN** the SDK normalizes a ReleaseSpec whose JSON serialization is 9 MB +- **THEN** the SDK calls `POST /content/v1/plans` listing the manifest's SHA, uploads the missing manifest bytes, and calls `POST /deploy/v2/plans` with `manifest_ref: { sha256, size, content_type: "application/vnd.run402.deploy-manifest+json" }` + +#### Scenario: Gateway reads manifest from CAS + +- **WHEN** the gateway receives a deploy plan request with `manifest_ref` +- **THEN** the gateway fetches the manifest bytes from the project's CAS namespace +- **AND** processes the deploy plan as if the manifest had been inlined + +### Requirement: Presigned URL TTL and refresh semantics + +Presigned PUT URLs returned by `POST /content/v1/plans` SHALL have a TTL of at least 1 hour. The plan response SHALL surface the URL `expires_at` so SDKs can refresh proactively before TTL. + +Re-issuing a plan for the same `(project_id, content list)` SHALL return fresh presigned URLs without altering CAS presence state — re-planning is safe and free of side effects on bytes. + +#### Scenario: SDK refreshes URLs before TTL expires + +- **WHEN** the SDK has held a plan for more than 50 minutes (under the 1-hour TTL) and still has uploads remaining +- **THEN** the SDK re-calls `POST /content/v1/plans` with the same content list to obtain fresh URLs +- **AND** the second call's `missing` list excludes objects already uploaded in the first plan window + +#### Scenario: Expired URL retry succeeds after refresh + +- **WHEN** an S3 PUT returns 403 (URL expired) during upload +- **THEN** the SDK re-plans once and retries the failed PUT against the new URL +- **AND** the upload succeeds without surfacing the transient failure to the caller diff --git a/openspec/changes/unify-deployments/specs/deploy-dir/spec.md b/openspec/changes/unify-deployments/specs/deploy-dir/spec.md new file mode 100644 index 00000000..08b6b8b6 --- /dev/null +++ b/openspec/changes/unify-deployments/specs/deploy-dir/spec.md @@ -0,0 +1,147 @@ +## MODIFIED Requirements + +### Requirement: SDK exposes deployDir on Node entry + +The `@run402/sdk/node` entry point SHALL expose a `deployDir` method on the `sites` namespace that takes `{ project: string, dir: string, target?: string, onEvent?: (event: DeployEvent) => void }` and returns the same `SiteDeployResult` shape (`{ deployment_id, url, bytes_total?, bytes_uploaded? }`) as a successful deploy. + +The implementation SHALL be a thin wrapper that delegates to the canonical `deploy.apply` primitive (defined by the `unified-deploy` capability): + +```ts +async deployDir({ project, dir, target, onEvent }) { + return r.deploy.apply( + { project, site: { replace: fileSetFromDir(dir) } }, + { onEvent } + ).then(shapeAsSiteDeployResult); +} +``` + +The isomorphic `@run402/sdk` entry point SHALL NOT expose `deployDir` — it remains a Node-only convenience because directory traversal depends on `node:fs/promises`, which is unavailable in V8 isolates. Isomorphic callers use `r.deploy.apply` directly with in-memory byte sources. + +#### Scenario: Node consumer deploys a directory + +- **WHEN** a Node consumer calls `r.sites.deployDir({ project: "prj_abc", dir: "./my-site" })` +- **THEN** the SDK delegates to `r.deploy.apply({ project, site: { replace: fileSetFromDir("./my-site") } })` +- **AND** all bytes travel via the unified `cas-content` transport (`POST /content/v1/plans` + presigned S3 PUTs) and the `unified-deploy` plan/commit endpoints (`POST /deploy/v2/plans` + commit) +- **AND** the agent-observable result is `{ deployment_id, url }` from the activated release + +#### Scenario: Sandbox consumer has no deployDir + +- **WHEN** a V8-isolate consumer imports `Run402` from `@run402/sdk` (the isomorphic entry) +- **THEN** the `sites` namespace does NOT expose a `deployDir` method +- **AND** isomorphic callers use `r.deploy.apply({ site: { replace: files({...}) } })` with in-memory byte sources + +### Requirement: Plan/commit transport handles dedup, URL refresh, and copy polling + +`deployDir` SHALL upload only files reported as `missing` by the unified content service (`POST /content/v1/plans`); files reported as already-present in the project's CAS SHALL NOT be re-uploaded. Within-deploy duplicate paths (same SHA, multiple paths) SHALL be uploaded once. + +The SDK SHALL refresh presigned URLs before TTL expires (50 minutes under the gateway's 1-hour TTL) by re-calling `POST /content/v1/plans` for any remaining missing entries. On HTTP 403 from S3 (expired URL), the SDK SHALL refresh once and retry the failed PUT. + +After commit, if the deploy operation is not immediately `ready`, the SDK SHALL poll `GET /deploy/v2/operations/:id` (initial 1 s interval, backing off to 30 s max, total cap 10 minutes) until the operation reaches a terminal state (`ready` or `failed`). + +#### Scenario: Re-deploy of unchanged tree makes no S3 PUTs + +- **WHEN** a caller invokes `deployDir` on a directory whose every file's SHA-256 already exists in the project's CAS +- **THEN** `POST /content/v1/plans` reports an empty `missing` list +- **AND** no `PUT` requests are sent to S3 +- **AND** the commit returns immediately with the new release activated + +#### Scenario: Stage-2 copy polls until ready + +- **WHEN** the commit response indicates the operation is still `running` (site copy not yet complete) +- **THEN** `deployDir` polls `GET /deploy/v2/operations/:id` until `status` is `ready` +- **AND** returns the final `{ deployment_id, url }` from the activated release + +### Requirement: deployDir emits progress events via onEvent callback + +`NodeSites.deployDir` SHALL accept an optional `onEvent: (event: DeployEvent) => void` field on its options object. When provided, the callback SHALL receive the structured event envelope defined by the `unified-deploy` capability (`plan.started`, `plan.diff`, `content.upload.progress`, `content.upload.skipped`, `commit.phase`, `ready`, etc.). + +For backward compatibility, `deployDir` SHALL also synthesize the legacy v1.32 event shapes (`{ phase: "plan", manifest_size }`, `{ phase: "upload", file, sha256, done, total }`, `{ phase: "commit" }`, `{ phase: "poll", status, elapsed_ms }`) and emit them alongside the new shapes for one minor release cycle. Existing event consumers SHALL continue to receive their expected payloads during the deprecation window. + +After the deprecation window, only the unified `DeployEvent` shapes SHALL be emitted. Consumers using the legacy `phase` field SHALL migrate to the discriminated `type` field. + +Errors thrown synchronously from the callback SHALL be caught and silently dropped. + +#### Scenario: New event shapes fire alongside legacy shapes during compat window + +- **WHEN** a caller invokes `deployDir({ project, dir, onEvent })` against a directory of 5 files where 3 are missing +- **THEN** the callback receives both the unified events (`plan.started`, `plan.diff`, `content.upload.progress` ×3, `commit.phase` series, `ready`) AND the legacy events (`{ phase: "plan", manifest_size: 5 }`, `{ phase: "upload", ... }` ×3, `{ phase: "commit" }`) + +#### Scenario: Existing legacy-only consumers still work + +- **WHEN** a caller's `onEvent` switches on `event.phase` only (legacy v1.32 pattern) +- **THEN** the deploy completes successfully and the legacy phase events fire as expected +- **AND** the caller is not broken by the addition of unified event shapes + +### Requirement: CLI subcommand sites deploy-dir exposes the helper + +The CLI SHALL accept a subcommand invoked as `run402 sites deploy-dir --project [--target