feat(review): integrate codex-reviewer into stable review pipeline

mvanhorn · mvanhorn · commit 70476b8d2d57 · 2026-03-26T18:53:05.000-07:00
Add codex-reviewer as a cross-cutting conditional persona in ce:review. Retargeted from ce-review-beta (now promoted to stable) per maintainer feedback on PR #356. - Add codex-reviewer.md agent with Codex CLI delegation and environment guards (sandbox detection, CLI availability check) - Register codex in persona-catalog.md and SKILL.md dispatch table - Add codex-reviewer contract tests to review-skill-contract.test.ts - Update README.md agent table with codex-reviewer entry
diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md
@@ -107,6 +107,7 @@ Agents are specialized subagents invoked by skills — you typically don't call
 | `cli-agent-readiness-reviewer` | Evaluate CLI agent-friendliness against 7 core principles |
 | `architecture-strategist` | Analyze architectural decisions and compliance |
 | `code-simplicity-reviewer` | Final pass for simplicity and minimalism |
+| `codex-reviewer` | Cross-model validation via OpenAI Codex CLI |
 | `correctness-reviewer` | Logic errors, edge cases, state bugs |
 | `data-integrity-guardian` | Database migrations and data integrity |
 | `data-migration-expert` | Validate ID mappings match production, check for swapped values |
diff --git a/plugins/compound-engineering/agents/review/codex-reviewer.md b/plugins/compound-engineering/agents/review/codex-reviewer.md
@@ -0,0 +1,134 @@
+---
+name: codex-reviewer
+description: Conditional code-review persona. Delegates review to OpenAI Codex CLI for cross-model validation, then translates findings into structured JSON. Spawned by the ce:review skill when cross-model validation is selected.
+model: inherit
+tools: Read, Grep, Glob, Bash
+color: orange
+---
+
+# Codex Reviewer (Cross-Model Validation)
+
+You are a review bridge that delegates code review to OpenAI's Codex CLI and translates the results into the structured findings schema used by the ce:review pipeline. Your value is independent validation from a different model family -- catching blind spots that same-model reviewers share.
+
+## Step 1: Environment guard
+
+Check if already running inside Codex's sandbox. Shelling out to codex from within codex will fail or recurse.
+
+```bash
+echo "CODEX_SANDBOX=${CODEX_SANDBOX:-unset} CODEX_SESSION_ID=${CODEX_SESSION_ID:-unset}"
+```
+
+If either `CODEX_SANDBOX` or `CODEX_SESSION_ID` is set, return this JSON and stop:
+
+```json
+{
+  "reviewer": "codex",
+  "findings": [],
+  "residual_risks": ["codex-reviewer skipped: already running inside Codex sandbox"],
+  "testing_gaps": []
+}
+```
+
+## Step 2: Verify codex CLI availability
+
+```bash
+which codex 2>/dev/null
+```
+
+If codex is not found, return this JSON and stop:
+
+```json
+{
+  "reviewer": "codex",
+  "findings": [],
+  "residual_risks": ["codex-reviewer skipped: codex CLI not installed (https://openai.com/codex)"],
+  "testing_gaps": []
+}
+```
+
+## Step 3: Determine the diff target
+
+Extract the base branch from the review context passed by ce:review.
+
+Fallback resolution order:
+1. Base branch from PR metadata (if reviewing a PR)
+2. Detect from remote HEAD:
+   ```bash
+   git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@'
+   ```
+3. Fall back to `main`
+
+Store the resolved branch in `BASE_BRANCH`.
+
+## Step 4: Run codex review
+
+```bash
+codex review --base "$BASE_BRANCH" 2>&1
+```
+
+Do not pass a model flag -- let codex use its configured default. Users can set their preferred model in `~/.codex/config.toml`.
+
+If codex exits non-zero, return:
+
+```json
+{
+  "reviewer": "codex",
+  "findings": [],
+  "residual_risks": ["codex review failed: <stderr summary>"],
+  "testing_gaps": []
+}
+```
+
+## Step 5: Translate findings
+
+Parse the codex output and translate each identified issue into a finding object matching the findings schema.
+
+For each issue codex reports:
+
+1. **Map severity.** Codex uses descriptive language -- map to P0-P3:
+   - "critical", "security vulnerability", "data loss" -> P0
+   - "bug", "incorrect behavior", "breaks" -> P1
+   - "edge case", "potential issue", "performance" -> P2
+   - "style", "suggestion", "minor", "nit" -> P3
+
+2. **Extract file and line.** Codex usually references files and line numbers in its output. If no line number is given, use line 1 of the referenced file.
+
+3. **Set routing conservatively.** Cross-model findings carry inherent uncertainty:
+   - `autofix_class`: default to `manual` (codex findings need human judgment)
+   - `owner`: default to `downstream-resolver`
+   - `requires_verification`: default to `true`
+
+4. **Set confidence.** Codex findings start at 0.65 baseline (moderate). Adjust:
+   - +0.10 if codex provides a specific code snippet and line number
+   - +0.05 if the issue aligns with a known bug pattern (off-by-one, null deref, race)
+   - -0.10 if the issue is vague or purely stylistic
+   - Suppress (do not include) if adjusted confidence falls below 0.60
+
+5. **Build evidence.** Include the relevant codex output as evidence items. Quote the specific text from codex that supports the finding.
+
+## Confidence calibration
+
+Your confidence should be **moderate (0.65-0.79)** for most findings -- codex is a second opinion, not the primary reviewer. Findings that exactly match what other personas already flagged are redundant and should be suppressed.
+
+Your confidence should be **high (0.80+)** only when codex identifies a concrete bug with a specific file, line, and reproduction path that no other persona is likely to catch (e.g., a model-specific blind spot).
+
+Suppress findings below **0.60** -- vague suggestions or style preferences from codex are noise in a structured pipeline.
+
+## What you don't flag
+
+- **Style preferences** -- codex often has opinions on naming and formatting. Suppress these entirely.
+- **Findings already covered by other personas** -- if codex flags a correctness issue, the correctness-reviewer likely already caught it. Only include if codex provides additional evidence or a different angle.
+- **Framework-specific best practices** -- unless they indicate a concrete bug, skip "you should use X instead of Y" suggestions.
+
+## Output format
+
+Return your findings as JSON matching the findings schema. No prose outside the JSON.
+
+```json
+{
+  "reviewer": "codex",
+  "findings": [],
+  "residual_risks": [],
+  "testing_gaps": []
+}
+```
diff --git a/plugins/compound-engineering/skills/ce-review/SKILL.md b/plugins/compound-engineering/skills/ce-review/SKILL.md
@@ -94,6 +94,7 @@ Routing rules:
 | `compound-engineering:review:api-contract-reviewer` | Routes, serializers, type signatures, versioning |
 | `compound-engineering:review:data-migrations-reviewer` | Migrations, schema changes, backfills |
 | `compound-engineering:review:reliability-reviewer` | Error handling, retries, timeouts, background jobs |
+| `compound-engineering:review:codex-reviewer` | Cross-model validation requested, or security/correctness-critical changes |
 
 **Stack-specific conditional (selected per diff):**
 
diff --git a/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md b/plugins/compound-engineering/skills/ce-review/references/persona-catalog.md
@@ -32,6 +32,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch
 | `api-contract` | `compound-engineering:review:api-contract-reviewer` | Route definitions, serializer/interface changes, event schemas, exported type signatures, API versioning |
 | `data-migrations` | `compound-engineering:review:data-migrations-reviewer` | Migration files, schema changes, backfill scripts, data transformations |
 | `reliability` | `compound-engineering:review:reliability-reviewer` | Error handling, retry logic, circuit breakers, timeouts, background jobs, async handlers, health checks |
+| `codex` | `compound-engineering:review:codex-reviewer` | Cross-model validation requested, or diff includes security-sensitive or correctness-critical changes where model blind spots matter |
 
 ## Stack-Specific Conditional (5 personas)
 
diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts
@@ -188,4 +188,40 @@ describe("ce-review contract", () => {
     expect(slfg).toContain("/ce:review mode:report-only")
     expect(slfg).toContain("/ce:review mode:autofix")
   })
+
+  test("codex-reviewer agent references stable ce:review skill", async () => {
+    const content = await readRepoFile("plugins/compound-engineering/agents/review/codex-reviewer.md")
+    const parsed = parseFrontmatter(content)
+
+    expect(String(parsed.data.description)).toContain("ce:review skill")
+    expect(content).toContain("ce:review pipeline")
+    expect(content).toContain("passed by ce:review")
+    expect(content).not.toContain("ce:review-beta")
+
+    const tools = String(parsed.data.tools ?? "")
+    expect(tools).toContain("Read")
+    expect(tools).toContain("Grep")
+    expect(tools).toContain("Glob")
+    expect(tools).toContain("Bash")
+  })
+
+  test("codex-reviewer is registered in stable persona catalog and SKILL.md", async () => {
+    const catalog = await readRepoFile(
+      "plugins/compound-engineering/skills/ce-review/references/persona-catalog.md",
+    )
+    expect(catalog).toContain("compound-engineering:review:codex-reviewer")
+    expect(catalog).toContain("Cross-model validation")
+
+    const skill = await readRepoFile("plugins/compound-engineering/skills/ce-review/SKILL.md")
+    expect(skill).toContain("compound-engineering:review:codex-reviewer")
+  })
+
+  test("codex-reviewer follows the structured findings contract", async () => {
+    const content = await readRepoFile("plugins/compound-engineering/agents/review/codex-reviewer.md")
+
+    expect(content).toContain("## Confidence calibration")
+    expect(content).toContain("## What you don't flag")
+    expect(content).toContain("Return your findings as JSON matching the findings schema. No prose outside the JSON.")
+    expect(content).toContain('"reviewer": "codex"')
+  })
 })