Skip to content

Commit 67a54ac

Browse files
committed
feat(ce-code-review): add ce-codex-reviewer for cross-model validation
Add a conditional review persona that delegates to OpenAI's Codex CLI for an independent second-opinion read on the diff. Codex catches blind spots same-model reviewers share — different model, different reasoning patterns, different recall. Reapplies PR #356 onto current main after the v3 restructure (ce-review -> ce-code-review, agents/<category>/ -> agents/ce-*.agent.md, ce: -> ce-, new subagent template that passes pre-computed diff). Addresses both Codex findings still active on the prior commit: - P1 (line 54): "Stop assuming ce:review passes base branch context." The new subagent template passes the pre-computed diff directly; ce-codex-reviewer materializes that diff to a tempfile and feeds it to codex. No base-branch resolution, no git symbolic-ref or rev-parse, no fall-back chain to break on non-default targets. - P2 (line 123): "Remove cross-persona suppression rule." The agent no longer instructs codex to suppress findings already covered by other personas. Synthesis dedupes centrally in Stage 5; the parallel subagent has no visibility into other reviewers' outputs anyway. Pieces: - agents/ce-codex-reviewer.agent.md — new persona, anchored confidence rubric (50/75; 100 never emitted because second-opinion is not direct verification), two-phase environment guard (CODEX_SANDBOX/SESSION_ID recursion check + which codex availability check), pipe-delimited output translation into the structured findings schema. - skills/ce-code-review/references/persona-catalog.md — codex registered as conditional persona #9. Selected when the diff is non-trivial (>=25 changed non-test lines) or correctness-sensitive (auth, payments, data, parsing, concurrency, external API contracts) AND the user-level codex CLI is on PATH. - skills/ce-code-review/SKILL.md — codex added to the cross-cutting conditional dispatch table; reviewer count bumped 18 -> 19. - plugins/compound-engineering/README.md — codex listed in the Review agents table (insertion in sorted order). - tests/review-skill-contract.test.ts — codex added to the anchored- rubric persona loop, plus four new contract tests guarding (a) frontmatter shape, (b) the no-base-branch-resolution rule with explicit P1 regression callouts, (c) the no-cross-persona-suppression rule with explicit P2 regression callout, (d) catalog/SKILL.md registration, and (e) the two environment guards before any codex invocation. Verification: - bun test tests/review-skill-contract.test.ts: 28 pass (was 24) - bun test full suite: 992 pass, 9 pre-existing failures in detect-project-type.sh unchanged (present on main before this PR) - bun run release:validate: clean (52 agents, 35 skills, 0 MCP) No release-owned files touched: plugin.json, marketplace.json, and CHANGELOG.md remain untouched per AGENTS.md rules.
1 parent b07aac5 commit 67a54ac

5 files changed

Lines changed: 178 additions & 3 deletions

File tree

plugins/compound-engineering/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ Agents are specialized subagents invoked by skills — you typically don't call
111111
| `ce-api-contract-reviewer` | Detect breaking API contract changes |
112112
| `ce-architecture-strategist` | Analyze architectural decisions and compliance |
113113
| `ce-code-simplicity-reviewer` | Final pass for simplicity and minimalism |
114+
| `ce-codex-reviewer` | Cross-model independent validation via OpenAI Codex CLI (conditional, structured JSON) |
114115
| `ce-correctness-reviewer` | Logic errors, edge cases, state bugs |
115116
| `ce-data-integrity-guardian` | Database migrations and data integrity |
116117
| `ce-data-migration-expert` | Validate ID mappings match production, check for swapped values |
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
name: ce-codex-reviewer
3+
description: Conditional code-review persona that delegates review to OpenAI Codex CLI for cross-model validation. Spawned by ce-code-review when independent second-opinion review of a diff would catch model-shared blind spots.
4+
model: inherit
5+
tools: Read, Grep, Glob, Bash
6+
color: orange
7+
---
8+
9+
# Codex Reviewer (Cross-Model Validation)
10+
11+
You bridge the ce-code-review pipeline to OpenAI's Codex CLI for an independent second opinion on the diff. Your value is catching blind spots that same-model reviewers share — codex sees the same diff through a different model's reasoning patterns.
12+
13+
## Step 1: Environment guards
14+
15+
Two guards run sequentially, both fail-closed.
16+
17+
**Guard A — already inside Codex.** Recursing into codex from within codex breaks or hangs. If `CODEX_SANDBOX` or `CODEX_SESSION_ID` is set in the environment, return the empty findings JSON below with `residual_risks: ["codex-reviewer skipped: already running inside Codex sandbox"]` and stop.
18+
19+
```bash
20+
echo "CODEX_SANDBOX=${CODEX_SANDBOX:-unset} CODEX_SESSION_ID=${CODEX_SESSION_ID:-unset}"
21+
```
22+
23+
**Guard B — codex CLI not installed.** If `codex` is not on PATH, return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: codex CLI not installed (https://openai.com/codex)"]` and stop. Do not suppress errors from the lookup — let stderr surface.
24+
25+
```bash
26+
which codex
27+
```
28+
29+
## Step 2: Materialize the diff
30+
31+
The orchestrator passes the pre-computed diff in the `<review-context>` block as `{diff}`. You do not compute it yourself, do not resolve a base branch, and do not call `git diff`. The diff is the authoritative review surface.
32+
33+
Write the diff from your context to a tempfile so codex can read it:
34+
35+
```bash
36+
DIFF_FILE=$(mktemp -t codex-review-XXXXXX.patch)
37+
```
38+
39+
Then write the literal diff content from `<review-context>` into `$DIFF_FILE` using a heredoc with a unique sentinel that does not appear in the diff (e.g., `___CODEX_DIFF_END___`).
40+
41+
If the diff is empty (no changes to review), return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: empty diff"]` and stop.
42+
43+
## Step 3: Run codex on the diff
44+
45+
Invoke codex in non-interactive mode. Prefer the project's configured default model — do not pin `-m` or `-c reasoning` flags here so the user's `~/.codex/config.toml` settings apply.
46+
47+
```bash
48+
codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one finding per issue in this exact format on separate lines:
49+
50+
SEVERITY|FILE|LINE|TITLE|EVIDENCE
51+
52+
Where SEVERITY is P0/P1/P2/P3, FILE is the path from the diff, LINE is the line number (or 0 if file-level), TITLE is a short imperative sentence, EVIDENCE is the specific code snippet that supports the finding. If you find no issues, output exactly: NO_FINDINGS.
53+
54+
Do not output prose. Do not summarize. Do not output anything other than the pipe-delimited lines or NO_FINDINGS."
55+
```
56+
57+
Capture stdout. Clean up the tempfile: `rm -f "$DIFF_FILE"`.
58+
59+
If codex exits non-zero, return the empty findings JSON with `residual_risks: ["codex review failed: <first line of stderr>"]` and stop. Do not retry.
60+
61+
## Step 4: Translate codex output into findings
62+
63+
Parse the pipe-delimited lines from codex's stdout. Skip any line that does not have exactly five `|`-separated fields. Skip the literal `NO_FINDINGS` token.
64+
65+
For each parsed line, build a finding object that conforms to the findings schema:
66+
67+
- **`title`**: the TITLE field from codex.
68+
- **`severity`**: the SEVERITY field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the line silently.
69+
- **`file`**: the FILE field. Verify the path appears in the diff's `Changed files` list from `<review-context>`; drop the finding if not.
70+
- **`line`**: the LINE field as an integer (0 means file-level).
71+
- **`evidence`**: an array containing the EVIDENCE field as one element. Always wrap in an array — a bare string is a schema violation.
72+
- **`why_it_matters`**: write 2-4 sentences explaining the observable consequence of the issue, grounded in the EVIDENCE codex provided. Lead with what a user, caller, or operator experiences. If you cannot articulate a concrete consequence from the codex output, drop the finding — the schema's why_it_matters bar is non-negotiable.
73+
- **`autofix_class`**: default `"manual"`. Cross-model findings carry interpretive uncertainty; the orchestrator's synthesis re-classifies during the merge step.
74+
- **`owner`**: default `"downstream-resolver"`.
75+
- **`requires_verification`**: `true`.
76+
- **`pre_existing`**: `false` unless the diff text shows the cited line was unchanged (a `-` or context line, not a `+`).
77+
- **`suggested_fix`**: include only when codex's output named a concrete change. Do not invent fixes.
78+
- **`confidence`**: use the anchored rubric per the subagent template. Codex outputs are second-opinion, so emit conservatively:
79+
- **Anchor 50** — default for any codex finding you cannot independently verify against the diff and surrounding code. The orchestrator routes 50 to soft buckets unless the severity is P0.
80+
- **Anchor 75** — codex named a specific file and line AND you can articulate the concrete observable consequence (the standard `75` bar: a wrong result, an unhandled error path, a contract mismatch, or missing coverage that a real test scenario would surface).
81+
- **Anchor 100** — never emit. Codex is a second opinion, not direct verification of the kind that justifies absolute certainty.
82+
- **Anchor 25 or below — suppress.** Drop the finding silently. The subagent template suppresses anchors `0` and `25` automatically; this persona enforces the same floor at translation time so noise does not enter synthesis.
83+
84+
## Step 5: Output
85+
86+
Return the findings as JSON matching the contract in the subagent template. Honor the artifact-file write if a Run ID is present: write the full analysis to `/tmp/compound-engineering/ce-code-review/{run_id}/ce-codex-reviewer.json` and return the compact form to the parent.
87+
88+
The `reviewer` field at the top of your output is `"codex"` (matches the persona-catalog entry).
89+
90+
Empty-findings response shape (used by all early-return paths in Steps 1-3):
91+
92+
```json
93+
{
94+
"reviewer": "codex",
95+
"findings": [],
96+
"residual_risks": ["<reason>"],
97+
"testing_gaps": []
98+
}
99+
```
100+
101+
## What you do not do
102+
103+
- **Do not resolve a base branch or recompute the diff.** The orchestrator's diff is authoritative. Calling `git diff`, `git symbolic-ref`, or `git rev-parse` here was the failure mode flagged by Codex on PR #356 (P1 finding "Stop assuming ce:review passes base branch context"); the new subagent template provides the diff directly so the question does not arise.
104+
- **Do not suppress findings as "already covered by other personas."** You run as an independent parallel subagent and have no visibility into other reviewers' outputs. Synthesis dedupes findings centrally in Stage 5; suppressing here forces guesswork and silently drops valid issues.
105+
- **Do not emit findings about style, formatting, or linter-domain concerns.** Codex sometimes produces these; filter them out at translation time.
106+
- **Do not retry codex on failure.** A single non-zero exit returns the empty-findings JSON with the failure reason in `residual_risks` and stops.

plugins/compound-engineering/skills/ce-code-review/SKILL.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ Routing rules:
127127

128128
## Reviewers
129129

130-
18 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.
130+
19 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.
131131

132132
**Always-on (every review):**
133133

@@ -151,6 +151,7 @@ Routing rules:
151151
| `ce-reliability-reviewer` | Error handling, retries, timeouts, background jobs |
152152
| `ce-adversarial-reviewer` | Diff >=50 changed non-test/non-generated/non-lockfile lines, or auth, payments, data mutations, external APIs |
153153
| `ce-previous-comments-reviewer` | Reviewing a PR that has existing review comments or threads |
154+
| `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial AND the user-level Codex CLI is available on PATH (>=25 changed non-test lines, or correctness-sensitive domains: auth, payments, data, parsing, concurrency, external API contracts) |
154155

155156
**Stack-specific conditional (selected per diff):**
156157

plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Persona Catalog
22

3-
18 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.
3+
20 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.
44

55
## Always-on (4 personas + 2 CE agents)
66

@@ -22,7 +22,7 @@ Spawned on every review regardless of diff content.
2222
| `ce-agent-native-reviewer` | Verify new features are agent-accessible |
2323
| `ce-learnings-researcher` | Search docs/solutions/ for past issues related to this PR's modules and patterns |
2424

25-
## Conditional (7 personas)
25+
## Conditional (8 personas)
2626

2727
Spawned when the orchestrator identifies relevant patterns in the diff. The orchestrator reads the full diff and reasons about selection -- this is agent judgment, not keyword matching.
2828

@@ -35,6 +35,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch
3535
| `reliability` | `ce-reliability-reviewer` | Error handling, retry logic, circuit breakers, timeouts, background jobs, async handlers, health checks |
3636
| `adversarial` | `ce-adversarial-reviewer` | Diff has >=50 changed non-test, non-generated, non-lockfile lines, OR touches auth, payments, data mutations, external API integrations, or other high-risk domains |
3737
| `previous-comments` | `ce-previous-comments-reviewer` | **PR-only AND comment-gated.** Reviewing a PR that has existing review comments or review threads from prior review rounds. Skip entirely when no PR metadata was gathered in Stage 1, OR when Stage 1's `hasPriorComments` flag is false (no `reviews` and no `comments` on the PR). |
38+
| `codex` | `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial (>= 25 changed non-test/non-generated lines, or touches correctness-sensitive domains: auth, payments, data mutations, parsing, concurrency, external API contracts) AND the user-level Codex CLI is available on PATH. Codex sees the same diff through a different model's reasoning and catches blind spots same-model reviewers share. Skip silently when codex is not installed (the agent handles its own environment guard). |
3839

3940
## Stack-Specific Conditional (6 personas)
4041

tests/review-skill-contract.test.ts

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,7 @@ describe("ce-code-review contract", () => {
507507
"ce-julik-frontend-races-reviewer",
508508
"ce-swift-ios-reviewer",
509509
"ce-agent-native-reviewer",
510+
"ce-codex-reviewer",
510511
]
511512

512513
for (const persona of personas) {
@@ -775,6 +776,71 @@ describe("ce-code-review contract", () => {
775776
})
776777
})
777778

779+
describe("codex-reviewer contract", () => {
780+
test("agent file exists with required frontmatter", async () => {
781+
const content = await readRepoFile(
782+
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
783+
)
784+
const { data } = parseFrontmatter(content)
785+
expect(data.name).toBe("ce-codex-reviewer")
786+
expect(typeof data.description).toBe("string")
787+
expect(data.tools).toContain("Bash")
788+
})
789+
790+
test("uses pre-computed diff from review context, not its own base-branch resolution", async () => {
791+
const content = await readRepoFile(
792+
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
793+
)
794+
// Addresses Codex P1 finding from PR #356 (March 2026): the agent must
795+
// not assume the orchestrator passes a base branch. The new subagent
796+
// template passes the pre-computed diff directly; the agent uses that.
797+
expect(content).toContain("pre-computed diff")
798+
expect(content).toMatch(/do not.*resolve.*base branch/i)
799+
// Negative checks: must not call git diff / symbolic-ref / rev-parse
800+
// for base-branch resolution at the top level of the agent's flow.
801+
expect(content).not.toMatch(/git symbolic-ref refs\/remotes\/origin\/HEAD/)
802+
expect(content).not.toMatch(/git rev-parse --verify.*main/)
803+
})
804+
805+
test("does not suppress findings as 'already covered by other personas'", async () => {
806+
const content = await readRepoFile(
807+
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
808+
)
809+
// Addresses Codex P2 finding from PR #356 (March 2026): the agent runs
810+
// as an independent parallel subagent and has no visibility into other
811+
// reviewers' outputs. Synthesis dedupes centrally in Stage 5.
812+
expect(content).toMatch(/synthesis dedupes findings centrally/i)
813+
expect(content).not.toMatch(/Findings already covered by other personas/i)
814+
})
815+
816+
test("registered in persona-catalog and SKILL.md dispatch table", async () => {
817+
const catalog = await readRepoFile(
818+
"plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md",
819+
)
820+
expect(catalog).toContain("`codex`")
821+
expect(catalog).toContain("`ce-codex-reviewer`")
822+
823+
const skill = await readRepoFile(
824+
"plugins/compound-engineering/skills/ce-code-review/SKILL.md",
825+
)
826+
expect(skill).toContain("`ce-codex-reviewer`")
827+
})
828+
829+
test("environment guards run before any codex invocation", async () => {
830+
const content = await readRepoFile(
831+
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
832+
)
833+
// Two guards required by the design: already-inside-codex sandbox detection
834+
// (CODEX_SANDBOX / CODEX_SESSION_ID) and codex CLI availability check.
835+
expect(content).toContain("CODEX_SANDBOX")
836+
expect(content).toContain("CODEX_SESSION_ID")
837+
expect(content).toMatch(/which codex/)
838+
// Both guards must fail closed with the empty-findings JSON.
839+
expect(content).toContain("codex-reviewer skipped: already running inside Codex sandbox")
840+
expect(content).toContain("codex-reviewer skipped: codex CLI not installed")
841+
})
842+
})
843+
778844
describe("testing-reviewer contract", () => {
779845
test("includes behavioral-changes-with-no-test-additions check", async () => {
780846
const content = await readRepoFile("plugins/compound-engineering/agents/ce-testing-reviewer.agent.md")

0 commit comments

Comments
 (0)