Skip to content

Commit adba28a

Browse files
committed
fix(ce-codex-reviewer): NDJSON output and schema-valid line numbers
Addresses both Codex findings on a8a5ed2 (April 2026): P1 — Emit schema-valid line numbers for file-level findings. The findings schema requires `line >= 1` (verified in references/findings-schema.json). The previous prompt told codex to emit `LINE=0` for file-level issues, which would have been dropped by the merge validator as malformed before synthesis — silently losing every file-level Codex finding. Fix: the codex prompt now requires `line` to be a positive integer and adds a `file_level` boolean. When `file_level: true`, codex sets `line: 1` and the agent prepends a "file-level finding (no specific line applies)" string to the evidence array so synthesis and downstream surfaces can still distinguish "line 1 was the issue" from "this is a whole-file concern." The `file_level` signal is preserved in evidence rather than as a separate field because the findings schema doesn't expose a top-level file_level flag, and inventing one would fail the strict-validator path. P2 — Use structured output so evidence cannot break parsing. The previous pipe-delimited contract dropped any row that wasn't exactly five `|`-separated fields. EVIDENCE is a raw code snippet that can legitimately contain `|` (bitwise OR / union types, shell pipes, markdown tables, regex alternation). When that happened, valid findings disappeared silently. Fix: switch the codex prompt to NDJSON — one JSON object per line. The agent JSON-parses each line independently; embedded pipes in evidence are no longer a parsing hazard because JSON quoting handles them. Lines that fail to parse are skipped (no retry, no inference). Test coverage: - New contract test "uses NDJSON output contract so evidence can carry pipes safely" guards the prompt format and asserts the pipe-delimited shape stays gone. - New contract test "emits schema-valid line numbers for file-level findings" guards the line=1 + file_level=true convention, asserts the old "0 means file-level" wording is gone, and reads the findings schema to verify minimum=1 still holds (so a future schema relaxation triggers a deliberate test update rather than a silent drift). Both tests cite their originating Codex finding inline so future regressions get a clear pointer to PR #356. bun test tests/review-skill-contract.test.ts: 30 pass (was 28) bun run release:validate: clean
1 parent a8a5ed2 commit adba28a

2 files changed

Lines changed: 57 additions & 11 deletions

File tree

plugins/compound-engineering/agents/ce-codex-reviewer.agent.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,19 @@ If the diff is empty (no changes to review), return the empty findings JSON with
4545
Invoke codex in non-interactive mode. Prefer the project's configured default model — do not pin `-m` or `-c reasoning` flags here so the user's `~/.codex/config.toml` settings apply.
4646

4747
```bash
48-
codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one finding per issue in this exact format on separate lines:
48+
codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one JSON object per line (NDJSON), with these fields per object:
4949
50-
SEVERITY|FILE|LINE|TITLE|EVIDENCE
50+
{\"severity\": \"P0|P1|P2|P3\", \"file\": \"path/from/diff\", \"line\": <positive integer>, \"file_level\": <true|false>, \"title\": \"short imperative sentence\", \"evidence\": \"specific code snippet that supports the finding\"}
5151
52-
Where SEVERITY is P0/P1/P2/P3, FILE is the path from the diff, LINE is the line number (or 0 if file-level), TITLE is a short imperative sentence, EVIDENCE is the specific code snippet that supports the finding. If you find no issues, output exactly: NO_FINDINGS.
52+
Rules for each field:
53+
- severity: exactly one of \"P0\", \"P1\", \"P2\", \"P3\".
54+
- file: a path from the diff's Changed files list. No quoting, no expansion.
55+
- line: a positive integer (1 or higher). For file-level findings (no specific line), use 1 and set file_level to true.
56+
- file_level: true when the issue is whole-file and no specific line applies; false when the line number is precise.
57+
- title: a short imperative sentence describing the issue.
58+
- evidence: a string with the exact code snippet, quote, or line content that supports the finding. May contain any characters including pipes, JSON-escaped per JSON rules.
5359
54-
Do not output prose. Do not summarize. Do not output anything other than the pipe-delimited lines or NO_FINDINGS."
60+
Output one JSON object per line. Do not wrap them in an array. Do not output prose, headers, or commentary. If you find no issues, output exactly the literal token: NO_FINDINGS"
5561
```
5662

5763
Capture stdout. Clean up the tempfile: `rm -f "$DIFF_FILE"`.
@@ -60,15 +66,15 @@ If codex exits non-zero, return the empty findings JSON with `residual_risks: ["
6066

6167
## Step 4: Translate codex output into findings
6268

63-
Parse the pipe-delimited lines from codex's stdout. Skip any line that does not have exactly five `|`-separated fields. Skip the literal `NO_FINDINGS` token.
69+
Parse codex's stdout line-by-line as NDJSON. Skip blank lines. Skip the literal `NO_FINDINGS` token. For each remaining line, attempt to JSON-parse it; if parsing fails, skip the line (do not retry, do not infer).
6470

65-
For each parsed line, build a finding object that conforms to the findings schema:
71+
For each successfully-parsed object, build a finding object that conforms to the findings schema:
6672

67-
- **`title`**: the TITLE field from codex.
68-
- **`severity`**: the SEVERITY field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the line silently.
69-
- **`file`**: the FILE field. Verify the path appears in the diff's `Changed files` list from `<review-context>`; drop the finding if not.
70-
- **`line`**: the LINE field as an integer (0 means file-level).
71-
- **`evidence`**: an array containing the EVIDENCE field as one element. Always wrap in an array — a bare string is a schema violation.
73+
- **`title`**: the `title` field from the codex output.
74+
- **`severity`**: the `severity` field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the object silently.
75+
- **`file`**: the `file` field. Verify the path appears in the diff's `Changed files` list from `<review-context>`; drop the finding if not.
76+
- **`line`**: the `line` field as an integer. The findings schema requires `line >= 1`, so reject any object with `line < 1` outright. When codex set `file_level: true`, line will already be `1` per the contract above; the agent does not re-default `0` to `1` because the contract refuses `0`. The `file_level` signal is preserved in `evidence` (see below) so synthesis and downstream surfaces can distinguish "line 1 was the issue" from "this is a file-level concern."
77+
- **`evidence`**: an array. Always include the `evidence` string from codex as the first element. When `file_level: true`, prepend an additional element: `"file-level finding (no specific line applies)"`. The schema requires evidence to be a non-empty array of strings; never emit a bare string.
7278
- **`why_it_matters`**: write 2-4 sentences explaining the observable consequence of the issue, grounded in the EVIDENCE codex provided. Lead with what a user, caller, or operator experiences. If you cannot articulate a concrete consequence from the codex output, drop the finding — the schema's why_it_matters bar is non-negotiable.
7379
- **`autofix_class`**: default `"manual"`. Cross-model findings carry interpretive uncertainty; the orchestrator's synthesis re-classifies during the merge step.
7480
- **`owner`**: default `"downstream-resolver"`.

tests/review-skill-contract.test.ts

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -760,6 +760,46 @@ describe("codex-reviewer contract", () => {
760760
expect(content).toContain("codex-reviewer skipped: already running inside Codex sandbox")
761761
expect(content).toContain("codex-reviewer skipped: codex CLI not installed")
762762
})
763+
764+
test("uses NDJSON output contract so evidence can carry pipes safely", async () => {
765+
const content = await readRepoFile(
766+
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
767+
)
768+
// Addresses Codex P2 finding on commit a8a5ed2 (April 2026): pipe-delimited
769+
// output dropped any row that wasn't exactly five `|`-separated fields,
770+
// so EVIDENCE containing a literal `|` (bitwise OR, shell pipes, markdown
771+
// tables) silently lost real findings. NDJSON is robust to embedded pipes.
772+
expect(content).toContain("NDJSON")
773+
expect(content).toMatch(/JSON-?parse/i)
774+
// Pipe-delimited contract must be gone from the agent's prose.
775+
expect(content).not.toMatch(/SEVERITY\|FILE\|LINE\|TITLE\|EVIDENCE/)
776+
expect(content).not.toMatch(/exactly five `\|`-separated fields/)
777+
})
778+
779+
test("emits schema-valid line numbers for file-level findings", async () => {
780+
const content = await readRepoFile(
781+
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
782+
)
783+
// Addresses Codex P1 finding on commit a8a5ed2 (April 2026): the findings
784+
// schema requires line >= 1, so emitting line=0 for file-level findings
785+
// would silently drop them at the merge validator. Contract says: line=1
786+
// with file_level=true, evidence array prepends the file-level annotation.
787+
expect(content).toContain("file_level")
788+
expect(content).toMatch(/positive integer/i)
789+
expect(content).toMatch(/file-level finding \(no specific line applies\)/)
790+
// Negative check: the old "0 means file-level" wording must be gone.
791+
expect(content).not.toMatch(/0 means file-level/)
792+
expect(content).not.toMatch(/or 0 if file-level/)
793+
794+
// Also assert the schema constraint is what we think it is — if upstream
795+
// ever loosens line to allow 0, this test should be updated to match.
796+
const schemaJson = await readRepoFile(
797+
"plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json",
798+
)
799+
const schema = JSON.parse(schemaJson)
800+
const lineMin = schema.properties?.findings?.items?.properties?.line?.minimum
801+
expect(lineMin).toBe(1)
802+
})
763803
})
764804

765805
describe("testing-reviewer contract", () => {

0 commit comments

Comments
 (0)