Skip to content

Commit 610f7cb

Browse files
committed
feat: [US-009] - [Document Codex auth policy and fail-fast model availability]
1 parent 2e7a781 commit 610f7cb

File tree

4 files changed

+29
-2
lines changed

4 files changed

+29
-2
lines changed

.beads/issues.jsonl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@
8989
{"id":"CodeContextBench-cey","title":"US-012: Build failure analysis engine","status":"closed","priority":1,"issue_type":"feature","owner":"locobench@anthropic.com","created_at":"2026-02-15T13:53:47.854221697Z","created_by":"LoCoBench Bot","updated_at":"2026-02-15T13:57:20.769673188Z","closed_at":"2026-02-15T13:57:20.769673188Z","close_reason":"US-012 implemented and all ACs verified"}
9090
{"id":"CodeContextBench-d00","title":"US-001: Create inv-deep-001 Envoy filter chain deep causal task","status":"closed","priority":1,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-16T15:08:15.0008813Z","created_by":"LoCoBench Bot","updated_at":"2026-02-16T15:13:36.330715999Z","closed_at":"2026-02-16T15:13:36.330715999Z","close_reason":"US-001 complete: inv-deep-001 Envoy deep causal chain task created and committed"}
9191
{"id":"CodeContextBench-d5q","title":"US-003: Create inv-deep-003 - Deep causal chain in Terraform","status":"closed","priority":1,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-16T15:28:45.184016129Z","created_by":"LoCoBench Bot","updated_at":"2026-02-16T15:40:01.995896974Z","closed_at":"2026-02-16T15:40:01.995896974Z","close_reason":"US-003 complete: inv-deep-003 created with Terraform sensitive marks bug"}
92-
{"id":"CodeContextBench-dcp","title":"US-009 Document Codex auth policy and fail-fast model availability","status":"open","priority":2,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-17T03:33:32.070356971Z","created_by":"LoCoBench Bot","updated_at":"2026-02-17T03:33:32.070356971Z"}
92+
{"id":"CodeContextBench-dcp","title":"US-009 Document Codex auth policy and fail-fast model availability","status":"closed","priority":2,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-17T03:33:32.070356971Z","created_by":"LoCoBench Bot","updated_at":"2026-02-17T04:06:53.202388149Z","closed_at":"2026-02-17T04:06:53.202388149Z","close_reason":"done"}
9393
{"id":"CodeContextBench-dfp","title":"Run LoCoBench baseline and SG_full configs","description":"QA audit H2: LoCoBench only has SG_base results in MANIFEST (25/25 tasks). Need baseline and SG_full runs for complete 3-config comparison. SG_full should use the updated Deep Search preamble.","status":"closed","priority":2,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-06T14:50:17.265852053Z","created_by":"LoCoBench Bot","updated_at":"2026-02-16T01:33:58.368434048Z","closed_at":"2026-02-16T01:33:58.368434048Z","close_reason":"Stale - LoCoBench dropped in favor of enterprise largerepo tasks (25 tasks across 5 categories). New tasks don't need separate LoCoBench runs.","dependencies":[{"issue_id":"CodeContextBench-dfp","depends_on_id":"CodeContextBench-17e","type":"blocks","created_at":"2026-02-06T21:09:35.481295416Z","created_by":"LoCoBench Bot"}]}
9494
{"id":"CodeContextBench-ega","title":"US-008b: Scaffold remaining 3 governance tasks","status":"closed","priority":1,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-15T14:39:32.981506882Z","created_by":"LoCoBench Bot","updated_at":"2026-02-15T14:45:09.007512651Z","closed_at":"2026-02-15T14:45:09.007512651Z","close_reason":"US-008b complete: 3 governance tasks scaffolded (cross-team-boundary, audit-trail, degraded-context)"}
9595
{"id":"CodeContextBench-f0x","title":"US-001: Create nlqa-arch-001 Envoy HTTP filter chain task","status":"closed","priority":1,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-16T15:58:19.87022273Z","created_by":"LoCoBench Bot","updated_at":"2026-02-16T16:01:50.336839468Z","closed_at":"2026-02-16T16:01:50.336839468Z","close_reason":"US-001 complete: nlqa-arch-001 task created with all acceptance criteria passing"}

docs/CONFIGS.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,3 +99,16 @@ cost extraction depends on `scripts/ccb_metrics/extractors.py` model pricing
9999
keys. Official Codex runs should use `gpt-5.3-codex` so pricing is explicit.
100100
If a model identifier is unknown to `MODEL_PRICING`, extraction falls back to
101101
`claude-opus-4-5-20250514` rates and emits a warning.
102+
103+
## Codex Harness Auth and Model Policy
104+
105+
Codex authentication is separate from Claude OAuth refresh automation in
106+
`configs/_common.sh`. Codex operators must configure Codex credentials directly
107+
using either ChatGPT login or an API key; Claude token refresh helpers are not
108+
reused for Codex harness execution.
109+
110+
Official Codex benchmark runs require model `gpt-5.3-codex` and should
111+
fail-fast if that model is unavailable in the configured Codex environment.
112+
113+
For this rollout, Codex MCP policy is sourcegraph_full-only for MCP-enabled
114+
runs, with baseline comparisons using `none`. No other MCP modes are allowed.

ralph-multi-harness/prd.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@
134134
"rg -n 'gpt-5.3-codex|fail-fast|ChatGPT login|API key|sourcegraph_full' docs returns matches in the new/updated doc"
135135
],
136136
"priority": 9,
137-
"passes": false,
137+
"passes": true,
138138
"notes": ""
139139
},
140140
{

ralph-multi-harness/progress.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
- Multi-harness requirements should be centralized in `docs/HARNESS_CONTRACT.md` (fields, transcript artifact expectations, and allowed MCP modes) so runner and registry changes stay aligned.
88
- Harness registry entries should be keyed by stable harness IDs (`codex`, `cursor`, `gemini`, `copilot`, `openhands`) and constrain `allowed_mcp_modes` to exactly `none` and `sourcegraph_full` for this rollout.
99
- Non-Claude harness runners should still source `configs/_common.sh` for validation/parallel helpers, but must not depend on Claude OAuth refresh flows.
10+
- Codex harness authentication policy is separate from Claude OAuth automation: document and operate Codex via ChatGPT login or API key, not `_common.sh` refresh helpers.
1011
- In sandboxed environments, `runs/staging` may resolve to an external symlink target; use a writable `--category` override when dry-running scaffolds locally.
1112
- In `scripts/ccb_metrics`, resolve transcript artifacts through a shared candidate list (not hardcoded `agent/claude-code.txt`) so non-Claude harness outputs are discoverable.
1213
- In `scripts/ccb_metrics/extractors.py`, treat unknown `MODEL_PRICING` keys deterministically by falling back to `_DEFAULT_MODEL` and emitting a one-time warning to keep cross-harness cost reports explainable.
@@ -111,3 +112,16 @@
111112
- Useful context (e.g., "the evaluation panel is in component X")
112113
- `logger.warning` with a one-time set guard (`_WARNED_UNKNOWN_PRICING_MODELS`) avoids log spam while preserving visibility for unknown model identifiers.
113114
---
115+
116+
## 2026-02-17 04:06:08 UTC - US-009
117+
- Documented Codex auth policy in `docs/CONFIGS.md` with explicit separation from Claude OAuth refresh automation in `_common.sh`.
118+
- Added Codex run policy requirements for ChatGPT login/API key auth options, fail-fast behavior when `gpt-5.3-codex` is unavailable, and rollout MCP constraints (`none` baseline vs `sourcegraph_full` only for MCP-enabled runs).
119+
- Files changed: `docs/CONFIGS.md`, `ralph-multi-harness/prd.json`, `ralph-multi-harness/progress.txt`
120+
- **Learnings for future iterations:**
121+
- Patterns discovered (e.g., "this codebase uses X for Y")
122+
- Codex harness docs should explicitly call out auth separation from Claude token automation to prevent operators from assuming shared credential refresh behavior.
123+
- Gotchas encountered (e.g., "don't forget to update Z when changing W")
124+
- Acceptance checks rely on exact terminology (`ChatGPT login`, `API key`, `fail-fast`, `sourcegraph_full`), so keep those terms verbatim in docs.
125+
- Useful context (e.g., "the evaluation panel is in component X")
126+
- `docs/CONFIGS.md` is already referenced for rollout config behavior, making it the lowest-friction location for Codex policy additions versus creating a new doc.
127+
---

0 commit comments

Comments
 (0)