Skip to content

Commit 2e7a781

Browse files
committed
feat: [US-008] - Add gpt-5.3-codex pricing support and unknown-model guard
1 parent 0e026a1 commit 2e7a781

File tree

5 files changed

+42
-5
lines changed

5 files changed

+42
-5
lines changed

.beads/issues.jsonl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
{"id":"CodeContextBench-4zx","title":"US-003: Fix remaining Docker environments (qutebrowser, Teleport, vuls, others)","status":"closed","priority":2,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-11T23:41:41.094793432Z","created_by":"LoCoBench Bot","updated_at":"2026-02-11T23:47:40.236064552Z","closed_at":"2026-02-11T23:47:40.236064552Z","close_reason":"No Docker fixes needed. All 7 non-protonmail repos have healthy environments. 32 infra failures are rate-limit hits. Re-run plan documented."}
3939
{"id":"CodeContextBench-54u","title":"US-003 Create codex_2config runner","status":"closed","priority":3,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-17T03:32:30.120181582Z","created_by":"LoCoBench Bot","updated_at":"2026-02-17T03:42:30.711156774Z","closed_at":"2026-02-17T03:42:30.711156774Z","close_reason":"done"}
4040
{"id":"CodeContextBench-56k","title":"US-005: Enrich judge context with LoCoBench SE dimensions","status":"closed","priority":2,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-15T22:53:56.691202916Z","created_by":"LoCoBench Bot","updated_at":"2026-02-15T22:55:55.340453929Z","closed_at":"2026-02-15T22:55:55.340453929Z","close_reason":"Implemented _locobench_dimensions in judge_context.py with ACS/DTA/CFRD rubrics"}
41-
{"id":"CodeContextBench-56s","title":"US-008 Add gpt-5.3-codex pricing support and unknown-model guard","status":"open","priority":2,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-17T03:33:21.651812464Z","created_by":"LoCoBench Bot","updated_at":"2026-02-17T03:33:21.651812464Z"}
41+
{"id":"CodeContextBench-56s","title":"US-008 Add gpt-5.3-codex pricing support and unknown-model guard","status":"closed","priority":2,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-17T03:33:21.651812464Z","created_by":"LoCoBench Bot","updated_at":"2026-02-17T04:03:46.456195307Z","closed_at":"2026-02-17T04:03:46.456195307Z","close_reason":"done"}
4242
{"id":"CodeContextBench-5e7","title":"Run investigation benchmark (4 tasks x 3 configs)","status":"closed","priority":1,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-10T12:50:04.232906279Z","created_by":"LoCoBench Bot","updated_at":"2026-02-10T15:49:12.300132241Z","closed_at":"2026-02-10T15:49:12.300132241Z","close_reason":"All 12 runs complete (4 tasks x 3 configs). MANIFEST regenerated."}
4343
{"id":"CodeContextBench-5kj","title":"US-002: Create workflow taxonomy module and methodology doc","status":"closed","priority":1,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-15T13:29:45.828982776Z","created_by":"LoCoBench Bot","updated_at":"2026-02-15T13:32:45.597491177Z","closed_at":"2026-02-15T13:32:45.597491177Z","close_reason":"US-002 complete: workflow taxonomy module + methodology doc"}
4444
{"id":"CodeContextBench-5m5","title":"Document PyTorch sgt-025 as permanently excluded from SG_full","description":"sgt-025 Docker build fails because the referenced PyTorch commit is unreachable. Two attempts both failed with RuntimeError: Docker compose command failed. This is an unresolvable infrastructure issue. Document in TASK_CATALOG.md and potentially remove from SG_full task list.","status":"closed","priority":4,"issue_type":"task","owner":"locobench@anthropic.com","created_at":"2026-02-10T11:28:28.715732672Z","created_by":"LoCoBench Bot","updated_at":"2026-02-12T10:29:30.648519648Z","closed_at":"2026-02-12T10:29:30.648519648Z","close_reason":"Documented in benchmarks/ccb_pytorch/README.md — new Excluded Tasks section. sgt-025 Docker build fails due to unreachable pre_fix_rev commit. Permanently excluded from SG_full."}

docs/CONFIGS.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,3 +91,11 @@ The configuration is controlled by the `BASELINE_MCP_TYPE` environment variable
9191
All three configs use `--dangerously-skip-permissions` for autonomous operation and deliver evaluation context via `--append-system-prompt`.
9292

9393
Source: `~/evals/custom_agents/agents/claudecode/agents/claude_baseline_agent.py` lines 97-480
94+
95+
## Multi-Harness Costing Caveat
96+
97+
For non-Anthropic harnesses (Codex, Cursor, Gemini, Copilot, OpenHands), token
98+
cost extraction depends on `scripts/ccb_metrics/extractors.py` model pricing
99+
keys. Official Codex runs should use `gpt-5.3-codex` so pricing is explicit.
100+
If a model identifier is unknown to `MODEL_PRICING`, extraction falls back to
101+
`claude-opus-4-5-20250514` rates and emits a warning.

ralph-multi-harness/prd.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@
120120
"docs/WORKFLOW_METRICS.md or docs/CONFIGS.md mentions model pricing caveat for non-Anthropic harnesses"
121121
],
122122
"priority": 8,
123-
"passes": false,
123+
"passes": true,
124124
"notes": ""
125125
},
126126
{

ralph-multi-harness/progress.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- Non-Claude harness runners should still source `configs/_common.sh` for validation/parallel helpers, but must not depend on Claude OAuth refresh flows.
1010
- In sandboxed environments, `runs/staging` may resolve to an external symlink target; use a writable `--category` override when dry-running scaffolds locally.
1111
- In `scripts/ccb_metrics`, resolve transcript artifacts through a shared candidate list (not hardcoded `agent/claude-code.txt`) so non-Claude harness outputs are discoverable.
12+
- In `scripts/ccb_metrics/extractors.py`, treat unknown `MODEL_PRICING` keys deterministically by falling back to `_DEFAULT_MODEL` and emitting a one-time warning to keep cross-harness cost reports explainable.
1213

1314
## Progress
1415

@@ -97,3 +98,16 @@
9798
- Useful context (e.g., "the evaluation panel is in component X")
9899
- `extract_run_config` now reads init metadata from whichever transcript candidate resolves first, improving MCP-mode inference for non-Claude harness runs.
99100
---
101+
102+
## 2026-02-17 04:03:08 UTC - US-008
103+
- Added explicit `gpt-5.3-codex` token pricing in `scripts/ccb_metrics/extractors.py` and introduced a deterministic unknown-model guard that falls back to `_DEFAULT_MODEL` rates with a one-time warning per unknown model.
104+
- Documented the non-Anthropic pricing caveat in `docs/CONFIGS.md`, including fallback behavior when a model key is missing from `MODEL_PRICING`.
105+
- Files changed: `scripts/ccb_metrics/extractors.py`, `docs/CONFIGS.md`, `ralph-multi-harness/prd.json`, `ralph-multi-harness/progress.txt`
106+
- **Learnings for future iterations:**
107+
- Patterns discovered (e.g., "this codebase uses X for Y")
108+
- Keep model-pricing fallback behavior explicit and centrally enforced in `calculate_cost_from_tokens` so cross-harness cost reporting remains deterministic.
109+
- Gotchas encountered (e.g., "don't forget to update Z when changing W")
110+
- Adding a new model key in `MODEL_PRICING` should be paired with operator-facing docs to avoid silent cost interpretation drift.
111+
- Useful context (e.g., "the evaluation panel is in component X")
112+
- `logger.warning` with a one-time set guard (`_WARNED_UNKNOWN_PRICING_MODELS`) avoids log spam while preserving visibility for unknown model identifiers.
113+
---

scripts/ccb_metrics/extractors.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from __future__ import annotations
88

99
import json
10+
import logging
1011
import re
1112
import statistics as _statistics
1213
from datetime import datetime
@@ -16,6 +17,9 @@
1617
from .models import TaskMetrics
1718
from .transcript_paths import infer_task_dir_from_transcript_path, resolve_task_transcript_path
1819

20+
logger = logging.getLogger(__name__)
21+
_WARNED_UNKNOWN_PRICING_MODELS: set[str] = set()
22+
1923

2024
def _parse_iso(ts: Optional[str]) -> Optional[datetime]:
2125
"""Parse an ISO 8601 timestamp, returning None on failure."""
@@ -840,6 +844,7 @@ def extract_code_changes_from_transcript(
840844
# GPT family
841845
"gpt-4o": {"input": 2.50, "output": 10.0, "cache_write": 0, "cache_read": 0},
842846
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "cache_write": 0, "cache_read": 0},
847+
"gpt-5.3-codex": {"input": 1.50, "output": 6.0, "cache_write": 0, "cache_read": 0},
843848
"o1": {"input": 15.0, "output": 60.0, "cache_write": 0, "cache_read": 0},
844849
# Gemini family
845850
"gemini-2.0-flash": {"input": 0.10, "output": 0.40, "cache_write": 0, "cache_read": 0},
@@ -862,16 +867,26 @@ def calculate_cost_from_tokens(
862867
output_tokens: Number of output tokens.
863868
cache_creation: Number of cache creation (write) tokens.
864869
cache_read: Number of cache read tokens.
865-
model: Model identifier (key into MODEL_PRICING). Falls back to
866-
default Opus 4.5 pricing for unknown models.
870+
model: Model identifier (key into MODEL_PRICING). Unknown models
871+
deterministically fall back to default Opus 4.5 pricing and emit
872+
a one-time warning per model.
867873
868874
Returns:
869875
Estimated cost in USD, or None if input/output tokens unavailable.
870876
"""
871877
if input_tokens is None or output_tokens is None:
872878
return None
873879

874-
prices = MODEL_PRICING.get(model, MODEL_PRICING[_DEFAULT_MODEL])
880+
prices = MODEL_PRICING.get(model)
881+
if prices is None:
882+
prices = MODEL_PRICING[_DEFAULT_MODEL]
883+
if model not in _WARNED_UNKNOWN_PRICING_MODELS:
884+
logger.warning(
885+
"Unknown model pricing for '%s'; using fallback '%s' rates",
886+
model,
887+
_DEFAULT_MODEL,
888+
)
889+
_WARNED_UNKNOWN_PRICING_MODELS.add(model)
875890

876891
cost = (input_tokens / 1_000_000) * prices["input"]
877892
cost += (output_tokens / 1_000_000) * prices["output"]

0 commit comments

Comments
 (0)