Skip to content

Commit 611d34c

Browse files
LEANDERANTONYclaude
andcommitted
fix(eval): clamp Kimi adapter max_tokens to per-task budget; preflight-only arg
KimiEvalService floored max_tokens at 8000 even for the 20-token preflight call. OpenRouter reserves max_tokens*price of credit upfront, so flooring tiny calls inflated the reservation and caused spurious 402s on pricier models / low balances. Now clamps to the caller's real per-task budget with 8000 as a CEILING (never a floor); truncation is still counted via finish_reason=="length". provider_ab_runner gains a `--preflight-only` arg (validate every slug/key for ~$0.001 then exit; early-exit wiring still TODO — tracked in the parked eval plan). Eval-scoped, not production-wired. 7 hermetic adapter tests green (tests/quality/test_kimi_eval_service.py). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent dfb1ed8 commit 611d34c

2 files changed

Lines changed: 15 additions & 4 deletions

File tree

tests/quality/kimi_eval_service.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,14 @@
5454

5555
_DEFAULT_BASE_URL = os.getenv("KIMI_BASE_URL", "https://openrouter.ai/api/v1").strip()
5656
_DEFAULT_MODEL = os.getenv("KIMI_MODEL", "moonshotai/kimi-k2.6").strip()
57-
# Generous so truncation doesn't confound the model-quality signal;
58-
# we still COUNT any finish_reason=="length" as a fidelity miss.
57+
# Safety CEILING (not a floor): callers pass real per-task budgets
58+
# (parsers/agents from config; preflight passes ~20). We clamp to
59+
# this ceiling so a runaway never over-spends, but never inflate a
60+
# small request up to it — OpenRouter reserves max_tokens*price of
61+
# credit upfront, so flooring tiny calls at 8000 caused spurious 402s
62+
# on pricier models / low balances. Truncation is still COUNTED via
63+
# finish_reason=="length"; the eval controls truncation by the
64+
# callers' already-generous per-task budgets.
5965
_EVAL_MAX_TOKENS = int(os.getenv("KIMI_EVAL_MAX_TOKENS", "8000"))
6066

6167

@@ -167,7 +173,8 @@ def run_json_prompt(self, system_prompt, user_prompt, expected_keys=None,
167173
reasoning_effort=None) -> dict:
168174
task = task_name or "unknown"
169175
content = self._chat(system_prompt, user_prompt, task_name=task,
170-
max_tokens=max(max_completion_tokens, _EVAL_MAX_TOKENS))
176+
max_tokens=min(max_completion_tokens or _EVAL_MAX_TOKENS,
177+
_EVAL_MAX_TOKENS))
171178
try:
172179
payload = json.loads(content)
173180
except json.JSONDecodeError as exc:
@@ -194,7 +201,8 @@ def run_structured_prompt(self, system_prompt, user_prompt, *,
194201
previous_response_id=None, reasoning_effort=None):
195202
task = task_name or "unknown"
196203
content = self._chat(system_prompt, user_prompt, task_name=task,
197-
max_tokens=max(max_completion_tokens, _EVAL_MAX_TOKENS))
204+
max_tokens=min(max_completion_tokens or _EVAL_MAX_TOKENS,
205+
_EVAL_MAX_TOKENS))
198206
try:
199207
raw = json.loads(content)
200208
except json.JSONDecodeError as exc:

tests/quality/provider_ab_runner.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,9 @@ def main() -> None:
211211
help="alias: --suite parser --limit 3 (cheap sanity + fidelity)")
212212
ap.add_argument("--preflight", action="store_true",
213213
help="1 tiny call/candidate to validate slug+key before the run")
214+
ap.add_argument("--preflight-only", action="store_true",
215+
help="just validate every slug/key (~$0.001 total) and exit; "
216+
"no suites — use when credits are tight")
214217
ap.add_argument("--json", default="")
215218
args = ap.parse_args()
216219

0 commit comments

Comments
 (0)