Skip to content

Commit aa8799f

Browse files
ehsan6shaclaude
andcommitted
runtime: raise max_new_tokens 768 -> 1500 for verdict completion headroom
768 was too tight — the second turn (analyzing tool result + emitting verdict) needs ~700-1200 tokens for think + structured output. Model was truncating mid-think and never reaching the <verdict> block. 1500 gives the model enough room without making turns absurdly long. At RK3588 NPU's ~5-7 tps thinking-mode rate, that's 3-5 minutes per turn. Combined with all earlier fixes (no set_chat_template, inlined system, low temperature, GC-safe ctypes, role-based input), the model should now complete full multi-turn diagnostic flows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent dbd15ff commit aa8799f

1 file changed

Lines changed: 8 additions & 11 deletions

File tree

src/runtime/rkllm_runtime.py

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -352,17 +352,14 @@ def init_model(
352352
# If a future conversion bumps the model's limit, raise this
353353
# value in lock-step.
354354
max_context_len: int = 4096,
355-
# max_new_tokens lowered to 768 (2026-05-27 lab test). On RK3588
356-
# NPU, Qwen 3 1.7B in thinking mode generates ~5-7 tokens/sec.
357-
# 3072 would cap a turn at ~7-10 minutes which exceeds phone-
358-
# app UX timeouts (and even our per-token 90s wait stretches
359-
# the user's perception). 768 caps at ~2-3 minutes — long but
360-
# bounded. The model occasionally truncates verbose verdicts
361-
# at this limit, which the synthetic-verdict fallback handles
362-
# gracefully. Raise back once we have a faster model or a real
363-
# streaming SSE path that emits tokens as they arrive (current
364-
# generate() blocks until the full turn completes).
365-
max_new_tokens: int = 768,
355+
# max_new_tokens budget per turn on RK3588 NPU. At ~5-7 tps in
356+
# thinking mode: 1500 tokens ≈ 3-5 minutes per turn. The
357+
# synthetic-verdict fallback in run_troubleshoot handles
358+
# truncation gracefully when the model runs out of budget
359+
# mid-output. Lower if phone UX timeouts demand quicker
360+
# responses; raise once the SSE pipeline streams tokens
361+
# incrementally (current generate() blocks until turn done).
362+
max_new_tokens: int = 1500,
366363
# Lower temperature + tighter top_k for STRUCTURED-OUTPUT
367364
# adherence. Lab observation 2026-05-27: at temp=0.6/top_k=20
368365
# the model produced narrative prose ("diag/summary") instead

0 commit comments

Comments
 (0)