Skip to content

Commit fec2975

Browse files
ehsan6shaclaude
andcommitted
runtime: re-init RKLLM on rc=-1 + restore FULL EXAMPLE w/ anti-mimicry framing
Two model-quality fixes for failures observed in lab transcripts: 1. RKLLM rc=-1 recovery The native rkllm_run returns -1 when the runtime is in a stuck state — most commonly after an aborted prior generation leaves the KV cache in a half-state. Lab 2026-05-28: clicking "Try again with the same question" reliably hit `[RKLLM_GENERATE_FAILED] rkllm_run returned -1` at T+0s, killing the session. The bridge now catches the rc=-1 case specifically (substring match on "returned -1" in the error message so other RKLLMLoadError shapes — timeouts, callback errors, init failures — bubble unchanged), destroys + re-inits the runtime, and retries the same generate() call ONCE with keep_history=0 (cache was wiped). If the retry also fails, the bridge emits the error event AND a synthetic insufficient_data verdict in the same stream so the app's Try-Again CTA surfaces (raw error events have no Try-Again gate; only synthetic-verdict root_causes trigger it). 2. FULL EXAMPLE restored with anti-mimicry framing Earlier today (c919a1d) the FULL EXAMPLE three-turn flow was deleted from the system prompt on advisor input — the model was pattern-copying entire example structures, not just numbers. Lab transcripts since then show the model producing unanchored prose (e.g. "# Onboarding\n\nThe Blox app asks you to name the symptom...") that has zero relation to the user's request. Net effect of the deletion: the model has even less format anchoring than before. Restored a SHAPE-only example that uses uppercase placeholders (<TOOL_NAME>, <ONE_SENTENCE_FROM_YOUR_DATA>, etc.) and an explicit "mimic the STRUCTURE, not the words. Do NOT mention any of the placeholder labels in your output" framing. Counters the pattern-copy risk by removing memorable content to copy. Added "Do NOT use markdown headings" to the FIRST ACTION section since lab transcripts showed the model emitting `# Onboarding`-style headers despite the existing "no markdown" rule elsewhere — the closer the rule is to the first-token directive, the more reliably small models honor it. Two new unit tests: - rc=-1 → destroy + re-init + retry succeeds → verdict came through - rc=-1 → re-init retry ALSO fails → error event + synthetic insufficient_data verdict emitted, recoverable=true, Try-Again button reachable All 258 existing tests still pass. Bigger-picture concern (out of scope for this commit, surfaced to user): Qwen3-1.7B fine-tuned on 174 examples appears to be overfitting to specific training phrases and falling back to those phrases on novel inputs. The longer-term fix is either (a) audit + clean the training set, (b) move to Qwen2.5-3B as the original plan specified, or (c) drop the fine-tune entirely and use a more directive system prompt with the stock model. The user picked this "quick win" path; deeper options remain on the table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 8898b0c commit fec2975

2 files changed

Lines changed: 207 additions & 8 deletions

File tree

src/runtime/rkllm_runtime.py

Lines changed: 104 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -625,7 +625,27 @@ def destroy(self) -> None:
625625
626626
Your VERY FIRST output token MUST start a <tool_call> block. Do NOT
627627
repeat these instructions back. Do NOT acknowledge. Do NOT explain
628-
what you are about to do. Start with `<tool_call>{{"name":"diag/`.
628+
what you are about to do. Do NOT use markdown headings (#, ##).
629+
Start with `<tool_call>{{"name":"diag/`.
630+
631+
# FORMAT SHAPE — mimic the STRUCTURE, not the words
632+
633+
Below is the SHAPE a healthy session takes across three model turns.
634+
**The words are placeholders. Do NOT copy them. Do NOT mention any
635+
of the placeholder labels in your output.** Use the actual user
636+
request + real <tool_response> data you receive to fill in your own
637+
content.
638+
639+
Turn 1 (after the user's request arrives):
640+
<tool_call>{{"name":"diag/<TOOL_NAME>","arguments":{{}}}}</tool_call>
641+
642+
Turn 2 (after the runtime injects a <tool_response>...</tool_response>
643+
into your context):
644+
<tool_call>{{"name":"diag/<NEXT_TOOL>","arguments":{{}}}}</tool_call>
645+
646+
Turn 3 (when you've seen enough tool data to decide):
647+
<verdict>{{"summary":"<ONE_SENTENCE_FROM_YOUR_DATA>","severity":"<green|yellow|red>","root_cause":"<SHORT_TOKEN>"}}</verdict>
648+
<recommendation>{{"action_name":"<one_of_the_action_names_above>","args":{{}},"reasoning":"<WHY_FROM_YOUR_DATA>","confidence":<0_TO_1>,"tier":<2_OR_3>}}</recommendation>
629649
630650
# VERDICT RULES
631651
@@ -1183,13 +1203,89 @@ async def run_troubleshoot(
11831203
),
11841204
)
11851205
except RKLLMLoadError as e:
1186-
yield {
1187-
"type": "error",
1188-
"code": "RKLLM_GENERATE_FAILED",
1189-
"message": str(e)[:500],
1190-
"recoverable": False,
1191-
}
1192-
return
1206+
err_msg = str(e)
1207+
# rc=-1 from rkllm_run typically indicates the runtime
1208+
# is in a stuck state (KV cache corruption after an
1209+
# aborted prior generation, partial template state,
1210+
# etc.). The only known recovery is destroy + re-init.
1211+
# Try ONCE per turn — if the retry also fails, surface
1212+
# the error AND emit a synthetic insufficient_data
1213+
# verdict so the app's Try-Again button still surfaces
1214+
# (the bare error event currently has no Try-Again gate).
1215+
#
1216+
# We only attempt re-init for the rc=-1 family ("returned -1"
1217+
# in the message) — other errors (timeouts, callback
1218+
# signalled RUN_ERROR, init failed) point at deeper
1219+
# issues that re-init won't fix and that should bubble
1220+
# to the user with the original message intact.
1221+
reinit_ok = False
1222+
if "returned -1" in err_msg:
1223+
try:
1224+
logger.warning(
1225+
"rkllm_run returned -1; destroying + re-initializing runtime"
1226+
)
1227+
self._runtime.destroy()
1228+
self._runtime.init_model()
1229+
reinit_ok = True
1230+
except Exception as reinit_e: # noqa: BLE001
1231+
logger.exception(
1232+
"rkllm re-init failed after rc=-1: %s", reinit_e
1233+
)
1234+
1235+
if reinit_ok:
1236+
# Retry the SAME generate() call once. keep_history
1237+
# is reset because destroy() wiped the KV cache.
1238+
try:
1239+
output = await loop.run_in_executor(
1240+
None,
1241+
lambda r=next_role, c=next_content, t=next_thinking: (
1242+
self._runtime.generate(
1243+
c,
1244+
role=r,
1245+
enable_thinking=t,
1246+
keep_history=0, # cache was wiped
1247+
timeout_s=PER_TURN_TIMEOUT_S,
1248+
)
1249+
),
1250+
)
1251+
except (RKLLMLoadError, asyncio.TimeoutError) as retry_e:
1252+
# Retry also failed — surface both the error
1253+
# event AND a synthetic verdict so Try-Again
1254+
# is reachable.
1255+
yield {
1256+
"type": "error",
1257+
"code": "RKLLM_GENERATE_FAILED",
1258+
"message": f"after re-init retry: {str(retry_e)[:400]}",
1259+
"recoverable": True,
1260+
}
1261+
yield {
1262+
"type": "verdict",
1263+
"payload": {
1264+
"summary": "Blox AI runtime stalled; please try again.",
1265+
"severity": "yellow",
1266+
"root_cause": "insufficient_data",
1267+
},
1268+
}
1269+
return
1270+
else:
1271+
# Initial -1 + re-init failed, OR a non--1 RKLLM
1272+
# error. Surface as before + emit a synthetic
1273+
# verdict so Try-Again is reachable.
1274+
yield {
1275+
"type": "error",
1276+
"code": "RKLLM_GENERATE_FAILED",
1277+
"message": err_msg[:500],
1278+
"recoverable": True,
1279+
}
1280+
yield {
1281+
"type": "verdict",
1282+
"payload": {
1283+
"summary": "Blox AI runtime stalled; please try again.",
1284+
"severity": "yellow",
1285+
"root_cause": "insufficient_data",
1286+
},
1287+
}
1288+
return
11931289
except asyncio.TimeoutError:
11941290
yield {
11951291
"type": "error",

tests/test_rkllm_runtime.py

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -586,6 +586,109 @@ def test_guardrail_does_not_drop_when_user_did_NOT_complain_about_connectivity()
586586
assert out[0]["confidence"] == 0.6 # but still capped
587587

588588

589+
def test_rkllm_minus_one_triggers_destroy_reinit_and_retry():
590+
"""Lab 2026-05-28: clicking Retry on a previous session sometimes
591+
hit `rkllm_run returned -1` at T+0s — the native runtime gets
592+
stuck after an aborted prior generation. The bridge now destroys
593+
+ re-inits the runtime + retries the same generate() call ONCE
594+
before yielding the error. If the retry succeeds, the model's
595+
output proceeds normally; if it fails, we yield the error AND
596+
a synthetic insufficient_data verdict so the app's Try-Again
597+
CTA surfaces (raw error events have no Try-Again gate)."""
598+
import asyncio
599+
from src.runtime.rkllm_runtime import RKLLMBackend, RKLLMLoadError
600+
601+
state = {"calls": 0, "destroyed": False, "reinited": False}
602+
603+
class FlakyRuntime:
604+
def generate(self, prompt, role="user", enable_thinking=False, keep_history=0, timeout_s=90.0):
605+
state["calls"] += 1
606+
if state["calls"] == 1:
607+
raise RKLLMLoadError("rkllm_run returned -1")
608+
# After re-init, second call succeeds with a verdict-only
609+
# output (no tool_call) so the loop terminates cleanly on
610+
# the next condition check. The runtime guard fires only
611+
# once per generate call, so loop-iteration ends here.
612+
return (
613+
'<verdict>{"summary":"ok now","severity":"green","root_cause":"recovered_after_reinit"}</verdict>'
614+
)
615+
616+
def destroy(self):
617+
state["destroyed"] = True
618+
619+
def init_model(self):
620+
state["reinited"] = True
621+
622+
backend = RKLLMBackend(loaded=True, _runtime=FlakyRuntime())
623+
backend.wire_runtime_deps(tool_executor=None, action_signer=lambda x: "f" * 64)
624+
625+
async def collect():
626+
return [ev async for ev in backend.run_troubleshoot("retry")]
627+
628+
events = asyncio.run(collect())
629+
# Runtime was destroyed + re-inited
630+
assert state["destroyed"] is True
631+
assert state["reinited"] is True
632+
# 2 calls inside turn 0 (initial -1 + retry) is what we want; the
633+
# no-evidence guardrail then overrides because no tool calls
634+
# succeeded, but the runtime calls are bounded.
635+
assert state["calls"] == 2
636+
# A verdict came through — either the model's recovered_after_reinit
637+
# OR the no-evidence guardrail's insufficient_data override (both
638+
# acceptable here; the point is the runtime recovered).
639+
verdicts = [e for e in events if e["type"] == "verdict"]
640+
assert len(verdicts) == 1
641+
assert verdicts[0]["payload"]["root_cause"] in (
642+
"recovered_after_reinit",
643+
"insufficient_data",
644+
), verdicts[0]
645+
646+
647+
def test_rkllm_minus_one_retry_failure_emits_error_plus_synthetic_verdict():
648+
"""If the re-init retry ALSO returns -1 (or any other failure),
649+
the bridge emits the error event AND a synthetic insufficient_data
650+
verdict in the same stream so the app's synthetic-detector
651+
triggers the Try-Again button. Without the second yield the user
652+
is stuck on a bare error with only Start-new-chat."""
653+
import asyncio
654+
from src.runtime.rkllm_runtime import RKLLMBackend, RKLLMLoadError
655+
656+
state = {"calls": 0}
657+
658+
class StuckRuntime:
659+
def generate(self, prompt, role="user", enable_thinking=False, keep_history=0, timeout_s=90.0):
660+
state["calls"] += 1
661+
raise RKLLMLoadError("rkllm_run returned -1")
662+
663+
def destroy(self):
664+
pass
665+
666+
def init_model(self):
667+
pass
668+
669+
backend = RKLLMBackend(loaded=True, _runtime=StuckRuntime())
670+
backend.wire_runtime_deps(tool_executor=None, action_signer=lambda x: "f" * 64)
671+
672+
async def collect():
673+
return [ev async for ev in backend.run_troubleshoot("retry")]
674+
675+
events = asyncio.run(collect())
676+
677+
# Two generate attempts were made (original + retry post re-init)
678+
assert state["calls"] == 2
679+
types = [e["type"] for e in events]
680+
# Error event present
681+
assert "error" in types
682+
err = next(e for e in events if e["type"] == "error")
683+
assert err["code"] == "RKLLM_GENERATE_FAILED"
684+
assert "after re-init retry" in err["message"]
685+
assert err["recoverable"] is True
686+
# Synthetic insufficient_data verdict present so Try-Again surfaces
687+
assert "verdict" in types
688+
verdict = next(e for e in events if e["type"] == "verdict")
689+
assert verdict["payload"]["root_cause"] == "insufficient_data"
690+
691+
589692
def test_run_troubleshoot_terminates_when_model_only_emits_prose():
590693
"""If the model produces no verdict + no tool_calls, the backend
591694
INJECTS a force-verdict directive + gives the model one more chance.

0 commit comments

Comments
 (0)