You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Five distinct bugs found while debugging a healthy lab device that the
AI plugin kept (loudly) reporting as broken. Each layer compounded the
next, producing a high-confidence "restart_fula" recommendation when
the device was actually fine.
1. diag/internet false "discovery unreachable" (HTTP 403 misclassified)
_helpers.py:https_head returned ok=False for HTTP 403, treating a
"server responded — you can't HEAD this resource" answer as a
network failure. discovery.fula.network's /relays only accepts
POST; HEAD legitimately returns 403, yet the server is alive and
the network path is fine. Lab ground truth: TLS handshake OK,
ping 3ms, HTTP 403 — REACHABLE.
Fix: new https_reachable() that returns ok=True on ANY HTTP
response (2xx/3xx/4xx/5xx). Only network-level failures (DNS,
TCP, TLS, timeout) count as unreachable. diag/internet now uses
https_reachable for the discovery probe; https_head retained for
google.com (the "internet itself works" canary). Regression test
test_internet_discovery_403_is_reachable_not_captive guards
against re-introducing the bug.
2. diag/summary's power threshold turned 1 yellow event into nuclear red
summary.py had `red if ue > 0 else green` — a single transient
undervoltage event flipped power to red and dominated the AI's
verdict. Graduated thresholds: 0=green, 1-2=yellow (acknowledge
but don't panic), 3+=red (real PSU issue).
3. Schema-invalid backend event killed the entire stream
tool_call_loop.py returned on first validation failure, so when
the 1.5B Qwen hallucinated a tool name (e.g. "diag/discovery"),
the user's session bombed with [SCHEMA_VIOLATION] and no
recommendations rendered.
Fix: invalid tool_call now yields a recoverable error event +
synthesizes a tool_result with ok=false + "unknown tool 'X'"
message, so the model can self-correct on its next turn AND
the UI keeps the session alive. Other invalid events yield a
non-fatal error and the bridge continues. Regression tests:
test_schema_invalid_tool_call_yields_synthetic_tool_result +
updated test_schema_invalid_backend_event_emits_synthetic_error.
4. System prompt let the model pre-confabulate + invent action names
rkllm_runtime.py SYSTEM_PROMPT_TEMPLATE strengthened with three
new hard rules:
- Rule 5: ANY action MUST be in a <recommendation> XML block,
never markdown prose (prose has no Approve button).
- Rule 6: Read tool_response field by field — quote the actual
field name. Don't confuse internet.latency_ms_avg with a
clock offset (lab observed).
- Rule 8: If user reports a symptom but diagnostics CONTRADICT
it (e.g. user says disconnected but heartbeat.status=green
http_status=200), ASK via <user_question> before acting.
- Rule 9: NEVER emit a tier-2/3 destructive action at confidence
> 0.7 when severity != "red".
- Rule 10: relay.reservation_count=0 + wireguard.active=false
are NOT problems on their own (normal for LAN-only devices).
Plus BAD/GOOD examples showing what NOT to do.
5. Server-side guardrails for when the model ignores rules 8/9 anyway
1.5B Qwen still pattern-matches "user said disconnected + relay
yellow → restart_fula at 95%" even with strengthened prompt.
Belt-and-suspenders defense:
apply_recommendation_guardrails() runs after recommendation
parsing, BEFORE emission:
- DROPS restart_fula entirely when heartbeat.status=green
http_status=200 AND user prompt mentions disconnect /
unreachable / can't see / offline. Restarting fula when the
device IS heartbeating would create the very disconnect the
user complained about (self-fulfilling bug).
- CAPS confidence to 0.6 on restart-class actions
(restart_fula, reset, wireguard.bounce, docker.restart,
systemctl.restart) when verdict.severity is yellow/green.
These actions should not be high-confidence on non-red
severity.
Six regression-guard tests in test_rkllm_runtime.py cover the
exact lab scenario + adjacent cases (red severity passes,
non-restart-class actions unaffected, non-connectivity prompts
don't trigger the drop, etc.).
Tests: 230/230 pass.
End-to-end verified on lab pi@192.168.2.159 via hot-patch (before
container recreation wiped the patches): diag/internet correctly
reported discovery reachable, diag/summary returned power=green, AI
emitted proper verdict + recommended_action with HMAC token, and
the model began ASKING about WiFi configuration instead of jumping
to restart_fula.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: src/runtime/rkllm_runtime.py
+148Lines changed: 148 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -405,6 +405,37 @@ def destroy(self) -> None:
405
405
2. After you have called the diagnostic tools you need (typically 1-3 calls), you MUST emit a <verdict> based on the results.
406
406
3. NEVER call tools indefinitely. After 1-2 follow-up calls, finalize.
407
407
4. NEVER output a turn that is prose-only with no <tool_call> AND no <verdict>. Every turn must contain at least one XML block.
408
+
5. **ANY action you suggest MUST be emitted as a <recommendation> XML block — NEVER as a markdown numbered list, bullet, or table.** If the user can't tap "Approve" on it, it didn't happen. Prose-only suggestions are invisible to the app. Translate every fix you have in mind into one <recommendation> block per fix.
409
+
6. Read tool_response JSON FIELD BY FIELD. Do NOT confuse `internet.latency_ms_avg` with a clock offset. Do NOT call a subsystem "red" if its `status` field says "green". Quote the actual field name you're basing your conclusion on.
410
+
7. NEVER use markdown headings (###), markdown bold (**...**), or numbered lists like "1. ntp.resync — ...". Those render as plain text in the chat; only <recommendation> blocks produce an Approve button.
411
+
8. **If the user reports a symptom but the diagnostic data CONTRADICTS it, ASK before acting.** Specifically:
412
+
- User says "device disconnected" / "not reachable" / "app can't see my blox" BUT `heartbeat.status` is "green" with `http_status: 200`. → Device IS reachable from the cloud (heartbeat is the canonical "I'm alive" signal posting to discovery.fula.network). The disconnect they see is almost certainly phone-side (app cache, NetInfo wrong, WiFi switched, captive portal). DO NOT recommend restart_fula. INSTEAD emit a <user_question> like: {{"question":"Your device is currently posting heartbeats successfully (heartbeat.status=green, http_status=200). The connection issue may be phone-side. What error message do you see, and is your phone on the same WiFi as your Blox?","options":["Same WiFi","Cellular","Different WiFi","Don't know"]}}
413
+
- User says "slow" but `containers.status: green, oom_count: 0, storage.status: green`. → ASK what specifically is slow.
414
+
- User says "not earning" but `relay.reservation_count > 0` AND `heartbeat.status: green`. → Device IS connected. ASK if they've actually joined a pool.
415
+
9. **NEVER emit a tier-2 or tier-3 destructive action (`restart_fula`, `docker.restart`, `systemctl.restart`, `wireguard.bounce`, `reset`) with confidence > 0.7 when severity is "yellow" or "green".** Yellow signals can be normal — relay=yellow on a LAN-only device is expected. Acting on yellow with high confidence creates self-fulfilling problems (the action briefly DISCONNECTS the device, "confirming" the false diagnosis). Confidence > 0.7 on these actions requires severity="red" AND a specific failing subsystem named in the reasoning.
416
+
10. `relay.reservation_count: 0` is NOT a problem on its own — it only matters if the user is trying to be reached from outside their LAN. `wireguard.active: false` is NOT a problem unless the user explicitly set up WG. Mention these only as "informational" in your verdict, never as the root cause unless other evidence points to them.
417
+
418
+
# BAD vs GOOD examples
419
+
420
+
❌ BAD — prose recommendations get NO Approve button, user can take no action:
421
+
422
+
### Tier 2 Actions:
423
+
1. **ntp.resync** - Resync the clock.
424
+
2. **docker.restart container=ipfs_host** - Restart the container.
425
+
426
+
✅ GOOD — each recommendation is its own XML block:
427
+
428
+
<recommendation>{{"action_name":"ntp.resync","args":{{}},"reasoning":"Clock is unsynced.","confidence":0.85,"tier":2}}</recommendation>
0 commit comments