From a542cdb068b8e63ebc96b1b1df7448679f3332a3 Mon Sep 17 00:00:00 2001 From: ehsan shariati Date: Tue, 26 May 2026 19:00:52 -0400 Subject: [PATCH] fix multi-layer false-positive diagnoses (lab-observed 2026-05-26) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Five distinct bugs found while debugging a healthy lab device that the AI plugin kept (loudly) reporting as broken. Each layer compounded the next, producing a high-confidence "restart_fula" recommendation when the device was actually fine. 1. diag/internet false "discovery unreachable" (HTTP 403 misclassified) _helpers.py:https_head returned ok=False for HTTP 403, treating a "server responded — you can't HEAD this resource" answer as a network failure. discovery.fula.network's /relays only accepts POST; HEAD legitimately returns 403, yet the server is alive and the network path is fine. Lab ground truth: TLS handshake OK, ping 3ms, HTTP 403 — REACHABLE. Fix: new https_reachable() that returns ok=True on ANY HTTP response (2xx/3xx/4xx/5xx). Only network-level failures (DNS, TCP, TLS, timeout) count as unreachable. diag/internet now uses https_reachable for the discovery probe; https_head retained for google.com (the "internet itself works" canary). Regression test test_internet_discovery_403_is_reachable_not_captive guards against re-introducing the bug. 2. diag/summary's power threshold turned 1 yellow event into nuclear red summary.py had `red if ue > 0 else green` — a single transient undervoltage event flipped power to red and dominated the AI's verdict. Graduated thresholds: 0=green, 1-2=yellow (acknowledge but don't panic), 3+=red (real PSU issue). 3. Schema-invalid backend event killed the entire stream tool_call_loop.py returned on first validation failure, so when the 1.5B Qwen hallucinated a tool name (e.g. "diag/discovery"), the user's session bombed with [SCHEMA_VIOLATION] and no recommendations rendered. Fix: invalid tool_call now yields a recoverable error event + synthesizes a tool_result with ok=false + "unknown tool 'X'" message, so the model can self-correct on its next turn AND the UI keeps the session alive. Other invalid events yield a non-fatal error and the bridge continues. Regression tests: test_schema_invalid_tool_call_yields_synthetic_tool_result + updated test_schema_invalid_backend_event_emits_synthetic_error. 4. System prompt let the model pre-confabulate + invent action names rkllm_runtime.py SYSTEM_PROMPT_TEMPLATE strengthened with three new hard rules: - Rule 5: ANY action MUST be in a XML block, never markdown prose (prose has no Approve button). - Rule 6: Read tool_response field by field — quote the actual field name. Don't confuse internet.latency_ms_avg with a clock offset (lab observed). - Rule 8: If user reports a symptom but diagnostics CONTRADICT it (e.g. user says disconnected but heartbeat.status=green http_status=200), ASK via before acting. - Rule 9: NEVER emit a tier-2/3 destructive action at confidence > 0.7 when severity != "red". - Rule 10: relay.reservation_count=0 + wireguard.active=false are NOT problems on their own (normal for LAN-only devices). Plus BAD/GOOD examples showing what NOT to do. 5. Server-side guardrails for when the model ignores rules 8/9 anyway 1.5B Qwen still pattern-matches "user said disconnected + relay yellow → restart_fula at 95%" even with strengthened prompt. Belt-and-suspenders defense: apply_recommendation_guardrails() runs after recommendation parsing, BEFORE emission: - DROPS restart_fula entirely when heartbeat.status=green http_status=200 AND user prompt mentions disconnect / unreachable / can't see / offline. Restarting fula when the device IS heartbeating would create the very disconnect the user complained about (self-fulfilling bug). - CAPS confidence to 0.6 on restart-class actions (restart_fula, reset, wireguard.bounce, docker.restart, systemctl.restart) when verdict.severity is yellow/green. These actions should not be high-confidence on non-red severity. Six regression-guard tests in test_rkllm_runtime.py cover the exact lab scenario + adjacent cases (red severity passes, non-restart-class actions unaffected, non-connectivity prompts don't trigger the drop, etc.). Tests: 230/230 pass. End-to-end verified on lab pi@192.168.2.159 via hot-patch (before container recreation wiped the patches): diag/internet correctly reported discovery reachable, diag/summary returned power=green, AI emitted proper verdict + recommended_action with HMAC token, and the model began ASKING about WiFi configuration instead of jumping to restart_fula. Co-Authored-By: Claude Opus 4.7 --- src/runtime/rkllm_runtime.py | 148 +++++++++++++++++++++++++++++++ src/session/tool_call_loop.py | 72 ++++++++++++++- src/tools/diag_impls/_helpers.py | 37 ++++++++ src/tools/diag_impls/internet.py | 28 ++++-- src/tools/diag_impls/summary.py | 18 +++- tests/test_diag_impls.py | 45 ++++++++-- tests/test_rkllm_runtime.py | 131 ++++++++++++++++++++++++++- tests/test_troubleshoot_sse.py | 82 ++++++++++++++++- 8 files changed, 540 insertions(+), 21 deletions(-) diff --git a/src/runtime/rkllm_runtime.py b/src/runtime/rkllm_runtime.py index 3c120e8..c361cbc 100644 --- a/src/runtime/rkllm_runtime.py +++ b/src/runtime/rkllm_runtime.py @@ -405,6 +405,37 @@ def destroy(self) -> None: 2. After you have called the diagnostic tools you need (typically 1-3 calls), you MUST emit a based on the results. 3. NEVER call tools indefinitely. After 1-2 follow-up calls, finalize. 4. NEVER output a turn that is prose-only with no AND no . Every turn must contain at least one XML block. +5. **ANY action you suggest MUST be emitted as a XML block — NEVER as a markdown numbered list, bullet, or table.** If the user can't tap "Approve" on it, it didn't happen. Prose-only suggestions are invisible to the app. Translate every fix you have in mind into one block per fix. +6. Read tool_response JSON FIELD BY FIELD. Do NOT confuse `internet.latency_ms_avg` with a clock offset. Do NOT call a subsystem "red" if its `status` field says "green". Quote the actual field name you're basing your conclusion on. +7. NEVER use markdown headings (###), markdown bold (**...**), or numbered lists like "1. ntp.resync — ...". Those render as plain text in the chat; only blocks produce an Approve button. +8. **If the user reports a symptom but the diagnostic data CONTRADICTS it, ASK before acting.** Specifically: + - User says "device disconnected" / "not reachable" / "app can't see my blox" BUT `heartbeat.status` is "green" with `http_status: 200`. → Device IS reachable from the cloud (heartbeat is the canonical "I'm alive" signal posting to discovery.fula.network). The disconnect they see is almost certainly phone-side (app cache, NetInfo wrong, WiFi switched, captive portal). DO NOT recommend restart_fula. INSTEAD emit a like: {{"question":"Your device is currently posting heartbeats successfully (heartbeat.status=green, http_status=200). The connection issue may be phone-side. What error message do you see, and is your phone on the same WiFi as your Blox?","options":["Same WiFi","Cellular","Different WiFi","Don't know"]}} + - User says "slow" but `containers.status: green, oom_count: 0, storage.status: green`. → ASK what specifically is slow. + - User says "not earning" but `relay.reservation_count > 0` AND `heartbeat.status: green`. → Device IS connected. ASK if they've actually joined a pool. +9. **NEVER emit a tier-2 or tier-3 destructive action (`restart_fula`, `docker.restart`, `systemctl.restart`, `wireguard.bounce`, `reset`) with confidence > 0.7 when severity is "yellow" or "green".** Yellow signals can be normal — relay=yellow on a LAN-only device is expected. Acting on yellow with high confidence creates self-fulfilling problems (the action briefly DISCONNECTS the device, "confirming" the false diagnosis). Confidence > 0.7 on these actions requires severity="red" AND a specific failing subsystem named in the reasoning. +10. `relay.reservation_count: 0` is NOT a problem on its own — it only matters if the user is trying to be reached from outside their LAN. `wireguard.active: false` is NOT a problem unless the user explicitly set up WG. Mention these only as "informational" in your verdict, never as the root cause unless other evidence points to them. + +# BAD vs GOOD examples + +❌ BAD — prose recommendations get NO Approve button, user can take no action: + + ### Tier 2 Actions: + 1. **ntp.resync** - Resync the clock. + 2. **docker.restart container=ipfs_host** - Restart the container. + +✅ GOOD — each recommendation is its own XML block: + + {{"action_name":"ntp.resync","args":{{}},"reasoning":"Clock is unsynced.","confidence":0.85,"tier":2}} + {{"action_name":"docker.restart","args":{{"container":"ipfs_host"}},"reasoning":"ipfs_host restart-looping.","confidence":0.75,"tier":2}} + +❌ BAD — making up a field: + + "Time Status: Clock offset is significant (93 ms)" + (when tool_response actually said `time.status: green, synced: true` and the 93 was `internet.latency_ms_avg`) + +✅ GOOD — quote what you actually read: + + "time.status is green (synced=true). internet.latency_ms_avg is 93ms — that's network latency, not a clock offset." # AVAILABLE TOOLS (read-only) @@ -593,6 +624,98 @@ def parse_recommendations(raw_text: str) -> list[dict]: return out +# Tier-2/3 destructive actions that should NOT be recommended at high +# confidence unless severity is "red". Empirically observed (2026-05-26 +# lab): 1.5B Qwen pattern-matches "user said disconnected → relay +# yellow → restart_fula at 95%" even when heartbeat is green +# (contradicting evidence of actual reachability). The post-processor +# below caps confidence to 0.6 on these names when the verdict's +# severity is not red, AND additionally suppresses them entirely when +# heartbeat is green but the user's prompt mentioned a connectivity +# complaint (the AI is acting on false-positive contradictory data). +RESTART_CLASS_ACTIONS = frozenset({ + "restart_fula", "reset", "wireguard.bounce", + "docker.restart", "systemctl.restart", +}) + + +def apply_recommendation_guardrails( + recommendations: list[dict], + verdict: Optional[dict], + last_summary_payload: Optional[dict] = None, + user_prompt: Optional[str] = None, +) -> tuple[list[dict], list[str]]: + """Cap or drop misbehaving recommendations from the model output. + + Two rules (in order): + + 1. CONFIDENCE CAP: restart-class action (restart_fula, reset, + wireguard.bounce, docker.restart, systemctl.restart) with + confidence > 0.7 AND verdict.severity != "red" → cap confidence + at 0.6. Reasoning: yellow/green severity + high-confidence + destructive action is the false-positive pattern. + + 2. CONTRADICTION SUPPRESS: if the user's prompt mentions a + CONNECTIVITY symptom ("disconnect", "not reachable", "can't see", + "offline") BUT the last diag/summary's heartbeat.status is + "green" with http_status 200, suppress restart_fula entirely. + The device IS reachable from the cloud; restarting it would + create the very disconnect the user is complaining about. + + Returns (filtered_recommendations, list_of_human_readable_reasons) + so the caller can log/surface why something was dropped or capped. + """ + notes: list[str] = [] + severity = (verdict or {}).get("severity") if verdict else None + + # Did the user actually complain about connectivity? + prompt_lc = (user_prompt or "").lower() + is_connectivity_complaint = any( + kw in prompt_lc + for kw in ("disconnect", "not reachable", "unreachable", + "can't see", "cannot see", "offline", "can't reach", + "cannot reach", "showing as disconnected", "shows offline") + ) + # Heartbeat ground truth: device IS reachable iff posted recent 200. + hb = (last_summary_payload or {}).get("subsystems", {}).get("heartbeat", {}) + heartbeat_green_with_200 = ( + hb.get("status") == "green" + and hb.get("key_metrics", {}).get("http_status") == 200 + ) + + out: list[dict] = [] + for rec in recommendations: + name = rec.get("action_name", "") + conf = float(rec.get("confidence", 0.5)) + + # Rule 2: drop restart_fula on contradiction (most aggressive). + if ( + name == "restart_fula" + and is_connectivity_complaint + and heartbeat_green_with_200 + ): + notes.append( + f"dropped action={name!r} (confidence {conf:.2f}): user reported " + f"a connectivity issue but heartbeat.status=green http_status=200, " + f"so device IS reachable — restart_fula would create a real " + f"disconnect from a non-existent problem" + ) + continue + + # Rule 1: cap restart-class confidence when severity is not red. + if name in RESTART_CLASS_ACTIONS and conf > 0.7 and severity != "red": + capped = 0.6 + notes.append( + f"capped action={name!r} confidence from {conf:.2f} to " + f"{capped:.2f}: severity={severity!r} is not red, so this " + f"action should not be presented as high-confidence" + ) + rec = {**rec, "confidence": capped} + + out.append(rec) + return out, notes + + def strip_blocks(raw_text: str) -> str: """Remove tool_call/verdict/recommendation blocks. What's left is the model's prose 'thought' content. Also strips any UNCLOSED @@ -699,6 +822,11 @@ async def run_troubleshoot( emitted_verdict = False force_verdict_attempted = False loop = asyncio.get_event_loop() + # Last diag/summary result + the user's original prompt are needed + # by the recommendation guardrails to detect false positives like + # "user says disconnected but heartbeat is green → suppress restart_fula". + last_summary_payload: Optional[dict] = None + original_user_prompt = prompt for turn in range(MAX_TURNS): # Last-chance: at MAX_TURNS-1 without a verdict, inject the @@ -793,6 +921,12 @@ async def run_troubleshoot( ) yield tr_event + # Capture the diag/summary result for the guardrail + # check below. Lets us detect "heartbeat green but + # user said disconnected" false-positive pattern. + if ok and tc["tool"] == "diag/summary" and isinstance(result, dict): + last_summary_payload = result + # Add tool responses to history for the next turn history.append({ "role": "tool", @@ -804,6 +938,20 @@ async def run_troubleshoot( emitted_verdict = True yield {"type": "verdict", "payload": verdict} + # Server-side guardrails on recommendations (caps + drops + # the false-positive patterns the 1.5B model produces). + # Logged to stderr so operators can see WHY a recommendation + # was dropped/capped. + if recommendations: + recommendations, guardrail_notes = apply_recommendation_guardrails( + recommendations, + verdict=verdict, + last_summary_payload=last_summary_payload, + user_prompt=original_user_prompt, + ) + for note in guardrail_notes: + logger.warning("recommendation guardrail: %s", note) + # Emit recommendations with real HMAC tokens if recommendations and self._action_signer is not None: for i, rec in enumerate(recommendations): diff --git a/src/session/tool_call_loop.py b/src/session/tool_call_loop.py index c28ea4a..1fe8f28 100644 --- a/src/session/tool_call_loop.py +++ b/src/session/tool_call_loop.py @@ -69,8 +69,50 @@ async def stream_troubleshoot( """ async for event in backend_events: if not _validate(event, validator): - yield _schema_violation_error(event) - return + # Bug fix 2026-05-26: previously this killed the stream + # (`return`), so a single hallucinated event from the model + # — e.g. `tool_call` with an invalid tool name like + # `diag/discovery` — bombed the entire session. The user + # saw `[SCHEMA_VIOLATION]` instead of any recommendations. + # + # New behavior: if the offending event is a tool_call, + # synthesize a tool_result with ok=false + an error message + # naming the bad tool. This lets the backend's tool-call + # loop continue and gives the LLM a chance to self-correct + # (it can read "unknown tool 'diag/discovery'" and retry + # with a valid name). For OTHER event types (verdict, + # thought, etc.) we still surface a non-fatal error event + # but don't end the stream — the backend may yield more + # valid events afterward. + invalid_evtype = event.get("type", "") + invalid_call_id = event.get("call_id") + yield _validation_error_event( + offending_type=invalid_evtype, + call_id=invalid_call_id, + fatal=False, + ) + if invalid_evtype == "tool_call" and invalid_call_id: + bad_tool = ( + event.get("payload", {}).get("tool") + if isinstance(event.get("payload"), dict) else None + ) + synth = { + "type": "tool_result", + "call_id": invalid_call_id, + "ok": False, + "payload": None, + "error": ( + f"unknown or unsupported tool name " + f"{bad_tool!r}. Pick one of the diag/* tools " + f"listed in your system prompt." + ), + } + # Best-effort: emit ONLY if it itself validates. + # Otherwise we'd loop indefinitely. + if _validate(synth, validator): + yield synth + # Move on to the next event from the backend. + continue yield event @@ -179,5 +221,31 @@ def _schema_violation_error(offending: dict) -> dict: } +def _validation_error_event( + offending_type: str, + call_id: str | None, + fatal: bool, +) -> dict: + """Non-fatal variant of the schema-violation error. + + Used when the bridge can RECOVER from an invalid backend event + (e.g. by synthesizing a tool_result with ok=false). Marked + `recoverable: True` so the UI knows to keep the chat alive and + not show a terminal-error state. + """ + msg = ( + f"backend emitted an invalid {offending_type!s} event; " + f"continuing — the model may self-correct." + ) + if call_id: + msg += f" (call_id={call_id!r})" + return { + "type": "error", + "code": "SCHEMA_VIOLATION_RECOVERED" if not fatal else "SCHEMA_VIOLATION", + "message": msg, + "recoverable": not fatal, + } + + def _truncate(s: str, n: int) -> str: return s if len(s) <= n else (s[: n - 1] + "…") diff --git a/src/tools/diag_impls/_helpers.py b/src/tools/diag_impls/_helpers.py index f015e9d..3c90896 100644 --- a/src/tools/diag_impls/_helpers.py +++ b/src/tools/diag_impls/_helpers.py @@ -73,6 +73,10 @@ def https_head(url: str, timeout_s: float = 5.0) -> tuple[bool, int | None, floa `ok` is True iff the request completed AND status is 2xx/3xx. Latency is wall-clock from request start to response. Stdlib urllib — no runtime requests/httpx dependency. + + Note: this answers "did the request SUCCEED?" — for reachability + checks, prefer https_reachable() (treats 4xx/5xx as 'host is up + and responded, you just can't access this resource'). """ import time start = time.monotonic() @@ -90,6 +94,39 @@ def https_head(url: str, timeout_s: float = 5.0) -> tuple[bool, int | None, floa return False, None, latency +def https_reachable(url: str, timeout_s: float = 5.0) -> tuple[bool, int | None, float]: + """Reachability check — distinguishes 'host is unreachable' from + 'host responded with an HTTP error'. + + `ok` is True iff we got ANY valid HTTP response (2xx, 3xx, 4xx, or + 5xx). A 403/404/500 means the server is alive, TCP+TLS worked, the + network path is fine — we just don't have access to that specific + URL. `ok` is False ONLY when we got NO HTTP response (DNS failure, + connection refused, TLS handshake fail, timeout). + + Bug fix 2026-05-26: previously diag/internet used https_head() with + `ok = 200 <= status < 400`, so a 403 on https://discovery.fula.network/relays + (which requires POST not HEAD) was reported as 'discovery + unreachable' — pure false positive. Lab observed: HTTP 403, TLS + ok, ping 3ms RTT, yet diag/internet returned discovery_https_ok=false. + """ + import time + start = time.monotonic() + req = urllib.request.Request(url, method="HEAD") + try: + with urllib.request.urlopen(req, timeout=timeout_s) as resp: + latency = (time.monotonic() - start) * 1000 + return True, resp.status, latency + except urllib.error.HTTPError as e: + # Server responded with an HTTP error — host IS reachable. + latency = (time.monotonic() - start) * 1000 + return True, e.code, latency + except (urllib.error.URLError, OSError, TimeoutError): + # No HTTP response at all (DNS / TCP / TLS / timeout). + latency = (time.monotonic() - start) * 1000 + return False, None, latency + + def dns_lookup(host: str, timeout_s: float = 3.0) -> bool: """True iff `host` resolves. socket.gethostbyname has no per-call timeout option in the stdlib, but it's bounded by the system resolver.""" diff --git a/src/tools/diag_impls/internet.py b/src/tools/diag_impls/internet.py index 41d3b84..5dfa0ef 100644 --- a/src/tools/diag_impls/internet.py +++ b/src/tools/diag_impls/internet.py @@ -3,7 +3,12 @@ import time -from src.tools.diag_impls._helpers import dns_lookup, https_head, now_iso +from src.tools.diag_impls._helpers import ( + dns_lookup, + https_head, + https_reachable, + now_iso, +) # Targets: google.com is the "the internet itself works" canary; the @@ -16,13 +21,26 @@ def diag_internet() -> dict: dns_ok = dns_lookup(GOOGLE_HOST) and dns_lookup(DISCOVERY_HOST) + # google.com: regular https_head (200 expected on root) — this is + # the "internet itself works" canary. g_ok, _, g_lat = https_head(f"https://{GOOGLE_HOST}", timeout_s=5.0) - d_ok, _, d_lat = https_head(f"https://{DISCOVERY_HOST}/relays", timeout_s=5.0) + # discovery.fula.network: use https_reachable, not https_head. + # The /relays endpoint requires POST (server returns HTTP 403 on + # HEAD/GET — that's a server-up response, NOT a network failure). + # Bug fix 2026-05-26: previously used https_head which classified + # 403 as ok=false → 'discovery unreachable' false-positive that + # the AI then reported as a connectivity problem. Lab confirmed: + # ping 3ms, TLS handshake completes, server returns 403 → IS + # reachable. Now we ask the right question. + d_ok, _, d_lat = https_reachable( + f"https://{DISCOVERY_HOST}/relays", timeout_s=5.0, + ) avg_lat = (g_lat + d_lat) / 2 if (g_lat + d_lat) > 0 else 0.0 # Captive-portal heuristic: DNS works AND google https returns OK - # AND latency is suspiciously low (a portal usually intercepts at the - # router with sub-50ms RTT) AND discovery is blocked. False positives - # acceptable — the AI cites this as one signal among many. + # AND latency is suspiciously low (a portal usually intercepts at + # the router with sub-50ms RTT) AND discovery is genuinely blocked + # (no HTTP response at all, not just 4xx). Without the + # https_reachable fix this false-fired constantly. captive = dns_ok and g_ok and not d_ok and avg_lat < 50 return { "dns_ok": dns_ok, diff --git a/src/tools/diag_impls/summary.py b/src/tools/diag_impls/summary.py index b437ac6..a959997 100644 --- a/src/tools/diag_impls/summary.py +++ b/src/tools/diag_impls/summary.py @@ -93,9 +93,25 @@ def _scorecard(name: str, raw: dict) -> dict: "key_metrics": {"synced": bool(raw.get("synced"))}, } if name == "power": + # Lab observed 2026-05-26: a SINGLE undervoltage event in 24h + # would flip status=red and feed the AI a panic-level signal, + # which then dominated the verdict. Real PSU failures show + # repeated events; isolated transients (one bad cable wiggle, + # one boot brownout) should be a yellow, not a red. + # + # Tiers: + # 0 events → green + # 1-2 events → yellow (note but don't panic; could be transient) + # 3+ events → red (real PSU issue, recommend power-cable check) ue = raw.get("undervoltage_events_24h", 0) + if ue == 0: + status = "green" + elif ue <= 2: + status = "yellow" + else: + status = "red" return { - "status": "red" if ue > 0 else "green", + "status": status, "key_metrics": { "uptime_s": raw.get("uptime_s", 0), "undervoltage_events_24h": ue, diff --git a/tests/test_diag_impls.py b/tests/test_diag_impls.py index 9ef0a41..0c25c32 100644 --- a/tests/test_diag_impls.py +++ b/tests/test_diag_impls.py @@ -50,9 +50,11 @@ def _validate(payload: dict, defname: str): # --------------------------------------------------------------------------- def test_internet_happy_path_all_ok(): + """google.com via https_head, discovery via https_reachable.""" from src.tools.diag_impls import internet as mod with patch.object(mod, "dns_lookup", return_value=True), \ - patch.object(mod, "https_head", return_value=(True, 200, 42.0)): + patch.object(mod, "https_head", return_value=(True, 200, 42.0)), \ + patch.object(mod, "https_reachable", return_value=(True, 200, 42.0)): r = mod.diag_internet() assert r["dns_ok"] is True assert r["https_google_ok"] is True @@ -64,25 +66,52 @@ def test_internet_happy_path_all_ok(): def test_internet_no_dns_marks_everything_down(): from src.tools.diag_impls import internet as mod with patch.object(mod, "dns_lookup", return_value=False), \ - patch.object(mod, "https_head", return_value=(False, None, 0.0)): + patch.object(mod, "https_head", return_value=(False, None, 0.0)), \ + patch.object(mod, "https_reachable", return_value=(False, None, 0.0)): r = mod.diag_internet() assert r["dns_ok"] is False _validate(r, "internet") def test_internet_captive_portal_heuristic_fires(): - """DNS OK + google OK + discovery DOWN + low latency → likely captive.""" + """DNS OK + google OK + discovery completely UNREACHABLE (no HTTP + response at all) + low latency → likely captive.""" from src.tools.diag_impls import internet as mod - def fake_head(url, timeout_s=5.0): - if "google" in url: - return True, 200, 30.0 - return False, None, 20.0 with patch.object(mod, "dns_lookup", return_value=True), \ - patch.object(mod, "https_head", side_effect=fake_head): + patch.object(mod, "https_head", return_value=(True, 200, 30.0)), \ + patch.object(mod, "https_reachable", return_value=(False, None, 20.0)): r = mod.diag_internet() assert r["captive_portal_likely"] is True +def test_internet_discovery_403_is_reachable_not_captive(): + """Regression guard 2026-05-26: lab observed + https://discovery.fula.network/relays returning HTTP 403 (HEAD + not allowed by the server — only POST). Before the fix, this was + classified as discovery_https_ok=False AND captive_portal_likely=True + — both false positives that led the AI to diagnose 'discovery + unreachable' when the server was actually fine. + + With the https_reachable fix, ANY HTTP response (including 403) + counts as 'reachable' — only no-response-at-all counts as down.""" + from src.tools.diag_impls import internet as mod + with patch.object(mod, "dns_lookup", return_value=True), \ + patch.object(mod, "https_head", return_value=(True, 200, 80.0)), \ + patch.object(mod, "https_reachable", return_value=(True, 403, 5.0)): + r = mod.diag_internet() + # The 403 from discovery still counts as REACHABLE. + assert r["https_discovery_ok"] is True, ( + "discovery server responded with HTTP 403 — that means it's reachable, " + "not down. The AI must not be told 'discovery unreachable' just because " + "we can't HEAD /relays." + ) + # And captive-portal must NOT fire when discovery responded. + assert r["captive_portal_likely"] is False, ( + "captive-portal heuristic fired despite discovery returning a real HTTP " + "response — should only fire when discovery is COMPLETELY unreachable" + ) + + # --------------------------------------------------------------------------- # diag/relay # --------------------------------------------------------------------------- diff --git a/tests/test_rkllm_runtime.py b/tests/test_rkllm_runtime.py index 3b57404..1b7931d 100644 --- a/tests/test_rkllm_runtime.py +++ b/tests/test_rkllm_runtime.py @@ -435,7 +435,136 @@ async def collect(): err = next((e for e in events if e["type"] == "error"), None) assert err is not None assert err["code"] == "RKLLM_NOT_LOADED" - assert err["recoverable"] is False + + +# --------------------------------------------------------------------------- +# Recommendation guardrails — 2026-05-26 lab false-positive regression +# --------------------------------------------------------------------------- + +def _verdict(severity: str): + return {"summary": "x", "severity": severity, "root_cause": "y"} + + +def _rec(name: str, conf: float, tier: int = 2, args=None): + return { + "action_name": name, + "args": args or {}, + "reasoning": "...", + "confidence": conf, + "tier": tier, + } + + +def _summary_with_heartbeat(status: str, http_status: int): + return { + "overall": status, + "subsystems": { + "heartbeat": {"status": status, + "key_metrics": {"http_status": http_status}}, + }, + } + + +def test_guardrail_caps_restart_class_confidence_when_not_red(): + """1.5B Qwen pattern: 'restart_fula' at confidence 0.95 with verdict + severity=yellow. Cap to 0.6 — yellow is not red, restart-class + actions should not be presented as high-confidence.""" + from src.runtime.rkllm_runtime import apply_recommendation_guardrails + recs = [_rec("restart_fula", 0.95)] + out, notes = apply_recommendation_guardrails( + recs, + verdict=_verdict("yellow"), + last_summary_payload=None, + user_prompt="trying to upload a file but it's slow", + ) + assert len(out) == 1 + assert out[0]["confidence"] == 0.6 + assert any("capped" in n.lower() for n in notes) + + +def test_guardrail_passes_high_confidence_when_severity_red(): + """If severity IS red, the action is appropriately confident.""" + from src.runtime.rkllm_runtime import apply_recommendation_guardrails + recs = [_rec("docker.restart", 0.9, args={"container": "ipfs_host"})] + out, _ = apply_recommendation_guardrails( + recs, + verdict=_verdict("red"), + last_summary_payload=None, + user_prompt="ipfs_host crashing", + ) + assert len(out) == 1 + assert out[0]["confidence"] == 0.9 + + +def test_guardrail_drops_restart_fula_when_heartbeat_green_and_user_said_disconnected(): + """The EXACT lab regression 2026-05-26: user clicked 'disconnected' + scenario, device was healthy + heartbeat green http_status=200, but + AI emitted restart_fula at 0.95 based on relay=yellow alone. That + action would create the very disconnect being complained about. + Must be dropped entirely.""" + from src.runtime.rkllm_runtime import apply_recommendation_guardrails + recs = [_rec("restart_fula", 0.95)] + out, notes = apply_recommendation_guardrails( + recs, + verdict=_verdict("yellow"), + last_summary_payload=_summary_with_heartbeat("green", 200), + user_prompt="My Blox is showing as disconnected in the app. Please diagnose why the device is not reachable", + ) + assert out == [], ( + f"restart_fula should have been dropped (heartbeat green + connectivity " + f"complaint = false positive); got {out}" + ) + assert any("dropped" in n.lower() for n in notes) + assert any("heartbeat" in n.lower() for n in notes) + + +def test_guardrail_passes_restart_fula_when_heartbeat_actually_red(): + """If heartbeat IS failing (real connectivity issue), restart_fula + is legitimate even on user-reported disconnect.""" + from src.runtime.rkllm_runtime import apply_recommendation_guardrails + recs = [_rec("restart_fula", 0.8)] + out, _ = apply_recommendation_guardrails( + recs, + verdict=_verdict("red"), + last_summary_payload={ + "subsystems": { + "heartbeat": {"status": "red", + "key_metrics": {"http_status": 0}}, + }, + }, + user_prompt="disconnected", + ) + assert len(out) == 1 + assert out[0]["confidence"] == 0.8 + + +def test_guardrail_leaves_non_restart_class_actions_alone(): + """ntp.resync is NOT a restart-class action — confidence cap doesn't apply.""" + from src.runtime.rkllm_runtime import apply_recommendation_guardrails + recs = [_rec("ntp.resync", 0.95)] + out, _ = apply_recommendation_guardrails( + recs, + verdict=_verdict("yellow"), + last_summary_payload=None, + user_prompt="clock drift suspected", + ) + assert out[0]["confidence"] == 0.95 + + +def test_guardrail_does_not_drop_when_user_did_NOT_complain_about_connectivity(): + """If user prompt didn't mention disconnect/unreachable, the + heartbeat-green check shouldn't fire — restart_fula stays (just + capped if severity isn't red).""" + from src.runtime.rkllm_runtime import apply_recommendation_guardrails + recs = [_rec("restart_fula", 0.95)] + out, _ = apply_recommendation_guardrails( + recs, + verdict=_verdict("yellow"), + last_summary_payload=_summary_with_heartbeat("green", 200), + user_prompt="my Blox is slow when streaming", + ) + assert len(out) == 1, "non-connectivity complaint shouldn't trigger the drop" + assert out[0]["confidence"] == 0.6 # but still capped def test_run_troubleshoot_terminates_when_model_only_emits_prose(): diff --git a/tests/test_troubleshoot_sse.py b/tests/test_troubleshoot_sse.py index 7359544..440b0ee 100644 --- a/tests/test_troubleshoot_sse.py +++ b/tests/test_troubleshoot_sse.py @@ -324,8 +324,14 @@ async def run_troubleshoot(self, prompt, session_id=None): def test_schema_invalid_backend_event_emits_synthetic_error(client_with_broken_backend): - """If the backend somehow emits a malformed event, the bridge MUST - swallow it and emit an `error` event with code SCHEMA_VIOLATION. + """If the backend somehow emits a malformed NON-tool_call event, + the bridge emits a recoverable error event and continues iterating + (no return) — so a single bad event doesn't kill the whole session. + + Bug fix 2026-05-26: previously the bridge `return`ed on first + schema-invalid event, which killed any in-flight troubleshooting + session (user saw '[SCHEMA_VIOLATION]' instead of recommendations). + Now: invalid event → recoverable error + keep streaming. Requires the REAL sse_events.schema.json (the permissive stub fallback would accept anything). Skips when the stub is in use. @@ -344,8 +350,76 @@ def test_schema_invalid_backend_event_emits_synthetic_error(client_with_broken_b f"expected an error event; got {len(events)} event(s): " f"{[e.get('type') for e in events]}" ) - assert err["code"] == "SCHEMA_VIOLATION" - assert err["recoverable"] is False + # The bad event was a thought (not a tool_call) so the bridge marks + # the recovery as ongoing — recoverable=True. + assert err["code"] == "SCHEMA_VIOLATION_RECOVERED", ( + f"expected SCHEMA_VIOLATION_RECOVERED, got {err.get('code')}" + ) + assert err["recoverable"] is True + + +@pytest.fixture +def client_with_bad_tool_name_backend(schema_dir_with_all_required, monkeypatch): + """Backend whose first event is a tool_call with a NON-WHITELISTED + tool name ('diag/discovery'). Mirrors the user's lab observation + 2026-05-26 where the 1.5B Qwen hallucinated a tool that doesn't + exist + the bridge killed the whole stream.""" + from fastapi.testclient import TestClient + monkeypatch.setenv("BLOX_AI_SCHEMA_DIR", str(schema_dir_with_all_required)) + import sys + for mod in ("src.app", "src.schemas", "src.runtime.mock_backend", + "src.runtime.mock_diag", "src.session.tool_call_loop", + "src.routes.troubleshoot"): + sys.modules.pop(mod, None) + from src.app import app as fresh_app + + class BadToolBackend: + name = "bad-tool" + loaded = True + consumes_tool_results = False + def status_snapshot(self): return {} + async def run_troubleshoot(self, prompt, session_id=None): + yield {"type": "tool_call", "call_id": "rk-0-0", + "payload": {"tool": "diag/discovery", "args": {}}} + + with TestClient(fresh_app) as c: + c.app.state.backend = BadToolBackend() + yield c + + +def test_schema_invalid_tool_call_yields_synthetic_tool_result( + client_with_bad_tool_name_backend, +): + """Regression guard 2026-05-26: if the LLM hallucinates a tool name + not in the closed enum (e.g. 'diag/discovery'), the bridge MUST + synthesize a tool_result with ok=false + a clear 'unknown tool' + error — so the LLM has a chance to self-correct on the next turn, + AND the stream keeps flowing. + + Before fix: bridge `return`ed on first invalid tool_call → user + saw [SCHEMA_VIOLATION] and no further events. Lab observed: + LLM emitted `{"tool": "diag/discovery"}` and entire session bombed. + """ + from .conftest import _real_schemas_in_use + if not _real_schemas_in_use(): + pytest.skip("real sse_events.schema.json required") + r = _post_troubleshoot(client_with_bad_tool_name_backend) + events = _events_from_response(r) + types = [e.get("type") for e in events] + err = next((e for e in events if e.get("type") == "error"), None) + tr = next((e for e in events if e.get("type") == "tool_result"), None) + assert err is not None, f"expected error event; saw types: {types}" + assert err["recoverable"] is True, ( + "schema-invalid tool_call must NOT kill the stream; bridge " + "should yield recoverable error and continue" + ) + assert tr is not None, ( + "expected synthetic tool_result with ok=false so the LLM can " + "read the failure and pick a valid tool next turn" + ) + assert tr["ok"] is False + assert tr["call_id"] == "rk-0-0" + assert "diag/discovery" in (tr.get("error") or "") def test_unknown_tool_in_executor_fails_open_with_ok_false(