Skip to content

runtime: switch to Qwen 3 1.7B with thinking mode#3

Merged
ehsan6sha merged 1 commit into
mainfrom
qwen3-1.7b-thinking-mode
May 27, 2026
Merged

runtime: switch to Qwen 3 1.7B with thinking mode#3
ehsan6sha merged 1 commit into
mainfrom
qwen3-1.7b-thinking-mode

Conversation

@ehsan6sha
Copy link
Copy Markdown
Member

Summary

Switch the on-device runtime from Qwen 2.5 1.5B to Qwen 3 1.7B with thinking mode. User requested most-intelligent mode + hide <think> content so chain-of-thought doesn't bloat the KV cache across the tool-call loop's multi-turn flow.

  • DEFAULT_MODEL_FILENAMEqwen3-1.7b-rk3588-w8a8.rkllm.
  • New _is_qwen3_model(path) filename gate so devices on old cached Qwen 2.5 keep thinking OFF (rollout safety).
  • _build_chat_prompt(enable_thinking=True) injects <think>\n into the assistant prefix.
  • New _strip_think() drops chain-of-thought content from the post-</think> tail.
  • RKLLMBackend.run_troubleshoot in Qwen 3 mode:
    • Strips <think> from history before the next turn (KV cache stays bounded — Qwen 3 model-card guidance).
    • Strips <think> from the SSE thought event (user preference: hide CoT from UI).
    • Emits a synthetic "Analyzing diagnostics..." marker when the post-think prose is empty so BLE transports don't show silent stretches.
    • Pre-strips think before parsing tool_call / verdict / recommendation so stray XML mentions inside reasoning prose can't pollute the parse.
  • max_new_tokens bumped 2048 → 3072 (thinking blocks empirically run 500–1500 tokens; structured response adds another 200–500).

Test plan

  • 244/244 existing tests pass (pytest tests/).
  • 10 new tests cover the Qwen 3 swap:
    • _is_qwen3_model filename detection (canonical, hyphen variant, case-insensitive, rejects Qwen 2.5 + Deepseek).
    • _strip_think all four shapes (full block, truncated mid-think, self-wrapped pair, trailing unclosed open).
    • _build_chat_prompt(enable_thinking=True) injects prefix; default leaves legacy Qwen 2.5 path alone.
    • try_load() wires _enable_thinking from the resolved model path.
    • run_troubleshoot strips <think> from history AND from SSE thought events.
    • Synthetic marker fills empty post-think turns.
  • Lab verification (deferred until .rkllm exists):
    • Convert Qwen 3 1.7B to W8A8 RKLLM format on build host.
    • Place at /uniondrive/blox-ai/model/qwen3-1.7b-rk3588-w8a8.rkllm on lab device.
    • Restart blox-ai.service; expect thinking=True in init log.
    • Run a /troubleshoot session; confirm SSE stream has no <think> content but does have post-think reasoning + structured events.
    • Verify next turn's prompt (via debug log) doesn't contain prior-turn CoT.

Sibling work

Sibling fula-ota commit (held locally until the publisher uploads the .rkllm + provides the SHA) bumps download_model.sh URL/SHA, info.json version + model name, .env BLOX_AI_MODEL_PATH, start.sh SIZE_LIMIT, and adds Qwen 2.5 1.5B cleanup logic (mirroring the existing 3B cleanup pattern).

🤖 Generated with Claude Code

User requested most-intelligent mode (thinking ON) + hide <think>
content to avoid bloating KV cache across the tool-call loop's
multi-turn flow. Sibling fula-ota PR ships the matching model file
(qwen3-1.7b-rk3588-w8a8.rkllm via GitHub release) + download_model.sh
URL/SHA pinning.

Changes:

- DEFAULT_MODEL_FILENAME flips to qwen3-1.7b-rk3588-w8a8.rkllm.
- New _is_qwen3_model(path) filename detector (matches qwen3 / qwen-3
  case-insensitive). Used by try_load() to wire the new
  _enable_thinking flag on the backend. Rollout safety: devices that
  still have an old Qwen 2.5 cached (the new file not yet downloaded)
  keep thinking OFF so the model does not get a <think> prefix it
  cannot parse.
- _build_chat_prompt gains enable_thinking parameter. When ON, the
  assistant prefix gets `<think>\n` injected so the model starts
  inside the think block. Matches apply_chat_template(enable_thinking=True)
  from the HF tokenizer config.
- New _strip_think(text) drops the chain-of-thought portion:
    * normal case: splits on first </think>, keeps the tail
    * self-wrapped pair after main close: defensive sub
    * trailing unclosed <think>: cut to end
    * truncated mid-think (no </think> anywhere): returns empty so
      caller treats as a prose-only turn and force-verdicts
- RKLLMBackend.run_troubleshoot in Qwen 3 mode:
    * pre-strips think for the history rewrite (KV cache stays
      bounded across the tool-call loop per Qwen 3 model-card
      guidance: "historical output should not include the thinking")
    * pre-strips think before parsing tool_call / verdict / recommendation
      so stray XML mentions inside reasoning prose cannot pollute
      the parse
    * strips think from the SSE thought event payload too (user
      preference: hide CoT from UI). When post-think prose is empty,
      emits a synthetic "Analyzing diagnostics..." marker so BLE
      transports do not show silent stretches.
- max_new_tokens bumped from 2048 to 3072 in init_model() because
  thinking blocks empirically run 500-1500 tokens; structured response
  adds 200-500 more. The prior 2048 was tight enough to truncate
  mid-verdict on hard prompts, manifesting as missing </think> in the
  output and an empty post-strip result.

Tests added (10 new in tests/test_rkllm_runtime.py):
- _is_qwen3_model: canonical filename, hyphen variant, case-insensitive,
  rejects Qwen 2.5 and Deepseek (rollout-safety regression guard).
- _strip_think: full block, truncated-mid-think, self-wrapped pair,
  trailing unclosed open.
- _build_chat_prompt with enable_thinking=True injects the prefix;
  default leaves legacy Qwen 2.5 path alone.
- try_load() sets _enable_thinking based on resolved model path.
- run_troubleshoot strips <think> from history before the next turn
  (KV bloat regression guard) AND from SSE thought events (UI
  preference). When post-think prose is empty, the synthetic marker
  fills the gap.

244/244 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ehsan6sha ehsan6sha merged commit 8218617 into main May 27, 2026
2 checks passed
@ehsan6sha ehsan6sha deleted the qwen3-1.7b-thinking-mode branch May 27, 2026 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant