Skip to content

Commit 3cea3fb

Browse files
committed
chore(flows): switch default QA LLM from qwen36-fast (4B) to qwen36-deep (27B)
The smaller qwen36-fast was the previous default for OBOL_LLM_MODEL across release-smoke and flow-{03,04,11,13,14} plus buy-external. It's documented as flaky on the long single-shot agent-buy prompt at flow-13/14 step 46 (see the retry-wrapper rationale added in the prior commit, plus plans/inference-v1337-followup-20260514.md). Switching the default to qwen36-deep (27B-class, also served by the same spark1 vLLM endpoint) trades a bit of latency for a much more reliable tool-call behaviour. Operators can still pin the smaller model explicitly via OBOL_LLM_MODEL=qwen36-fast for fast iteration on non-agent flows. Files changed: - flows/lib.sh, flows/release-smoke.sh, flows/flow-{03,04,11,13,14}*.sh, flows/buy-external.sh — default value switch - flows/lib-dual-stack.sh — WARN box in agent_buy_with_retry now recommends checking the model is qwen36-deep first; mentions qwen36-35b-heretic as the next escalation - CLAUDE.md, .agents/skills/obol-stack-dev/{SKILL.md,references/*.md} — documentation refreshed Not changed (intentional): - internal/{model,hermes}/*_test.go — qwen36-fast is a test fixture for the rank parser, not a default; switching would invalidate test expectations without changing test intent - plans/post-490-integration-20260513.md — historical record
1 parent 7a7d51b commit 3cea3fb

14 files changed

Lines changed: 40 additions & 35 deletions

.agents/skills/obol-stack-dev/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ OBOL_TOKEN_BASE_SEPOLIA=0x0a09371a8b011d5110656ceBCc70603e53FD2c78
4747

4848
**Payment assertion**: don't bypass the agent buy step with a direct script exec. If the agent times out, diagnose Hermes/LiteLLM/model routing — don't relax the assertion. Required evidence: `PurchaseRequest Ready=True` + paid HTTP 200 + on-chain `Transfer` + exact balance deltas.
4949

50-
**QA LLM**: full seller/buyer QA must route Alice and Bob through `OBOL_LLM_ENDPOINT` (OpenAI-compatible vLLM or llama.cpp on the QA host). Default `OBOL_LLM_MODEL=qwen36-fast`. Sequence: `obol model setup custom``obol model prefer` → one `obol model sync`. Local Ollama and cloud-fallback are **not** acceptable green substitutes for full-flow QA.
50+
**QA LLM**: full seller/buyer QA must route Alice and Bob through `OBOL_LLM_ENDPOINT` (OpenAI-compatible vLLM or llama.cpp on the QA host). Default `OBOL_LLM_MODEL=qwen36-deep` (27B-class). The smaller `qwen36-fast` (~4B) was the previous default but flakes on the long single-shot agent-buy prompt at flow-13/14 step 46 — see the retry-wrapper rationale in `flows/lib-dual-stack.sh::agent_buy_with_retry`. Sequence: `obol model setup custom``obol model prefer` → one `obol model sync`. Local Ollama and cloud-fallback are **not** acceptable green substitutes for full-flow QA.
5151

5252
**Public vs private routes**: `/services/*`, `/.well-known/agent-registration.json`, `/skill.md`, and `/` (storefront) are public via the tunnel. **NEVER** remove `hostnames: ["obol.stack"]` from frontend or eRPC HTTPRoutes — exposing them publicly is a critical security flaw.
5353

.agents/skills/obol-stack-dev/references/llm-routing.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,15 +47,15 @@ Canonical user flow for vLLM / sglang / mlx-lm / a remote GPU box. **No ConfigMa
4747
obol stack up
4848

4949
# Drop auto-detected Ollama entries — they will out-rank the new custom entry.
50-
# Internal/model/rank.go parses ":9b" as 90 deci-billions; "qwen36-fast" (no
50+
# Internal/model/rank.go parses ":9b" as 90 deci-billions; "qwen36-deep" (no
5151
# ":Nb" tag) ranks 0. Without removing them, the agent stays on slow host Ollama.
5252
obol model remove qwen3.5:9b
5353
obol model remove qwen3.5:4b
5454

5555
obol model setup custom \
5656
--name spark1-vllm \
5757
--endpoint http://192.168.18.23:8000/v1 \
58-
--model qwen36-fast
58+
--model qwen36-deep
5959
# `setup custom` validates the endpoint, patches LiteLLM, and internally calls
6060
# syncAgentModels → hermes.Sync → rewrites the default agent's deployment files
6161
# with the new primary model. No manual restart needed.

.agents/skills/obol-stack-dev/references/paid-flows.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ The runner has a `warn_unpaid_base_sepolia_rpc` preflight. The CLI scrubs paid-R
6464
- Alice ServiceOffer reaches `Ready=True`.
6565
- ERC-8004 registration tx published to Base Sepolia (`/.well-known/agent-registration.json` reachable via tunnel for live flows).
6666
- Bob `PurchaseRequest` reaches `Ready=True`.
67-
- LiteLLM exposes `paid/<OBOL_LLM_MODEL>` (default `qwen36-fast`).
67+
- LiteLLM exposes `paid/<OBOL_LLM_MODEL>` (default `qwen36-deep`).
6868
- Paid inference returns HTTP 200 and **final-answer** content (not reasoning metadata or tool-catalogue text).
6969
- On-chain `Transfer(Bob signer → Alice, <PAID_AMOUNT>)` receipt is archived.
7070
- Alice balance increases and Bob signer balance decreases by exactly `PAID_AMOUNT` wei (USDC for flow-11, OBOL for flow-13/14).

.agents/skills/obol-stack-dev/references/remote-qa.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Set `OBOL_LLM_MODEL` to an id returned by `/models`.
6060
cd "$QA"
6161
export PATH="$QA/.workspace/bin:$FOUNDRY_BIN:$TOOL_ROOT:$PATH"
6262
export OBOL_LLM_ENDPOINT=${OBOL_LLM_ENDPOINT:-http://127.0.0.1:8000/v1}
63-
export OBOL_LLM_MODEL=${OBOL_LLM_MODEL:-qwen36-fast}
63+
export OBOL_LLM_MODEL=${OBOL_LLM_MODEL:-qwen36-deep}
6464
ts=$(date +%Y%m%d-%H%M%S)
6565
log="$QA/.tmp/flow-14-$ts.log"
6666
art="$QA/.tmp/flow-14-$ts-artifacts"

CLAUDE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ go test -tags integration -v -run TestIntegration_Tunnel_SellDiscoverBuySidecar_
3737

3838
# Release-gate seller/buyer smoke (requires OBOL_LLM_ENDPOINT pointing at OpenAI-compatible vLLM/llama.cpp)
3939
RELEASE_SMOKE_INCLUDE_OBOL=true RELEASE_SMOKE_INCLUDE_OBOL_FORK=true \
40-
OBOL_LLM_ENDPOINT=http://127.0.0.1:8000/v1 OBOL_LLM_MODEL=qwen36-fast \
40+
OBOL_LLM_ENDPOINT=http://127.0.0.1:8000/v1 OBOL_LLM_MODEL=qwen36-deep \
4141
bash flows/release-smoke.sh
4242

4343
just up # obol stack init + up
@@ -246,13 +246,13 @@ obol model remove qwen3.5:4b
246246
obol model setup custom \
247247
--name spark1-vllm \
248248
--endpoint http://192.168.18.23:8000/v1 \
249-
--model qwen36-fast
249+
--model qwen36-deep
250250
# `setup custom` validates the endpoint, patches LiteLLM, and internally calls
251251
# syncAgentModels → hermes.Sync → rewrites the default agent's deployment files
252252
# with the new primary model. No manual restart needed.
253253

254254
# (b) OR keep Ollama and force-promote the custom entry to the head:
255-
obol model prefer qwen36-fast
255+
obol model prefer qwen36-deep
256256
obol model sync # propagate to Hermes
257257

258258
obol model list # confirm head of model_list

flows/buy-external.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@
6060
# EXTERNAL_PR_TIMEOUT_S default: 300 (5 min)
6161
# EXTERNAL_LOG_BLOCKS_BACK default: 30 (~6 min on Base Sepolia at 2s/blk)
6262
# OBOL_LLM_ENDPOINT default: http://127.0.0.1:8000/v1
63-
# OBOL_LLM_MODEL default: qwen36-fast
63+
# OBOL_LLM_MODEL default: qwen36-deep (27B-class)
6464
# OBOL_LLM_NAME default: external-llm
6565
#
6666
# Exit code: 0 on PASS (every step pass), 1 on any FAIL.
@@ -106,7 +106,7 @@ EXTERNAL_PR_TIMEOUT_S="${EXTERNAL_PR_TIMEOUT_S:-300}"
106106
EXTERNAL_LOG_BLOCKS_BACK="${EXTERNAL_LOG_BLOCKS_BACK:-30}"
107107

108108
OBOL_LLM_ENDPOINT="${OBOL_LLM_ENDPOINT:-http://127.0.0.1:8000/v1}"
109-
OBOL_LLM_MODEL="${OBOL_LLM_MODEL:-qwen36-fast}"
109+
OBOL_LLM_MODEL="${OBOL_LLM_MODEL:-qwen36-deep}"
110110
OBOL_LLM_NAME="${OBOL_LLM_NAME:-external-llm}"
111111

112112
# Resolve OBOL_ROOT before sourcing helpers — lib.sh re-derives it but

flows/flow-03-inference.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ source "$(dirname "$0")/lib.sh"
55

66
if [ -n "${OBOL_LLM_ENDPOINT:-}" ]; then
77
run_step "Route LiteLLM through QA LLM endpoint" route_llm_via_obol_cli "$OBOL"
8-
LITELLM_MODEL="${OBOL_LLM_MODEL:-qwen36-fast}"
8+
LITELLM_MODEL="${OBOL_LLM_MODEL:-qwen36-deep}"
99
else
1010
LITELLM_MODEL="$FLOW_MODEL"
1111

flows/flow-04-agent.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,8 @@ fi
110110

111111
model_name=$("$OBOL" kubectl get cm hermes-config -n "$NS" -o jsonpath='{.data.config\.yaml}' 2>/dev/null | sed -n 's/^[[:space:]]*default: //p' | tr -d '"' | head -1)
112112
[ -n "$model_name" ] || model_name="qwen3.5:35b"
113-
if [ -n "${OBOL_LLM_ENDPOINT:-}" ] && [ "$model_name" != "${OBOL_LLM_MODEL:-qwen36-fast}" ]; then
114-
fail "Hermes default model $model_name does not match QA LLM model ${OBOL_LLM_MODEL:-qwen36-fast}"
113+
if [ -n "${OBOL_LLM_ENDPOINT:-}" ] && [ "$model_name" != "${OBOL_LLM_MODEL:-qwen36-deep}" ]; then
114+
fail "Hermes default model $model_name does not match QA LLM model ${OBOL_LLM_MODEL:-qwen36-deep}"
115115
cleanup_pid "$PF_PID"
116116
emit_metrics
117117
exit 0

flows/flow-11-dual-stack.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
# FLOW11_BOB_HTTP_PORT FLOW11_BOB_HTTP_ALT_PORT
3737
# FLOW11_BOB_HTTPS_PORT FLOW11_BOB_HTTPS_ALT_PORT
3838
# OBOL_LLM_ENDPOINT required vLLM/llama.cpp/OpenAI-compatible endpoint
39-
# OBOL_LLM_MODEL endpoint model name (default: qwen36-fast)
39+
# OBOL_LLM_MODEL endpoint model name (default: qwen36-deep)
4040
source "$(dirname "$0")/lib.sh"
4141

4242
# ═════════════════════════════════════════════════════════════════
@@ -60,7 +60,7 @@ BOB_HTTP_ALT_PORT="${FLOW11_BOB_HTTP_ALT_PORT:-$(pick_free_port)}"
6060
BOB_HTTPS_PORT="${FLOW11_BOB_HTTPS_PORT:-$(pick_free_port)}"
6161
BOB_HTTPS_ALT_PORT="${FLOW11_BOB_HTTPS_ALT_PORT:-$(pick_free_port)}"
6262
FACILITATOR_URL="${FLOW11_FACILITATOR_URL:-https://x402.gcp.obol.tech}"
63-
OBOL_LLM_MODEL="${OBOL_LLM_MODEL:-qwen36-fast}"
63+
OBOL_LLM_MODEL="${OBOL_LLM_MODEL:-qwen36-deep}"
6464
export OBOL_LLM_MODEL
6565
FLOW11_ARTIFACT_DIR="${FLOW11_ARTIFACT_DIR:-$OBOL_ROOT/.tmp/flow-11-$(date +%Y%m%d-%H%M%S)}"
6666
if ! BASE_SEPOLIA_RPC="$(resolve_base_sepolia_rpc "${FLOW11_BASE_SEPOLIA_RPC:-${BASE_SEPOLIA_RPC:-}}")"; then

flows/flow-13-dual-stack-obol.sh

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
# FLOW13_BOB_HTTP_PORT, _ALT, _HTTPS_PORT, _HTTPS_ALT_PORT
3737
# FLOW13_ARTIFACT_DIR where receipts + logs land
3838
# OBOL_LLM_ENDPOINT required vLLM/llama.cpp/OpenAI-compatible endpoint
39-
# OBOL_LLM_MODEL endpoint model name (default: qwen36-fast)
39+
# OBOL_LLM_MODEL endpoint model name (default: qwen36-deep, 27B-class)
4040
#
4141
source "$(dirname "$0")/lib.sh"
4242
DUAL_STACK_FLOW_PREFIX="FLOW13"
@@ -61,7 +61,7 @@ BOB_HTTP_ALT_PORT="$(dual_stack_env_or_free_port BOB_HTTP_ALT_PORT)"
6161
BOB_HTTPS_PORT="$(dual_stack_env_or_free_port BOB_HTTPS_PORT)"
6262
BOB_HTTPS_ALT_PORT="$(dual_stack_env_or_free_port BOB_HTTPS_ALT_PORT)"
6363

64-
OBOL_LLM_MODEL="${OBOL_LLM_MODEL:-qwen36-fast}"
64+
OBOL_LLM_MODEL="${OBOL_LLM_MODEL:-qwen36-deep}"
6565
export OBOL_LLM_MODEL
6666

6767
ANVIL_PORT="${FLOW13_ANVIL_PORT:-$(pick_free_port)}"
@@ -923,9 +923,9 @@ buyer_status=$(buyer_sidecar_status)
923923
# Mirror flow-14's relaxed assertion. Two reasons to allow remaining>=5
924924
# rather than exact-5: (a) controller may merge into an existing auth
925925
# pool on rerun (remaining=10 etc.); (b) the agent prompt asks for
926-
# --count 5, but qwen36-fast occasionally hallucinates --count 1, which
927-
# is an LLM-stochasticity issue not a buy-flow correctness issue. We
928-
# only care that the buy step actually provisioned at least the
926+
# --count 5, but the LLM occasionally hallucinates a different count,
927+
# which is an LLM-stochasticity issue not a buy-flow correctness issue.
928+
# We only care that the buy step actually provisioned at least the
929929
# requested count.
930930
remaining_n=$(echo "$buyer_status" | grep -oE 'remaining=[0-9]+' | head -1 | cut -d= -f2)
931931
if [ -n "$remaining_n" ] && [ "$remaining_n" -ge 5 ] 2>/dev/null; then

0 commit comments

Comments
 (0)