mios-daemon + agent-pipe router: repoint at the iGPU lane (:11435)

mios-dev · claude · mios-dev · commit e3b17a03519d · 2026-05-18T13:06:19.000-04:00
With mios-ollama-igpu standing up the micro-LLM lane in the
preceding commit, repoint the two known micro-LLM clients at it
so dGPU/CUDA traffic (qwen2.5-coder:7b polish, big Hermes
inference) stops competing with classifier + nudger requests
for queue slots.

usr/libexec/mios/mios-daemon
  MIOS_DAEMON_ENDPOINT default flips from :11434 -&gt; :11435. The
  daemon IS the iGPU micro-LLM agent per the operator's stated
  architecture ("iGPU micro-llm(s)/mios-daemon agent collects
  Linux systems logs, journals, relevant AI files--etc-etc"); the
  prior default had it talking to the dGPU/CUDA lane instead.

usr/lib/mios/agent-pipe/server.py
  MIOS_AGENT_PIPE_ROUTER_ENDPOINT default flips from :11434 -&gt;
  :11435. The Layer-1 router classifier (qwen3:1.7b) now lives
  exclusively on the iGPU lane; under dGPU saturation router
  latency stays bounded because it never queues behind big-model
  inference.

usr/lib/systemd/system/mios-agent-pipe.service
  Adds explicit Environment=MIOS_AGENT_PIPE_ROUTER_ENDPOINT line
  so an operator inspecting the unit (`systemctl cat
  mios-agent-pipe.service`) sees the routing decision without
  having to read the Python defaults.

Live-verified on podman-MiOS-DEV:
  /health.router.endpoint -&gt; http://localhost:11435  ✓
  Chat fast-path end-to-end -&gt; 7.2s first-call (cold ollama-rocm
    runner, CPU fallback because WSL kernel doesn't expose AMD
    /dev/kfd + /dev/dri) -&gt; "Hello!" returned cleanly  ✓
  podman logs mios-ollama-igpu shows the POST landed at the new
    lane: [GIN] 200 | 7.195948393s | POST "/v1/chat/completions"  ✓

On bare-metal MiOS-bootc with the AMD iGPU exposed in the host
kernel, the same setup activates ROCm on the iGPU and router /
nudger latency drops further. Same code path; deployment-time
capability difference.

Operator overrides remain available -- a deployment with only an
NVIDIA dGPU (no AMD iGPU lane) can point both endpoints back at
:11434 via /etc/mios/agent-pipe.env + the matching mios-daemon
env override.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/usr/lib/mios/agent-pipe/server.py b/usr/lib/mios/agent-pipe/server.py
@@ -75,8 +75,13 @@
 ROUTER_ENABLED = os.environ.get("MIOS_AGENT_PIPE_ROUTER_ENABLED",
                                 "true").lower() not in {"false", "0", "no"}
 ROUTER_MODEL = os.environ.get("MIOS_AGENT_PIPE_ROUTER_MODEL", "qwen3:1.7b")
+# Router runs the micro-LLM classifier (qwen3:1.7b) on the iGPU lane
+# (mios-ollama-igpu at :11435) -- isolates micro-LLM workload from the
+# dGPU/CUDA queue so router latency stays sub-second even when big-model
+# inference is saturating :11434. Falls back to the CUDA-ollama lane
+# if the iGPU instance is down (operator override via the env).
 ROUTER_ENDPOINT = os.environ.get(
-    "MIOS_AGENT_PIPE_ROUTER_ENDPOINT", "http://localhost:11434"
+    "MIOS_AGENT_PIPE_ROUTER_ENDPOINT", "http://localhost:11435"
 ).rstrip("/")
 ROUTER_TIMEOUT_S = int(os.environ.get("MIOS_AGENT_PIPE_ROUTER_TIMEOUT_S", "12"))
 ROUTER_MAX_TOKENS = int(os.environ.get("MIOS_AGENT_PIPE_ROUTER_MAX_TOKENS", "200"))
diff --git a/usr/lib/systemd/system/mios-agent-pipe.service b/usr/lib/systemd/system/mios-agent-pipe.service
@@ -18,6 +18,12 @@ EnvironmentFile=-/etc/mios/agent-pipe.env
 Environment=MIOS_PORT_AGENT_PIPE=8640
 Environment=MIOS_AGENT_PIPE_BACKEND=http://localhost:8642/v1
 Environment=MIOS_AGENT_PIPE_BACKEND_MODEL=hermes-agent
+# Router calls qwen3:1.7b on the iGPU micro-LLM lane (:11435), not
+# the dGPU/CUDA lane (:11434). Keeps router latency sub-second under
+# dGPU load. Override via /etc/mios/agent-pipe.env if a deployment
+# doesn't have mios-ollama-igpu standing up (e.g., bare-metal with
+# only NVIDIA dGPU).
+Environment=MIOS_AGENT_PIPE_ROUTER_ENDPOINT=http://localhost:11435
 Environment=MIOS_DB_URL=http://localhost:8000
 Environment=MIOS_DB_USER=root
 Environment=MIOS_DB_PASS=root
diff --git a/usr/libexec/mios/mios-daemon b/usr/libexec/mios/mios-daemon
@@ -72,7 +72,13 @@ log = logging.getLogger("mios-daemon")
 # iGPU CDI lane (wsl2-amd.yaml / wsl2-intel.yaml) when present, freeing
 # the dGPU for big-model work. Override via MIOS_DAEMON_MODEL.
 MODEL = os.environ.get("MIOS_DAEMON_MODEL", "qwen3:1.7b")
-ENDPOINT = os.environ.get("MIOS_DAEMON_ENDPOINT", "http://127.0.0.1:11434")
+# mios-ollama-igpu (sibling at :11435) is the canonical micro-LLM lane
+# for the daemon. Falls back to the CUDA-ollama lane (:11434) if the
+# iGPU instance isn't running. Operator directive 2026-05-18: "iGPU
+# micro-llm(s)/mios-daemon agent collects Linux systems logs, journals,
+# relevant AI files" -- the daemon IS the iGPU micro-LLM agent; this
+# default points it at the right lane.
+ENDPOINT = os.environ.get("MIOS_DAEMON_ENDPOINT", "http://127.0.0.1:11435")
 STATE_DIR = Path(os.environ.get("MIOS_DAEMON_STATE_DIR", "/var/lib/mios/daemon"))
 STATE_FILE = STATE_DIR / "state.json"
 CLASSIFY_BATCH_S = float(os.environ.get("MIOS_DAEMON_CLASSIFY_S", "30"))