🌍 Versión en español: README.es.md
Qwen3-8B answering live from the Arc 140V iGPU — 94% utilization, 1950 MHz, 7 W. The whole conversation runs locally, no provider API calls.
A local AI agent stack on Intel Core Ultra (Lunar Lake) — Qwen3-8B + Qwen2.5-VL-7B in INT4, zero token cost:
- OVMS (OpenVINO Model Server) serving INT4 models on the Arc 140V iGPU.
- Open WebUI as the chat / agent frontend.
- Pipelines running a ReAct agent (Thought / Action / Observation loop) with search, fetch and arithmetic tools.
- SearXNG locally for web search without leaking queries to third parties.
- A single OVMS backend exposed to Open WebUI and to Claude Code via an OpenAI-compatible API.
Measured throughput: ~18-20 tokens/s generating with Qwen3-8B INT4 on the Arc 140V iGPU of a Core Ultra 7 258V.
- Why local? No tokens, no network, no strings
- Architecture
- Models
- Requirements
- Configuration (
.env) and HuggingFace token - Setup
- OVMS as a Claude Code backend
- Memory and performance
- Project structure
- Intel resources worth knowing
- Troubleshooting
- License
- Author
Almost the entire "LLM agent" ecosystem is built on burning tokens against paid APIs (OpenAI, Anthropic, Google) or on cloud infrastructure billed per use. This project shows that, on a laptop-class Core Ultra (Lunar Lake, Arc 140V iGPU), you can:
- Zero external token cost. Every word the model generates costs exactly the same as having the laptop on. Whether you chat all day or leave it sleeping in a drawer, the marginal cost of inference is zero. No more bill shock from a runaway loop.
- Total privacy. Prompts never leave the machine. For environments with sensitive data (clinical, legal, IP) or companies with DLP policies, this changes the game: the agent can read documents you would never be allowed to send to an external API.
- Real offline. Works without network. Travel, booth demos, air-gapped environments, or when the client's wifi is a nightmare — you still have an assistant. The only time you need internet is for the initial weight download.
- Local latency. ~18-20 tokens/s on iGPU INT4 is enough for a fluid conversation and for agent tasks that don't require frontier reasoning. First token < 1 s.
- Bill-free sandbox. Iterate on prompts, tools, ReAct pipelines, function calling, RAG… without the experiment costing you a cent. Ideal for learning, prototyping, teaching, and for the projects you'd never try "in case it gets expensive".
- Hardware you already own. Lunar Lake / Arrow Lake / Meteor Lake ship with a competent NPU + iGPU that most people don't use for anything. This stack puts them to work.
What this is NOT for: tasks that demand Sonnet / Opus / GPT-5 — complex refactors over large codebases, high-level multi-step reasoning, non-trivial code. An 8B-INT4 is ~50× smaller than a frontier model; you'll feel it. But for chat, RAG, simple function calling, assisted search and agent prototypes, it's enough.
flowchart LR
User([👤 User<br/>browser])
CC([🖥️ Claude Code CLI<br/>optional])
subgraph Stack[Docker Compose stack on WSL2]
OWUI[Open WebUI<br/>:3000]
PIPE[Pipelines<br/>ReAct agent<br/>:9099]
OVMS[OVMS<br/>OpenVINO Model Server<br/>:8000 / :9000]
SX[SearXNG<br/>:8080]
end
subgraph HW[Intel Core Ultra 7 258V]
IGPU[(iGPU Arc 140V<br/>Qwen3-8B INT4<br/>~18 tok/s)]
CPU[(CPU<br/>Qwen2.5-VL-7B INT4<br/>~6 tok/s)]
end
Router[claude-code-router<br/>or LiteLLM proxy]
User -->|HTTPS| OWUI
OWUI -->|OpenAI /v3| OVMS
OWUI -->|model: react-agent| PIPE
OWUI -->|search JSON| SX
PIPE -->|OpenAI /v3| OVMS
PIPE -->|search JSON| SX
OVMS --> IGPU
OVMS --> CPU
CC -.->|Anthropic API| Router
Router -.->|OpenAI /v3| OVMS
The single OVMS instance serves both models — clients pick which one via the
"model" field in the chat-completions request. Claude Code only needs a
small proxy/router (right side, dashed) because it speaks the Anthropic
Messages format natively.
| Model | Type | Precision | Device | OpenAI model endpoint |
|---|---|---|---|---|
Qwen/Qwen3-8B |
text-generation | INT4 | iGPU | qwen3-8b |
Qwen/Qwen2.5-VL-7B-Instruct |
image-text | INT4 | CPU | qwen25-vl-7b |
Why not both on iGPU: the Arc 140V shares RAM with the system and its USM pool can't hold two 5 GB INT4 models + their KV caches + their compile blobs at once. Hybrid placement (LLM on GPU, VLM on CPU) is the sweet spot for this class of hardware: chat is fast (18-20 tok/s) and the VLM runs at 3-8 tok/s only when you drop in an image, where the vision encoder's cost dominates latency anyway.
Qwen2.5-VL-7B on CPU describing an uploaded image of a raccoon — Task Manager shows 0 W on the iGPU and 98% CPU utilization. The iGPU stays free for Qwen3-8B; the VLM only takes over when there's an image to look at.
- Windows with WSL2 + Docker Desktop, WSL integration enabled for this distro.
- Docker Desktop → Settings → Resources → WSL Integration → enable your distro.
- iGPU exposed to WSL:
ls /dev/drimust showcard0/renderD128. - RAM: 32 GB physical on Windows, at least 24 GB allocated to WSL (WSL defaults to ~50% of host). If
free -hinside WSL shows less, create%USERPROFILE%\.wslconfigwith:Then restart with[wsl2] memory=26GB swap=8GB
wsl --shutdown. - ~80 GB free for the HuggingFace cache + converted IR.
- Internet access for the weight download.
Intel iGPU + WSL2: requires up-to-date host (Windows) drivers and the
openvino/model_server:latest-gpuimage (which already shipsintel-opencl-icdandlibze-intel-gpu). Thisdocker-compose.ymlalready mounts/dev/dxgand/usr/lib/wslwith the correctLD_LIBRARY_PATHso the host driver is visible from inside the container.
⚠️ NPU (Intel AI Boost) is not accessible from WSL2 in this version. The NPU on Lunar Lake / Meteor Lake / Arrow Lake is exposed on native Linux via/dev/accel/accel0with theintel_vpukernel module, but WSL2 does not route that device into the Linux kernel as of today. Verify on the host withls /dev/accel— it will be empty. If you want to use the NPU for inference (~10-20 tok/s on small LLMs without touching the iGPU), you have two paths:
- Leave WSL2 and run OVMS directly on Windows (PowerShell + native binary), breaking this Docker stack.
- Wait for Microsoft/Intel to enable NPU passthrough in WSL2 (on roadmap, no confirmed date).
That's why this stack uses iGPU for the LLM and CPU for the VLM, leaving the NPU untouched.
Before running anything, copy .env.example to .env:
cp .env.example .envThe example values are sane defaults — for a single-user local install you may not need to change anything except HF_TOKEN (see below).
| Variable | Default | Purpose |
|---|---|---|
OVMS_IMAGE |
openvino/model_server:latest-gpu |
OVMS Docker image. Switch to :latest (no -gpu suffix) if you don't have an Intel iGPU and only want CPU inference. |
OVMS_REST_PORT |
8000 |
Host port mapped to OVMS's OpenAI-compatible REST API. |
OVMS_GRPC_PORT |
9000 |
Host port mapped to OVMS's gRPC API (used by tritonclient, not by Open WebUI). |
OPENWEBUI_IMAGE |
ghcr.io/open-webui/open-webui:main |
Open WebUI image (chat frontend). |
OPENWEBUI_PORT |
3000 |
Host port where you reach Open WebUI in the browser. |
OPENWEBUI_AUTH |
False |
If True, Open WebUI requires login (first account created = admin). Keep False for single-user local dev. |
PIPELINES_API_KEY |
0p3n-w3bu! |
Shared key between Open WebUI and the Pipelines container (ReAct agent). Doesn't touch the public internet, but rotate it if you expose the stack on a LAN. |
HF_TOKEN |
(empty) | HuggingFace read token. Optional for public models (the two Qwen models used here are public). Required for gated/private models. |
The conversion script (scripts/export-models.sh) pulls these two models from HuggingFace and converts them to OpenVINO IR with INT4 quantization:
Qwen/Qwen3-8B— text generation, ~16 GB FP16 download → ~5 GB INT4 on disk.Qwen/Qwen2.5-VL-7B-Instruct— vision-language, ~16 GB FP16 → ~5 GB INT4.
Both are public — no token strictly required — but having an HF_TOKEN is recommended:
- Anonymous downloads are rate-limited; with a token you avoid throttling on large files.
- You'll be ready when you switch to gated models (e.g. some Llama variants, Mistral commercial models).
Steps to get an HF token:
- Create a free account at https://huggingface.co (if you don't have one).
- Open https://huggingface.co/settings/tokens.
- Click "Create new token" → give it a name (e.g.
openvino-agent-stack) → role "Read" (read-only is enough; do NOT pick "Write" for this use case). - Copy the
hf_...value. - Paste it into your
.env:HF_TOKEN=hf_your_token_here
- Re-run
./scripts/export-models.sh— the script picksHF_TOKENfrom.envautomatically.
⚠️ Don't commit.env. The.gitignorealready excludes it (.env.exampleis what gets published). If you ever paste the token into chats, logs, screenshots or PRs, treat it as leaked and rotate it on the HF tokens page.
License acceptance (gated models):
If you later swap to a gated model (e.g. some Llama variants, Mistral commercial), HuggingFace will block the download until you visit the model card in a browser and click "Agree to the license". Your HF_TOKEN alone does NOT bypass this — only your acknowledgement on the web UI does. After accepting, the same token works for optimum-cli to fetch the weights.
./scripts/export-models.shThis spawns a throwaway container with optimum-intel, downloads the models from HuggingFace and exports them to OpenVINO IR with INT4 compression. Output lands at:
ovms/models/
├── qwen3-8b/
│ └── (openvino_model.xml, tokenizer, etc. + graph.pbtxt)
└── qwen25-vl-7b/
└── (vision + LM models + tokenizer + graph.pbtxt)
If a model is gated (none of these are, but in case it changes later), drop your token in .env → HF_TOKEN=hf_... and re-run.
docker compose up -d
docker compose logs -f ovms # first model load can take a whileOVMS exposes:
- REST:
http://localhost:8000/v3/chat/completions(OpenAI-compatible) - gRPC:
localhost:9000 - Health:
http://localhost:8000/v2/health/ready
Open WebUI: http://localhost:3000
curl -s http://localhost:8000/v3/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-8b",
"messages": [{"role":"user","content":"Hi, who are you?"}],
"max_tokens": 128
}'For the VLM (Qwen2.5-VL) with an image:
curl -s http://localhost:8000/v3/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen25-vl-7b",
"messages": [{
"role":"user",
"content":[
{"type":"text","text":"What do you see?"},
{"type":"image_url","image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstration_1.png"}}
]
}],
"max_tokens": 256
}'- Open
http://localhost:3000. - If
OPENWEBUI_AUTH=False, you're in. If you flip it toTrue, the first account created becomes admin. - Settings → Models: you should see
qwen3-8bandqwen25-vl-7b(Open WebUI pulls them from OVMS via/v3/models).
The stack already launches SearXNG and has function calling + RAG enabled via env vars. All that's left is loading the tools.
Tools (function calling, in ./tools/)
Open WebUI loads tools from its DB, not from disk. The three .py files in tools/ (calculator.py, web_fetch.py, weather.py) are imported like this:
- Open WebUI → Workspace → Tools → "+".
- Copy/paste the contents of the
.pyand save. - In each chat, click the tools icon and enable the ones you want.
- Make sure "Native function calling" is checked under Settings → Interface (uses OpenAI's
toolsformat, not prompting). Onlyqwen3-8bworks well with tools; VLMs tend to be flaky at function calling.
The tools are also mounted inside the container at
/app/backend/data/tools-staging/(read-only) in case a future Open WebUI version adds auto-import from disk.
Web search (SearXNG)
Already wired up via SEARXNG_QUERY_URL. In any chat, flip the "Web Search" switch under the input and the model will query SearXNG → read the top-5 results → answer with citations.
Verify SearXNG directly:
curl 'http://localhost:8888/search?q=openvino&format=json' | head -c 500(SearXNG is not exposed on the host by default. If you want host access, add ports: ["8888:8080"] to the service.)
RAG with documents
Workspace → Knowledge → create a collection → upload PDFs / MD / TXT. In chat, prefix with #collection-name or attach the document directly. Embeddings: all-MiniLM-L6-v2 (CPU, downloaded on first use).
Images with Qwen2.5-VL
Switch the model to qwen25-vl-7b, drag an image into the input, ask. The image goes as image_url in the OpenAI payload, OVMS routes it to the VLM.
Beyond Open WebUI's native tools (step 5), the stack ships a ReAct agent running as a Pipeline. It appears in Open WebUI as another model, named react-agent.
How it works:
- The pipe (pipelines/react_agent.py) receives the user's message.
- Runs a loop up to
MAX_ITERATIONS(default 6): asks the LLM (qwen3-8bvia OVMS) for aThought + Action + Action Input, executes the tool, injects theObservationback, and repeats until the model emitsFinal Answer:. - Streams the full trace to the chat (disable it by setting
SHOW_TRACE=Falsein the valves).
Internal agent tools (different from step 5's — these live inside the pipeline):
search(query)→ SearXNGfetch(url)→ GET + HTML cleanupcalc(expression)→ safe arithmetic
Usage:
- Open WebUI → model selector at the top → pick
react-agent. - Ask something that requires searching/calculating, e.g. "How old is Nvidia's CEO right now? Compute his age assuming he was born on Feb 17, 1963."
- You'll see each
Thought / Action / Observationin the chat, with the final answer at the end.
Tuning the agent:
Open WebUI → Admin Panel → Pipelines → react-agent → Valves. You can tune:
MAX_ITERATIONS,TEMPERATURE,MAX_TOKENS_PER_STEPMODEL(point at a different OVMS model)SHOW_TRACE(hide the trace if you only want the final answer)
Handling Qwen3's known quirks:
Qwen3 ships with thinking mode enabled by default — it emits free-form
<think>...</think> blocks before any structured output. That breaks the
strict Thought: / Action: / Action Input: grammar the ReAct loop relies on.
This pipeline applies three defensive patches:
| Patch | Why it's needed |
|---|---|
System prompt starts with /no_think |
Qwen3's official directive to skip the thinking phase. Without it the model rambles in prose and the regex parser bails on turn 1. |
Output is run through _strip_thinking() (removes any <think>...</think> blocks) before parsing |
Defensive — Qwen3 occasionally still emits a short <think> even with /no_think. We strip it instead of failing. |
| System prompt is in English with a few-shot example | Qwen3 follows English ReAct system prompts more reliably than Spanish; instruction tuning is heavily weighted toward English. |
Default TEMPERATURE = 0.0 (was 0.2) |
Format compliance becomes deterministic. Raise via valves only if you want stylistic variety in answers. |
These four patches together make the loop go from "fails immediately"
to "completes in ~4 s for a calc query" on the same hardware.
Other Qwen3 caveats that are out of scope of this pipeline but worth knowing:
- Native OpenAI tool calling: Qwen3-8B's
tools=[...]support is shaky via OVMS. We use a text-prompted ReAct loop instead, which is slower but far more reliable on small models. - Context window: ~32k tokens. Long ReAct traces with many tool calls
can blow past it; lower
MAX_ITERATIONSorMAX_TOKENS_PER_STEPif you see truncation.
Pipelines vs native Tools — when to use which:
- Native tools (step 5): the LLM decides when to call them via
tools=[...]in the API. Faster, less transparent. - ReAct pipeline: explicit loop in Python, you control iterations and prompt, you see every step. More robust when the model is shaky on native function calling.
Claude Code is Anthropic's official CLI. By default it talks to api.anthropic.com using the Anthropic Messages API format. OVMS speaks OpenAI Chat Completions. They aren't the same: you need a proxy / router that translates between them.
Honesty first: using Qwen3-8B INT4 inside Claude Code instead of Sonnet / Opus significantly degrades quality. Claude Code is tuned and validated against real Claude models (precise tool calling, prompt caching, extended thinking, surgical file editing). A local 8B model does not replicate that. Use it for:
- Learning how Claude Code works internally.
- Iterating without burning credits on trivial tasks.
- Working offline or in air-gapped environments.
- Demos, workshops and teaching.
Don't use it to ship critical production code.
Two paths depending on how fine you want to weave it.
A router built exactly for plugging Claude Code into OpenAI-compatible backends. It replaces the claude binary with ccr code.
npm install -g @musistudio/claude-code-routerCreate ~/.claude-code-router/config.json:
{
"Providers": [
{
"name": "local-ovms",
"api_base_url": "http://localhost:8000/v3/chat/completions",
"api_key": "sk-not-checked",
"models": ["qwen3-8b"]
}
],
"Router": {
"default": "local-ovms,qwen3-8b",
"background": "local-ovms,qwen3-8b",
"think": "local-ovms,qwen3-8b",
"longContext": "local-ovms,qwen3-8b"
}
}Start the router (leave it running in another terminal):
ccr startAnd, instead of claude, run:
ccr codeClaude Code hits your local OVMS without realizing it isn't Anthropic.
LiteLLM is a proxy that exposes an Anthropic-compatible API on one side and speaks any backend (OpenAI, OVMS, Ollama, vLLM, etc.) on the other. More flexible if you're juggling several models / providers.
pip install 'litellm[proxy]'litellm.config.yaml:
model_list:
- model_name: claude-sonnet-4-5
litellm_params:
model: openai/qwen3-8b
api_base: http://localhost:8000/v3
api_key: sk-not-checked
- model_name: claude-haiku-4-5
litellm_params:
model: openai/qwen3-8b
api_base: http://localhost:8000/v3
api_key: sk-not-checkedLaunch the proxy:
litellm --config litellm.config.yaml --port 4000And point Claude Code at it:
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_AUTH_TOKEN=sk-anything
claude # now uses your local OVMSThe
model_nameyou set must match the one Claude Code tries to call (usually the alias of the current Sonnet / Haiku model). If Claude Code asks for a model LiteLLM doesn't know, it will fail — add it to themodel_listwith the samelitellm_params.
- Tool calling: Qwen3-8B supports tools, but its format can drift from what Claude Code expects. The internal tools (
Bash,Edit,Read) may need an adapted prompt template or fail silently (badly serialized params, calls the model "invents" in plain text instead of thetoolsfield). - Prompt caching: Anthropic optimizes with cache breakpoints. OVMS ignores them — every request re-processes the whole context.
- Context window: Qwen3-8B supports ~32k tokens; Claude Code assumes up to 200k. Long conversations will truncate and the model will "forget" the beginning.
- Extended thinking: Qwen3 has its own
<think>mode (you'll see it in responses). It's not Claude's "extended thinking" but Claude Code can get confused with the tags. - Speed: ~18-20 tok/s on iGPU. Fine for short tasks, frustrating for
claudedoing long refactors.
Sweet spot: "explain this file", "write a test for this function", "what does this snippet do", "rename this variable across the directory". For architectural refactors, go back to Sonnet / Opus.
Default placement (see Models table). Numbers measured on Core Ultra 7 258V, 32 GB RAM, INT4:
qwen3-8b→ Arc 140V iGPU: ~18 tok/s sustained decoding, first-token < 1 s. Measured: 145 tokens in 7.9 s.qwen25-vl-7b→ CPU: ~6 tok/s sustained decoding. Measured: 191 tokens in 31.6 s. With an image, the first token adds 3-8 s for the vision encoder.
Why we didn't try NPU: as explained in Requirements, Intel's NPU is not accessible from WSL2 in this version. If we had access, it would be the ideal target for the VLM (frees the iGPU without penalizing as much as CPU).
If you really want both on GPU (not recommended, runs at the edge):
- Drop
cache_sizein bothgraph.pbtxt(KV cache in GB, default 0 = dynamic). - Reduce
max_num_seqsto 4-8 andmax_num_batched_tokensto 2048. - Add
"KV_CACHE_PRECISION":"u8"toplugin_config— halves the KV cache footprint. - Drop
enable_prefix_caching: true. - Even then, any long prompt can knock the iGPU out with
USM Host allocation failed.
The first generation after docker compose up will be slow (30-90 s): OVMS compiles the model for the iGPU and stores the compiled blob in /tmp/.ov_cache/ inside the container. Subsequent ones are immediate while the container is alive. If you recreate the container, it recompiles.
.
├── docker-compose.yml
├── .env.example
├── scripts/
│ └── export-models.sh # HF → OpenVINO INT4 conversion
├── ovms/
│ ├── config.json # model list for OVMS
│ └── models/
│ ├── qwen3-8b/graph.pbtxt
│ └── qwen25-vl-7b/graph.pbtxt
├── searxng/
│ └── settings.yml # SearXNG with JSON format enabled
├── tools/ # Open WebUI native tools (step 5)
│ ├── calculator.py
│ ├── web_fetch.py
│ └── weather.py
├── pipelines/ # ReAct agent (step 6)
│ └── react_agent.py
└── openwebui-data/ # persistent Open WebUI state (gitignored)
A short, opinionated list of upstream Intel-side resources behind this stack — useful if you want to go deeper, swap components, or contribute.
- 📖 OpenVINO documentation — official docs, the authoritative source for plugin behavior, supported ops and device hints.
- 🐙
openvinotoolkit/openvino— the core C++/Python runtime. Browse the GPU plugin source when GPU compile fails in non-obvious ways (the error we hit withis_static()was traced from there). - 🐙
openvinotoolkit/openvino.genai— the LLM-specific runtime layer (continuous batching, KV cache, chat templates). What OVMS uses under the hood. - 🐙
openvinotoolkit/nncf— Neural Network Compression Framework. Read the weight compression docs to understand what--weight-format int4 --group-size 64actually does.
- 🐙
openvinotoolkit/model_server— the server we run. The demos directory has canonicalgraph.pbtxtexamples for every task (text generation, embeddings, rerank, image generation, VLMs). When in doubt, copy from there. - 🛠
optimum-intel— the HuggingFace bridge that turnsQwen/Qwen3-8Binto an OpenVINO IR.scripts/export-models.shis essentially a wrapper around itsoptimum-cli export openvino.
- 🤗
OpenVINOorganization on HuggingFace — official pre-converted OpenVINO IR models (Qwen, Llama, Phi, Mistral, embedding models, etc.). If you don't want to wait for the local INT4 conversion, grab one of these directly and skipscripts/export-models.sh.
- 💻 Intel Core Ultra processors (Series 2) — the family. Lunar Lake (Series 2) is what this stack was tuned for, but the same compose file works on Meteor Lake and Arrow Lake H/HX with the same iGPU/CPU placement logic.
- 💻 Intel Arc Graphics — both the integrated (Arc 140V here) and discrete Arc lineups speak the same OpenVINO plugin. If you have a discrete Arc A770/B580, you can reuse this exact stack with much more headroom for big models.
- ☁️ Intel Tiber AI Cloud — Intel's developer cloud (formerly Intel Developer Cloud). Useful if you want to try the same OVMS stack on a larger Xeon + GPU instance before buying hardware.
- 📰 OpenVINO blog — release notes, perf numbers and model-support announcements. Subscribe if you live in this ecosystem.
- 🎥 Intel Developer YouTube — OpenVINO/OVMS deep-dives and conference talks.
/dev/dridoesn't exist in WSL: update Windows + Intel Arc drivers. Restart WSL:wsl --shutdown.- OVMS doesn't detect GPU: from inside the container
clinfoshould list the iGPU. If not, check/dev/dripermissions (thegroup_add992/44 covers WSL2; if your render device has a different GID, adjust it). - Open WebUI doesn't see the models: check
docker compose logs openwebuiand thatcurl http://localhost:8000/v3/modelsfrom the host returns both models. - OOM when loading the second model: see the "Memory and performance" section.
- Tools don't fire: enable "Native function calling" in Open WebUI → Settings → Interface. If they still don't, the model may not handle
toolswell via OVMS — tryqwen3-8b. - SearXNG returns 0 results: the JSON format must be enabled in
searxng/settings.yml(it already is). If SearXNG starts before reading the config, restart:docker compose restart searxng. react-agentdoesn't appear: checkdocker compose logs pipelines— should show "Loaded react-agent". If not, check the.pysyntax. After editingpipelines/react_agent.py, restart:docker compose restart pipelines.- ReAct stuck in a loop: the model isn't following the format. Drop
TEMPERATUREto 0.0–0.1 in the valves; if it persists, tweakSYSTEM_PROMPTin the.pywith few-shot examples. - Inference hangs silently (request accepted, no response): the
graph.pbtxtis missing theinput_stream_handler { SyncSetInputStreamHandler ... }block needed by OVMS 2026.1.0+ to sync the LOOPBACK back-edge. The graphs in this repo already include it; if you write your own, copy the structure fromovms/models/qwen3-8b/graph.pbtxtor regenerate withovms --pull --task text_generation --source_model <HF_ID>.
MIT — free for personal, commercial and private use, no warranty.
Mariano Ortega de Mues, Ph.D, Msc AI · 2026
If this project helps you, a ⭐ on GitHub keeps me motivated to open more things. PRs and issues welcome.

