Skip to content

Commit d892e4a

Browse files
localai-botmudler
andauthored
feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758)
* test(e2e-backends): allow BACKEND_BINARY for native-built backends Adds an escape hatch for hardware-gated backends (e.g. ds4) where the model is too large for Docker build context. When BACKEND_BINARY points at a run.sh produced by 'make -C backend/cpp/<name> package', the suite skips docker image extraction and drives the binary directly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(e2e-backends): validate BACKEND_BINARY basename + log actual source Two follow-ups from the cbcf514 code review: - BACKEND_BINARY now requires a path whose basename is `run.sh`. Without this check, `filepath.Dir(binary)` silently discarded the filename, so pointing the env var at an arbitrary binary failed later with a confusing assertion that named a path the user never typed. - The "Testing image=..." debug line printed an empty string when the binary path was used, hiding the actual source in CI logs. The line now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect as `src=...`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): scaffold ds4 backend dir Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the implementation arrive in follow-up commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add backend Makefile Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then invokes CMake on our wrapper. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add CMakeLists for grpc-server Generates protoc stubs from backend.proto, links grpc-server.cpp + dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built ds4 engine .o files. DS4_GPU=cuda|metal|cpu selects the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): grpc-server skeleton + module stubs The minimum that links: Backend service with Health + Free; other RPCs default to UNIMPLEMENTED. Stub headers/sources for dsml_parser, dsml_renderer, and kv_cache are in place so CMake links cleanly even before those modules ship. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement LoadModel Opens engine + creates session sized to ContextSize (default 32768). Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else CUDA. MTP/speculative options are accepted via ModelOptions.Options[] (mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into g_kv_cache_dir for the cache module (Task 19 wires it in). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement TokenizeString Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Predict (plain text) Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement PredictStream (plain text) ChatDelta + reasoning/tool_calls split arrives in Task 14. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Status RPC Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add DSML streaming parser Classifies raw model-emitted token text into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the literal DSML strings rendered by ds4_server.c's prompt template (<|DSML|tool_calls>, <|DSML|invoke name=...>, <think>, etc.) - these are plain text the model emits, not special tokens. Partial markers split across token chunks are buffered until a full marker or a definitively-not-a-marker '<' is observed. RandomToolId() generates the API-side tool call id (call_xxx) that exact-replay would key on. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape producing byte 0xCD, eating the 'D'. The markers were never actually matching the DSML text the model emits. Split each escape with adjacent string literal concatenation so the byte sequence is exactly EF BD 9C 44 (|D) at runtime. Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively expose std::strlen / std::snprintf via <string>). The local plan file (uncommitted) was also updated with the same fixes so Task 16's dsml_renderer.cpp does not re-introduce the bug. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta) Non-streaming Predict now emits one ChatDelta carrying content, reasoning_content, and tool_calls[] parsed from the model's DSML output. Reply.message still carries the raw model bytes for backends that prefer the regex fallback path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into PredictStream Per-token ChatDelta writes: content/reasoning_content go incrementally, tool_calls emit TOOL_START as one delta (id + name) followed by TOOL_ARGS deltas with incremental JSON. The Go-side aggregator (pkg/functions/chat_deltas.go) reassembles them. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): chat template + reasoning_effort mapping UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append / assistant_prefix. PredictOptions.Metadata['enable_thinking'] and ['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default; 'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE). Tool-call rendering for assistant turns with tool_calls JSON arrives in the next commit (dsml_renderer). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML Closes the round-trip: when an OpenAI client sends a multi-turn chat where prior turns contain tool_calls or role=tool messages, build_prompt serializes them back to the DSML shape the model was trained on. Mirrors ds4_server.c's prompt renderer; uses nlohmann::json for parsing the OpenAI tool_calls payload. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): disk KV cache module Dir-based cache keyed by SHA1(rendered prompt prefix). File format: 'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes + ds4_session_save_payload output. NOT bit-compatible with ds4-server's KVC files - that interop is a follow-up plan. LoadLongestPrefix walks the dir picking the longest stored prefix that prefixes the incoming prompt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for the request, tries LoadLongestPrefix to recover state, then Saves the new state after generation. ds4_session_sync handles the live-cache fast path internally, so the disk cache only matters for cold-starts and cross-session reuse. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add package.sh Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into package/lib so the FROM scratch image boots without a host libc. Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp ds4.h defines 'typedef enum {...} ds4_backend' which collides with our C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h includes ds4.h directly and surfaces the conflict immediately; other TUs would hit it once gRPC dev headers are available. Renames the C++ namespace to ds4cpp across all wrapper files and the plan, leaving the upstream ds4 typedef untouched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): add Dockerfile.ds4 Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu) -> FROM scratch with packaged grpc-server + bundled runtime libs. nlohmann-json3-dev is required for dsml_renderer's JSON handling. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4 in docker-build-backends + .NOTPARALLEL guards. Also adds the backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh (landed in Task 24). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch) Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13. Darwin Metal is handled outside this matrix by backend_build_darwin.yml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add ds4 meta + image entries cpu + cuda13 x latest + master. Darwin Metal builds publish under ds4-darwin via the existing llama-cpp-darwin OCI pipeline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(scripts/build): add ds4-darwin.sh Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh: make grpc-server -> otool -L for dylib bundling -> OCI tar that 'local-ai backends install' consumes via the backends/ds4-darwin Makefile target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(darwin): build ds4-darwin in backend_build_darwin Adds a 'Build ds4 backend (Darwin Metal)' step that runs the backends/ds4-darwin Makefile target on the macOS runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(import): auto-detect ds4 weights via DS4Importer Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf repo URI and the DeepSeek-V4-Flash-*.gguf filename pattern. Registered before LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling through to llama-cpp. Also lists ds4 in /backends/known so the /import-model UI surfaces it as a manual choice for users who want to force the backend on a non-canonical URI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add deepseek-v4-flash-q2 (ds4 backend) One-click install of the q2 weights with backend: ds4. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(.agents): add ds4-backend.md Documents the backend shape, DSML state machine, thinking-mode mapping, disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY hardware-validation path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to 2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported into the environment so it can pick the right cuda-keyring / cudss / nvpl debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern. Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13' failed at: /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add Metal image entries for ds4 Adds metal-ds4 + metal-ds4-development image entries pointing at quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4 (built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the 'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and ds4-development variant. Closes a gap from the initial Task 23 landing - the Darwin Metal build script and CI workflow step were already wired (Tasks 24-25), but the gallery had no image entry for users to install the Metal variant. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04' which clashes with install-base-deps.sh's cuda-keyring step: E: Conflicting values set for option Signed-By regarding source https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/ The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain 'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA from scratch via its own keyring setup. Adopting that here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): drop install-base-deps.sh dependency The .docker/install-base-deps.sh pipeline is built around the llama-cpp needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at /opt/grpc. For ds4 we don't need any of that: - CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda ready to go; install-base-deps's keyring step then conflicts with the pre-installed Signed-By. - gRPC: ds4's grpc-server.cpp only links against grpc++; system libgrpc++-dev (apt) is sufficient, no source build needed. Replaced the install-base-deps invocation in Dockerfile.ds4 with a direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries back to nvidia/cuda base + skip-drivers=true so install-base-deps would no-op even if some downstream tooling calls it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus Two compile bugs caught by the docker build: 1. proto::Message uses snake_case accessors. The build_prompt loop called m.toolcalls() / m.toolcallid() - the protoc-generated names are m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the wrapper. 2. The Status RPC method shadowed the 'using grpc::Status' alias, so any later method declaration using Status as a return type failed to parse ('Status does not name a type' starting at LoadModel). Solution: alias grpc::Status as GStatus instead, with no 'using' clause that would conflict. All RPC method declarations and return-statement constructions now use GStatus. Pre-existing code reviewer flagged the Status-shadow concern as 'minor' in the original Task 10 commit; it turned out to be a real compile blocker under libstdc++ 13 once the surrounding methods were filled in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush When the model emitted a parameter value that arrived in the same buffer as the surrounding tool_call markers (e.g. the buffered tail after a literal '</think>' opened the model output), the parser deferred all buffered bytes to Flush() because looks_like_prefix() always returns true while buf starts with '<'. Flush() then drained the buffer as plain CONTENT/REASONING regardless of parser state, so the bytes between the parameter open and close markers were classified as CONTENT instead of TOOL_ARGS. Symptom: the model emitted <|DSML|parameter name="location" string="true">Paris, France</|DSML|parameter> and the assembled tool_call arguments came out as {"location":""} - the opener and closer were emitted into the args stream but the "Paris, France" content went to the assistant message instead. Fix: 1. Flush() now uses the same state-aware emit logic as DrainPlain: PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string), THINK bytes become REASONING, TEXT bytes become CONTENT, and INVOKE / TOOL_CALLS structural whitespace is discarded. 2. looks_like_prefix() restricts its leading-'<' fallback to buffers that have not yet seen a '>'. Without that change, char-by-char feeds would discard the '<' of '<|DSML|invoke name="..."' once the marker prefix length was reached but the closing quote/'>' were still in flight. Verified with a standalone harness that runs the failing input three ways (single Feed, split-after-'>', and char-by-char) and aggregates TOOL_ARGS for tool index 0: all three now produce {"location":"Paris, France"}. Assisted-by: Claude:opus-4.7 [Read,Edit,Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence ds4_engine_generate_argmax() is a self-contained helper that doesn't take or update a ds4_session - it manages its own internal state. Our Predict and PredictStream methods created g_session via ds4_session_create() but then called ds4_engine_generate_argmax(), so g_session's KV state never advanced. ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save correctly rejected with 'session has no valid checkpoint to save'. Switch both RPCs to the proper session API: ds4_session_sync(g_session, &prompt, ...) loop: int token = ds4_session_argmax(g_session) if token == eos: break emit(token) ds4_session_eval(g_session, token, ...) After the loop the session has a real checkpoint and ds4_session_save_payload writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three .kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets kv_cache_dir, and the e2e tool-call assertion still passes. Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save path + payload_bytes + result) so future failures are visible instead of silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream when the cache is enabled - and skipped entirely when the option is unset. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded Wires MTP (Multi-Token Prediction) speculative decoding into the manual generation loop in both Predict and PredictStream. When the upstream MTP weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal, ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per verifier step. When MTP is not loaded (no option, CPU backend, or weights absent), we fall through to the simple ds4_session_argmax + ds4_session_eval path with no behavior change. Validated on a DGX Spark GB10 with the optional MTP GGUF (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs 'ds4: MTP support model loaded ... (draft=2)' on stderr. Caveat per upstream README: 'currently provides at most a slight speedup, not a meaningful generation-speed win'. Wired now mainly to track the upstream API; bigger speedups arrive when ds4 improves the speculative path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI gRPC side. The generation loop now consults compute_sample_params() per token to pick the effective (temperature, top_k, top_p, min_p), based on: 1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp 2. Thinking-mode override: when enable_thinking != false, force T=1.0, top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and the trailing content) 3. DSML structural override: when DsmlParser::IsInDsmlStructural() returns true (we are between tool-call markers but NOT in a param value payload), force T=0.0 so protocol bytes parse cleanly When the effective temperature is 0, we keep using ds4_session_argmax + MTP speculative path (matches ds4-server's gate that only enables MTP for greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with a per-thread RNG seeded from system_clock and fall back to single-token ds4_session_eval. New public method on DsmlParser: IsInDsmlStructural() encodes which states need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user sampling); TEXT and THINK are excluded (no tool-call context to protect). Verified on the DGX Spark GB10: the e2e suite still passes with all 5 specs including tools, and the Predict output now varies between runs (creative sampling active) while the tool-call args remain a clean '{"location":"Paris, France"}' because the parser-state check forces greedy on the structural bytes. UX note: thinking mode is ON by default (matching ds4-server). Users who want deterministic output should set Metadata.enable_thinking = false. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add sha256 to deepseek-v4-flash-q2 entry Per HF LFS metadata for antirez/deepseek-v4-gguf: size: 86720111200 bytes (~80.76 GiB) sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c LocalAI's downloader verifies sha256 when present, so users who install deepseek-v4-flash-q2 from the gallery get integrity-checked weights and the partial-download issue (an 81 GB file is easy to truncate) becomes recoverable instead of silently producing a broken backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
1 parent 5d0f732 commit d892e4a

27 files changed

Lines changed: 2361 additions & 9 deletions

.agents/ds4-backend.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Working on the ds4 Backend
2+
3+
`antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
4+
LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
5+
`backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
6+
7+
## Pin
8+
9+
`backend/cpp/ds4/prepare.sh` clones `antirez/ds4` at `DS4_VERSION`. Bump that
10+
commit to follow upstream.
11+
12+
## Wire shape
13+
14+
| RPC | Implementation |
15+
|---|---|
16+
| Health, Free, Status | Trivial; no engine dependency for Health |
17+
| LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
18+
| TokenizeString | `ds4_tokenize_text` |
19+
| Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
20+
| PredictStream | Same, per-token ChatDelta writes |
21+
22+
## DSML
23+
24+
ds4 emits tool calls as literal text markers (`<|DSML|tool_calls>` etc.) -
25+
NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
26+
classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
27+
events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
28+
OpenAI tool_calls + role=tool messages back into DSML for the next turn.
29+
30+
## Thinking modes
31+
32+
`PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
33+
`["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
34+
maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
35+
36+
## Disk KV cache
37+
38+
`kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
39+
`ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
40+
via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
41+
NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
42+
43+
## Build matrix
44+
45+
| Build | Where | Notes |
46+
|---|---|---|
47+
| `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
48+
| `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
49+
| `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
50+
51+
cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
52+
53+
## Hardware-gated validation
54+
55+
`tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
56+
57+
```
58+
BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
59+
BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
60+
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
61+
BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
62+
go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
63+
```
64+
65+
CI does not load the model; the suite is opt-in via env vars.
66+
67+
## Importer
68+
69+
`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
70+
matching the `antirez/deepseek-v4-gguf` repo URI or the
71+
`DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
72+
`LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
73+
specific, and first-match-wins. The importer emits `backend: ds4`, uses
74+
`ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
75+
disables the Go-side automatic tool-parsing fallback (the C++ backend emits
76+
ChatDelta.tool_calls natively via `DsmlParser`).
77+
78+
ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
79+
slice so the `/import-model` UI surfaces it as a manual choice for users who
80+
want to force the backend on a non-canonical URI.

.github/backend-matrix.yml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -948,6 +948,32 @@ include:
948948
backend: "turboquant"
949949
dockerfile: "./backend/Dockerfile.turboquant"
950950
context: "./"
951+
- build-type: 'cublas'
952+
cuda-major-version: "13"
953+
cuda-minor-version: "0"
954+
platforms: 'linux/amd64'
955+
tag-latest: 'auto'
956+
tag-suffix: '-gpu-nvidia-cuda-13-ds4'
957+
runs-on: 'ubuntu-latest'
958+
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
959+
skip-drivers: 'true'
960+
backend: "ds4"
961+
dockerfile: "./backend/Dockerfile.ds4"
962+
context: "./"
963+
ubuntu-version: '2404'
964+
- build-type: 'cublas'
965+
cuda-major-version: "13"
966+
cuda-minor-version: "0"
967+
platforms: 'linux/arm64'
968+
skip-drivers: 'true'
969+
tag-latest: 'auto'
970+
tag-suffix: '-nvidia-l4t-cuda-13-arm64-ds4'
971+
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
972+
runs-on: 'ubuntu-24.04-arm'
973+
ubuntu-version: '2404'
974+
backend: "ds4"
975+
dockerfile: "./backend/Dockerfile.ds4"
976+
context: "./"
951977
- build-type: 'cublas'
952978
cuda-major-version: "13"
953979
cuda-minor-version: "0"
@@ -2321,6 +2347,34 @@ include:
23212347
dockerfile: "./backend/Dockerfile.turboquant"
23222348
context: "./"
23232349
ubuntu-version: '2404'
2350+
- build-type: ''
2351+
cuda-major-version: ""
2352+
cuda-minor-version: ""
2353+
platforms: 'linux/amd64'
2354+
platform-tag: 'amd64'
2355+
tag-latest: 'auto'
2356+
tag-suffix: '-cpu-ds4'
2357+
runs-on: 'ubuntu-latest'
2358+
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
2359+
skip-drivers: 'true'
2360+
backend: "ds4"
2361+
dockerfile: "./backend/Dockerfile.ds4"
2362+
context: "./"
2363+
ubuntu-version: '2404'
2364+
- build-type: ''
2365+
cuda-major-version: ""
2366+
cuda-minor-version: ""
2367+
platforms: 'linux/arm64'
2368+
platform-tag: 'arm64'
2369+
tag-latest: 'auto'
2370+
tag-suffix: '-cpu-ds4'
2371+
runs-on: 'ubuntu-24.04-arm'
2372+
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
2373+
skip-drivers: 'true'
2374+
backend: "ds4"
2375+
dockerfile: "./backend/Dockerfile.ds4"
2376+
context: "./"
2377+
ubuntu-version: '2404'
23242378
- build-type: ''
23252379
cuda-major-version: ""
23262380
cuda-minor-version: ""

.github/workflows/backend_build_darwin.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,8 +211,13 @@ jobs:
211211
make protogen-go
212212
make backends/llama-cpp-darwin
213213
214+
- name: Build ds4 backend (Darwin Metal)
215+
if: inputs.backend == 'ds4'
216+
run: |
217+
make backends/ds4-darwin
218+
214219
- name: Build ${{ inputs.backend }}-darwin
215-
if: inputs.backend != 'llama-cpp'
220+
if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
216221
run: |
217222
make protogen-go
218223
BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
2525
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
2626
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
2727
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
28+
| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
2829
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
2930
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
3031
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |

Makefile

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Disable parallel execution for backend builds
2-
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx
2+
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin
33

44
GOCMD=go
55
GOTEST=$(GOCMD) test
@@ -1009,6 +1009,10 @@ backends/llama-cpp-darwin: build
10091009
bash ./scripts/build/llama-cpp-darwin.sh
10101010
./local-ai backends install "ocifile://$(abspath ./backend-images/llama-cpp.tar)"
10111011

1012+
backends/ds4-darwin: build
1013+
bash ./scripts/build/ds4-darwin.sh
1014+
./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
1015+
10121016
build-darwin-python-backend: build
10131017
bash ./scripts/build/python-darwin.sh
10141018

@@ -1050,6 +1054,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
10501054
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
10511055
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
10521056
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
1057+
# ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
1058+
# Single-model; hardware-only validation lives at tests/e2e-backends/
1059+
# (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
1060+
BACKEND_DS4 = ds4|ds4|.|false|false
10531061

10541062
# Golang backends
10551063
BACKEND_PIPER = piper|golang|.|false|true
@@ -1135,6 +1143,7 @@ endef
11351143
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
11361144
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
11371145
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
1146+
$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
11381147
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
11391148
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
11401149
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1188,7 +1197,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
11881197
docker-save-%: backend-images
11891198
docker save local-ai-backend:$* -o backend-images/$*.tar
11901199

1191-
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
1200+
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
11921201

11931202
########################################################
11941203
### Mock Backend for E2E Tests

backend/Dockerfile.ds4

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
ARG BASE_IMAGE=ubuntu:24.04
2+
ARG APT_MIRROR=""
3+
ARG APT_PORTS_MIRROR=""
4+
5+
# BASE_IMAGE is either ubuntu:24.04 (for cpu builds) or nvidia/cuda:13.0.0-devel-ubuntu24.04
6+
# (for cublas builds). Both ship apt + Ubuntu Noble packages; the nvidia/cuda base
7+
# additionally provides /usr/local/cuda. Darwin (Metal) builds bypass this Dockerfile
8+
# entirely via scripts/build/ds4-darwin.sh.
9+
FROM ${BASE_IMAGE} AS builder
10+
ARG BUILD_TYPE
11+
ARG TARGETARCH
12+
ARG TARGETVARIANT
13+
14+
ENV BUILD_TYPE=${BUILD_TYPE} \
15+
DEBIAN_FRONTEND=noninteractive \
16+
PATH=/usr/local/cuda/bin:${PATH}
17+
18+
WORKDIR /build
19+
20+
# Install build-time deps via plain apt - install-base-deps.sh's full pipeline
21+
# (CUDA keyring + from-source gRPC) is unnecessary here:
22+
# - CUDA: when BASE_IMAGE=nvidia/cuda:*, /usr/local/cuda is already populated;
23+
# for the cpu build we don't need CUDA at all.
24+
# - gRPC/Protobuf: system apt packages are sufficient; ds4's wrapper only links
25+
# against them, it doesn't ship the gRPC source tree.
26+
# - nlohmann-json: dsml_renderer's only third-party dep.
27+
RUN apt-get update && \
28+
apt-get install -y --no-install-recommends \
29+
git cmake build-essential pkg-config ca-certificates \
30+
libgrpc++-dev libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
31+
nlohmann-json3-dev && \
32+
apt-get clean && \
33+
rm -rf /var/lib/apt/lists/*
34+
35+
COPY . /LocalAI
36+
37+
RUN --mount=type=cache,target=/root/.ccache,id=ds4-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
38+
make -C /LocalAI/backend/cpp/ds4 BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
39+
40+
FROM scratch
41+
COPY --from=builder /LocalAI/backend/cpp/ds4/package/. ./

backend/cpp/ds4/.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
ds4/
2+
build/
3+
package/
4+
grpc-server
5+
*.o
6+
backend.pb.cc
7+
backend.pb.h
8+
backend.grpc.pb.cc
9+
backend.grpc.pb.h

backend/cpp/ds4/CMakeLists.txt

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
cmake_minimum_required(VERSION 3.15)
2+
project(ds4-grpc-server LANGUAGES CXX C)
3+
4+
set(CMAKE_CXX_STANDARD 17)
5+
set(CMAKE_CXX_STANDARD_REQUIRED ON)
6+
set(TARGET grpc-server)
7+
8+
option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
9+
set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
10+
set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
11+
12+
find_package(Threads REQUIRED)
13+
find_package(Protobuf CONFIG QUIET)
14+
if(NOT Protobuf_FOUND)
15+
find_package(Protobuf REQUIRED)
16+
endif()
17+
find_package(gRPC CONFIG QUIET)
18+
if(NOT gRPC_FOUND)
19+
# Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
20+
find_library(GRPCPP_LIB grpc++ REQUIRED)
21+
find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
22+
add_library(gRPC::grpc++ INTERFACE IMPORTED)
23+
set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
24+
add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
25+
set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
26+
endif()
27+
28+
find_program(_PROTOC NAMES protoc REQUIRED)
29+
find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
30+
31+
get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
32+
get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
33+
34+
set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
35+
set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
36+
set(HW_GRPC_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
37+
set(HW_GRPC_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
38+
39+
add_custom_command(
40+
OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
41+
COMMAND ${_PROTOC}
42+
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
43+
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
44+
-I "${HW_PROTO_PATH}"
45+
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
46+
"${HW_PROTO}"
47+
DEPENDS "${HW_PROTO}")
48+
49+
add_library(hw_grpc_proto STATIC
50+
${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
51+
${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
52+
target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
53+
54+
set(DS4_OBJS "${DS4_DIR}/ds4.o")
55+
if(DS4_GPU STREQUAL "cuda")
56+
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_cuda.o")
57+
elseif(DS4_GPU STREQUAL "metal")
58+
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_metal.o")
59+
elseif(DS4_GPU STREQUAL "cpu")
60+
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
61+
endif()
62+
63+
add_executable(${TARGET}
64+
grpc-server.cpp
65+
dsml_parser.cpp
66+
dsml_renderer.cpp
67+
kv_cache.cpp)
68+
69+
target_include_directories(${TARGET} PRIVATE ${DS4_DIR})
70+
71+
foreach(obj ${DS4_OBJS})
72+
target_sources(${TARGET} PRIVATE ${obj})
73+
set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
74+
endforeach()
75+
76+
target_link_libraries(${TARGET} PRIVATE
77+
hw_grpc_proto
78+
gRPC::grpc++
79+
gRPC::grpc++_reflection
80+
protobuf::libprotobuf
81+
Threads::Threads
82+
m)
83+
84+
if(DS4_GPU STREQUAL "cuda")
85+
find_package(CUDAToolkit REQUIRED)
86+
target_link_libraries(${TARGET} PRIVATE CUDA::cudart CUDA::cublas)
87+
elseif(DS4_GPU STREQUAL "metal")
88+
find_library(FOUNDATION_LIB Foundation REQUIRED)
89+
find_library(METAL_LIB Metal REQUIRED)
90+
target_link_libraries(${TARGET} PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
91+
elseif(DS4_GPU STREQUAL "cpu")
92+
target_compile_definitions(${TARGET} PRIVATE DS4_NO_GPU)
93+
endif()
94+
95+
if(DS4_NATIVE)
96+
if(APPLE)
97+
target_compile_options(${TARGET} PRIVATE -mcpu=native)
98+
else()
99+
target_compile_options(${TARGET} PRIVATE -march=native)
100+
endif()
101+
endif()

0 commit comments

Comments
 (0)