Summary
On Kotlin/Native (verified on macosArm64, SKaiNET-transformers @ 0.21.1 against SKaiNET 0.22.2), generating tokens via the kllama runtime produces deterministic but meaningless token loops. Reproducible across:
- Both runtime paths: deprecated
LlamaRuntime + LlamaIngestion, and the new OptimizedLLMRuntime + QwenNetworkLoader.fromGguf(...).load(...).
- Multiple model families:
Llama-3.2-1B-Instruct (Q8), TinyLlama-1.1B-Chat-v1.0 (Q8, Llama-2 arch), Qwen2.5-0.5B-Instruct (Q8).
- Multiple compute backends: the SKaiNET CPU backend, and an external platform-native macOS backend used through
BackendRegistry.register(...). Both backends emit bit-for-bit identical output, which rules out backend-specific bugs and points to a shared step earlier in the pipeline (weight conversion, dequantization, the runtime forward pass, or the CPU ops underneath both).
The same model files load and generate sensibly on JVM (e.g. via skainet-cli with MemSegWeightConverter + NATIVE_OPTIMIZED). So this is a Kotlin/Native-specific regression somewhere along the DEQUANTIZE_TO_FP32 → OptimizedLLMRuntime/LlamaRuntime → forward path.
Repro
The shortest repro uses SKaiNET-transformers' own native CLI binary (no third-party code involved):
./gradlew :llm-runtime:kllama:linkReleaseExecutableMacosArm64
./llm-runtime/kllama/build/bin/macosArm64/releaseExecutable/kllama.kexe \
/path/to/Llama-3.2-1B-Instruct-Q8_0.gguf \
"The capital of France is" 32 0.0
Output (both with and without --backend=cpu; same on every restart):
Generating 32 tokens with temperature=0.0...
---
Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies reigning reigning reigning ...
Same kllama binary on tinyllama-1.1b-chat-v1.0.Q8_0.gguf:
witzwitzwitzwitzwitzwitzwitzwitzwitzwitzwitzwitz...
The new path via OptimizedLLMRuntime + QwenNetworkLoader on Qwen2.5-0.5B-Instruct-Q8_0.gguf:
bynDRAM ASAP */čĊčĊčĊ appréci(animation mur Offline bénéfic coppia wrzeÅĽnia{i]'; ...
All three: deterministic at temperature 0.0; tokens are valid (within vocab) but semantically meaningless.
What's been ruled out
- EOS resolution:
tokenizer.eosTokenId == 128009 (<|eot_id|>) for Llama-3.2 — correct. Confirmed via diagnostic print before generation.
- Backend correctness: same prompt produces bit-for-bit identical output on the SKaiNET CPU backend and on a different platform-native execution backend wired through
BackendRegistry. If a backend op were wrong, the two would diverge. They don't.
- Tool calling / agent loop: gibberish reproduces with the raw
kllama.kexe baseline binary that does no tool calling at all — just runtime.generate(...) after LlamaIngestion.load(...) or QwenNetworkLoader.load(...).
- JVM: the same loaders + runtimes, exercised via JUnit tests (gated on
TINYLLAMA_MODEL_PATH) on the JVM target, work. The issue is K/N-specific.
Most likely places to look
In rough order of suspicion:
- CPU
TensorOps on K/N — the runtime is DirectCpuExecutionContext(), which uses Accelerate ([SKaiNET] Using Accelerate-backed CPU operations (ARM NEON + AMX) is logged at startup). A miscompile or interop bug in any of matmul / RMSNorm / RoPE / softmax on K/N would produce coherent-looking but wrong logits, indistinguishable from what we see.
- GGUF dequantization on K/N —
QuantPolicy.DEQUANTIZE_TO_FP32 is the only K/N-viable policy (MemSegWeightConverter is jvmMain). If the K/N Q8 dequant has a bug, all downstream is poisoned. Worth verifying by snapshotting a few weight tensors after load and comparing JVM vs K/N.
OptimizedLLMRuntime.forward() state on K/N — KV-cache positions, masking, or attention scratch buffers behaving differently on K/N. Less likely since the deprecated LlamaRuntime (totally separate code) shows the same symptom.
Why this matters
K/N is the platform target for any consumer that wants to ship a single-binary CLI or a non-JVM mobile/desktop app on top of SKaiNET-transformers. The JVM-only golden path is fine for backend services but rules out a large slice of intended consumers. Right now, the K/N native CLI in this repo (kllama.kexe) doesn't actually generate sensible output — that's a strong signal something needs attention even before any external consumer.
Notes for triage
- Same machine, same JDK toolchain (21), same model files. Only the target (K/N vs JVM) differs.
- Tested on SKaiNET 0.22.2 — the
fix(gguf): handle unsigned numeric metadata fields change in 0.22.2 did not affect this behaviour.
- I have logs / can produce a minimum-pair JVM-vs-K/N diff (e.g. logits at position 0 for a fixed prompt) on request.
Summary
On Kotlin/Native (verified on
macosArm64, SKaiNET-transformers @ 0.21.1 against SKaiNET 0.22.2), generating tokens via the kllama runtime produces deterministic but meaningless token loops. Reproducible across:LlamaRuntime+LlamaIngestion, and the newOptimizedLLMRuntime+QwenNetworkLoader.fromGguf(...).load(...).Llama-3.2-1B-Instruct(Q8),TinyLlama-1.1B-Chat-v1.0(Q8, Llama-2 arch),Qwen2.5-0.5B-Instruct(Q8).BackendRegistry.register(...). Both backends emit bit-for-bit identical output, which rules out backend-specific bugs and points to a shared step earlier in the pipeline (weight conversion, dequantization, the runtime forward pass, or the CPU ops underneath both).The same model files load and generate sensibly on JVM (e.g. via
skainet-cliwithMemSegWeightConverter+NATIVE_OPTIMIZED). So this is a Kotlin/Native-specific regression somewhere along theDEQUANTIZE_TO_FP32→OptimizedLLMRuntime/LlamaRuntime→ forward path.Repro
The shortest repro uses SKaiNET-transformers' own native CLI binary (no third-party code involved):
./gradlew :llm-runtime:kllama:linkReleaseExecutableMacosArm64 ./llm-runtime/kllama/build/bin/macosArm64/releaseExecutable/kllama.kexe \ /path/to/Llama-3.2-1B-Instruct-Q8_0.gguf \ "The capital of France is" 32 0.0Output (both with and without
--backend=cpu; same on every restart):Same kllama binary on
tinyllama-1.1b-chat-v1.0.Q8_0.gguf:The new path via
OptimizedLLMRuntime+QwenNetworkLoaderonQwen2.5-0.5B-Instruct-Q8_0.gguf:All three: deterministic at temperature 0.0; tokens are valid (within vocab) but semantically meaningless.
What's been ruled out
tokenizer.eosTokenId == 128009(<|eot_id|>) for Llama-3.2 — correct. Confirmed via diagnostic print before generation.BackendRegistry. If a backend op were wrong, the two would diverge. They don't.kllama.kexebaseline binary that does no tool calling at all — justruntime.generate(...)afterLlamaIngestion.load(...)orQwenNetworkLoader.load(...).TINYLLAMA_MODEL_PATH) on the JVM target, work. The issue is K/N-specific.Most likely places to look
In rough order of suspicion:
TensorOpson K/N — the runtime isDirectCpuExecutionContext(), which uses Accelerate ([SKaiNET] Using Accelerate-backed CPU operations (ARM NEON + AMX)is logged at startup). A miscompile or interop bug in any of matmul / RMSNorm / RoPE / softmax on K/N would produce coherent-looking but wrong logits, indistinguishable from what we see.QuantPolicy.DEQUANTIZE_TO_FP32is the only K/N-viable policy (MemSegWeightConverterisjvmMain). If the K/N Q8 dequant has a bug, all downstream is poisoned. Worth verifying by snapshotting a few weight tensors after load and comparing JVM vs K/N.OptimizedLLMRuntime.forward()state on K/N — KV-cache positions, masking, or attention scratch buffers behaving differently on K/N. Less likely since the deprecatedLlamaRuntime(totally separate code) shows the same symptom.Why this matters
K/N is the platform target for any consumer that wants to ship a single-binary CLI or a non-JVM mobile/desktop app on top of SKaiNET-transformers. The JVM-only golden path is fine for backend services but rules out a large slice of intended consumers. Right now, the K/N native CLI in this repo (
kllama.kexe) doesn't actually generate sensible output — that's a strong signal something needs attention even before any external consumer.Notes for triage
fix(gguf): handle unsigned numeric metadata fieldschange in 0.22.2 did not affect this behaviour.