Kotlin/Native inference produces nonsense tokens across model families and runtime paths (JVM unaffected)

## Summary

On Kotlin/Native (verified on `macosArm64`, SKaiNET-transformers @ 0.21.1 against SKaiNET 0.22.2), generating tokens via the kllama runtime produces deterministic but meaningless token loops. Reproducible across:

- **Both runtime paths**: deprecated `LlamaRuntime` + `LlamaIngestion`, *and* the new `OptimizedLLMRuntime` + `QwenNetworkLoader.fromGguf(...).load(...)`.
- **Multiple model families**: `Llama-3.2-1B-Instruct` (Q8), `TinyLlama-1.1B-Chat-v1.0` (Q8, Llama-2 arch), `Qwen2.5-0.5B-Instruct` (Q8).
- **Multiple compute backends**: the SKaiNET CPU backend, *and* an external platform-native macOS backend used through `BackendRegistry.register(...)`. Both backends emit **bit-for-bit identical output**, which rules out backend-specific bugs and points to a shared step earlier in the pipeline (weight conversion, dequantization, the runtime forward pass, or the CPU ops underneath both).

The same model files load and generate sensibly on JVM (e.g. via `skainet-cli` with `MemSegWeightConverter` + `NATIVE_OPTIMIZED`). So this is a Kotlin/Native-specific regression somewhere along the `DEQUANTIZE_TO_FP32` → `OptimizedLLMRuntime`/`LlamaRuntime` → forward path.

## Repro

The shortest repro uses SKaiNET-transformers' own native CLI binary (no third-party code involved):

```bash
./gradlew :llm-runtime:kllama:linkReleaseExecutableMacosArm64
./llm-runtime/kllama/build/bin/macosArm64/releaseExecutable/kllama.kexe \
    /path/to/Llama-3.2-1B-Instruct-Q8_0.gguf \
    "The capital of France is" 32 0.0
```

**Output** (both with and without `--backend=cpu`; same on every restart):

```
Generating 32 tokens with temperature=0.0...
---
 Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies Bodies reigning reigning reigning ...
```

Same kllama binary on `tinyllama-1.1b-chat-v1.0.Q8_0.gguf`:

```
witzwitzwitzwitzwitzwitzwitzwitzwitzwitzwitzwitz...
```

The new path via `OptimizedLLMRuntime` + `QwenNetworkLoader` on `Qwen2.5-0.5B-Instruct-Q8_0.gguf`:

```
bynDRAM ASAP */čĊčĊčĊ apprÃ©ci(animation mur Offline bÃ©nÃ©fic coppia wrzeÅĽnia{i]'; ...
```

All three: deterministic at temperature 0.0; tokens are valid (within vocab) but semantically meaningless.

## What's been ruled out

- **EOS resolution**: `tokenizer.eosTokenId == 128009` (`<|eot_id|>`) for Llama-3.2 — correct. Confirmed via diagnostic print before generation.
- **Backend correctness**: same prompt produces *bit-for-bit identical* output on the SKaiNET CPU backend and on a different platform-native execution backend wired through `BackendRegistry`. If a backend op were wrong, the two would diverge. They don't.
- **Tool calling / agent loop**: gibberish reproduces with the raw `kllama.kexe` baseline binary that does no tool calling at all — just `runtime.generate(...)` after `LlamaIngestion.load(...)` or `QwenNetworkLoader.load(...)`.
- **JVM**: the same loaders + runtimes, exercised via JUnit tests (gated on `TINYLLAMA_MODEL_PATH`) on the JVM target, work. The issue is K/N-specific.

## Most likely places to look

In rough order of suspicion:

1. **CPU `TensorOps` on K/N** — the runtime is `DirectCpuExecutionContext()`, which uses Accelerate (`[SKaiNET] Using Accelerate-backed CPU operations (ARM NEON + AMX)` is logged at startup). A miscompile or interop bug in any of matmul / RMSNorm / RoPE / softmax on K/N would produce coherent-looking but wrong logits, indistinguishable from what we see.
2. **GGUF dequantization on K/N** — `QuantPolicy.DEQUANTIZE_TO_FP32` is the only K/N-viable policy (`MemSegWeightConverter` is `jvmMain`). If the K/N Q8 dequant has a bug, all downstream is poisoned. Worth verifying by snapshotting a few weight tensors after load and comparing JVM vs K/N.
3. **`OptimizedLLMRuntime.forward()` state on K/N** — KV-cache positions, masking, or attention scratch buffers behaving differently on K/N. Less likely since the deprecated `LlamaRuntime` (totally separate code) shows the same symptom.

## Why this matters

K/N is the platform target for any consumer that wants to ship a single-binary CLI or a non-JVM mobile/desktop app on top of SKaiNET-transformers. The JVM-only golden path is fine for backend services but rules out a large slice of intended consumers. Right now, the K/N native CLI in this repo (`kllama.kexe`) doesn't actually generate sensible output — that's a strong signal something needs attention even before any external consumer.

## Notes for triage

- Same machine, same JDK toolchain (21), same model files. Only the target (K/N vs JVM) differs.
- Tested on SKaiNET 0.22.2 — the `fix(gguf): handle unsigned numeric metadata fields` change in 0.22.2 did not affect this behaviour.
- I have logs / can produce a minimum-pair JVM-vs-K/N diff (e.g. logits at position 0 for a fixed prompt) on request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kotlin/Native inference produces nonsense tokens across model families and runtime paths (JVM unaffected) #99

Summary

Repro

What's been ruled out

Most likely places to look

Why this matters

Notes for triage

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kotlin/Native inference produces nonsense tokens across model families and runtime paths (JVM unaffected) #99

Description

Summary

Repro

What's been ruled out

Most likely places to look

Why this matters

Notes for triage

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions