Kotlin/Native inference produces deterministic nonsense across model families and runtime paths (JVM unaffected)

# Kotlin/Native inference produces deterministic nonsense across model families and runtime paths (JVM unaffected)

## Summary

On Kotlin/Native (verified on `macosArm64` against SKaiNET-transformers `develop` @ commit `4cd1da9`, SKaiNET `0.23.0`), generating tokens via the kllama runtime — and via `OptimizedLLMRuntime` + `QwenNetworkLoader` — produces deterministic but meaningless token sequences. The same model files generate sensibly on the JVM target. Issue persists across:

- **Both runtime paths**: deprecated `LlamaRuntime` + `LlamaIngestion`, *and* the new `OptimizedLLMRuntime` + `QwenNetworkLoader.fromGguf(...).load(...)`.
- **Both load paths**: sequential (`sourceProvider`, slurps to `ByteArray`) *and* streaming (`randomAccessProvider`, exercises the new POSIX-`pread` `RandomAccessSource` shipped in 0.23.0 / [#591](https://github.com/SKaiNET-developers/SKaiNET/pull/591)).
- **Multiple model families**: `Llama-3.2-1B-Instruct` (Q8), `TinyLlama-1.1B-Chat-v1.0` (Q8, Llama-2 arch), `Qwen2.5-0.5B-Instruct` (Q8), `Qwen3-1.7B-Q8_0` (Q8).
- **Multiple compute backends**: the SKaiNET CPU backend, *and* an external macOS-native execution backend wired in via `BackendRegistry.register(...)`. **On `Qwen2.5-0.5B`, the two backends emit bit-for-bit identical output** — same garbage, same tokens, same order, every run. That's the cleanest signal that the problem is upstream of the compute layer: two different `TensorOps` implementations agree perfectly, which only happens if the bug is somewhere they share (weight load, dequantization, the runtime's forward orchestration, or some K/N stdlib / cinterop quirk).

## Reproduction

The shortest repro uses SKaiNET-transformers' own native CLI binary (no third-party code involved):

```bash
./gradlew :llm-runtime:kllama:linkReleaseExecutableMacosArm64
./llm-runtime/kllama/build/bin/macosArm64/releaseExecutable/kllama.kexe \
    /path/to/Llama-3.2-1B-Instruct-Q8_0.gguf \
    "The capital of France is" 32 0.0
```

**Output**, deterministic on every restart:

```
Generating 32 tokens with temperature=0.0...
---
 Bodies Bodies Bodies Bodies Bodies Bodies … reigning reigning reigning
```

Same kllama binary on `tinyllama-1.1b-chat-v1.0.Q8_0.gguf`:

```
witzwitzwitzwitzwitzwitzwitzwitzwitzwitzwitzwitz…
```

`OptimizedLLMRuntime` + `QwenNetworkLoader` on `Qwen2.5-0.5B-Instruct-Q8_0.gguf` (in a small consumer CLI we built to verify the streaming path works):

```
bynDRAM ASAP */čĊčĊčĊ apprÃ©ci(animation mur Offline bÃ©nÃ©fic coppia wrzeÅĽnia…
```

All deterministic at temperature 0.0; tokens are within vocab but semantically nonsense.

## Cleanest single repro for triage

Two consumer CLI invocations, same prompt, same model, two different `BackendProvider`s:

```
$ ./llama3-native-cli -m Qwen2.5-0.5B-Instruct-Q8_0.gguf --backend cpu              --steps 32 --temperature 0.0 "What is 17 * 23?"
Backend: CPU (SIMD)  (priority=0)
Tokenizer: BPE (model=gpt2)
Assistant: bynDRAM ASAP */čĊčĊčĊ apprÃ©ci(animation mur Offline bÃ©nÃ©fic coppia wrzeÅĽnia{i]';Ċodonå¡ĳåįķä½įæĪĬ åŃĹÃ©ducationLUaniobihrÃł,GLæĭ¼éŁ³ournemouth_endian)$_.AdapterView.nlmMurlu

$ ./llama3-native-cli -m Qwen2.5-0.5B-Instruct-Q8_0.gguf --backend platform-native  --steps 32 --temperature 0.0 "What is 17 * 23?"
Backend: Platform-native (macOS)  (priority=100)
Tokenizer: BPE (model=gpt2)
Assistant: bynDRAM ASAP */čĊčĊčĊ apprÃ©ci(animation mur Offline bÃ©nÃ©fic coppia wrzeÅĽnia{i]';Ċodonå¡ĳåįķä½įæĪĬ åŃĹÃ©ducationLUaniobihrÃł,GLæĭ¼éŁ³ournemouth_endian)$_.AdapterView.nlmMurlu
```

Bit-for-bit identical output across two completely different `TensorOps` implementations. Same prompt on the JVM target with the same loader stack produces sensible output.

## What's been ruled out

- **EOS resolution**: `tokenizer.eosTokenId == 128009` (`<|eot_id|>`) for Llama-3.2 — correct. Confirmed via diagnostic print before `runtime.generate(...)`.
- **Backend correctness on small models**: Qwen2.5-0.5B → bit-identical output on two different backends (proof above). Rules out compute layer for small / single-block-size cases.
- **Tool calling / agent loop**: gibberish reproduces with the raw `kllama.kexe` baseline that does no tool calling at all — `runtime.generate(...)` straight after `LlamaIngestion.load(...)`.
- **GGUF reader path**: same gibberish via the legacy `Source`-based reader and via the new POSIX-`pread` `StreamingGGUFReader` path. Not a slurp-vs-stream issue.
- **JVM**: same loaders + runtimes, exercised via the existing JUnit test suite (gated on `TINYLLAMA_MODEL_PATH`), produce sensible output on the JVM target. K/N-specific.
- **0.23.0 fixes**: the K/N pread fix ([#591](https://github.com/SKaiNET-developers/SKaiNET/pull/591)) and the lazy zero-init API ([#588](https://github.com/SKaiNET-developers/SKaiNET/pull/588)) did not change the symptom.

## Most likely places to look

Ordered by suspicion (the `TensorOps`-agnostic invariant pins this somewhere shared):

1. **GGUF Q8 dequantization on K/N**. `QuantPolicy.DEQUANTIZE_TO_FP32` is the only K/N-viable policy (`MemSegWeightConverter` is `jvmMain`-only). If the K/N Q8 dequant has a bug, every downstream tensor is silently corrupted; both backends consume the same corrupt tensors and produce the same nonsense — exactly what we see. **A quick verification path**: snapshot a known weight tensor (e.g. `token_embd.weight[0,:]`) on JVM and on K/N and compare element-wise; first divergence localises the bug.
2. **`OptimizedLLMRuntime` / `LlamaRuntime` forward orchestration on K/N**. Both runtimes are independent code paths and both fail identically — but they share helpers (RoPE position table, attention mask construction, KV cache layout). A K/N-specific bug in one of those helpers would surface in both.
3. **K/N stdlib / cinterop edge case** — less likely but possible. E.g. `FloatArray` of unusual size silently zeroing a tail, or a cinterop boundary on Accelerate calls passing wrong sizes.

`(1)` is by far the most likely; would suggest tackling first because it's the easiest to falsify with a one-page test.

## Separate, related bug (filing separately, not this issue)

On larger models (`Qwen3-1.7B-Q8_0`, ~1.8 GiB) the CPU backend and platform-native backend **diverge** — produce *different* nonsense. That divergence is real backend-side disagreement and points at one or more compute ops the larger / GQA-using architecture exercises that the small Qwen2.5-0.5B doesn't. This is a backend issue, separate from the K/N inference bug above; mentioned only because it confirms `(1)`/`(2)` as different from `(3)`.

## Why this matters

K/N is the only target for consumers shipping single-binary CLIs or non-JVM mobile/desktop apps on top of SKaiNET-transformers. The JVM-only happy path is fine for backend services but rules out a large slice of intended consumers. The K/N native CLI in this repo (`kllama.kexe`) currently doesn't generate sensible output for any tested model — that's a strong "K/N is broken end-to-end" signal worth attending to even before any external consumer.

## Triage helpers

- Same machine, same JDK toolchain (21), same model files. Only the Kotlin target differs.
- Tested against `release/0.23.0` of SKaiNET (commit `e0b228c2` plus the docs cleanup) and `develop` of SKaiNET-transformers @ `4cd1da9`.
- Happy to produce a JVM-vs-K/N logits diff (forward pass on a fixed input, position 0, first 16 logits) on request — that should localise where the K/N path diverges.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kotlin/Native inference produces deterministic nonsense across model families and runtime paths (JVM unaffected) #104

Kotlin/Native inference produces deterministic nonsense across model families and runtime paths (JVM unaffected)

Summary

Reproduction

Cleanest single repro for triage

What's been ruled out

Most likely places to look

Separate, related bug (filing separately, not this issue)

Why this matters

Triage helpers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kotlin/Native inference produces deterministic nonsense across model families and runtime paths (JVM unaffected) #104

Description

Kotlin/Native inference produces deterministic nonsense across model families and runtime paths (JVM unaffected)

Summary

Reproduction

Cleanest single repro for triage

What's been ruled out

Most likely places to look

Separate, related bug (filing separately, not this issue)

Why this matters

Triage helpers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions