feat(llama): NATIVE_OPTIMIZED packed weight path (mirror Gemma) by michalharakal · Pull Request #195 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-06-25T13:26:20Z

PR 1 of 3 — IREE/TinyLlama fix-stack (merge in order)

Base of a 3-PR stack landing the upstream fixes that skainet-tinyllama-iree depends on.

This PR (`ccbd87e`)

Adds the Llama NATIVE_OPTIMIZED packed-weight path, mirroring the existing Gemma implementation:

new LlamaPackedWeights.kt
new LlamaQuantLayout.kt
LlamaNetworkLoader.kt wiring

Purely additive (+227 lines), no behavior change to existing paths.

Note: branch is named fix/llama-gguf-orientation for historical reasons; the actual change is the packed-weights path above.

Stack

this PR → develop
perf/fused-decode-attention (fused decode-attention + traceable RoPE) → this branch
release/0.32.0 (version bump + API dumps) → perf branch

Please merge with a merge-commit / rebase (do NOT squash) — downstream skainet-tinyllama-iree pins these commit SHAs by hash.

Llama real-GGUF inference under NATIVE_OPTIMIZED was broken ("gather: unsupported input rank 1") because raw quantized bytes were never converted into the packed/FP32 forms the DSL consumes. Mirror the working Gemma path: - LlamaQuantLayout.kt: logicalShapeFor(name, LlamaModelMetadata) [out,in] + packLlamaKQuant (Q4_K/Q5_K/Q6_K/Q8_0 block tensors, row-major->block-major relayout) + relayout helper. - LlamaPackedWeights.kt: convertLlamaWeightsPacked — token_embd -> FP32 [vocab,dim] (gathered), other matrices -> packed Q*BlockTensorData (matmul'd), legacy types -> FP32 [out,in]. - LlamaNetworkLoader.load(): invoke the converter when quantPolicy == NATIVE_OPTIMIZED. Verified (TinyLlama-1.1B Q4_K_M, composite build): next-token logits now match llama.cpp (top-1 'Quant', top-10 identical); host eager 0.17 -> 1.80 tok/s, 8.1 -> 5.5 GB, coherent. Shapes derived from metadata (uniform Llama dims); logicalShapes-during-load deferred. Perf-Tag: perf/a1-packed-llama Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This was referenced Jun 25, 2026

perf(mha)+fix(rope): fused decode-attention & traceable interleaved RoPE #196

Merged

release: SKaiNET-transformers 0.32.0 #197

Merged

michalharakal merged commit e4a0799 into develop Jun 25, 2026
1 of 2 checks passed

michalharakal deleted the fix/llama-gguf-orientation branch June 25, 2026 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama): NATIVE_OPTIMIZED packed weight path (mirror Gemma)#195

feat(llama): NATIVE_OPTIMIZED packed weight path (mirror Gemma)#195
michalharakal merged 1 commit into
developfrom
fix/llama-gguf-orientation

michalharakal commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

michalharakal commented Jun 25, 2026

PR 1 of 3 — IREE/TinyLlama fix-stack (merge in order)

This PR (ccbd87e)

Stack

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

This PR (`ccbd87e`)