Skip to content

feat(llama): NATIVE_OPTIMIZED packed weight path (mirror Gemma)#195

Merged
michalharakal merged 1 commit into
developfrom
fix/llama-gguf-orientation
Jun 25, 2026
Merged

feat(llama): NATIVE_OPTIMIZED packed weight path (mirror Gemma)#195
michalharakal merged 1 commit into
developfrom
fix/llama-gguf-orientation

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

PR 1 of 3 — IREE/TinyLlama fix-stack (merge in order)

Base of a 3-PR stack landing the upstream fixes that skainet-tinyllama-iree depends on.

This PR (ccbd87e)

Adds the Llama NATIVE_OPTIMIZED packed-weight path, mirroring the existing Gemma implementation:

  • new LlamaPackedWeights.kt
  • new LlamaQuantLayout.kt
  • LlamaNetworkLoader.kt wiring

Purely additive (+227 lines), no behavior change to existing paths.

Note: branch is named fix/llama-gguf-orientation for historical reasons; the actual change is the packed-weights path above.

Stack

  1. this PRdevelop
  2. perf/fused-decode-attention (fused decode-attention + traceable RoPE) → this branch
  3. release/0.32.0 (version bump + API dumps) → perf branch

Please merge with a merge-commit / rebase (do NOT squash) — downstream skainet-tinyllama-iree pins these commit SHAs by hash.

Llama real-GGUF inference under NATIVE_OPTIMIZED was broken ("gather: unsupported input rank 1")
because raw quantized bytes were never converted into the packed/FP32 forms the DSL consumes.
Mirror the working Gemma path:
- LlamaQuantLayout.kt: logicalShapeFor(name, LlamaModelMetadata) [out,in] + packLlamaKQuant
  (Q4_K/Q5_K/Q6_K/Q8_0 block tensors, row-major->block-major relayout) + relayout helper.
- LlamaPackedWeights.kt: convertLlamaWeightsPacked — token_embd -> FP32 [vocab,dim] (gathered),
  other matrices -> packed Q*BlockTensorData (matmul'd), legacy types -> FP32 [out,in].
- LlamaNetworkLoader.load(): invoke the converter when quantPolicy == NATIVE_OPTIMIZED.

Verified (TinyLlama-1.1B Q4_K_M, composite build): next-token logits now match llama.cpp
(top-1 'Quant', top-10 identical); host eager 0.17 -> 1.80 tok/s, 8.1 -> 5.5 GB, coherent.
Shapes derived from metadata (uniform Llama dims); logicalShapes-during-load deferred.

Perf-Tag: perf/a1-packed-llama

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit e4a0799 into develop Jun 25, 2026
1 of 2 checks passed
@michalharakal michalharakal deleted the fix/llama-gguf-orientation branch June 25, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant