feat(llama): NATIVE_OPTIMIZED packed weight path (mirror Gemma)#195
Merged
Conversation
Llama real-GGUF inference under NATIVE_OPTIMIZED was broken ("gather: unsupported input rank 1")
because raw quantized bytes were never converted into the packed/FP32 forms the DSL consumes.
Mirror the working Gemma path:
- LlamaQuantLayout.kt: logicalShapeFor(name, LlamaModelMetadata) [out,in] + packLlamaKQuant
(Q4_K/Q5_K/Q6_K/Q8_0 block tensors, row-major->block-major relayout) + relayout helper.
- LlamaPackedWeights.kt: convertLlamaWeightsPacked — token_embd -> FP32 [vocab,dim] (gathered),
other matrices -> packed Q*BlockTensorData (matmul'd), legacy types -> FP32 [out,in].
- LlamaNetworkLoader.load(): invoke the converter when quantPolicy == NATIVE_OPTIMIZED.
Verified (TinyLlama-1.1B Q4_K_M, composite build): next-token logits now match llama.cpp
(top-1 'Quant', top-10 identical); host eager 0.17 -> 1.80 tok/s, 8.1 -> 5.5 GB, coherent.
Shapes derived from metadata (uniform Llama dims); logicalShapes-during-load deferred.
Perf-Tag: perf/a1-packed-llama
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 1 of 3 — IREE/TinyLlama fix-stack (merge in order)
Base of a 3-PR stack landing the upstream fixes that
skainet-tinyllama-ireedepends on.This PR (
ccbd87e)Adds the Llama
NATIVE_OPTIMIZEDpacked-weight path, mirroring the existing Gemma implementation:LlamaPackedWeights.ktLlamaQuantLayout.ktLlamaNetworkLoader.ktwiringPurely additive (+227 lines), no behavior change to existing paths.
Stack
developperf/fused-decode-attention(fused decode-attention + traceable RoPE) → this branchrelease/0.32.0(version bump + API dumps) → perf branchPlease merge with a merge-commit / rebase (do NOT squash) — downstream
skainet-tinyllama-ireepins these commit SHAs by hash.