Commit f94ce6c
fix(gemma): keep tied Q8_0 lm_head packed in eager NATIVE_OPTIMIZED path (#178)
FunctionGemma's token_embd is Q8_0 and tied, so convertGemmaWeightsPacked was
dequanting BOTH token_embd AND output to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
SL2610. `output`/lm_head is a real matmul weight, not an embedding:
- packGemmaKQuant: add Q8_0 (32-elem/34B blocks → Q8_0BlockTensorData);
generalize the row-major→block-major relayout with a blockSize param.
- convertGemmaWeightsPacked: drop OUTPUT_WEIGHT from the isEmbed FP32 branch so
it packs like the other matmul weights and runs on the (NEON) Q8_0 kernel.
token_embd stays FP32 (it's gathered) but is now wrapped no-copy via
DenseFloatArrayTensorData instead of ctx.fromFloatArray (which allocates a
second ~0.67 GB buffer).
Footprint for the tied embed/lm_head drops ~1.34 GB → ~0.67 GB (embed FP32) +
~0.09 GB (packed Q8_0 lm_head). Requires the engine Q8_0 case in ops.transpose
(SKaiNET fix/q8_0-lazy-transpose) so linearProject can transpose the packed
weight.
Verified: GemmaQ5KPackedParityTest (composite -PuseLocalSkainet) — eager
load(NATIVE_OPTIMIZED) decodes byte-identically to the FP32 baseline; lm_head
packed as Q8_0. (token_embd row-dequant gather to drop the last ~0.67 GB is the
remaining follow-up in #178.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent b4a1600 commit f94ce6c
2 files changed
Lines changed: 38 additions & 15 deletions
File tree
- llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma
Lines changed: 12 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| |||
48 | 49 | | |
49 | 50 | | |
50 | 51 | | |
51 | | - | |
52 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
53 | 58 | | |
54 | 59 | | |
55 | 60 | | |
| |||
76 | 81 | | |
77 | 82 | | |
78 | 83 | | |
79 | | - | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
80 | 89 | | |
81 | 90 | | |
82 | 91 | | |
| |||
Lines changed: 26 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| |||
66 | 67 | | |
67 | 68 | | |
68 | 69 | | |
| 70 | + | |
69 | 71 | | |
70 | | - | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
| |||
88 | 89 | | |
89 | 90 | | |
90 | 91 | | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
96 | 103 | | |
97 | 104 | | |
98 | 105 | | |
99 | 106 | | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
104 | 117 | | |
105 | 118 | | |
106 | 119 | | |
| |||
109 | 122 | | |
110 | 123 | | |
111 | 124 | | |
112 | | - | |
113 | | - | |
| 125 | + | |
| 126 | + | |
114 | 127 | | |
115 | 128 | | |
116 | 129 | | |
117 | 130 | | |
118 | 131 | | |
| 132 | + | |
119 | 133 | | |
120 | 134 | | |
121 | 135 | | |
0 commit comments