fix(kvcache): trace-faithful PositionalKVCache.update (#763)#193
Merged
Conversation
PositionalKVCache.update copied the incoming K/V tensor *data* into a heap FloatArray (writeAt) and read it back via ctx.fromData (sliceView), bypassing ctx.ops. Under tracing the symbolic K/V carry no data, so toComputeGraph(embedConstants=true) baked an all-zero KV buffer as stablehlo.constant and disconnected the computed k_proj/v_proj — the exported decoder then attended over K=V=0. In eager inference the buffer holds real data, so the bug was invisible there; in export it was masked in plain decoders by the unnormalized FFN dominating the residual stream, and exposed by Gemma's sandwichNorms (post_ffw_norm normalizes the FFN, so the lost attention becomes significant — ~1.4x/block logit error). Fix: when ctx.isRecording, wire K/V functionally through ops.concat (the same history AppendKVCache already uses) instead of the raw heap buffer, so the StableHLO export carries the real projections. The eager fast-path (heap buffer) is unchanged. As a side effect the traced graph no longer surfaces dangling KV-cache buffer leaves as graph outputs. Verified end-to-end via skainet-iree-conformance gemma-decoder (real 2-block Gemma3 with qk-norm + sandwichNorms + layer-output-scale): export -> iree-compile -> iree-run-module matches the SKaiNET-CPU oracle at max_abs_err 3.8e-6 (was ~5.3). Adds KvCacheTraceFidelityTest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PositionalKVCache.updateis not trace-faithful: it copies the incoming K/V tensor data into a heapFloatArray(writeAt) and reads it back viactx.fromData(sliceView), bypassingctx.ops. Under tracing the symbolic K/V carry no data, sotoComputeGraph(embedConstants=true)bakes an all-zero KV buffer asstablehlo.constantand disconnects the computedk_proj/v_proj— the exported decoder attends over K=V=0.Eager inference is fine (the buffer holds real data). In export the bug was masked in plain decoders (the unnormalized FFN dominates the residual stream) and exposed by Gemma's
sandwichNorms(post_ffw_normnormalizes the FFN, so the lost attention becomes significant → ~1.4×/block logit error). Tracked as SKaiNET#763.Fix
When
ctx.isRecording, wire K/V functionally throughops.concat(the same historyAppendKVCachealready uses) instead of the raw heap buffer, so the StableHLO export carries the real projections. The eager fast-path (heap buffer) is unchanged. Side effect: the traced graph no longer surfaces dangling KV-cache buffer leaves as graph outputs.Verification
End-to-end via
skainet-iree-conformancegemma-decoder— a real 2-block Gemma3 decoder with qk-norm + sandwichNorms + layer-output-scale:export → iree-compile → iree-run-modulematches the SKaiNET-CPU oracle at max_abs_err 3.8e-6 (was ~5.3). AddsKvCacheTraceFidelityTest(asserts a traced sandwich decoder bakes no zero KV-cache constants).Note: the other
PositionalKVCache-derived shared variants (SharedPositionalKVCache,PaddedSharedPositionalKVCache,OwnerReadOnlyKVCache) still use the buffersliceViewon read and would need the same treatment for kv-shared exports — out of scope here (gemma-decoder uses the non-shared cache).🤖 Generated with Claude Code