Skip to content

Commit aad8dba

Browse files
committed
Phase-2: apply 2 KB cap to BOTH constants and activations - CCT_2 now PASSES
Earlier 'cap-on-constants-only' (a8694f7) made CCT_2 fail 10/10 because its 7x 32 KB residual / LayerNorm-output activations hit the write-side mirror of the §10.1 codegen bug (writer layer tile-loops output to a fixed L2 base address, every tile clobbers the previous). The §10.1 bug is symmetric: any buffer >2 KB tiled across the producer's output dimension corrupts under promotion, regardless of whether it's a static-PI_L2 ConstantBuffer (read-side) or a dynamically-allocated MEMORYARENA_L2 VariableBuffer (write-side). Apply the cap uniformly. Cost: IC100 / miniMobileNet lose their large-activation gains (x3_tensor_split 16 KB blocked), back to const-only level. microLlama1 retains its activation gains (all activations <2 KB already). All 5 models PASS: | Model | const-only | act+cap+users>=2 | Δ vs baseline | |-----------------|----------------:|------------------:|--------------:| | IC100 | 2 389 072 cyc | 2 389 072 cyc | -3.9 %% | | microLlama1 | 3 475 840 cyc | 3 095 426 cyc | -26.8 %% | | miniMobileNet | 134 946 cyc | 134 946 cyc | -29.6 %% | | AnomalyDet 200 | 509 558 cyc | 509 738 cyc | -15.3 %% | | CCT_2 | 309 888 350 cyc | 309 893 862 cyc | ~0 %% (PASS) | Correctness wins over peak perf for IC/mmn — recovering the large-activation gains there requires the proper Phase-3 fix (per-tile L1↔L2 source/destination advancement when the buffer is at the closure's externalMemory level).
1 parent a8694f7 commit aad8dba

1 file changed

Lines changed: 16 additions & 11 deletions

File tree

Deeploy/MemoryLevelExtension/OptimizationPasses/MemoryLevelAnnotationPasses.py

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -234,17 +234,22 @@ def apply(self, ctxt: NetworkContext, graph: gs.Graph) -> Tuple[NetworkContext,
234234
continue
235235
if self.excludeNamePatterns and any(pat in name for pat in self.excludeNamePatterns):
236236
continue
237-
# Per-buffer-size cap: applies to ConstantBuffer ONLY.
238-
# ConstantBuffers above the cap are subject to the §10.1 read-side
239-
# codegen bug (per-tile static PI_L2 read with no source pointer
240-
# advancement); the cap is the safe interim workaround.
241-
# Activations live in MEMORYARENA_L2 (dynamically allocated, not
242-
# linker-placed static PI_L2), so the §10.1 mechanism does not
243-
# apply to them — large activations are free to promote provided
244-
# the users >= 2 filter below is satisfied.
245-
if isinstance(buf, ConstantBuffer):
246-
if self.maxBufferBytes is not None and _bufferSizeBytes(buf) > self.maxBufferBytes:
247-
continue
237+
# Per-buffer-size cap. Applied to BOTH ConstantBuffer (because
238+
# of the §10.1 read-side codegen bug for static-PI_L2 weights
239+
# tiled across an output dimension) AND VariableBuffer (because
240+
# of the symmetric write-side codegen bug for promoted
241+
# activations whose writer tiles across them — observed on
242+
# CCT_2's 32 KB residual / LayerNorm-output activations).
243+
# Both bugs leave the per-tile L1↔L2 source/destination pointer
244+
# at a fixed offset across the tile loop, so any buffer larger
245+
# than a single L1 tile slab corrupts under promotion. The
246+
# 2 KB default is the empirically-anchored safe boundary for
247+
# this codebase (see Phase-1 §10.1 bisection). Activations
248+
# smaller than the cap (residual splits, attention K/V tiles
249+
# that fit in one L1 tile, small per-channel intermediates)
250+
# still promote and yield real cycles.
251+
if self.maxBufferBytes is not None and _bufferSizeBytes(buf) > self.maxBufferBytes:
252+
continue
248253
reuse = max(1, len(getattr(buf, "_users", [])))
249254
# For activations specifically (non-Constant VariableBuffer),
250255
# require multi-consumer reuse (users >= 2). A single-use

0 commit comments

Comments
 (0)