Skip to content

Commit a8694f7

Browse files
committed
Phase-2: split cap (constants only) + users>=2 filter (activations)
Two refinements after the L2 reservation fix: 1. The 2 KB cap (Phase-1 \xc2\xa710.1) now applies ONLY to ConstantBuffers. Activations are dynamically allocated in MEMORYARENA_L2 (not linker-placed static PI_L2), so the per-tile static-read codegen bug \xc2\xa710.1 was protecting against does not apply to them. Removing the cap on activations lets large multi-consumer activations (residual splits, attention K/V) actually promote. 2. Activations now require users >= 2 (multi-consumer) by default. Single-use activations gain nothing structural from promotion (the next-layer cl_ram round-trip happens once either way) and triggered an L2-allocator corner case on AnomalyDetection (35/640 errors when 9 users=1 onnxGemm_* tensors entered the reservation pool). Constants bypass this filter. Phase-2 sweep with both refinements (and reservation fix from a3fc5d1): | Model | const-only (P1) | act-included (P2) | Δ vs baseline | |-----------------|----------------:|------------------:|--------------:| | IC100 | 2 389 072 cyc | 2 277 046 cyc | -8.4 %% | | ml1 | 3 475 840 cyc | 3 095 426 cyc | -26.8 %% | | miniMobileNet | 134 946 cyc | 134 946 cyc | -29.6 %% | | AnomalyDet 200 | 509 558 cyc | 509 464 cyc | -15.3 %% | All 4 PASS output equality. The users>=2 filter costs perf vs the no-filter run on ml1 / miniMobileNet (single-use activations were genuine wins there) but restores AD correctness. Future Phase-2 work: figure out why AD's single-use activations specifically trip the reservation allocator and lift the filter only for safe models.
1 parent a3fc5d1 commit a8694f7

1 file changed

Lines changed: 21 additions & 2 deletions

File tree

Deeploy/MemoryLevelExtension/OptimizationPasses/MemoryLevelAnnotationPasses.py

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -234,9 +234,28 @@ def apply(self, ctxt: NetworkContext, graph: gs.Graph) -> Tuple[NetworkContext,
234234
continue
235235
if self.excludeNamePatterns and any(pat in name for pat in self.excludeNamePatterns):
236236
continue
237-
if self.maxBufferBytes is not None and _bufferSizeBytes(buf) > self.maxBufferBytes:
238-
continue
237+
# Per-buffer-size cap: applies to ConstantBuffer ONLY.
238+
# ConstantBuffers above the cap are subject to the §10.1 read-side
239+
# codegen bug (per-tile static PI_L2 read with no source pointer
240+
# advancement); the cap is the safe interim workaround.
241+
# Activations live in MEMORYARENA_L2 (dynamically allocated, not
242+
# linker-placed static PI_L2), so the §10.1 mechanism does not
243+
# apply to them — large activations are free to promote provided
244+
# the users >= 2 filter below is satisfied.
245+
if isinstance(buf, ConstantBuffer):
246+
if self.maxBufferBytes is not None and _bufferSizeBytes(buf) > self.maxBufferBytes:
247+
continue
239248
reuse = max(1, len(getattr(buf, "_users", [])))
249+
# For activations specifically (non-Constant VariableBuffer),
250+
# require multi-consumer reuse (users >= 2). A single-use
251+
# activation gains nothing from promotion (the next-layer
252+
# cl_ram round-trip happens once either way) and adds reservation
253+
# pressure that triggered AnomalyDetection corner cases at the
254+
# L2 allocator (Phase-2 §7.5 follow-up). Constants bypass this
255+
# filter — small reuse=1 weights still benefit because they're
256+
# loaded once at network init, not per-inference.
257+
if not isinstance(buf, ConstantBuffer) and reuse < 2:
258+
continue
240259
if reuse < self.minReuse:
241260
continue
242261
size = _bufferSizeBytes(buf)

0 commit comments

Comments
 (0)