Phase-2: unblock VariableBuffer promotion (IndexError fix + IO/arena filter)

runwangdl · runwangdl · commit 1a8ffa3918d4 · 2026-04-16T21:21:36.000Z
Two structural fixes that together let PromoteTensorsToL2Greedy run
without --promoteOnlyConstants:

1. TilerExtension.computeMemoryMap was indexing
   tilingSolution[idx].nodeConstraints[0] for every entry of the
   combined memoryMap, which is inner_scheduler_entries (one per
   pattern, len == len(tilingSolution)) followed by exactly one
   outer_scheduler entry holding global/constant-style buffers.  The
   outer entry has no per-pattern node constraint.  With
   --promoteOnlyConstants the outer L2 entry was empty so the
   'if len(...) != 0' check skipped it; promoting a VariableBuffer
   into L2 populated the outer entry and the loop crashed with
   IndexError at tilingSolution[len(tilingSolution)].nodeConstraints[0].
   Pass None for the outer entry, same convention as the
   default-memory-level branch already used.

2. PromoteTensorsToL2Greedy was including graph IO tensors (input_0,
   output_0) and the MEMORYARENA_L3 meta-buffer in the candidate set
   when --promoteOnlyConstants was off.  Promoting those produced a
   binary that crashed gvsoc with 'Platform returned an error
   (exitcode: 1)' (the runtime contract for IO tensors and the arena
   pointer requires a fixed memory level).  Filter them out at
   candidate-selection time.

End-to-end verified on the Phase-1 sweep, 2 KB cap kept on:
* IC100:    const 2 389 072 cyc / var 2 389 051 cyc / PASS
* AD200:    const   509 738 cyc / var   509 464 cyc / PASS
* ml1:      const 3 475 925 cyc / var 3 475 840 cyc / PASS
* CCT_2:    const 309 860 361   / var 309 890 036   / PASS
* miniMobileNet @ 16 KB L2: var 134 946 cyc / PASS

Cycle Δ between modes is in noise because the Phase-1 2 KB cap also
filters out VariableBuffer activations (all &gt; 2 KB on these models).
The structural blocker is now removed; the practical perf benefit of
VariableBuffer promotion (where reuse &gt; 1 lets cycle-aware actually
outrank smallest) awaits the codegen-side fix that lets the cap be
lifted.
diff --git a/Deeploy/MemoryLevelExtension/OptimizationPasses/MemoryLevelAnnotationPasses.py b/Deeploy/MemoryLevelExtension/OptimizationPasses/MemoryLevelAnnotationPasses.py
@@ -180,6 +180,19 @@ def apply(self, ctxt: NetworkContext, graph: gs.Graph) -> Tuple[NetworkContext,
                 used += _bufferSizeBytes(buf)
 
         from Deeploy.DeeployTypes import ConstantBuffer
+
+        # Build the set of names we must NOT promote even if they live at the
+        # source level: graph IO and the arena meta-buffer.  These have
+        # external semantics (the runtime delivers the input / collects the
+        # output / treats the arena as a malloc bump pointer); flipping their
+        # memory level is meaningless and breaks the runtime contract.
+        ioNames = set()
+        try:
+            for tensor in list(graph.inputs) + list(graph.outputs):
+                ioNames.add(tensor.name)
+        except Exception:
+            pass
+
         candidates: List[Tuple[int, int, int, str, VariableBuffer]] = []
         for name, buf in ctxt.globalObjects.items():
             if not isinstance(buf, VariableBuffer):
@@ -192,6 +205,12 @@ def apply(self, ctxt: NetworkContext, graph: gs.Graph) -> Tuple[NetworkContext,
                 continue
             if self.onlyConstants and not isinstance(buf, ConstantBuffer):
                 continue
+            # Never promote arena meta-buffers, IO tensors, or buffers that the
+            # graph already excluded from deployment.
+            if "MEMORYARENA" in name:
+                continue
+            if name in ioNames or buf.name in ioNames:
+                continue
             if self.excludeNamePatterns and any(pat in name for pat in self.excludeNamePatterns):
                 continue
             if self.maxBufferBytes is not None and _bufferSizeBytes(buf) > self.maxBufferBytes:
diff --git a/Deeploy/TilingExtension/TilerExtension.py b/Deeploy/TilingExtension/TilerExtension.py
@@ -528,11 +528,22 @@ def computeMemoryMap(self, ctxt: NetworkContext, tilingSolution: TilingSolution)
                         memoryMap[memoryLevel][-1], ctxt, None,
                         self.memoryHierarchy.memoryLevels[memoryLevel].size - constantTensorOffset, memoryLevel)
                 else:
+                    # memoryMap[memoryLevel] = inner_scheduler_entries (one per pattern,
+                    # length == len(tilingSolution)) followed by exactly one
+                    # outer_scheduler entry holding global/constant-style buffers.
+                    # The outer entry has no per-pattern node constraint — pass None
+                    # for it (same convention the default-memory-level branch uses).
+                    # Without this, promoting a VariableBuffer into this level
+                    # populated the outer entry and the loop crashed with IndexError
+                    # at tilingSolution[len(tilingSolution)].nodeConstraints[0].
                     for idx, memMap in enumerate(memoryMap[memoryLevel]):
-                        if len(memoryMap[memoryLevel][idx]) != 0:
-                            memoryMap[memoryLevel][idx] = self.minimalloc(
-                                memMap, ctxt, tilingSolution[idx].nodeConstraints[0],
-                                self.memoryHierarchy.memoryLevels[memoryLevel].size - constantTensorOffset, memoryLevel)
+                        if len(memoryMap[memoryLevel][idx]) == 0:
+                            continue
+                        nodeConstraint = (None
+                                          if idx >= len(tilingSolution) else tilingSolution[idx].nodeConstraints[0])
+                        memoryMap[memoryLevel][idx] = self.minimalloc(
+                            memMap, ctxt, nodeConstraint,
+                            self.memoryHierarchy.memoryLevels[memoryLevel].size - constantTensorOffset, memoryLevel)
             log.info(f" {SUCCESS_MARK} Memory allocation successful!")
 
         return memoryMap