fix(promote): skip single-use graph I/O to avoid InitNetwork/closure mismatch

runwangdl · claude · runwangdl · commit 82323e018405 · 2026-05-14T17:03:23.000+02:00
When PromoteTensorsToL2 promotes a graph input/output via the globalObjects-VariableBuffer loop added by 18fb78f, codegen ends up in an inconsistent state on inference graphs: * InitNetwork still emits cl_ram_malloc(input_0) producing a hyperram address (cl_ram_malloc is necessary because load_file_to_ram only works for hyperram destinations -- L2 destinations mask to 24 bits and hit the 0x800000 OOB previously fixed in 750f1c9 for the training optimizer path) * The tiling closure sees _memoryLevel='L2' and emits a single L2->L1 transfer via mchan_transfer_1d (cluster idma) using DeeployNetwork_input_0 + 0 as the source * mchan cannot access hyperram, so the firmware polls the DMA STATUS bit forever -> silent hang (Siracusa has no UART, so this looks identical to "sim is slow") CCT_2_32_32_128 inference reproduces this deterministically with --promoteToL2IncludeActivations + --promoteToL2MaxBufferBytes=0; CCT_1_32_32_8 happened to dodge it because graph topology timing left _memoryLevel='L3' at codegen time and the L3-aware closure path was emitted. Skip globalObjects VariableBuffers with len(_users) <= 1: * Inference input_0 / output_0 are single-use -> skipped. No measurable cycle benefit lost: each is read/written exactly once so the L3<->L2 transfer happens regardless of promotion. * Training weights / grads in globalObjects are multi-use (fwd + bwd + optimizer) -> still promoted; 18fb78f's L2 utilisation gain is preserved. Side effect: the af39722 revert ("preserve _memoryLevel" lost the ResNet8 -27.8% promote win) is no longer needed in practice -- the paradox was triggered by graph I/O actually promoting and hitting this same codegen mismatch on the training path. With graph I/O skipped, 26f4539's preserve fix can be re-applied without regression. Not done in this commit; left as follow-up. Verified: * CCT_2_32_32_128 inference (was hanging): 47,555,683 cycles, PASSED, 0/10 errors. -17.8% vs no-promote baseline 57,827,658. * ResNet8 training: 233,682,523 cycles/step, PASSED, 0/4 errors. -27.4% vs baseline 321.7M/step -- matches the historical "working" -27.8% target that 26f4539 had paradoxically lost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/Deeploy/MemoryLevelExtension/OptimizationPasses/MemoryLevelAnnotationPasses.py b/Deeploy/MemoryLevelExtension/OptimizationPasses/MemoryLevelAnnotationPasses.py
@@ -225,6 +225,16 @@ def apply(self, ctxt: NetworkContext, graph: gs.Graph) -> Tuple[NetworkContext,
                     continue
                 if 'allocTemplate' in buf.__dict__:
                     continue
+                # Skip single-use graph I/O (inference input_0 / output_0).
+                # InitNetwork allocates these with cl_ram_malloc -> hyperram
+                # regardless of _memoryLevel; the closure that follows assumes
+                # the buffer is in L2 and uses mchan_transfer_1d (cluster idma)
+                # which can only access L1/L2, not hyperram -> firmware polls
+                # the DMA STATUS bit forever -> silent hang.
+                # Training graph I/O (weights/grads) is multi-use and remains
+                # eligible: len(_users) >= 2.
+                if len(buf._users) <= 1:
+                    continue
                 size = self._bufferSize(buf)
                 if self.maxBufferBytes > 0 and size > self.maxBufferBytes:
                     continue