Skip to content

Commit 82323e0

Browse files
runwangdlclaude
andcommitted
fix(promote): skip single-use graph I/O to avoid InitNetwork/closure mismatch
When PromoteTensorsToL2 promotes a graph input/output via the globalObjects-VariableBuffer loop added by 18fb78f, codegen ends up in an inconsistent state on inference graphs: * InitNetwork still emits cl_ram_malloc(input_0) producing a hyperram address (cl_ram_malloc is necessary because load_file_to_ram only works for hyperram destinations -- L2 destinations mask to 24 bits and hit the 0x800000 OOB previously fixed in 750f1c9 for the training optimizer path) * The tiling closure sees _memoryLevel='L2' and emits a single L2->L1 transfer via mchan_transfer_1d (cluster idma) using DeeployNetwork_input_0 + 0 as the source * mchan cannot access hyperram, so the firmware polls the DMA STATUS bit forever -> silent hang (Siracusa has no UART, so this looks identical to "sim is slow") CCT_2_32_32_128 inference reproduces this deterministically with --promoteToL2IncludeActivations + --promoteToL2MaxBufferBytes=0; CCT_1_32_32_8 happened to dodge it because graph topology timing left _memoryLevel='L3' at codegen time and the L3-aware closure path was emitted. Skip globalObjects VariableBuffers with len(_users) <= 1: * Inference input_0 / output_0 are single-use -> skipped. No measurable cycle benefit lost: each is read/written exactly once so the L3<->L2 transfer happens regardless of promotion. * Training weights / grads in globalObjects are multi-use (fwd + bwd + optimizer) -> still promoted; 18fb78f's L2 utilisation gain is preserved. Side effect: the af39722 revert ("preserve _memoryLevel" lost the ResNet8 -27.8% promote win) is no longer needed in practice -- the paradox was triggered by graph I/O actually promoting and hitting this same codegen mismatch on the training path. With graph I/O skipped, 26f4539's preserve fix can be re-applied without regression. Not done in this commit; left as follow-up. Verified: * CCT_2_32_32_128 inference (was hanging): 47,555,683 cycles, PASSED, 0/10 errors. -17.8% vs no-promote baseline 57,827,658. * ResNet8 training: 233,682,523 cycles/step, PASSED, 0/4 errors. -27.4% vs baseline 321.7M/step -- matches the historical "working" -27.8% target that 26f4539 had paradoxically lost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ca32e26 commit 82323e0

1 file changed

Lines changed: 10 additions & 0 deletions

File tree

Deeploy/MemoryLevelExtension/OptimizationPasses/MemoryLevelAnnotationPasses.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,16 @@ def apply(self, ctxt: NetworkContext, graph: gs.Graph) -> Tuple[NetworkContext,
225225
continue
226226
if 'allocTemplate' in buf.__dict__:
227227
continue
228+
# Skip single-use graph I/O (inference input_0 / output_0).
229+
# InitNetwork allocates these with cl_ram_malloc -> hyperram
230+
# regardless of _memoryLevel; the closure that follows assumes
231+
# the buffer is in L2 and uses mchan_transfer_1d (cluster idma)
232+
# which can only access L1/L2, not hyperram -> firmware polls
233+
# the DMA STATUS bit forever -> silent hang.
234+
# Training graph I/O (weights/grads) is multi-use and remains
235+
# eligible: len(_users) >= 2.
236+
if len(buf._users) <= 1:
237+
continue
228238
size = self._bufferSize(buf)
229239
if self.maxBufferBytes > 0 and size > self.maxBufferBytes:
230240
continue

0 commit comments

Comments
 (0)