Commit 82323e0
fix(promote): skip single-use graph I/O to avoid InitNetwork/closure mismatch
When PromoteTensorsToL2 promotes a graph input/output via the
globalObjects-VariableBuffer loop added by 18fb78f, codegen ends up in
an inconsistent state on inference graphs:
* InitNetwork still emits cl_ram_malloc(input_0) producing a hyperram
address (cl_ram_malloc is necessary because load_file_to_ram only
works for hyperram destinations -- L2 destinations mask to 24 bits
and hit the 0x800000 OOB previously fixed in 750f1c9 for the
training optimizer path)
* The tiling closure sees _memoryLevel='L2' and emits a single
L2->L1 transfer via mchan_transfer_1d (cluster idma) using
DeeployNetwork_input_0 + 0 as the source
* mchan cannot access hyperram, so the firmware polls the DMA STATUS
bit forever -> silent hang (Siracusa has no UART, so this looks
identical to "sim is slow")
CCT_2_32_32_128 inference reproduces this deterministically with
--promoteToL2IncludeActivations + --promoteToL2MaxBufferBytes=0;
CCT_1_32_32_8 happened to dodge it because graph topology timing left
_memoryLevel='L3' at codegen time and the L3-aware closure path was
emitted.
Skip globalObjects VariableBuffers with len(_users) <= 1:
* Inference input_0 / output_0 are single-use -> skipped. No
measurable cycle benefit lost: each is read/written exactly once
so the L3<->L2 transfer happens regardless of promotion.
* Training weights / grads in globalObjects are multi-use
(fwd + bwd + optimizer) -> still promoted; 18fb78f's L2
utilisation gain is preserved.
Side effect: the af39722 revert ("preserve _memoryLevel" lost the
ResNet8 -27.8% promote win) is no longer needed in practice -- the
paradox was triggered by graph I/O actually promoting and hitting this
same codegen mismatch on the training path. With graph I/O skipped,
26f4539's preserve fix can be re-applied without regression. Not done
in this commit; left as follow-up.
Verified:
* CCT_2_32_32_128 inference (was hanging): 47,555,683 cycles, PASSED,
0/10 errors. -17.8% vs no-promote baseline 57,827,658.
* ResNet8 training: 233,682,523 cycles/step, PASSED, 0/4 errors.
-27.4% vs baseline 321.7M/step -- matches the historical "working"
-27.8% target that 26f4539 had paradoxically lost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent ca32e26 commit 82323e0
1 file changed
Lines changed: 10 additions & 0 deletions
Lines changed: 10 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
225 | 225 | | |
226 | 226 | | |
227 | 227 | | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
228 | 238 | | |
229 | 239 | | |
230 | 240 | | |
| |||
0 commit comments