@@ -588,3 +588,80 @@ follow-ups:
588588 end-to-end compile-time effect. The dominant open costs are still
589589 ` optimize_kernels ` /cluster specialization and, for ` heavy_IO ` , the IET
590590 async/definitions path.
591+
592+ May 4, 2026 IET ` reuse_efuncs ` drill-down:
593+
594+ - The expensive IET buckets in ` heavy_IO ` (` make_parallel ` ,
595+ ` place_definitions ` , ` _place_transfers ` , ` lower_async_objs ` , and ` process ` )
596+ are mostly paying common ` Graph.apply ` post-processing cost rather than pass
597+ body cost. A temporary graph-phase profile showed:
598+ - ` Graph.apply ` total: about ` 7.33 s ` across ` 25 ` calls;
599+ - ` reuse_efuncs ` : about ` 3.93 s ` across ` 5 ` calls;
600+ - pass bodies: about ` 2.17 s ` ;
601+ - ` update_args ` : about ` 0.85 s ` .
602+
603+ - Inside ` reuse_efuncs ` , the hot path is abstraction/signature generation:
604+ - before the new signature cache: ` reuse_efuncs ~3.93 s ` ,
605+ ` abstract_efunc ~1.91 s ` , ` _signature ~1.75 s ` ;
606+ - with IET ` Node._signature() ` memoized per node: ` reuse_efuncs ` drops to
607+ about ` 3.62-3.69 s ` , and ` _signature ` drops to about ` 1.41-1.44 s ` .
608+
609+ - The tested signature-cache patch was deliberately narrow:
610+ IET ` Node ` overrode ` _signature() ` with ` @memoized_meth ` and delegated to
611+ ` Signer._signature() ` , caching the SHA1 signature on the immutable-ish IET
612+ node instance without caching the full CIR string.
613+
614+ - Direct multiplicity check on ` heavy_IO ` showed why the patch is not a
615+ meaningful end-to-end win:
616+ - ` _signature() ` calls: ` 180 ` ;
617+ - unique IET nodes: ` 150 ` ;
618+ - repeated calls on the same node: only ` 30 ` ;
619+ - call histogram: ` 121 ` nodes called once, ` 28 ` nodes called twice, ` 1 ` node
620+ called three times.
621+
622+ - The remaining ` abstract_efunc ` body cost is still substantial. A temporary
623+ body-level profile of ` heavy_IO ` showed about ` 150 ` misses and ` 30 ` hits
624+ across the five ` reuse_efuncs ` calls. Miss cost split roughly as:
625+ - ` Uxreplace ` : ` 0.63 s ` ;
626+ - ` abstract_objects ` : ` 0.63 s ` ;
627+ - ` FindSymbols('basics|symbolics|dimensions') ` : ` 0.23 s ` .
628+
629+ - Dropped variants:
630+ - IET ` Node._signature() ` memoization was dropped after the multiplicity
631+ check. There are not enough repeated calls on the same node to justify even
632+ this small cache as a production change;
633+ - filtering identity mappings out of ` abstract_objects ` was slower in
634+ practice; ` abstract_objects ` increased from about ` 0.63 s ` to about
635+ ` 1.62 s ` in the instrumented run, because rebuilding the mapper dominated;
636+ - returning raw CIR from IET ` Node._signature() ` instead of the SHA1 digest
637+ was also rejected. It retains large strings and made the instrumented
638+ profile noisier/worse, without a clear wall-time win.
639+
640+ - Validation and benchmark signal from the rejected signature-cache patch:
641+ - targeted OSS IET/visitor tests still pass:
642+ ` /app/devitopro/submodules/devito/tests/test_iet.py ` and
643+ ` /app/devitopro/submodules/devito/tests/test_visitors.py `
644+ (` 42 passed ` );
645+ - the earlier ` heavy 22.25 s ` combined-run sample was confirmed noisy and
646+ should be ignored.
647+
648+ - May 4 rerun, three combined invocations before and after the signature-cache
649+ patch, same setup (` devitopro-cuda:latest ` , GPU ` 3 ` , ` taskset 0-15 ` ):
650+ - without signature cache:
651+ ` stress-only 10.02/10.03/10.00 s ` (avg ` 10.02 s ` ),
652+ ` heavy 21.29/21.27/21.27 s ` (avg ` 21.28 s ` ),
653+ ` heavy_IO 24.80/24.66/24.63 s ` (avg ` 24.70 s ` );
654+ - with signature cache:
655+ ` stress-only 10.02/10.03/9.98 s ` (avg ` 10.01 s ` ),
656+ ` heavy 21.36/21.40/21.29 s ` (avg ` 21.35 s ` ),
657+ ` heavy_IO 24.55/24.49/24.37 s ` (avg ` 24.47 s ` ).
658+
659+ - Interpretation:
660+ memoizing IET node signatures is not worth keeping. The end-to-end signal is
661+ neutral for ` stress-only ` , neutral/slightly negative for ` heavy ` , and only
662+ mildly positive for ` heavy_IO ` (` ~0.23 s ` ). The direct multiplicity check
663+ shows the cache surface is tiny: only ` 30/180 ` calls are repeated on the same
664+ node. The next meaningful IET win is unlikely to come from the individual
665+ pass bodies. It would need to reduce repeated ` abstract_efunc ` misses, likely
666+ by making ` reuse_efuncs ` more incremental/cache-aware across successive
667+ ` Graph.apply ` calls.
0 commit comments