Skip to content

Commit acfbcba

Browse files
committed
Update plan
1 parent 93ff555 commit acfbcba

1 file changed

Lines changed: 77 additions & 0 deletions

File tree

compier_optimization_plan.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -588,3 +588,80 @@ follow-ups:
588588
end-to-end compile-time effect. The dominant open costs are still
589589
`optimize_kernels`/cluster specialization and, for `heavy_IO`, the IET
590590
async/definitions path.
591+
592+
May 4, 2026 IET `reuse_efuncs` drill-down:
593+
594+
- The expensive IET buckets in `heavy_IO` (`make_parallel`,
595+
`place_definitions`, `_place_transfers`, `lower_async_objs`, and `process`)
596+
are mostly paying common `Graph.apply` post-processing cost rather than pass
597+
body cost. A temporary graph-phase profile showed:
598+
- `Graph.apply` total: about `7.33 s` across `25` calls;
599+
- `reuse_efuncs`: about `3.93 s` across `5` calls;
600+
- pass bodies: about `2.17 s`;
601+
- `update_args`: about `0.85 s`.
602+
603+
- Inside `reuse_efuncs`, the hot path is abstraction/signature generation:
604+
- before the new signature cache: `reuse_efuncs ~3.93 s`,
605+
`abstract_efunc ~1.91 s`, `_signature ~1.75 s`;
606+
- with IET `Node._signature()` memoized per node: `reuse_efuncs` drops to
607+
about `3.62-3.69 s`, and `_signature` drops to about `1.41-1.44 s`.
608+
609+
- The tested signature-cache patch was deliberately narrow:
610+
IET `Node` overrode `_signature()` with `@memoized_meth` and delegated to
611+
`Signer._signature()`, caching the SHA1 signature on the immutable-ish IET
612+
node instance without caching the full CIR string.
613+
614+
- Direct multiplicity check on `heavy_IO` showed why the patch is not a
615+
meaningful end-to-end win:
616+
- `_signature()` calls: `180`;
617+
- unique IET nodes: `150`;
618+
- repeated calls on the same node: only `30`;
619+
- call histogram: `121` nodes called once, `28` nodes called twice, `1` node
620+
called three times.
621+
622+
- The remaining `abstract_efunc` body cost is still substantial. A temporary
623+
body-level profile of `heavy_IO` showed about `150` misses and `30` hits
624+
across the five `reuse_efuncs` calls. Miss cost split roughly as:
625+
- `Uxreplace`: `0.63 s`;
626+
- `abstract_objects`: `0.63 s`;
627+
- `FindSymbols('basics|symbolics|dimensions')`: `0.23 s`.
628+
629+
- Dropped variants:
630+
- IET `Node._signature()` memoization was dropped after the multiplicity
631+
check. There are not enough repeated calls on the same node to justify even
632+
this small cache as a production change;
633+
- filtering identity mappings out of `abstract_objects` was slower in
634+
practice; `abstract_objects` increased from about `0.63 s` to about
635+
`1.62 s` in the instrumented run, because rebuilding the mapper dominated;
636+
- returning raw CIR from IET `Node._signature()` instead of the SHA1 digest
637+
was also rejected. It retains large strings and made the instrumented
638+
profile noisier/worse, without a clear wall-time win.
639+
640+
- Validation and benchmark signal from the rejected signature-cache patch:
641+
- targeted OSS IET/visitor tests still pass:
642+
`/app/devitopro/submodules/devito/tests/test_iet.py` and
643+
`/app/devitopro/submodules/devito/tests/test_visitors.py`
644+
(`42 passed`);
645+
- the earlier `heavy 22.25 s` combined-run sample was confirmed noisy and
646+
should be ignored.
647+
648+
- May 4 rerun, three combined invocations before and after the signature-cache
649+
patch, same setup (`devitopro-cuda:latest`, GPU `3`, `taskset 0-15`):
650+
- without signature cache:
651+
`stress-only 10.02/10.03/10.00 s` (avg `10.02 s`),
652+
`heavy 21.29/21.27/21.27 s` (avg `21.28 s`),
653+
`heavy_IO 24.80/24.66/24.63 s` (avg `24.70 s`);
654+
- with signature cache:
655+
`stress-only 10.02/10.03/9.98 s` (avg `10.01 s`),
656+
`heavy 21.36/21.40/21.29 s` (avg `21.35 s`),
657+
`heavy_IO 24.55/24.49/24.37 s` (avg `24.47 s`).
658+
659+
- Interpretation:
660+
memoizing IET node signatures is not worth keeping. The end-to-end signal is
661+
neutral for `stress-only`, neutral/slightly negative for `heavy`, and only
662+
mildly positive for `heavy_IO` (`~0.23 s`). The direct multiplicity check
663+
shows the cache surface is tiny: only `30/180` calls are repeated on the same
664+
node. The next meaningful IET win is unlikely to come from the individual
665+
pass bodies. It would need to reduce repeated `abstract_efunc` misses, likely
666+
by making `reuse_efuncs` more incremental/cache-aware across successive
667+
`Graph.apply` calls.

0 commit comments

Comments
 (0)