Benchmark nomenclature used below:
light:test_profile_eiso_stress_like_schedule_infosheavy:test_profile_etti_velocity_then_stress_like_schedule_infosheavy_IO:test_profile_etti_velocity_then_stress_like_bitcomp_serial_schedule_infos(theheavyoperator plus bitcomp compression and serialization)
The latest two OSS commits moved the paired PRO/OSS branch beyond the previous
April 21/22 ~26 s heavy-compile checkpoint:
-
fc755479d(compiler: Split into EqBlock and Cluster)- split the structural Cluster payload into cached
EqBlockobjects while preserving Cluster identity semantics - made
IREqhashing/equality include IR metadata via_hashable_content, with the reusableas_hashablehelper, so EqBlock caching does not merge equations that only differ inispace,conditionals,implicit_dims, oroperation - measured impact after correctness fixes:
- stress-only: about
11.0 s - heavy
velocity+stress: about24.4 s - heavy
optimize_kernels: about11.3-11.4 s
- stress-only: about
- validation at this point included the targeted EqBlock repros, nearby
equation/visitor/CSE tests, and a full OSS suite:
3549 passed, 5 skipped, 4 xfailed, 1 xpassed
- split the structural Cluster payload into cached
-
771f807ab(compiler: Stash hash were essential for compilation performance)- added
cached_hash, which stashes immutable-object__hash__results in_mhash - applied it to the hottest hash sites from the profiling investigation:
support-space objects (
Interval,IntervalGroup,IterationInterval,IterationSpace,DataSpace, andIterationDirection) plus Cluster queue keys (Prefix) andClusterGroup - removed the generic
Space.__hash__path and made subclasses hash their concrete payloads directly, avoiding shared base/subclass hash-cache ambiguity - measured impact after this commit:
- stress-only:
10.45-10.57 s, withoptimize_kernels 3.30-3.43 s - heavy
velocity+stress:22.82-22.89 s, withoptimize_kernels 10.43-10.50 s
- stress-only:
- validation at this point included
test_lower_clusters.py + test_ir.py, the four EqBlock/equation repros, and the two benchmark probes above; a full OSS suite has not yet been rerun after this second commit
- added
Current practical checkpoint for the heavy benchmark is therefore now about
22.8-22.9 s, not the older 25.8-26.0 s plateau. Relative to the previous
April 22 plateau, the latest two commits are worth roughly 3 s on the heavy
compile, with the larger second-step gain coming from cached support-space and
Cluster queue hashing.
More recent paired PRO/OSS reruns on April 21, 2026 with PRO
faster-python-1 at b770aaee, OSS /home/fl1612/devito-faster-python-1 at
30715c026, devitopro-cuda:latest, --taskset 0-15, --deviceid 3, and
the two schedule_infos probes in
devitopro/tests/test_kernelopt_nogil_tmp.py reproduced the current practical
checkpoint as:
test_profile_eiso_stress_like_schedule_infos:5.81-5.83 stest_profile_etti_velocity_then_stress_like_schedule_infos:25.81-26.21 s
At that point, the paired velocity+stress checkpoint for this branch family
was still about 26 s; the older 32.00 s value below is only a historical
April 10 milestone, and the April 28 section above supersedes both numbers for
the current branch.
Fresh clean-HEAD reruns on April 22, 2026 with PRO faster-python-1 at
b770aaee, OSS /home/fl1612/devito-faster-python-1 at c44c30339,
devitopro-cuda:latest, --taskset 0-15, --deviceid 3, and the same
temporary PRO harness confirm that the practical checkpoint for the current
pair is still:
- light probe: still about
5.8 s - heavy
test_profile_etti_velocity_then_stress_like_schedule_infos: still about25.8-26.0 s(25.80 s,25.99 sin fresh reruns)
Latest findings from the PRO-side deep profiling on this same pair:
- isolated
search_rotationandsearch_shmpostponement tweaks were both flat/noisy when rerun independently and were reverted - more invasive scheduler and node-caching refactors were also dropped after either code-complexity growth or measurable regressions
- explicit per-kernel accounting on the heavy benchmark shows the dominant
remaining cost is the initial stress kernel lineage, not the velocity kernel:
- helper lineage:
0.402522 s - velocity lineage:
0.585283 s - stress lineage:
11.393894 s
- helper lineage:
- splitting the first
optimize()pass by initial kernel yields:- helper (
r0..r5):0.315156 s - velocity (
v_x,v_y,v_z):0.426755 s - stress (
tau_*):10.524947 s
- helper (
- the stress-kernel
optimize()time is currently dominated by:action_apply:5.570871 sminimize_barrier_likelihood:2.337799 sreschedule:2.169853 s- with
68action steps,69schedule calls, and an average per-schedule state of about229.7IDG nodes and30.4queued actions
- current IET-side coarse buckets on the heavy probe are still:
make_parallel ~= 1.99 splace_definitions ~= 0.87 s_place_transfers ~= 0.86 slinearization ~= 0.37 s
Current interpretation:
the current pair looks close to a local plateau on this benchmark family. The
next step should therefore be to move up in benchmark complexity rather than
keep forcing increasingly marginal scheduler micro-optimizations on the current
velocity+stress case.
First result after moving up in benchmark complexity on the PRO scratch harness:
- new
heavy_IObenchmark:test_profile_etti_velocity_then_stress_like_bitcomp_serial_schedule_infos - construction:
extend the current
heavyvelocity+stressoperator with one compressed, serialized savedTimeFunctionpertau_*component and an extraEq(tau_*save, tau_*.forward)per component - first pinned compile result:
- total compile:
30.44 s lowering.Clusters:18.74 soptimize_kernels:12.55 slowering.IET:9.39 sspecializing.IET:8.49 s- main new IET-side buckets:
lower_async_objs ~= 1.29 s,place_definitions ~= 1.27 s,linearization ~= 1.12 s,_place_transfers ~= 0.92 s
- total compile:
Interpretation:
the first heavier benchmark does become meaningfully slower, but the extra cost
lands primarily in IET / serialization-lowering rather than in the already
stress-dominated kernelopt slice.
Older measured compile times on April 10, 2026 with PRO faster-python-1,
OSS faster-python-1, devitopro-cuda:latest, --taskset 0-15, and
--deviceid 0:
test_profile_etti_stress_like_schedule_infos:13.63 stest_profile_etti_velocity_then_stress_like_schedule_infos:32.00 s
Compared with the March 30, 2026 no-cache baseline on the same probe family:
- stress-like:
29.99 s -> 13.63 s(-16.36 s, about54.5%) - velocity+stress:
98.49 s -> 32.00 s(-66.49 s, about67.5%)
Compared with the later retrieve-accesses-only replay on the same branch family:
- stress-like:
29.32 s -> 13.63 s(-15.69 s) - velocity+stress:
93.16 s -> 32.00 s(-61.16 s)
Compared with the pre-TimedAccess landed branch state:
- stress-like:
24.65 s -> 13.63 s(-11.02 s) - velocity+stress:
79.68 s -> 32.00 s(-47.68 s)
Compared with the pre-space/rebuild branch (3f3017e46,
compiler: Augment caching and memoization):
- stress-like:
27.45 s -> 13.63 s(-13.82 s) - velocity+stress:
84.56 s -> 32.00 s(-52.56 s)
The current landed branch now covers:
- narrow helper memoization and finite-difference evaluation caching
(
3f3017e46) Scope/access-inventory caching and lazy function-view reuse (3f3017e46)- conservative space-object caching and no-op
Cluster.rebuild()reuse (e16f222e1) - cached
TimedAccessconstruction and reuse of its per-instance distance cache across repeatedScopebuilds (fd850927a) - synthetic
Scope.from_scopes(...)construction from cached access summaries, plus fusion-hazard analysis over those synthetic scopes rather than freshScope(exprs0 + exprs1)rescans - bounded derivative-driven topofusion in
lower_index_derivatives, using the maximum nestedIndexDerivative.depthas an upper bound on the number of productivetoposort='nofuse'rounds before the final plainfuse(False)
Current profiled bottlenecks on the landed branch:
- stress-like:
lowering.Clusters ~= 8.45 s,lowering.IET ~= 3.64 s,optimize_kernels ~= 6.12 s,fuse ~= 0.53 s - velocity+stress:
lowering.Clusters ~= 23.82 s,lowering.IET ~= 5.64 s,specializing.Clusters ~= 19.24 s,fuse ~= 2.71 s - hottest Cluster-side buckets on velocity+stress:
optimize_kernels ~= 18.17 s, with the remaining clean OSS cluster-side work still dominated by fusion/topofusion rather than the derivative lowering wrapper itself - hottest IET-side buckets on velocity+stress:
make_parallel ~= 1.59 s,place_definitions ~= 1.33 s,_place_transfers ~= 0.86 s,linearization ~= 0.37 s,_generate_macros ~= 0.24 s,minimize_symbols ~= 0.25 s,optimize_halospots ~= 0.22 s
Validation status of the latest derivative-topofuse heuristic:
- targeted OSS sensitivity checks around
test_unexpansion.py::{test_v3,test_v4, test_v5}passed - the previously failing PRO CUDA regression in compressed layered MPI
serialization turned out to be unrelated to compilation changes; it was a
NVIDIA_VISIBLE_DEVICES/implicitdeviceidcorrectness bug and is now fixed - current PRO
tests/test_gpu_lang.py::TestKernelOptDefault:: test_flip_for_canonical_orderingis failing on thefaster-python-1PRO/OSS pair, but the failure hits the baselineop0.apply(...)path with an undefinednpthreads0symbol in generated CUDA, so it currently looks like an OSS-side issue unrelated to the derivative-topofuse /dsequences()changes - a full fresh OSS + PRO sweep has not yet been rerun after the current derivative-topofuse heuristic
This temporary note captures the main OSS-side compilation optimizations explored in iteration 0. The list below is intentionally in ascending order of complexity: smaller and safer caching/micro-optimization ideas come first, while broader algorithmic and threading changes come later.
-
Cache tiny pure helper results and other stable scalar metadata first.Completed in a narrow form in
3f3017e46(compiler: Augment caching and memoization) via cachedIndexDerivative.pivot, memoizedDerivative._eval_fd, and shared numeric-weight reuse.Performance: this helper bucket was not isolated cleanly from point 5 in the squashed branch, but it is part of the landed
29.32 s -> 27.45 sand93.16 s -> 84.56 smove.Rationale: these changes are local, easy to reason about, and usually do not alter the structure of the compiler pipeline.
-
Preserve identity on no-op symbolic and visitor rewrites.Completed in a narrow compiler-local form in
e16f222e1(compiler: Augment caching and tweak memoization heuristics) viaCluster.rebuild()returningselfwhen all effective rebuild inputs are already identical objects.Performance: not isolated cleanly from point 9 below; the combined landed diff moved the probes from
27.45 s -> 24.65 sand84.56 s -> 79.68 s.Rationale: this is still fairly contained work, but it starts touching generic traversal machinery that is used in many places.
-
Specialize traversal-heavy symbol discovery before changing higher-level algorithms.
Relevant iteration-0 commits:
e450f0546(ir: trim findsymbols stack overhead),ceeb42689(ir: specialize findsymbols traversal),ff8d9efcc(symbolics: trim IET traversal overhead).Rationale: these changes stay within existing traversal semantics, but they attack some of the hottest generic walks in lowering.
-
Reuse already computed inventories in IET cleanup and callable deduplication.
Relevant iteration-0 commits:
f91d9256f(CODEX: ITER 6, better caller tracking and cheaper param drops),b05dd2084(iet: reuse symbol inventory in parameter updates),c26dbc3e6(WIP, shared DataManager inventory collection andreuse_efuncscaches),96ff77a94(iet: Prune reuse_efuncs by name family).Current replay status: still WIP and intentionally not landed.
April 7, 2026 replay findings on the current iteration-1 branch: rebuilding the non-WIP subset (
b05dd2084+f91d9256f+96ff77a94) was correct on targetedtest_iet.py/ DSE checks, but the payoff was small relative to the extra engine/utils complexity.Performance:
b05dd2084alone was flat-to-worse on the probes (23.16 s -> 23.16 sand72.34 s -> 73.66 s). Adding the two non-WIPengine.pyfollow-ups improved that to23.16 s -> 22.01 sand72.34 s -> 72.03 s. The subset was therefore dropped rather than landed: the light probe moved nicely, but the heavy probe improved by only about0.31 s.Rationale: this is the first bucket that spans multiple IET passes and shared helper caches, so it is more invasive than the previous purely local fast paths.
-
CheapenScopeconstruction and pairwise dependence pre-checks used by fusion/topofusion.Completed in
3f3017e46via memoizedretrieve_accesses, lazy cachedIREqread/write inventories, and reuse of cached function views inScope,Cluster.traffic,Expression, andOperator.Performance: the landed cache/memoization batch moved the probes from
29.32 s -> 27.45 sand93.16 s -> 84.56 s. A narrower mid-iteration replay of theScope/access portion alone had already reached roughly27.65 sand86.21 s. A laterTimedAccessfollow-up infd850927amoved the landed branch further from24.65 s -> 23.16 sand79.68 s -> 72.34 s, for a total point-5-aligned move of roughly29.32 s -> 23.16 sand93.16 s -> 72.34 s.Rationale: these changes keep the same broad fusion algorithm, but they start replacing repeated rescans with cached summaries and synthetic scopes.
-
Replace repeated generic fusion-hazard walks with focused hazard summaries, and tighten derivative-driven rescans.Relevant iteration-0 commits:
8c2e76a99(CODEX: ITER 5,fusion_hazardssummary),024de93a2(clusters: Cheapen derivative topofusion hazards),0abbe2cb9(clusters: Restrict derivative nofuse rescans).Completed on the current branch in a simpler form than the original iteration-0 patches: fusion hazard analysis now reuses the already-cached per-ClusterGroup
Scopeinventories and synthesizes cross-scope dependences viaScope.from_scopes(...), instead of repeatedly constructing freshScope(exprs0 + exprs1)objects from raw expressions. The derivative side is also now bounded:lower_index_derivativesruns at mostmax_depthtoposort='nofuse'rounds, wheremax_depthis the maximum nestedIndexDerivative.depthacross the input clusters, and then finishes with the usual plainfuse(False).Performance: compared with the pre-fusion landed state, the probes moved from
23.16 s -> 13.63 sand72.34 s -> 32.00 s.Rationale: this turned out to be the dominant remaining algorithmic win after the earlier caching groundwork was in place. The essential gain is sparing repeated expression rescans during fusion/topofusion legality checks.
Deferred April 17, 2026 follow-up: while profiling the PRO heavy
velocity_then_stresscompile on the paired OSS/PROfaster-python-1worktrees,minimize_barrier_likelihoodconsistently spent about2.5-2.6 sinsidefuse(toposort=True), with_build_dagand_fusion_hazardsdominating that cost. A trial OSS patch inFusion._build_dagskipped_fusion_hazardsfor unfenced ClusterGroup pairs whose scopes cannot possibly interact (cg0.scope.writes.keys().isdisjoint(cg1.scope.functions)and vice versa).Measured effect:
_fusion_hazardscalls dropped from about47kto about5.7k, and the barrier-minimization slice improved by about0.12-0.17 s, but the end-to-end heavy compile-time win was noisy and marginal. Focused OSS topofusion/barrier tests passed, but the change was still deferred rather than landed.Why deferred: this is exactly the kind of fast path that is easy to justify locally but hard to value globally. The measured win is real but small, and fusion/toposort is regression-prone enough that carrying extra control-flow in this area should require a clearer compile-time payoff.
If revisited later: keep the prefilter in
_build_dag, not inside_fusion_hazards. Moving it into_fusion_hazardswould still pay the function-call and memoization overhead that the experiment was specifically avoiding, whilefencedis a_build_dagscheduling concern rather than a property of the pairwise hazard relation itself. -
Add concurrency inside expression lowering only after the single-threaded fast paths are understood.
Relevant iteration-0 commits:
cd8bbec49(equations: Thread per-expression lowering),0f8d775c3(operator: Thread expression evaluation).Rationale: threading can move the needle, but it also introduces option plumbing, scheduling questions, and failure modes that are harder to debug than the earlier single-threaded wins.
-
Add concurrency inside fusion/toposort last.
Relevant iteration-0 commits:
e94ee8b52(CODEX: ITER 7,fuse-workersand threaded DAG row building).Rationale: this depends on the earlier
Scopeand hazard-summary work, and it sits in a particularly regression-prone area of the compiler. -
Treat aggressive object/space caching as a late experiment, not an initial iteration-1 target.Completed in a conservative form in
e16f222e1via cachedInterval/IterationSpace-family objects, immutable/hashableProperties,Prefix._preprocess_args, and the no-opCluster.rebuild()fast path above.Performance: compared with the pre-space/rebuild branch, the landed diff moved the probes from
27.45 s -> 24.65 sand84.56 s -> 79.68 s.Rationale: iteration 0 showed that this class of optimization can improve compile-time behavior, but it also showed that the semantic risk is high enough that it should not be part of the first iteration-1 subset.
April 28, 2026 landed follow-up: the current branch now extends this bucket with the EqBlock/Cluster split in
fc755479dand cached immutable-object hashes in771f807ab. This is the first object-caching follow-up in a while that clearly moved the main heavy benchmark rather than only shaving noise: the heavyvelocity+stressprobe moved from the previous25.8-26.0 splateau through about24.4 safter EqBlock caching, then to22.8-22.9 safter cached support-space and Cluster queue hashes. Thecached_hashresult also confirms that repeated hashing was a real compile-time cost, not just profiling noise.
Regression-fix commits such as cc6ee524a, 6bc7ea1fd, 9014e0ad0, and
d8981b0de are intentionally not part of the ordered list above. They matter
for keeping iteration 0 green, but they are correctness follow-ups rather than
the primary optimization ideas to replay in iteration 1.
April 22, 2026 IET / bitcomp+serialization (heavy_IO) follow-up:
-
New
heavy_IOPRO scratch benchmark: start from the currentheavyvelocity_then_stresscase and add one bitcomp+serialized savedTimeFunctionpertau_*component. -
Paired clean baseline: about
30.3-31.2 stotal compile, withoptimize_kernelsstill around12.5-12.7 sand the extra cost landing primarily inlowering.IET(~9.4-9.7 s). -
Profiling conclusions:
lower_async_objsscanning is not the dominant new cost; the more relevant IET-side work is inupdate_argsand in the secondplace_definitionspass triggered afterpthreadify. -
Reverted experiment 1: simplify
engine.py:update_argsby collapsing the separateFindSymbols('basics')/FindSymbols('symbolics')scans and computingdrop_paramsdirectly by index.Result: the compile-time probe looked mildly positive/noisy, but narrow compressed-layer runtime tests failed with the same
nbytes_avail_mapper/deviceid=-1breakage, so this is not safe as-is. -
Reverted experiment 2: after
pthreadify, rerunplace_definitionsonly on callables touched by async lowering rather than across the whole graph.Result: this was the strongest local compile-time signal in the new
heavy_IObenchmark: the second-epochplace_definitionsvisits dropped from31to5, and the heavy compile moved into roughly the30.2-30.4 sband. However, the same compressed-layer runtime tests failed, so the idea was reverted as well. -
Current recommendation: treat the async/definitions area as the right place to look for the new heavier benchmark, but do not carry either of the above optimizations without a stronger correctness story. The paired OSS worktree should stay at clean
HEAD. -
April 22 late follow-up: a narrower post-
pthreadifyrerun ofplace_definitionsdoes look viable after all. Instead of revisiting the whole graph, the current worktree now reruns the pass only on async-owned callables: the transformedThreadCallables, the helper callables (activate*,init_sdata*,shutdown*), and callers that reference those helpers. This is implemented by allowingGraph.apply(..., targets=...)and passing the selected names frompthreadify.Validation: the CPU layered async cases
tests/test_layered_funcs.py::TestSerialization::test_diskhost[...]withbuf-async-degree=1still pass on the paired worktrees. The higher-degreebuf-async-degree=4variant remains baseline-red because of the pre-existingnpthreads0codegen issue, so it is not a useful gate.Performance: on the
heavy_IObitcomp+serialization benchmark, the second-epochplace_definitionsvisits shrink from31down to5, and the local IET bucket improves from about1.80 sto about1.57-1.66 s. End-to-end compile time is a small but repeatable win on the latest paired reruns, moving from roughly30.54 sto about30.24-30.49 s.
April 30, 2026 IET memoization / no-op rebuild follow-up:
-
Baseline before this IET-focused patch series:
heavy:21.99 s,21.86 s,22.26 s; average22.04 s.lowering.IET:5.55 s,5.49 s,5.89 s; average5.64 s.heavy_IO:26.38 s,26.42 s,26.51 s; average26.44 s.lowering.IET:8.98 s,8.96 s,9.54 s; average9.16 s.
-
Current simplified patch:
- memoize public
create_call_graph, with callers passingas_hashable(self.efuncs)/as_hashable(efuncs)rather than using a private cached helper; - memoize public
abstract_efunc; - memoize public
abstract_objectsdirectly.rgacross OSS and PRO shows no caller passes an explicitsregistry, so the old optional parameter was removed and the function now always uses its localSymbolRegistry; - simplify IET
reuse_if_unchangedby usingNode._same_arginstead of a duplicate local kwarg comparison helper.
- memoize public
-
Dropped follow-up: a generic
memoized_funckey-path optimization was tested but left out of the patch. It appeared mildly positive in one set of runs, but was not necessary for the main IET win and is too broad for this focused change. -
Current measured performance with the simplified patch and unchanged
memoized_func:heavy:21.63 s,21.53 s,21.57 s; average21.58 s.heavy_IO:25.32 s,25.40 s,25.35 s; average25.36 s.- net improvement versus the pre-patch reference is about
0.46 sonheavyand about1.08 sonheavy_IO.
-
Validation: targeted OSS IET/tool tests passed:
/app/devitopro/submodules/devito/tests/test_iet.py,/app/devitopro/submodules/devito/tests/test_visitors.py,/app/devitopro/submodules/devito/tests/test_tools.py(72 passed). -
Interpretation: the durable win is in the IET callable-deduplication/reuse path, especially repeated call-graph creation and repeated abstraction of structurally stable callables. Dropping
abstract_objectscaching regressedheavy_IOback to roughly25.5 s, so that cache is worth keeping now that the unusedsregistryparameter has been removed.
May 4, 2026 benchmark refresh after the no-op IET transform and visitor-cache follow-ups:
-
Setup: PRO
faster-python-1worktree with paired OSSfaster-python-1, CUDA docker imagedevitopro-cuda:latest, GPU device3, launcher pinned withtaskset 0-15. The three schedule-info probes were run in one pytest-docker invocation. -
stress-only(test_profile_etti_stress_like_schedule_infos):- total compile:
10.06 s; lowering.Clusters:5.52 s;specializing.Clusters:4.18 s;optimize_kernels:3.39 s;lowering.IET:3.23 s;specializing.IET:3.00 s;- IET notable buckets:
make_parallel 1.59 s,_place_transfers 0.70 s,place_definitions 0.29 s; - kernelopt
fuse:0.60 s.
- total compile:
-
heavyvelocity+stress (test_profile_etti_velocity_then_stress_like_schedule_infos):- total compile:
21.63 s; lowering.Clusters:14.69 s;specializing.Clusters:11.17 s;optimize_kernels:10.08 s;lowering.IET:5.02 s;specializing.IET:4.43 s;- IET notable buckets:
make_parallel 1.47 s,_place_transfers 1.37 s,place_definitions 0.65 s,linearization 0.28 s; - kernelopt
fuse:1.82 s.
- total compile:
-
heavy_IOvelocity+stress plus bitcomp+serialization (test_profile_etti_velocity_then_stress_like_bitcomp_serial_schedule_infos):- total compile:
25.26 s; lowering.Clusters:15.58 s;specializing.Clusters:11.63 s;optimize_kernels:10.53 s;lowering.IET:7.59 s;specializing.IET:6.85 s;- IET notable buckets:
make_parallel 1.59 s,place_definitions 1.52 s,lower_async_objs 1.16 s,process 0.73 s,_place_transfers 0.54 s,linearization 0.47 s; - kernelopt
fuse:1.95 s.
- total compile:
-
Interpretation: the three current probes are still in the expected post-IET-cache band: about
10.1 sfor stress-only,21.5-21.7 sforheavy, and25.0-25.3 sforheavy_IO. TheFindNodesvisitor cache reduced direct repeated visitor cost in profiling, but it remains a small/noise-level end-to-end compile-time effect. The dominant open costs are stilloptimize_kernels/cluster specialization and, forheavy_IO, the IET async/definitions path.
May 4, 2026 IET reuse_efuncs drill-down:
-
The expensive IET buckets in
heavy_IO(make_parallel,place_definitions,_place_transfers,lower_async_objs, andprocess) are mostly paying commonGraph.applypost-processing cost rather than pass body cost. A temporary graph-phase profile showed:Graph.applytotal: about7.33 sacross25calls;reuse_efuncs: about3.93 sacross5calls;- pass bodies: about
2.17 s; update_args: about0.85 s.
-
Inside
reuse_efuncs, the hot path is abstraction/signature generation:- before the new signature cache:
reuse_efuncs ~3.93 s,abstract_efunc ~1.91 s,_signature ~1.75 s; - with IET
Node._signature()memoized per node:reuse_efuncsdrops to about3.62-3.69 s, and_signaturedrops to about1.41-1.44 s.
- before the new signature cache:
-
The tested signature-cache patch was deliberately narrow: IET
Nodeoverrode_signature()with@memoized_methand delegated toSigner._signature(), caching the SHA1 signature on the immutable-ish IET node instance without caching the full CIR string. -
Direct multiplicity check on
heavy_IOshowed why the patch is not a meaningful end-to-end win:_signature()calls:180;- unique IET nodes:
150; - repeated calls on the same node: only
30; - call histogram:
121nodes called once,28nodes called twice,1node called three times.
-
The remaining
abstract_efuncbody cost is still substantial. A temporary body-level profile ofheavy_IOshowed about150misses and30hits across the fivereuse_efuncscalls. Miss cost split roughly as:Uxreplace:0.63 s;abstract_objects:0.63 s;FindSymbols('basics|symbolics|dimensions'):0.23 s.
-
Dropped variants:
- IET
Node._signature()memoization was dropped after the multiplicity check. There are not enough repeated calls on the same node to justify even this small cache as a production change; - filtering identity mappings out of
abstract_objectswas slower in practice;abstract_objectsincreased from about0.63 sto about1.62 sin the instrumented run, because rebuilding the mapper dominated; - returning raw CIR from IET
Node._signature()instead of the SHA1 digest was also rejected. It retains large strings and made the instrumented profile noisier/worse, without a clear wall-time win.
- IET
-
Validation and benchmark signal from the rejected signature-cache patch:
- targeted OSS IET/visitor tests still pass:
/app/devitopro/submodules/devito/tests/test_iet.pyand/app/devitopro/submodules/devito/tests/test_visitors.py(42 passed); - the earlier
heavy 22.25 scombined-run sample was confirmed noisy and should be ignored.
- targeted OSS IET/visitor tests still pass:
-
May 4 rerun, three combined invocations before and after the signature-cache patch, same setup (
devitopro-cuda:latest, GPU3,taskset 0-15):- without signature cache:
stress-only 10.02/10.03/10.00 s(avg10.02 s),heavy 21.29/21.27/21.27 s(avg21.28 s),heavy_IO 24.80/24.66/24.63 s(avg24.70 s); - with signature cache:
stress-only 10.02/10.03/9.98 s(avg10.01 s),heavy 21.36/21.40/21.29 s(avg21.35 s),heavy_IO 24.55/24.49/24.37 s(avg24.47 s).
- without signature cache:
-
Interpretation: memoizing IET node signatures is not worth keeping. The end-to-end signal is neutral for
stress-only, neutral/slightly negative forheavy, and only mildly positive forheavy_IO(~0.23 s). The direct multiplicity check shows the cache surface is tiny: only30/180calls are repeated on the same node. The next meaningful IET win is unlikely to come from the individual pass bodies. It would need to reduce repeatedabstract_efuncmisses, likely by makingreuse_efuncsmore incremental/cache-aware across successiveGraph.applycalls.