perf(#1392): remove O(N²) ordering in NEAT fitness + cache topology sort (#1419)

ooples · franklinic · claude · web-flow · commit a801d11c5161 · 2026-05-22T22:28:58.000-04:00
* fix(#1309): cluster-1 DCGAN — restore deferred-shape guard + lazy-conv deserialize fallback PR #1290 CI Cluster 1: 25 of 25 DCGANTests failing post-master with one of two errors: 1. Most (23 tests): "Invalid layer configuration: The last layer's output shape [3, -1, -1] must match the architecture output size (12288)." 2. Clone tests (2): "Input spatial dims after padding (1+2*1, 1+2*1) must be >= kernelSize (4)" raised inside DeserializationHelper's pre-resolve of the discriminator's first conv layer. Plus 1 SparseNN test (intermittent mode-collapse) that re-runs pass without code change — flaky, not a regression target. ## Root causes (1) NeuralNetworkBase.IsLastLayerShapeCompatible: PR #1329 (commit 969977d) added a `outputShape.Any(d => d < 0)` early-return so the validator defers the flat-OutputSize check when any output-shape dim is deferred — DCGAN's last transposed-conv emits [3, -1, -1] until its first Forward resolves H/W. That guard was inadvertently deleted by the grafprint PR (c8cac23, May 16) one day later. Restoring it unblocks all 23 validator-rejection cases at once. (2) DeserializationHelper conv path: when the saved layer record's inputShape carries -1 sentinels (a lazy conv layer serialized before its first Forward — DCGAN's discriminator on a Predict-only probe sees only the generator), the pre-existing code coerced all -1 dims to 1 and called conv.ResolveShapesOnly(...). For DCGAN's first conv (kernel=4, padding=1) this fails OnFirstForward's kernel-size check (1 + 2 < 4). Coercing to Math.Max(1, KernelSize) fixes that specific check, but locks InputDepth at 1 — then the real Forward with the [3, 64, 64] RGB image throws "Expected input depth 1, but got 3". The correct fix is to skip pre-resolve entirely when InputDepth is deferred — ConvolutionalLayer.SetParameters has its own auto-resolve fallback at line ~1598 that derives InputDepth from the saved parameter vector's length, and uses KernelSize as the spatial placeholder. Pre-resolve still runs (and uses Math.Max(1, KernelSize) for any deferred spatial dim) when InputDepth is concrete — that's the original PR #1329 contract for the auto-resolve-disambiguation case. ## Verification $ dotnet test --framework net10.0 --filter "FullyQualifiedName~DCGANTests|FullyQualifiedName~SparseNeuralNetworkTests" Failed! - Failed: 2, Passed: 44, Skipped: 0, Total: 46 26 → 2 failures. The remaining two are NOT cluster-1 shape-contract issues: - DCGANTests.MoreData_ShouldNotDegrade — `Test execution timed out after 120000 milliseconds`. Pre-existing GAN training-path perf gap; the deep deconv+conv chain in tape mode is ~5-10× slower than PyTorch CPU baseline. Substep profile (Release): Generator.Predict 19 ms, Discriminator.Train 187 ms, Generator adversarial 313 ms — 519 ms/step × 250 iters = 130 s vs 120 s timeout. Filed separately so this PR ships the actual cluster-1 root causes (validator + conv-deserialize) without bundling a multi-week perf project. - SparseNeuralNetworkTests.DifferentInputs_AfterTraining_ShouldProduceDifferentOutputs — intermittent mode-collapse, passes on re-runs. Separate flaky-test issue, not a shape-contract bug. Closes #1309 partially (cluster-1 shape-contract root causes). The MoreData_ShouldNotDegrade timeout + SparseNN mode-collapse flakiness are tracked separately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(PR #1389 review): document zero-dim wildcard semantics + reject malformed Conv inputShape rank * fix(PR #1389 follow-up): widen rank check to reject rank-1/2 Conv inputShape too * perf(#1390): eliminate duplicate generator forward in GAN.Train — closes DCGAN MoreData timeout Previously GenerativeAdversarialNetwork.Train ran the generator forward TWICE per training step: 1. Generator.Predict(input) (eval mode, NoGradScope) → detached fake images for the combined real+fake discriminator step. 2. ForwardForTraining(input) (train mode, on tape) inside TrainWithCustomLoss — duplicate of the same forward, just for the gen-adversarial backward. On the DCGAN MoreData fixture (250 iters, double-precision, batch=2, 64×64 RGB) this duplicate forward contributed ~19 ms of the 519 ms / step profiled in #1390 — pushing the test 10 s over its 120 s budget. Refactor: - Open a single GradientTape at the start of the step. - Run ForwardForTraining(input) ONCE on that tape → fakeTapeTracked. - Take a value-copy detached snapshot (fakeImages) for the disc step; fresh Tensor<T> with no GradNode chain so disc.Train (which opens its own nested tape) can not leak gradients back into the generator. - Walk the discriminator layer-by-layer on the existing gen tape for the adversarial loss (unchanged from the prior closure semantics). - Drive the gen optimizer step via the new NeuralNetworkBase.BackwardAndStepOnPrecomputedLoss helper, which reuses the open tape instead of TrainWithCustomLoss opening a fresh one + re-running ForwardForTraining. Behavior note: the disc step now sees train-mode generator output (batch BN stats) instead of eval-mode (running BN stats). This matches PyTorch's standard DCGAN training pattern (fake = G(z); fake_detached = fake.detach()) and the existing gen step's own train-mode forward. DCGAN has no Dropout, so the only distribution shift is BN stats, which is the conventional adversarial behavior. Verified locally with the canonical Tensors 0.81.3 dependency: - DCGANTests.MoreData_ShouldNotDegrade: 1 m 47 s (was timing out at > 120 s) — closes the test's perf gap. - Full DCGANTests class: 25 / 25 passing. - ConditionalGANTests + InfoGANTests (other GAN.Train consumers): 50 / 50 passing. - Full SparseNeuralNetworkTests: 21 / 21 passing (previously "intermittent mode-collapse" in PR #1389 description — appears stable now, may have been transient). Closes #1390. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(pr1389-review): narrow visibility + reentrancy + extra trainables addresses three coderabbit comments on backwardandsteponprecomputed loss in pr #1389: 1. visibility narrowed public -> internal. the codebase contract is "users should only interact with aimodelbuilder / aimodelresult" and this helper is training plumbing for in-assembly callers (currently generativeadversarialnetwork.train); no reason for it to live on the public surface. only caller is in same assembly. 2. added using var __reentrancyguard = acquiretrainsentinel() at the top, mirroring trainwithtape's sentinel discipline. without it, concurrent callers on the same model race on lastloss + optimizer internal state. 3. trainableparams now concats getextratrainabletensors() with the layer params, matching trainwithtape's parameter set. without this models that expose raw tensors via getextratrainabletensors (rather than layer-resident params) silently skipped updates on the precomputed-loss path -- divergent semantics between the two training entry points. build passes. * fix(#1395): bump aidotnet.tensors 0.81.3 → 0.81.9 — pulls in the cache-key determinism fix issue #1395 root cause is `compiledmodelcache` reusing a cached plan across calls that differ only in `isdeterministicmode` — when step 2 flipped the flag, the plan-embedded adam state was incompatible with the (now cache-hit) plan from step 1, and the fused path threw `"plan-embedded adam/adamw/sgd state cannot be transferred"`. ooples/AiDotNet.Tensors#416 (tag v0.81.7) drops `isdeterministicmode` from the cache shape-key. that release plus the four intermediate versions (0.81.4 → 0.81.9) all sat un-consumed because the comment in this file was blocking the bump until #424 published a nuget — #424 has since merged + published as v0.81.9. contents of the bump (oldest first): - 0.81.4 #367 graphmode recording on 5 ops with silent gradient-drop bugs (closes #365) - 0.81.5 #410 three fp64 sd unet cliffs + compile-mode forward 2× faster than eager (#1305) - 0.81.6 #412 shape catalog + dcgan step probe + alloc profile (#403) - 0.81.7 #416 drop isdeterministicmode from compiledmodelcache shape-key (closes #1395 root) - 0.81.8 #418 serialise clcreatecommandqueue across threads — dodges amd driver race (#414) - 0.81.9 #424 long-arithmetic dim-product in tensorallocator — replaces silent checked(int * int) overflow that broke timemachine / dqn / owlvit / dgcnn / tabtransformer / tabdpt / slimsam / triaffinener on sonarcloud run 26241806890 verified locally: - restore + build src + tests all green - 34/34 vggnetworktests.UnitTests pass - vggnetworktests.modelfamily.training_shouldreduceloss no longer throws the #1395 exception (now times out at 120s — that's #1394 perf, distinct) * perf(#1392): remove o(n²) ordering in neat fitness + cache topology sort issue #1392 reported neattests.training_shouldreduceloss timing out at the 120 s ci budget. profiled it down to two hot-path issues inside neat.train's 50-generation fitness loop: 1. inline lastloss assignment in the fitness function called _population.orderbydescending(g => g.fitness).firstordefault() on every genome eval -- o(n) sort per genome × 150 genomes per generation × 50 generations × 30 train calls in the test = ~34 million ordering operations, all heap-allocating linq enumerables. the reference-equality probe against the pre-generation best was also semantically broken (the comment in the existing post-evolution recompute block already called it out), so the inline lastloss was both expensive AND wrong. fix: delete the branch. the post-evolution recompute at neat.cs:1230+ does the work correctly using the actual post-generation best. 2. activategenome rebuilt sortconnectionstopologically (o(e²)) from scratch on every call -- 225,000 sorts across the test run, most redundant because only weight-mutation occurred between successive generations. fix: cache the sort + the non-input-node id list on the genome, keyed on (connections.count, ulong bitmask of isenabled). topology mutations (addconnection / disableconnection) change one or both halves of the key, invalidating the cache; weight mutations leave it valid (the dominant case). local timing on the failing test (single-test run, 32-core host): baseline: 54.0 s post-fix: 46.0 s (~15 % faster) the issue itself scopes the perf gap as "multi-week" -- this is a first pass closing the easy 15-20 % win without changing model semantics. real residual is in activategenome's dictionary<int, t> allocator pressure and the per-mutation clone path, both of which will need follow-up work. added genome cache fields: internal int cachedtopologysignaturecount internal ulong cachedtopologysignaturemask internal list<connection<t>>? cachedsortedconnections internal list<int>? cachednoninputnodeids verified: all 12 neattests pass in isolation post-fix; no behavior change vs baseline (lastloss values, mutation outputs, evolution trajectory unchanged because the deleted branch was already dead code per the post-evolution recompute comment). * perf(#1392): activate genome on flat array + bulk clone ActivateGenome was allocating a Dictionary<int, T> per call and indexing through it for every connection in the topologically sorted edge list. Under EvolvePopulation that's one Dictionary per genome per fitness call — 150 pop x 50 gen x ~30 Train calls per test = ~225k Dictionary allocs per test invocation. Swap the Dictionary for a flat T[] sized to max(referenced node id, biasNodeId) + 1. Connection traversal becomes pure array indexing; the non-input-node sigmoid sweep walks a cached List<int> instead of Dictionary.Keys. Three new genome caches piggyback on the existing topology-sort caches added in 09534a4: - CachedMaxNodeId — max(referenced node id, biasNodeId) - CachedReferencedNonInputNodeIds — distinct non-input node ids - (existing CachedSortedConnections invalidates these too) Clone() now pre-sizes the child genome's Connections list to the parent count instead of letting List<Connection<T>> grow through the 0->4->8->16 capacity-doubling chain (each step memcpys the buffer). Connection<T> objects are still freshly allocated per child so parent mutations don't leak across the clone boundary. GetNamedLayerActivations dropped its Dictionary-specific ContainsKey checks and Keys.Where(...) lookup in favor of straight array indexing plus a walk over the cached non-input-node id list. Net wall time on NEATTests.Training_ShouldReduceLoss (isolated, net10): - pre-#1419 baseline: ~54 s - #1419 first pass (this PR): ~46 s (~15%) - + this commit: ~41 s (~24% cumulative) Build verified on net10.0 + net471. Test passes in isolation. Pre-existing parallel-suite timeout on Training_ShouldReduceLoss is unaffected by this change (confirmed by re-running against the stashed baseline) — that flake's root cause is xunit parallel CPU contention against the 120 s test budget, not a regression introduced here. Issue: #1392 * perf(#1392): zero-alloc tournament + linq-free crossover + mutate Three remaining hot paths in EvolvePopulation that were paying per-call allocations on every offspring: - SelectParent: built a fresh List<Genome<T>> + ran OrderByDescending over it on every invocation. Tournament size is fixed at 3; ~447 k calls per Training_ShouldReduceLoss run (149 children x 50 gens x ~30 Train calls x 2 parents per crossover). Rewritten as an inline 3-way argmax with no allocations and no LINQ. - Crossover: Enumerable.Concat (enumerator alloc) and a fresh HashSet<int> per call. Switched to a per-NEAT-instance scratch HashSet that .Clear()s at the top of each call (single-threaded Evolve loop, so reuse is safe), plus pre-sized the child Connections list to (parent1.Count + parent2.Count) to skip the capacity-doubling chain. Both parent lists are now walked by index. - Mutate: LINQ Max + Any allocated a Func<,> delegate + enumerator per call. Both replaced with manual index loops. Weight-mutation foreach also replaced with an index loop so JIT can elide the List<T>.Enumerator bounds check on each step. - EvolvePopulation: pre-size newPopulation to _populationSize so its backing array doesn't walk 0->4->8->16->...->150 on every generation. Connection<T> object pooling was considered and skipped — Connection's FromNode/ToNode/Innovation are init-only via the public constructor, so pooling would require either a breaking API change or a fragile internal reset path that's bug-prone. Per-genome allocation churn for connections is bounded by genome.Connections.Count (small) and is already paid under JIT-friendly Add() calls into the pre-sized child list, so the remaining marginal win does not justify the API risk. Test pass on all 6 NEAT tests run individually (net10.0). Wall time on Training_ShouldReduceLoss (isolated, 3-run min): ~41 s (unchanged vs the prior commit — ActivateGenome dominates the inner loop, this commit trims the outer loop's overhead and reduces GC pressure under parallel load). Issue: #1392 * fix(NEAT): FNV-1a topology signature catches same-count rewires + >64-conn aliasing PR #1419 review: the previous ComputeEnabledBitmask cache key keyed only on (Connections.Count, XOR of enabled-bits). That signature aliased two real edit patterns and let stale cached sorts / non-input-node sets / max-node-id leak back to ActivateGenome: - Same-count rewires — swapping a connection's FromNode/ToNode for a different node without flipping any IsEnabled bit preserved both count and the bitmask → cache hit on the WRONG topology. - >64-connection aliasing — the bitmask's `(i & 63)` wrap collapsed slots 0/64/128/… onto the same bit, so a flip at slot 64 could XOR-cancel an earlier flip at slot 0 and leave the mask unchanged. Replace with FNV-1a 64-bit hash over (FromNode, ToNode, IsEnabled) per slot in iteration order. Connection.Weight is deliberately excluded — weight-only mutations are the dominant case across the 50 internal generations per public Train call, and we WANT the cached topological sort to survive them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(NEAT): include biasNodeId in newNodeId floor (PR #1419) PR #1419 review (CodeRabbit minor): the add-node mutation's first-free-node-id scan started at 0 and used max(FromNode, ToNode) across connections. For an initial genome (connections only reference InputSize + OutputSize node ids), that max is InputSize + OutputSize − 1, producing newNodeId = InputSize + OutputSize = biasNodeId. ActivateGenome writes activations[biasNodeId] = NumOps.One BEFORE the connection sweep, so any connection accumulating into this hidden slot would corrupt the bias signal — and every connection targeting the new hidden node would also read a polluted pre-activation from the same slot. The collision only affected the FIRST add-node mutation on a fresh genome (subsequent mutations push maxNodeId past biasNodeId), but that's the most common path and the corruption was silent. Initialise maxNodeId at biasNodeId instead of 0 so newNodeId is guaranteed > biasNodeId regardless of starting topology. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: franklinic <franklin@ivorycloud.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/Directory.Packages.props b/Directory.Packages.props
@@ -5,22 +5,17 @@
   <ItemGroup>
     <!-- AiDotNet ecosystem -->
     <PackageVersion Include="AiDotNet" Version="0.113.0" />
-    <!-- AiDotNet.Tensors needs a version bump after ooples/AiDotNet.Tensors#424
-         publishes a new NuGet. That PR replaces the silent `checked(int * int)`
-         dim-product overflow in TensorAllocator.Rent / RentPinned with a
-         `long` accumulator that names the requested shape when the element
-         count exceeds Array.MaxLength, plus an ArgumentOutOfRangeException
-         naming the index and value for negative dims (lazy-layer `-1`
-         sentinel propagation). Diagnoses the otherwise-opaque
-         `OverflowException` failures on TimeMachine / DQN / OWLViT /
-         DGCNN / TabTransformer / TabDPT / SlimSAM / TriaffineNER tests on
-         SonarCloud run 26241806890. (Previous PR ooples/AiDotNet.Tensors#359
-         — GraFPrint perf-overhaul + scheduler-fused + determinism — is
-         already in 0.81.3.) -->
-    <PackageVersion Include="AiDotNet.Tensors" Version="0.81.3" />
-    <PackageVersion Include="AiDotNet.Native.OneDNN" Version="0.81.3" />
-    <PackageVersion Include="AiDotNet.Native.OpenBLAS" Version="0.81.3" />
-    <PackageVersion Include="AiDotNet.Native.CLBlast" Version="0.81.3" />
+    <!-- AiDotNet.Tensors 0.81.9 brings in six fixes accumulated since 0.81.3:
+         - 0.81.4 #367  GraphMode recording on 5 ops with silent gradient-drop bugs (closes #365)
+         - 0.81.5 #410  three FP64 SD UNet cliffs + compile-mode forward 2× faster than eager (#1305)
+         - 0.81.6 #412  shape catalog + DCGAN step probe + alloc profile (#403)
+         - 0.81.7 #416  drop IsDeterministicMode from CompiledModelCache shape-key (closes AiDotNet#1395 root cause — VGG fused-compiled training now engages on step 2)
+         - 0.81.8 #418  serialise clCreateCommandQueue across threads (dodges AMD driver race, #414)
+         - 0.81.9 #424  long-arithmetic dim-product in TensorAllocator (replaces silent `checked(int * int)` overflow — unblocks TimeMachine / DQN / OWLViT / DGCNN / TabTransformer / TabDPT / SlimSAM / TriaffineNER on SonarCloud run 26241806890) -->
+    <PackageVersion Include="AiDotNet.Tensors" Version="0.81.9" />
+    <PackageVersion Include="AiDotNet.Native.OneDNN" Version="0.81.9" />
+    <PackageVersion Include="AiDotNet.Native.OpenBLAS" Version="0.81.9" />
+    <PackageVersion Include="AiDotNet.Native.CLBlast" Version="0.81.9" />
     <!-- Microsoft / .NET -->
     <PackageVersion Include="Microsoft.CSharp" Version="4.7.0" />
     <PackageVersion Include="Microsoft.Data.Sqlite" Version="10.0.8" />
diff --git a/src/NeuralNetworks/Genome.cs b/src/NeuralNetworks/Genome.cs
@@ -177,6 +177,37 @@ public Genome(int inputSize, int outputSize)
         Fitness = NumOps.Zero;
     }
 
+    /// <summary>
+    /// Issue #1392 perf: per-genome cache for <c>NEAT.SortConnectionsTopologically</c> +
+    /// the "non-input node" set <c>NEAT.ActivateGenome</c> walks for activation. The
+    /// sort is O(E²) on the genome's enabled connections and was being rebuilt on
+    /// every <c>NEAT.ActivateGenome</c> call (population × generations × Train calls
+    /// = ~225 k calls for a Training_ShouldReduceLoss run). Caching pays back on
+    /// every call where the topology hasn't mutated since the prior activation.
+    ///
+    /// <para>Invalidation key is the cheap-to-compute <c>(Connections.Count, IsEnabled
+    /// bitmask)</c> snapshot taken at the start of activation. NEAT's mutation paths
+    /// either (a) append to <c>Connections</c> via <see cref="AddConnection"/> (Count
+    /// changes) or (b) flip <c>IsEnabled</c> via <see cref="DisableConnection"/> /
+    /// crossover (mask changes). Weight-only mutations don't change topology and
+    /// keep the cache valid — which is the dominant case across the 50 generations
+    /// per Train call. Snapshot computation is O(N) and reads cleanly off
+    /// <c>Connections</c>; cheap relative to the O(E²) sort it bypasses.</para>
+    /// </summary>
+    internal int CachedTopologySignatureCount;
+    internal ulong CachedTopologySignatureMask;
+    internal List<Connection<T>>? CachedSortedConnections;
+    internal List<int>? CachedNonInputNodeIds;
+
+    /// <summary>
+    /// Issue #1392 perf: max node id referenced by any (FromNode, ToNode) in
+    /// <see cref="Connections"/> plus the bias-node id, cached alongside the
+    /// topology signature so <see cref="NEAT{T}.ActivateGenome"/> can size its
+    /// flat-array activations buffer in one shot instead of growing a
+    /// Dictionary&lt;int, T&gt; one entry at a time. -1 means uninitialized.
+    /// </summary>
+    internal int CachedMaxNodeId = -1;
+
     /// <summary>
     /// Adds a new connection to this genome.
     /// </summary>
@@ -415,10 +446,24 @@ void Visit(int node)
     /// </remarks>
     public Genome<T> Clone()
     {
+        // Issue #1392 perf: under EvolvePopulation we Clone() the parent for
+        // every child every generation (150 pop x 50 gen per Train call x
+        // dozens of test calls). The previous foreach + AddConnection path
+        // drove List<Connection<T>>'s internal array through the 0->4->8->16
+        // capacity grow chain, copying the buffer at every doubling. Pre-size
+        // the list to the exact final capacity so the alloc is one-shot.
+        int count = Connections.Count;
         var clone = new Genome<T>(InputSize, OutputSize);
-        foreach (var conn in Connections)
+        if (count > 0)
         {
-            clone.AddConnection(conn.FromNode, conn.ToNode, conn.Weight, conn.IsEnabled, conn.Innovation);
+            clone.Connections.Capacity = count;
+            var src = Connections;
+            var dst = clone.Connections;
+            for (int i = 0; i < count; i++)
+            {
+                var c = src[i];
+                dst.Add(new Connection<T>(c.FromNode, c.ToNode, c.Weight, c.IsEnabled, c.Innovation));
+            }
         }
 
         return clone;
diff --git a/src/NeuralNetworks/NEAT.cs b/src/NeuralNetworks/NEAT.cs