Commit a801d11
* fix(#1309): cluster-1 DCGAN — restore deferred-shape guard + lazy-conv deserialize fallback
PR #1290 CI Cluster 1: 25 of 25 DCGANTests failing post-master with one
of two errors:
1. Most (23 tests): "Invalid layer configuration: The last layer's
output shape [3, -1, -1] must match the architecture output size
(12288)."
2. Clone tests (2): "Input spatial dims after padding (1+2*1, 1+2*1)
must be >= kernelSize (4)" raised inside DeserializationHelper's
pre-resolve of the discriminator's first conv layer.
Plus 1 SparseNN test (intermittent mode-collapse) that re-runs pass
without code change — flaky, not a regression target.
## Root causes
(1) NeuralNetworkBase.IsLastLayerShapeCompatible: PR #1329 (commit
969977d) added a `outputShape.Any(d => d < 0)` early-return so the
validator defers the flat-OutputSize check when any output-shape dim
is deferred — DCGAN's last transposed-conv emits [3, -1, -1] until
its first Forward resolves H/W. That guard was inadvertently deleted
by the grafprint PR (c8cac23, May 16) one day later. Restoring it
unblocks all 23 validator-rejection cases at once.
(2) DeserializationHelper conv path: when the saved layer record's
inputShape carries -1 sentinels (a lazy conv layer serialized before
its first Forward — DCGAN's discriminator on a Predict-only probe
sees only the generator), the pre-existing code coerced all -1 dims
to 1 and called conv.ResolveShapesOnly(...). For DCGAN's first conv
(kernel=4, padding=1) this fails OnFirstForward's kernel-size check
(1 + 2 < 4). Coercing to Math.Max(1, KernelSize) fixes that
specific check, but locks InputDepth at 1 — then the real Forward
with the [3, 64, 64] RGB image throws "Expected input depth 1, but
got 3". The correct fix is to skip pre-resolve entirely when
InputDepth is deferred — ConvolutionalLayer.SetParameters has its
own auto-resolve fallback at line ~1598 that derives InputDepth from
the saved parameter vector's length, and uses KernelSize as the
spatial placeholder. Pre-resolve still runs (and uses
Math.Max(1, KernelSize) for any deferred spatial dim) when
InputDepth is concrete — that's the original PR #1329 contract for
the auto-resolve-disambiguation case.
## Verification
$ dotnet test --framework net10.0 --filter "FullyQualifiedName~DCGANTests|FullyQualifiedName~SparseNeuralNetworkTests"
Failed! - Failed: 2, Passed: 44, Skipped: 0, Total: 46
26 → 2 failures. The remaining two are NOT cluster-1 shape-contract
issues:
- DCGANTests.MoreData_ShouldNotDegrade — `Test execution timed
out after 120000 milliseconds`. Pre-existing GAN training-path
perf gap; the deep deconv+conv chain in tape mode is ~5-10×
slower than PyTorch CPU baseline. Substep profile (Release):
Generator.Predict 19 ms, Discriminator.Train 187 ms, Generator
adversarial 313 ms — 519 ms/step × 250 iters = 130 s vs 120 s
timeout. Filed separately so this PR ships the actual
cluster-1 root causes (validator + conv-deserialize) without
bundling a multi-week perf project.
- SparseNeuralNetworkTests.DifferentInputs_AfterTraining_ShouldProduceDifferentOutputs
— intermittent mode-collapse, passes on re-runs. Separate
flaky-test issue, not a shape-contract bug.
Closes #1309 partially (cluster-1 shape-contract root causes).
The MoreData_ShouldNotDegrade timeout + SparseNN mode-collapse
flakiness are tracked separately.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
* fix(PR #1389 review): document zero-dim wildcard semantics + reject malformed Conv inputShape rank
* fix(PR #1389 follow-up): widen rank check to reject rank-1/2 Conv inputShape too
* perf(#1390): eliminate duplicate generator forward in GAN.Train — closes DCGAN MoreData timeout
Previously GenerativeAdversarialNetwork.Train ran the generator forward TWICE
per training step:
1. Generator.Predict(input) (eval mode, NoGradScope) → detached fake images
for the combined real+fake discriminator step.
2. ForwardForTraining(input) (train mode, on tape) inside
TrainWithCustomLoss — duplicate of the same forward, just for the
gen-adversarial backward.
On the DCGAN MoreData fixture (250 iters, double-precision, batch=2, 64×64
RGB) this duplicate forward contributed ~19 ms of the 519 ms / step
profiled in #1390 — pushing the test 10 s over its 120 s budget.
Refactor:
- Open a single GradientTape at the start of the step.
- Run ForwardForTraining(input) ONCE on that tape → fakeTapeTracked.
- Take a value-copy detached snapshot (fakeImages) for the disc step;
fresh Tensor<T> with no GradNode chain so disc.Train (which opens
its own nested tape) can not leak gradients back into the generator.
- Walk the discriminator layer-by-layer on the existing gen tape for
the adversarial loss (unchanged from the prior closure semantics).
- Drive the gen optimizer step via the new
NeuralNetworkBase.BackwardAndStepOnPrecomputedLoss helper, which
reuses the open tape instead of TrainWithCustomLoss opening a fresh
one + re-running ForwardForTraining.
Behavior note: the disc step now sees train-mode generator output
(batch BN stats) instead of eval-mode (running BN stats). This matches
PyTorch's standard DCGAN training pattern (fake = G(z); fake_detached =
fake.detach()) and the existing gen step's own train-mode forward.
DCGAN has no Dropout, so the only distribution shift is BN stats, which
is the conventional adversarial behavior.
Verified locally with the canonical Tensors 0.81.3 dependency:
- DCGANTests.MoreData_ShouldNotDegrade: 1 m 47 s (was timing out at
> 120 s) — closes the test's perf gap.
- Full DCGANTests class: 25 / 25 passing.
- ConditionalGANTests + InfoGANTests (other GAN.Train consumers):
50 / 50 passing.
- Full SparseNeuralNetworkTests: 21 / 21 passing (previously
"intermittent mode-collapse" in PR #1389 description — appears
stable now, may have been transient).
Closes #1390.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(pr1389-review): narrow visibility + reentrancy + extra trainables
addresses three coderabbit comments on backwardandsteponprecomputed
loss in pr #1389:
1. visibility narrowed public -> internal. the codebase contract is
"users should only interact with aimodelbuilder / aimodelresult"
and this helper is training plumbing for in-assembly callers
(currently generativeadversarialnetwork.train); no reason for it
to live on the public surface. only caller is in same assembly.
2. added using var __reentrancyguard = acquiretrainsentinel() at the
top, mirroring trainwithtape's sentinel discipline. without it,
concurrent callers on the same model race on lastloss + optimizer
internal state.
3. trainableparams now concats getextratrainabletensors() with the
layer params, matching trainwithtape's parameter set. without this
models that expose raw tensors via getextratrainabletensors (rather
than layer-resident params) silently skipped updates on the
precomputed-loss path -- divergent semantics between the two
training entry points.
build passes.
* fix(#1395): bump aidotnet.tensors 0.81.3 → 0.81.9 — pulls in the cache-key determinism fix
issue #1395 root cause is `compiledmodelcache` reusing a cached plan
across calls that differ only in `isdeterministicmode` — when step 2
flipped the flag, the plan-embedded adam state was incompatible with
the (now cache-hit) plan from step 1, and the fused path threw
`"plan-embedded adam/adamw/sgd state cannot be transferred"`.
ooples/AiDotNet.Tensors#416 (tag v0.81.7) drops `isdeterministicmode`
from the cache shape-key. that release plus the four intermediate
versions (0.81.4 → 0.81.9) all sat un-consumed because the comment in
this file was blocking the bump until #424 published a nuget — #424
has since merged + published as v0.81.9.
contents of the bump (oldest first):
- 0.81.4 #367 graphmode recording on 5 ops with silent gradient-drop bugs (closes #365)
- 0.81.5 #410 three fp64 sd unet cliffs + compile-mode forward 2× faster than eager (#1305)
- 0.81.6 #412 shape catalog + dcgan step probe + alloc profile (#403)
- 0.81.7 #416 drop isdeterministicmode from compiledmodelcache shape-key (closes #1395 root)
- 0.81.8 #418 serialise clcreatecommandqueue across threads — dodges amd driver race (#414)
- 0.81.9 #424 long-arithmetic dim-product in tensorallocator — replaces silent
checked(int * int) overflow that broke timemachine / dqn / owlvit /
dgcnn / tabtransformer / tabdpt / slimsam / triaffinener on
sonarcloud run 26241806890
verified locally:
- restore + build src + tests all green
- 34/34 vggnetworktests.UnitTests pass
- vggnetworktests.modelfamily.training_shouldreduceloss no longer throws
the #1395 exception (now times out at 120s — that's #1394 perf, distinct)
* perf(#1392): remove o(n²) ordering in neat fitness + cache topology sort
issue #1392 reported neattests.training_shouldreduceloss timing out at
the 120 s ci budget. profiled it down to two hot-path issues inside
neat.train's 50-generation fitness loop:
1. inline lastloss assignment in the fitness function called
_population.orderbydescending(g => g.fitness).firstordefault() on
every genome eval -- o(n) sort per genome × 150 genomes per
generation × 50 generations × 30 train calls in the test =
~34 million ordering operations, all heap-allocating linq
enumerables. the reference-equality probe against the
pre-generation best was also semantically broken (the comment in
the existing post-evolution recompute block already called it
out), so the inline lastloss was both expensive AND wrong. fix:
delete the branch. the post-evolution recompute at neat.cs:1230+
does the work correctly using the actual post-generation best.
2. activategenome rebuilt sortconnectionstopologically (o(e²)) from
scratch on every call -- 225,000 sorts across the test run, most
redundant because only weight-mutation occurred between successive
generations. fix: cache the sort + the non-input-node id list on
the genome, keyed on (connections.count, ulong bitmask of
isenabled). topology mutations (addconnection / disableconnection)
change one or both halves of the key, invalidating the cache;
weight mutations leave it valid (the dominant case).
local timing on the failing test (single-test run, 32-core host):
baseline: 54.0 s
post-fix: 46.0 s (~15 % faster)
the issue itself scopes the perf gap as "multi-week" -- this is a
first pass closing the easy 15-20 % win without changing model
semantics. real residual is in activategenome's dictionary<int, t>
allocator pressure and the per-mutation clone path, both of which
will need follow-up work.
added genome cache fields:
internal int cachedtopologysignaturecount
internal ulong cachedtopologysignaturemask
internal list<connection<t>>? cachedsortedconnections
internal list<int>? cachednoninputnodeids
verified: all 12 neattests pass in isolation post-fix; no behavior
change vs baseline (lastloss values, mutation outputs, evolution
trajectory unchanged because the deleted branch was already dead
code per the post-evolution recompute comment).
* perf(#1392): activate genome on flat array + bulk clone
ActivateGenome was allocating a Dictionary<int, T> per call and indexing
through it for every connection in the topologically sorted edge list.
Under EvolvePopulation that's one Dictionary per genome per fitness call
— 150 pop x 50 gen x ~30 Train calls per test = ~225k Dictionary allocs
per test invocation.
Swap the Dictionary for a flat T[] sized to max(referenced node id,
biasNodeId) + 1. Connection traversal becomes pure array indexing; the
non-input-node sigmoid sweep walks a cached List<int> instead of
Dictionary.Keys. Three new genome caches piggyback on the existing
topology-sort caches added in 09534a4:
- CachedMaxNodeId — max(referenced node id, biasNodeId)
- CachedReferencedNonInputNodeIds — distinct non-input node ids
- (existing CachedSortedConnections invalidates these too)
Clone() now pre-sizes the child genome's Connections list to the parent
count instead of letting List<Connection<T>> grow through the
0->4->8->16 capacity-doubling chain (each step memcpys the buffer).
Connection<T> objects are still freshly allocated per child so parent
mutations don't leak across the clone boundary.
GetNamedLayerActivations dropped its Dictionary-specific ContainsKey
checks and Keys.Where(...) lookup in favor of straight array indexing
plus a walk over the cached non-input-node id list.
Net wall time on NEATTests.Training_ShouldReduceLoss (isolated, net10):
- pre-#1419 baseline: ~54 s
- #1419 first pass (this PR): ~46 s (~15%)
- + this commit: ~41 s (~24% cumulative)
Build verified on net10.0 + net471. Test passes in isolation. Pre-existing
parallel-suite timeout on Training_ShouldReduceLoss is unaffected by this
change (confirmed by re-running against the stashed baseline) — that
flake's root cause is xunit parallel CPU contention against the 120 s
test budget, not a regression introduced here.
Issue: #1392
* perf(#1392): zero-alloc tournament + linq-free crossover + mutate
Three remaining hot paths in EvolvePopulation that were paying per-call
allocations on every offspring:
- SelectParent: built a fresh List<Genome<T>> + ran OrderByDescending
over it on every invocation. Tournament size is fixed at 3; ~447 k
calls per Training_ShouldReduceLoss run (149 children x 50 gens x
~30 Train calls x 2 parents per crossover). Rewritten as an inline
3-way argmax with no allocations and no LINQ.
- Crossover: Enumerable.Concat (enumerator alloc) and a fresh
HashSet<int> per call. Switched to a per-NEAT-instance scratch
HashSet that .Clear()s at the top of each call (single-threaded
Evolve loop, so reuse is safe), plus pre-sized the child
Connections list to (parent1.Count + parent2.Count) to skip the
capacity-doubling chain. Both parent lists are now walked by index.
- Mutate: LINQ Max + Any allocated a Func<,> delegate + enumerator
per call. Both replaced with manual index loops. Weight-mutation
foreach also replaced with an index loop so JIT can elide the
List<T>.Enumerator bounds check on each step.
- EvolvePopulation: pre-size newPopulation to _populationSize so its
backing array doesn't walk 0->4->8->16->...->150 on every
generation.
Connection<T> object pooling was considered and skipped — Connection's
FromNode/ToNode/Innovation are init-only via the public constructor, so
pooling would require either a breaking API change or a fragile internal
reset path that's bug-prone. Per-genome allocation churn for connections
is bounded by genome.Connections.Count (small) and is already paid
under JIT-friendly Add() calls into the pre-sized child list, so the
remaining marginal win does not justify the API risk.
Test pass on all 6 NEAT tests run individually (net10.0). Wall time on
Training_ShouldReduceLoss (isolated, 3-run min): ~41 s (unchanged vs
the prior commit — ActivateGenome dominates the inner loop, this commit
trims the outer loop's overhead and reduces GC pressure under parallel
load).
Issue: #1392
* fix(NEAT): FNV-1a topology signature catches same-count rewires + >64-conn aliasing
PR #1419 review: the previous ComputeEnabledBitmask cache key keyed
only on (Connections.Count, XOR of enabled-bits). That signature
aliased two real edit patterns and let stale cached sorts /
non-input-node sets / max-node-id leak back to ActivateGenome:
- Same-count rewires — swapping a connection's FromNode/ToNode for a
different node without flipping any IsEnabled bit preserved both
count and the bitmask → cache hit on the WRONG topology.
- >64-connection aliasing — the bitmask's `(i & 63)` wrap collapsed
slots 0/64/128/… onto the same bit, so a flip at slot 64 could
XOR-cancel an earlier flip at slot 0 and leave the mask unchanged.
Replace with FNV-1a 64-bit hash over (FromNode, ToNode, IsEnabled)
per slot in iteration order. Connection.Weight is deliberately
excluded — weight-only mutations are the dominant case across the 50
internal generations per public Train call, and we WANT the cached
topological sort to survive them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(NEAT): include biasNodeId in newNodeId floor (PR #1419)
PR #1419 review (CodeRabbit minor): the add-node mutation's
first-free-node-id scan started at 0 and used
max(FromNode, ToNode) across connections. For an initial genome
(connections only reference InputSize + OutputSize node ids), that
max is InputSize + OutputSize − 1, producing newNodeId =
InputSize + OutputSize = biasNodeId.
ActivateGenome writes activations[biasNodeId] = NumOps.One BEFORE
the connection sweep, so any connection accumulating into this
hidden slot would corrupt the bias signal — and every connection
targeting the new hidden node would also read a polluted
pre-activation from the same slot. The collision only affected the
FIRST add-node mutation on a fresh genome (subsequent mutations
push maxNodeId past biasNodeId), but that's the most common path
and the corruption was silent.
Initialise maxNodeId at biasNodeId instead of 0 so newNodeId is
guaranteed > biasNodeId regardless of starting topology.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent e443bd8 commit a801d11
3 files changed
Lines changed: 411 additions & 84 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
9 | | - | |
10 | | - | |
11 | | - | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
23 | | - | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
24 | 19 | | |
25 | 20 | | |
26 | 21 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
177 | 177 | | |
178 | 178 | | |
179 | 179 | | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
180 | 211 | | |
181 | 212 | | |
182 | 213 | | |
| |||
415 | 446 | | |
416 | 447 | | |
417 | 448 | | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
418 | 456 | | |
419 | | - | |
| 457 | + | |
420 | 458 | | |
421 | | - | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
422 | 467 | | |
423 | 468 | | |
424 | 469 | | |
| |||
0 commit comments