fix(#1405): moe default optimizer overshoots — use amsgrad adam(1e-4) (#1409)

ooples · franklinic · claude · web-flow · commit 2ce0b3027395 · 2026-05-20T18:37:09.000-04:00
* fix(#1309): cluster-1 DCGAN — restore deferred-shape guard + lazy-conv deserialize fallback PR #1290 CI Cluster 1: 25 of 25 DCGANTests failing post-master with one of two errors: 1. Most (23 tests): "Invalid layer configuration: The last layer's output shape [3, -1, -1] must match the architecture output size (12288)." 2. Clone tests (2): "Input spatial dims after padding (1+2*1, 1+2*1) must be >= kernelSize (4)" raised inside DeserializationHelper's pre-resolve of the discriminator's first conv layer. Plus 1 SparseNN test (intermittent mode-collapse) that re-runs pass without code change — flaky, not a regression target. ## Root causes (1) NeuralNetworkBase.IsLastLayerShapeCompatible: PR #1329 (commit 969977d) added a `outputShape.Any(d => d < 0)` early-return so the validator defers the flat-OutputSize check when any output-shape dim is deferred — DCGAN's last transposed-conv emits [3, -1, -1] until its first Forward resolves H/W. That guard was inadvertently deleted by the grafprint PR (c8cac23, May 16) one day later. Restoring it unblocks all 23 validator-rejection cases at once. (2) DeserializationHelper conv path: when the saved layer record's inputShape carries -1 sentinels (a lazy conv layer serialized before its first Forward — DCGAN's discriminator on a Predict-only probe sees only the generator), the pre-existing code coerced all -1 dims to 1 and called conv.ResolveShapesOnly(...). For DCGAN's first conv (kernel=4, padding=1) this fails OnFirstForward's kernel-size check (1 + 2 < 4). Coercing to Math.Max(1, KernelSize) fixes that specific check, but locks InputDepth at 1 — then the real Forward with the [3, 64, 64] RGB image throws "Expected input depth 1, but got 3". The correct fix is to skip pre-resolve entirely when InputDepth is deferred — ConvolutionalLayer.SetParameters has its own auto-resolve fallback at line ~1598 that derives InputDepth from the saved parameter vector's length, and uses KernelSize as the spatial placeholder. Pre-resolve still runs (and uses Math.Max(1, KernelSize) for any deferred spatial dim) when InputDepth is concrete — that's the original PR #1329 contract for the auto-resolve-disambiguation case. ## Verification $ dotnet test --filter "FullyQualifiedName~DCGANTests|FullyQualifiedName~SparseNeuralNetworkTests" --framework net10.0 Failed! - Failed: 2, Passed: 44, Skipped: 0, Total: 46 26 → 2 failures. The remaining two are NOT cluster-1 shape-contract issues: - DCGANTests.MoreData_ShouldNotDegrade — `Test execution timed out after 120000 milliseconds`. Pre-existing GAN training-path perf gap; the deep deconv+conv chain in tape mode is ~5-10× slower than PyTorch CPU baseline. Substep profile (Release): Generator.Predict 19 ms, Discriminator.Train 187 ms, Generator adversarial 313 ms — 519 ms/step × 250 iters = 130 s vs 120 s timeout. Filed separately so this PR ships the actual cluster-1 root causes (validator + conv-deserialize) without bundling a multi-week perf project. - SparseNeuralNetworkTests.DifferentInputs_AfterTraining_ShouldProduceDifferentOutputs — intermittent mode-collapse, passes on re-runs. Separate flaky-test issue, not a shape-contract bug. Closes #1309 partially (cluster-1 shape-contract root causes). The MoreData_ShouldNotDegrade timeout + SparseNN mode-collapse flakiness are tracked separately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) * fix(PR #1389 review): document zero-dim wildcard semantics + reject malformed Conv inputShape rank * fix(PR #1389 follow-up): widen rank check to reject rank-1/2 Conv inputShape too * perf(#1390): eliminate duplicate generator forward in GAN.Train — closes DCGAN MoreData timeout Previously GenerativeAdversarialNetwork.Train ran the generator forward TWICE per training step: 1. Generator.Predict(input) (eval mode, NoGradScope) → detached fake images for the combined real+fake discriminator step. 2. ForwardForTraining(input) (train mode, on tape) inside TrainWithCustomLoss — duplicate of the same forward, just for the gen-adversarial backward. On the DCGAN MoreData fixture (250 iters, double-precision, batch=2, 64×64 RGB) this duplicate forward contributed ~19 ms of the 519 ms / step profiled in #1390 — pushing the test 10 s over its 120 s budget. Refactor: - Open a single GradientTape at the start of the step. - Run ForwardForTraining(input) ONCE on that tape → fakeTapeTracked. - Take a value-copy detached snapshot (fakeImages) for the disc step; fresh Tensor<T> with no GradNode chain so disc.Train (which opens its own nested tape) can not leak gradients back into the generator. - Walk the discriminator layer-by-layer on the existing gen tape for the adversarial loss (unchanged from the prior closure semantics). - Drive the gen optimizer step via the new NeuralNetworkBase.BackwardAndStepOnPrecomputedLoss helper, which reuses the open tape instead of TrainWithCustomLoss opening a fresh one + re-running ForwardForTraining. Behavior note: the disc step now sees train-mode generator output (batch BN stats) instead of eval-mode (running BN stats). This matches PyTorch's standard DCGAN training pattern (fake = G(z); fake_detached = fake.detach()) and the existing gen step's own train-mode forward. DCGAN has no Dropout, so the only distribution shift is BN stats, which is the conventional adversarial behavior. Verified locally with the canonical Tensors 0.81.3 dependency: - DCGANTests.MoreData_ShouldNotDegrade: 1 m 47 s (was timing out at > 120 s) — closes the test's perf gap. - Full DCGANTests class: 25 / 25 passing. - ConditionalGANTests + InfoGANTests (other GAN.Train consumers): 50 / 50 passing. - Full SparseNeuralNetworkTests: 21 / 21 passing (previously "intermittent mode-collapse" in PR #1389 description — appears stable now, may have been transient). Closes #1390. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(pr1389-review): narrow visibility + reentrancy + extra trainables addresses three coderabbit comments on backwardandsteponprecomputed loss in pr #1389: 1. visibility narrowed public -> internal. the codebase contract is "users should only interact with aimodelbuilder / aimodelresult" and this helper is training plumbing for in-assembly callers (currently generativeadversarialnetwork.train); no reason for it to live on the public surface. only caller is in same assembly. 2. added using var __reentrancyguard = acquiretrainsentinel() at the top, mirroring trainwithtape's sentinel discipline. without it, concurrent callers on the same model race on lastloss + optimizer internal state. 3. trainableparams now concats getextratrainabletensors() with the layer params, matching trainwithtape's parameter set. without this models that expose raw tensors via getextratrainabletensors (rather than layer-resident params) silently skipped updates on the precomputed-loss path -- divergent semantics between the two training entry points. build passes. * fix(#1405): moe default optimizer overshoots — use amsgrad adam(1e-4) vanilla adam default at the recommended learning rate diverges on the moredata long-training regression for mixtureofexpertsneuralnetwork (short=ok, long=overshoot) — same signature as the densenet fix in #1393. two root causes: 1. lr too aggressive for the moe gating + per-expert summed-gradient path, where multiple experts touch each parameter per step. 2. vanilla adam's v denominator drifts late in training, so step size grows even after loss has bottomed out. switching the built-in default to amsgrad-mode adam at 1e-4 keeps the running max of v (reddi et al. 2018), which removes the late-training denominator decay and matches the recipe already proven on densenet. users passing an explicit optimizer are unaffected. Closes #1405 --------- Co-authored-by: franklinic <franklin@ivorycloud.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/src/NeuralNetworks/MixtureOfExpertsNeuralNetwork.cs b/src/NeuralNetworks/MixtureOfExpertsNeuralNetwork.cs
@@ -1,5 +1,7 @@
 using AiDotNet.Attributes;
 using AiDotNet.Enums;
+using AiDotNet.Models.Options;
+using AiDotNet.Optimizers;
 
 namespace AiDotNet.NeuralNetworks;
 
@@ -179,7 +181,35 @@ public MixtureOfExpertsNeuralNetwork(
         Options = _options;
         _options.Validate();
 
-        _optimizer = optimizer ?? new AdamOptimizer<T, Tensor<T>, Tensor<T>>(this);
+        // Issue #1405 (mirrors #1393 DenseNet fix): switched default optimizer
+        // from vanilla Adam(lr=1e-3) to AMSGrad-mode Adam(lr=1e-4) to stop the
+        // optimizer from drifting past the converged point on small /
+        // memorization fixtures (MoreData_ShouldNotDegrade was failing because
+        // lossLong > lossShort by enough to trip the 1e-4 MoreDataTolerance).
+        //
+        // Two compounding causes in the prior default:
+        //   1. lr=1e-3 is too aggressive for the MoE gating + expert
+        //      summed-gradient paths — every parameter sees gradient from
+        //      multiple expert chains weighted by the gate, amplifying the
+        //      effective per-step update vs a plain sequential MLP. 1e-4
+        //      shrinks per-step magnitude (Kingma & Ba 2014 §2.1 convention).
+        //   2. Vanilla Adam's per-parameter v̂ denominator can decay alongside
+        //      gradients near convergence so the effective step size doesn't
+        //      shrink in lockstep, leading to post-convergence drift. AMSGrad
+        //      (Reddi, Kale, Kumar 2018) maintains a running v̂_max — the
+        //      denominator can only grow, eliminating the drift.
+        //
+        // Same configuration NeuralNetworkBase.GetOrCreateBaseOptimizer hands
+        // out as the framework-wide tape default; made explicit on MoE so the
+        // public Train() path benefits without going through the tape-only
+        // path. Callers passing an explicit optimizer are unaffected.
+        _optimizer = optimizer ?? new AdamOptimizer<T, Tensor<T>, Tensor<T>>(
+            this,
+            new AdamOptimizerOptions<T, Tensor<T>, Tensor<T>>
+            {
+                InitialLearningRate = 1e-4,
+                UseAMSGrad = true
+            });
         _lossFunction = lossFunction ?? NeuralNetworkHelper<T>.GetDefaultLossFunction(architecture.TaskType);
 
         InitializeLayers();