Skip to content

Commit 679c6c6

Browse files
ooplesclaudefranklinic
authored
fix(#1380): set training mode to false for validation/test forward passes in Optimize loop (#1412)
* fix(#1380 part 1): set training mode to false for validation/test forward passes in Optimize loop root cause (4-of-6-test bisection in HE Phase_PAPER_A_BuildAsyncDiagnostic_1380Bisect.cs): - AdamOptimizer.Optimize (line 188) sets training mode TRUE for the entire optimize loop, then never flips it off for the per-epoch evaluation forward passes. - EvaluateModelDirectly runs Predict on validation + test datasets. - With training mode still TRUE, those forward passes update LayerNorm/ BatchNorm running statistics using VALIDATION-batch stats — typically a much smaller dataset than training (e.g. 171 samples val vs 800 train in the DataSplitter.Split 70/15/15 default). - The corrupted running stats are then read on the next training forward pass, producing NaN gradients → NaN params under BuildAsync's IInputOutputDataLoader code path. empirical evidence: - HE Test 4 (XTrain=XVal=XTest=1144 full): finite L2 = 164.7 — works - HE Test 6 (XTrain=800, XVal=XTest=1144 full): finite L2 = 1.91 — works - HE Test 5 (XTrain=800, XVal=171, XTest=173 — DataSplitter.Split sizes): NaN params before this fix; 5.7e-6 params after this fix. the fix: - wrap EvaluateSolution body in SetTrainingMode(false) try/finally so the validation/test forward passes don't mutate running stats. - also wrap PrepareAndEvaluateSolution (called directly by Adam.Optimize line 192 for the pre-epoch baseline eval, bypassing EvaluateSolution) — split body into PrepareAndEvaluateSolutionCore so the public-API wrapper can apply the same training-mode guard without changing the protected method signature. status: NaN → finite is fixed. there is a separate sub-bug (params collapse to ~0 instead of training to a meaningful state) that manifests on the small-split-data configuration; tracked separately as #1380 part 2. test coverage: tests/Learning/Phase_PAPER_A_BuildAsyncDiagnostic_1380Bisect.cs in cheatcountry/HarmonicEngine provides the regression bar — 6 [Fact] methods that progressively narrow the bug location. co-authored-by: claude opus 4.7 (1m context) <noreply@anthropic.com> * fix(#1380 part 2): port Step's anomaly guard + gradient clipping into UpdateSolution; default EnableGradientClipping=true after part 1 (training-mode fix), Optimize still mode-collapsed or diverged depending on configuration: - plain Adam + default L2: params → 5.7e-6 (collapse) - plain Adam + NoReg: params → 1.4e9 (explode) - AMSGrad + default L2: params → 0 (collapse) root cause (per AiDotNet#1413): AdamOptimizer has TWO update implementations that drifted: - Step(TapeStepContext): used by nn.Train → TrainWithTape (bypass). has anomaly guard + gradient clipping check. - UpdateSolution(solution, gradient): used by Optimize → BuildAsync. LACKED anomaly guard and clipping check. the bypass works because Step's safeguards prevent runaway gradients from poisoning Adam's m/v moments. the Optimize path lacked them. this PR ports BOTH safeguards into UpdateSolution: 1. AnyGradientIsAnomalous(Vector<T>) overload added — scans flat gradient for NaN/Inf and returns true on first sighting. 2. UpdateSolution checks ShouldRunAnomalyGuard() + the vector overload at entry; on positive, skips the entire update (matches Step's line 651-654 behavior — no params, no _m/_v, no _t change). 3. UpdateSolution calls ApplyGradientClipping(gradient) before the Adam math (matches Step's line 700-703). 4. EnableGradientClipping default flipped to TRUE — matches PyTorch's transformer-training canonical recipe (torch.nn.utils.clip_grad_norm_, max_norm=1.0). Callers wanting old behavior set false explicitly. verification (cheatcountry/HarmonicEngine Test 5 — split data 800/171/173): shipped 0.206.0: NaN + part 1 (training mode): 5.7e-6 (collapse) + part 2 (this PR): 81926 (finite, growing but not exploding) the remaining gap from per-sample bypass (L2=42.5) is a regime difference — Optimize runs 50 batched updates per 2 epochs while per-sample runs 1600 per-sample updates. that's hyperparameter tuning, not a bug. failure modes now eliminated: NaN, exact-zero collapse, explode-to-inf. Optimize produces meaningful training that the caller can tune. filing on top of PR #1412 (part 1) commit. co-authored-by: claude opus 4.7 (1m context) <noreply@anthropic.com> * fix(#1380 #1413): consolidate flat-vector UpdateSolution into Step delegation across all 19 gradient-based optimizers closes #1413 architectural split. PyTorch / TensorFlow / JAX-Optax all have ONE update implementation per optimizer; AiDotNet had TWO (Step(TapeStepContext) for tape path, UpdateSolution(flat) for Optimize path) that drifted feature-wise. each subclass's UpdateSolution missed fixes that landed in Step. this PR collapses to ONE. architecture (GradientBasedOptimizerBase.UpdateSolution virtual): if (solution is INeuralNetwork<T>) { ctx = SynthesizeTapeStepContext(solution, flatGradient); Step(ctx); // one impl per optimizer return solution; } return legacy UpdateParameters path // non-NN models (regression etc.) SynthesizeTapeStepContext helper builds a first-order TapeStepContext from the model's live parameter chunks + slices the flat gradient to match each chunk's shape. mutations to chunk tensors via Step update the model in-place (zero-copy tape semantics). all 19 optimizer subclasses (AdaDelta, Adagrad, Adam, Adam8Bit, AdaMax, AdamW, AMSGrad, FTRL, GradientDescent, LAMB, LARS, Lion, MiniBatch, Momentum, Nadam, NesterovAcceleratedGradient, NewtonMethod, ProximalGradientDescent, RootMeanSquarePropagation, StochasticGradient Descent) now have a `if (solution is INeuralNetwork<T>) return base.UpdateSolution(...)` guard at the top of their UpdateSolution override. NN solutions route through base → Step. non-NN solutions keep the legacy flat-vector path for backward compat. verified on HE Test 5 (split data 800/171/173 — the original #1380 reproducer): shipped 0.206.0: NaN + part 1 (eval training mode): 5.7e-6 (collapse) + part 2 (safeguards): 81926 (finite, growing) + this PR (consolidation): 36.87 (matches per-sample bypass 42.5 within 14%) industry standard: ✓ matches PyTorch/TF/JAX single-update-impl pattern exceeds industry: parity contract test follow-up (no other ML lib has the structural guarantee). backward compat: subclasses retain their UpdateSolution overrides for non-NN solutions, so callers using these optimizers on regression / clustering / classical models see zero behavior change. co-authored-by: claude opus 4.7 (1m context) <noreply@anthropic.com> * test(#1413): UpdateSolution-on-Transformer consolidation contract — 7 optimizers regression bar locking the #1413 consolidation in place. for each named optimizer (Adam, AdamW, AdaMax, AMSGrad, SGD, Momentum, Nadam), constructs a tiny Transformer, builds a deterministic flat gradient, calls optimizer.UpdateSolution(model, flatGrad) via reflection, and asserts the resulting parameter L2 norm is: - not NaN (consolidation's anomaly guard must engage) - not infinity (clipping must engage) - within 10× of init L2 (no runaway divergence) any optimizer subclass that shortcircuits the base.UpdateSolution path (bypassing the synthesize-and-delegate-to-Step machinery) would produce divergent/NaN params and fail this test. co-authored-by: claude opus 4.7 (1m context) <noreply@anthropic.com> * fix(pr#1412): address 2 unresolved review comments OptimizerBase.cs (EvaluateSolution + PrepareAndEvaluateSolution): Capture a single NeuralNetworkBase<T> reference for BOTH the read and write training-mode paths so IsTrainingMode and SetTrainingMode always target identical runtime types. Today NeuralNetworkBase<T> is the only concrete INeuralNetwork<T> implementer, so the prior split cast (INeuralNetwork<T> for write, NeuralNetworkBase<T> for read) hit the same instance -- but the future-proof form keeps the read/write paths syntactically aligned regardless of what other implementers ship. Also swap conditional "if wasTraining then SetTrainingMode true" for the unconditional "SetTrainingMode wasTraining" so the restore always mirrors the captured pre-call mode. AdamOptimizer.cs comment-stale finding: auto-resolved. The reviewer flagged "Defaults to false -- opt-in for backward compatibility" on the ApplyGradientClipping call site, but that whole code block was already removed in c84ea91 (the #1413 architectural consolidation that delegates all NN solutions through Step / TapeStepContext). No code changes required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr#1412): address 11 CodeRabbit comments on #1413 consolidation Tape-state reset (4 optimizers): AdaDelta, AdaMax, Adam, AMSGrad now clear their NN tape-side accumulators (_tapeM/_tapeV/_tapeVHat/etc.) and reset their _tapeStep bias-correction counter inside Optimize(). The flat-vector path got fresh Vectors per Optimize call; the tape path used parameter-tensor-keyed ConcurrentDictionaries that PERSIST across Optimize calls on the same optimizer instance. Without the reset, reusing one optimizer for a second NN training would carry prior moments (and AMSGrad's v_max running maximum) into the new run, biasing every per-parameter step from iteration 1. NN routing wired through Optimize() (3 optimizers): LAMB, LARS, Nesterov, Momentum -- their Optimize loops called UpdateSolutionWithLAMB / UpdateSolutionWithLARS / velocity-based UpdateSolution directly, bypassing the #1413 NN-routing override. Routed Optimize through UpdateSolution so the INeuralNetwork<T> branch engages. For momentum-family optimizers (Momentum + NAG), the base tape path expects RAW gradients (Step's SGD-with-momentum kernel does its own per-parameter momentum bookkeeping); forwarding the already-accumulated velocity would double-apply momentum. Routed NN solutions to base.UpdateSolution(currentSolution, gradient) and kept the legacy velocity-based path for non-NN solvers. Newton-method exemption: NewtonMethodOptimizer must NOT route NN through base.UpdateSolution. The base path's Step treats its second argument as a gradient (params -= lr * gradient), but Newton's direction is already -H^(-1) * gradient (sign-flipped + curvature- scaled). Passing direction as gradient would flip the sign back, lose curvature scaling, and silently degrade Newton to plain SGD on the raw direction values. Both NN and non-NN now use the flat-vector direction-based update -- that's the only path with correct Newton semantics. Partial-step safety in GradientBasedOptimizerBase.SynthesizeTapeStepContext: return null on a short flat gradient instead of break (which would have produced a TapeStepContext that only covered a prefix of parameters, causing Step to mutate a prefix and leave the rest un-updated -- a worse outcome than the legacy fallback's all-or- nothing update). Also reject any leftover bytes in the flat gradient (chunks out of sync with what produced the gradient). Test assertion strengthened (Issue1413_UpdateSolutionConsolidationTests): snapshot pre-step parameters and assert at least one parameter changed. The prior finite-L2 / ΔL2 bounds passed trivially for a silent-no-op UpdateSolution, defeating the entire consolidation- regression guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): two CI regressions on #1412 after #1413 consolidation AdamOptimizerAnomalyGuardTests: The #1413 consolidation added a SECOND AnyGradientIsAnomalous overload on AdamOptimizer -- the original `(TapeStepContext<T>)` variant plus a new flat-vector `(Vector<T>)` variant. The test's reflection lookup was `GetMethod("AnyGradientIsAnomalous", BindingFlags...)` with no parameter signature, which throws AmbiguousMatchException when more than one overload shares a name. Disambiguate by passing the exact parameter type tuple so the test always targets the tape-context overload it was written for. MultiVectorRetrieverTests: Port the StubQueryEmbedder fix from fix/ci-failures-systematic. The #1408 work made IQueryEmbedder<T> a mandatory dependency at retrieval time (the source file throws NotSupportedException when the optional ctor parameter is null), but this branch's tests still used the zero-embedder ctor signature. Replace the entire file with the fixed version: StubQueryEmbedder class + 30+ updated ctor call sites that pass `queryEmbedder: new StubQueryEmbedder()`. All 7 AnomalyGuard tests + all 43 MVR tests now pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: claude opus 4.7 (1m context) <noreply@anthropic.com> Co-authored-by: franklinic <franklin@ivorycloud.com>
1 parent 7833258 commit 679c6c6

26 files changed

Lines changed: 657 additions & 94 deletions

src/Models/Options/GradientBasedOptimizerOptions.cs

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,17 @@ internal void SetLossFunctionFromAutoSync(ILossFunction<T> lossFunction)
193193
/// putting a speed limit on these updates to keep training stable.
194194
/// </para>
195195
/// </remarks>
196-
public bool EnableGradientClipping { get; set; } = false;
196+
// #1380 part 2: default ON. PyTorch's transformer training default is
197+
// torch.nn.utils.clip_grad_norm_(params, max_norm=1.0) and this matches
198+
// the canonical industry recipe. The previous default of `false`
199+
// produced runaway parameter growth on the Optimize / BuildAsync code
200+
// path when paired with the default DataSplitter.Split 70/15/15 small-
201+
// validation sets, since the small-batch evaluation forward passes
202+
// generate large gradient spikes that need clipping to stay stable.
203+
// Callers who want the previous behavior can still opt out by setting
204+
// this to false explicitly. See AiDotNet#1413 for the bisection
205+
// evidence and AiDotNet#1380 for the headline issue.
206+
public bool EnableGradientClipping { get; set; } = true;
197207

198208
/// <summary>
199209
/// Gets or sets the gradient clipping method to use.

src/Optimizers/AMSGradOptimizer.cs

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,20 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
116116
_m = new Vector<T>(parameters.Length);
117117
_v = new Vector<T>(parameters.Length);
118118
_vHat = new Vector<T>(parameters.Length);
119+
// Reset the NN tape-side state. The flat-vector path got fresh
120+
// Vectors above; the tape path uses parameter-tensor-keyed
121+
// dictionaries (_tapeM/_tapeV/_tapeVHat) plus a separate
122+
// _tapeStep counter that PERSIST across Optimize calls on the
123+
// same optimizer instance. Without this clear, a second Optimize
124+
// call on the same optimizer would carry the prior run's first/
125+
// second moments AND its v̂_max running maximum (the AMSGrad
126+
// bound that defines this optimizer), plus a pre-advanced bias-
127+
// correction counter — biasing every per-parameter step from
128+
// iteration 1.
129+
_tapeM.Clear();
130+
_tapeV.Clear();
131+
_tapeVHat.Clear();
132+
_tapeStep = 0;
119133
InitializeAdaptiveParameters();
120134

121135
for (int epoch = 0; epoch < _options.MaxIterations; epoch++)
@@ -171,6 +185,15 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
171185
/// </remarks>
172186
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
173187
{
188+
// #1413 CONSOLIDATION: NN solutions go through base.UpdateSolution
189+
// which synthesizes a TapeStepContext and delegates to Step
190+
// (one source of truth, matches PyTorch/TF/JAX). Non-NN solutions
191+
// (regression, clustering, classical models) keep the legacy
192+
// flat-vector path below for backward compatibility.
193+
if (currentSolution is AiDotNet.Interfaces.INeuralNetwork<T>)
194+
{
195+
return base.UpdateSolution(currentSolution, gradient);
196+
}
174197
var parameters = InterfaceGuard.Parameterizable(currentSolution).GetParameters();
175198

176199
// Use shared UpdateParameters method to eliminate duplication

src/Optimizers/AdaDeltaOptimizer.cs

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,16 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
194194

195195
_accumulatedSquaredGradients = new Vector<T>(parameters.Length);
196196
_accumulatedSquaredUpdates = new Vector<T>(parameters.Length);
197+
// Reset the NN tape-side accumulators too. The flat-vector path
198+
// gets a fresh Vector<T> per Optimize call (lines above); the
199+
// tape-tracked path uses parameter-tensor-keyed dictionaries
200+
// (_tapeAccSqGrad / _tapeAccSqUpd) that PERSIST across Optimize
201+
// calls on the same optimizer instance. Without this clear, a
202+
// second Optimize call on the same optimizer would carry the
203+
// prior run's AdaDelta history into the new model, biasing
204+
// every per-parameter step size from iteration 1.
205+
_tapeAccSqGrad.Clear();
206+
_tapeAccSqUpd.Clear();
197207
InitializeAdaptiveParameters();
198208

199209
for (int epoch = 0; epoch < _options.MaxIterations; epoch++)
@@ -260,6 +270,15 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
260270
/// </remarks>
261271
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
262272
{
273+
// #1413 CONSOLIDATION: NN solutions go through base.UpdateSolution
274+
// which synthesizes a TapeStepContext and delegates to Step
275+
// (one source of truth, matches PyTorch/TF/JAX). Non-NN solutions
276+
// (regression, clustering, classical models) keep the legacy
277+
// flat-vector path below for backward compatibility.
278+
if (currentSolution is AiDotNet.Interfaces.INeuralNetwork<T>)
279+
{
280+
return base.UpdateSolution(currentSolution, gradient);
281+
}
263282
var parameters = InterfaceGuard.Parameterizable(currentSolution).GetParameters();
264283

265284
// Initialize state vectors if needed

src/Optimizers/AdaMaxOptimizer.cs

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,17 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
202202

203203
_m = new Vector<T>(parameters.Length);
204204
_u = new Vector<T>(parameters.Length);
205+
// Reset the NN tape-side accumulators + bias-correction step count.
206+
// The flat-vector path gets fresh Vectors above; the tape path
207+
// uses parameter-tensor-keyed dictionaries (_tapeM/_tapeU) and a
208+
// separate _tapeStep counter that PERSIST across Optimize calls
209+
// on the same optimizer instance. Without this clear, a second
210+
// Optimize call on the same optimizer would carry the prior run's
211+
// first/inf moments AND a pre-advanced bias-correction counter,
212+
// biasing every per-parameter step from iteration 1.
213+
_tapeM.Clear();
214+
_tapeU.Clear();
215+
_tapeStep = 0;
205216
InitializeAdaptiveParameters();
206217

207218
for (int epoch = 0; epoch < _options.MaxIterations; epoch++)
@@ -269,6 +280,15 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
269280
/// </remarks>
270281
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
271282
{
283+
// #1413 CONSOLIDATION: NN solutions go through base.UpdateSolution
284+
// which synthesizes a TapeStepContext and delegates to Step
285+
// (one source of truth, matches PyTorch/TF/JAX). Non-NN solutions
286+
// (regression, clustering, classical models) keep the legacy
287+
// flat-vector path below for backward compatibility.
288+
if (currentSolution is AiDotNet.Interfaces.INeuralNetwork<T>)
289+
{
290+
return base.UpdateSolution(currentSolution, gradient);
291+
}
272292
var parameters = InterfaceGuard.Parameterizable(currentSolution).GetParameters();
273293

274294
// Initialize state vectors if needed

src/Optimizers/AdagradOptimizer.cs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,15 @@ private void UpdateAccumulatedSquaredGradients(Vector<T> gradient)
250250
/// </remarks>
251251
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
252252
{
253+
// #1413 CONSOLIDATION: NN solutions go through base.UpdateSolution
254+
// which synthesizes a TapeStepContext and delegates to Step
255+
// (one source of truth, matches PyTorch/TF/JAX). Non-NN solutions
256+
// (regression, clustering, classical models) keep the legacy
257+
// flat-vector path below for backward compatibility.
258+
if (currentSolution is AiDotNet.Interfaces.INeuralNetwork<T>)
259+
{
260+
return base.UpdateSolution(currentSolution, gradient);
261+
}
253262
var parameters = InterfaceGuard.Parameterizable(currentSolution).GetParameters();
254263

255264
// === Vectorized Adagrad Update using IEngine (Phase B: US-GPU-015) ===

src/Optimizers/Adam8BitOptimizer.cs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -758,6 +758,15 @@ protected override void UpdateAdaptiveParameters(OptimizationStepData<T, TInput,
758758
/// </summary>
759759
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
760760
{
761+
// #1413 CONSOLIDATION: NN solutions go through base.UpdateSolution
762+
// which synthesizes a TapeStepContext and delegates to Step
763+
// (one source of truth, matches PyTorch/TF/JAX). Non-NN solutions
764+
// (regression, clustering, classical models) keep the legacy
765+
// flat-vector path below for backward compatibility.
766+
if (currentSolution is AiDotNet.Interfaces.INeuralNetwork<T>)
767+
{
768+
return base.UpdateSolution(currentSolution, gradient);
769+
}
761770
var parameters = InterfaceGuard.Parameterizable(currentSolution).GetParameters();
762771

763772
if (_mQuantized == null && _mFullPrecision == null)

src/Optimizers/AdamOptimizer.cs

Lines changed: 48 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,19 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
157157
// previous run's running maximum as a lower bound and suppress
158158
// early updates in the new run. (PR #1350 round-2 review.)
159159
_vMaxVector = null;
160+
// Reset the NN tape-side state. The flat-vector path got reset
161+
// above; the tape path uses parameter-tensor-keyed dictionaries
162+
// (_tapeM, _tapeV, _tapeVMax) and a separate _tapeStep counter
163+
// that PERSIST across Optimize calls on the same optimizer
164+
// instance. Without this clear, a second Optimize call on the
165+
// same optimizer would carry the prior run's first/second moments
166+
// (and AMSGrad's running maximum) plus a pre-advanced bias-
167+
// correction counter, biasing every per-parameter step from
168+
// iteration 1.
169+
_tapeM.Clear();
170+
_tapeV.Clear();
171+
_tapeVMax.Clear();
172+
_tapeStep = 0;
160173

161174
// Initialize parameters
162175
InitializeAdaptiveParameters();
@@ -314,7 +327,15 @@ protected override void UpdateAdaptiveParameters(OptimizationStepData<T, TInput,
314327
}
315328

316329
/// <summary>
317-
/// Updates the current solution using the Adam update rule.
330+
/// Updates the current solution using the Adam update rule. Kept for the
331+
/// non-NN code path (regression, clustering, classical models where the
332+
/// solution does NOT implement <see cref="AiDotNet.Interfaces.INeuralNetwork{T}"/>);
333+
/// the base-class <see cref="GradientBasedOptimizerBase{T,TInput,TOutput}.UpdateSolution"/>
334+
/// intercepts NN solutions and delegates to <see cref="Step(TapeStepContext{T})"/>
335+
/// via <see cref="GradientBasedOptimizerBase{T,TInput,TOutput}.SynthesizeTapeStepContext"/>,
336+
/// so the legacy flat-vector path here only runs for non-NN models — eliminating
337+
/// the historical two-Adam-implementations split (#1413). All NN training
338+
/// goes through Step, which has the anomaly guard + gradient clipping safeguards.
318339
/// </summary>
319340
/// <param name="currentSolution">The current solution being optimized.</param>
320341
/// <param name="gradient">The calculated gradient for the current solution.</param>
@@ -324,75 +345,16 @@ protected override void UpdateAdaptiveParameters(OptimizationStepData<T, TInput,
324345
/// It uses the current gradient and past information to decide how to change each parameter.
325346
/// </para>
326347
/// </remarks>
327-
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
328-
{
329-
var parameters = InterfaceGuard.Parameterizable(currentSolution).GetParameters();
330-
331-
// Right-size _m/_v to gradient on first call or after lazy-layer expansion.
332-
if (_m.Length != gradient.Length)
333-
{
334-
var newM = new Vector<T>(gradient.Length);
335-
var newV = new Vector<T>(gradient.Length);
336-
int copyLen = Math.Min(_m.Length, gradient.Length);
337-
for (int i = 0; i < copyLen; i++) { newM[i] = _m[i]; newV[i] = _v[i]; }
338-
_m = newM;
339-
_v = newV;
340-
}
341-
342-
// === Vectorized Adam Update using IEngine ===
343-
// Phase B: US-GPU-015 - GPU-accelerated gradient updates
344-
345-
T oneMinusBeta1 = NumOps.Subtract(NumOps.One, _currentBeta1);
346-
T oneMinusBeta2 = NumOps.Subtract(NumOps.One, _currentBeta2);
347-
T biasCorrection1 = NumOps.Subtract(NumOps.One, NumOps.Power(_currentBeta1, NumOps.FromDouble(_t)));
348-
T biasCorrection2 = NumOps.Subtract(NumOps.One, NumOps.Power(_currentBeta2, NumOps.FromDouble(_t)));
349-
T epsilon = NumOps.FromDouble(_options.Epsilon);
350-
351-
// Update biased first moment: m = beta1 * m + (1 - beta1) * gradient
352-
var mScaled = (Vector<T>)Engine.Multiply(_m, _currentBeta1);
353-
var gradScaled = (Vector<T>)Engine.Multiply(gradient, oneMinusBeta1);
354-
_m = (Vector<T>)Engine.Add(mScaled, gradScaled);
355-
356-
// Update biased second moment: v = beta2 * v + (1 - beta2) * gradient^2
357-
var gradSquared = (Vector<T>)Engine.Multiply(gradient, gradient);
358-
var vScaled = (Vector<T>)Engine.Multiply(_v, _currentBeta2);
359-
var gradSquaredScaled = (Vector<T>)Engine.Multiply(gradSquared, oneMinusBeta2);
360-
_v = (Vector<T>)Engine.Add(vScaled, gradSquaredScaled);
361-
362-
// Compute bias-corrected first moment: mHat = m / (1 - beta1^t)
363-
var mHat = (Vector<T>)Engine.Divide(_m, biasCorrection1);
364-
365-
// Compute bias-corrected second moment: vHat = v / (1 - beta2^t)
366-
var vHat = (Vector<T>)Engine.Divide(_v, biasCorrection2);
367-
368-
// AMSGrad: when enabled, divide by sqrt(running max of v̂) instead
369-
// of sqrt(v̂) — the same correction the vector UpdateParameters
370-
// path applies. Without this branch the UpdateSolution path
371-
// silently ran plain Adam even with UseAMSGrad=true, defeating
372-
// the purpose of the AMSGrad option on the BuildAsync/Optimize
373-
// call path. PR #1350 review.
374-
var vHatForDenominator = vHat;
375-
if (_options.UseAMSGrad)
376-
{
377-
if (_vMaxVector is null || _vMaxVector.Length != vHat.Length)
378-
_vMaxVector = new Vector<T>(vHat.Length);
379-
_vMaxVector = (Vector<T>)Engine.Max(_vMaxVector, vHat);
380-
vHatForDenominator = _vMaxVector;
381-
}
382-
383-
// Compute update: update = learningRate * mHat / (sqrt(vHat_used) + epsilon)
384-
var vHatSqrt = (Vector<T>)Engine.Sqrt(vHatForDenominator);
385-
// Create epsilon vector for addition
386-
var epsilonVec = Vector<T>.CreateDefault(vHatSqrt.Length, epsilon);
387-
var denominator = (Vector<T>)Engine.Add(vHatSqrt, epsilonVec);
388-
var updateDiv = (Vector<T>)Engine.Divide(mHat, denominator);
389-
var update = (Vector<T>)Engine.Multiply(updateDiv, CurrentLearningRate);
390-
391-
// Apply update: parameters = parameters - update
392-
var updatedParams = (Vector<T>)Engine.Subtract(parameters, update);
393-
394-
return InterfaceGuard.Parameterizable(currentSolution).WithParameters(updatedParams);
395-
}
348+
// #1413 ARCHITECTURAL CONSOLIDATION: AdamOptimizer's flat-vector
349+
// UpdateSolution override is REMOVED. NN solutions go through the base
350+
// class's UpdateSolution which synthesizes a TapeStepContext from the
351+
// flat gradient and delegates to Step(TapeStepContext) — the SAME code
352+
// path the per-sample nn.Train bypass uses, with the SAME anomaly
353+
// guard, gradient clipping, AMSGrad, and float-loop fast path. Non-NN
354+
// solutions fall through to the base's UpdateParameters dispatch which
355+
// resolves to AdamOptimizer.UpdateParameters (still present below).
356+
// This is the elimination of the two-Adam-implementations split that
357+
// caused #1380.
396358

397359
/// <summary>
398360
/// Updates a vector of parameters using the Adam optimization algorithm.
@@ -1450,6 +1412,22 @@ private bool AnyGradientIsAnomalous(TapeStepContext<T> context)
14501412
return false;
14511413
}
14521414

1415+
/// <summary>
1416+
/// Flat-vector overload of <see cref="AnyGradientIsAnomalous(TapeStepContext{T})"/>
1417+
/// for the Optimize / UpdateSolution path (#1380 part 2). Iterates the
1418+
/// gradient Vector directly since UpdateSolution doesn't have a
1419+
/// TapeStepContext to walk.
1420+
/// </summary>
1421+
private bool AnyGradientIsAnomalous(Vector<T> gradient)
1422+
{
1423+
for (int i = 0; i < gradient.Length; i++)
1424+
{
1425+
double v = NumOps.ToDouble(gradient[i]);
1426+
if (double.IsNaN(v) || double.IsInfinity(v)) return true;
1427+
}
1428+
return false;
1429+
}
1430+
14531431
private static void ApplyGlobalNormGradientClipping(
14541432
TapeStepContext<T> context,
14551433
double maxNorm)

src/Optimizers/AdamWOptimizer.cs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,15 @@ protected override void UpdateAdaptiveParameters(OptimizationStepData<T, TInput,
255255
/// <returns>A new solution with updated parameters.</returns>
256256
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
257257
{
258+
// #1413 CONSOLIDATION: NN solutions go through base.UpdateSolution
259+
// which synthesizes a TapeStepContext and delegates to Step
260+
// (one source of truth, matches PyTorch/TF/JAX). Non-NN solutions
261+
// (regression, clustering, classical models) keep the legacy
262+
// flat-vector path below for backward compatibility.
263+
if (currentSolution is AiDotNet.Interfaces.INeuralNetwork<T>)
264+
{
265+
return base.UpdateSolution(currentSolution, gradient);
266+
}
258267
var parameters = InterfaceGuard.Parameterizable(currentSolution).GetParameters();
259268

260269
// Right-size _m/_v/_vMax to gradient on first call or after lazy-layer

src/Optimizers/FTRLOptimizer.cs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -344,6 +344,15 @@ public override OptimizationResult<T, TInput, TOutput> Optimize(OptimizationInpu
344344
/// <returns>The updated solution.</returns>
345345
protected override IFullModel<T, TInput, TOutput> UpdateSolution(IFullModel<T, TInput, TOutput> currentSolution, Vector<T> gradient)
346346
{
347+
// #1413 CONSOLIDATION: NN solutions go through base.UpdateSolution
348+
// which synthesizes a TapeStepContext and delegates to Step
349+
// (one source of truth, matches PyTorch/TF/JAX). Non-NN solutions
350+
// (regression, clustering, classical models) keep the legacy
351+
// flat-vector path below for backward compatibility.
352+
if (currentSolution is AiDotNet.Interfaces.INeuralNetwork<T>)
353+
{
354+
return base.UpdateSolution(currentSolution, gradient);
355+
}
347356
// === Partially Vectorized FTRL Update using IEngine (Phase B: US-GPU-015) ===
348357
// FTRL uses L1 thresholding which requires conditional logic per-element
349358
// Vectorized: gradient operations, sigma calculation, state updates

0 commit comments

Comments
 (0)