Skip to content

Commit 900134c

Browse files
ooplesfranklinic
andauthored
fix(#1395): surface caught exception in CompiledTapeTrainingStep fallback (#1402)
When CompiledTapeTrainingStep.TryStepWithFusedOptimizer caught an exception (anything thrown by plan.Step or ConfigureOptimizer), it logged to Trace and returned false. The caller (NeuralNetworkBase) saw ran=false plus _fusedTrainingCommitted=true and threw a generic InvalidOperationException listing 'common causes' but NOT the actual exception text. Tests failing on this path got an opaque error; debugging required reproducing the failure locally just to see the Trace output. ## Fix CompiledTapeTrainingStep now stashes the caught exception in a [ThreadStatic] field on the catch path, cleared on entry to each call. NeuralNetworkBase's fused-committed throw at line 6240 reads it back via GetLastFallbackException() and quotes the type+message inline, plus attaches the underlying exception as innerException so callers that introspect ex.InnerException get the full stack. Three concrete exception types observed under parallel test load (traced during AiDotNet#1395 investigation) that now surface inline instead of being hidden: - InvalidOperationException 'Parameter N has a layout that does not expose a live CPU backing array' (ConfigureOptimizerDouble) - ArgumentException 'gradOutput shape [...] must be [...]' (kernel backward shape mismatch) - InvalidOperationException 'Lazy tensor produced NaN at index 0' (DifferentiableNeuralComputer numerical guard) Paired with the AiDotNet.Tensors-side fix(AiDotNet#1395): drop IsDeterministicMode from CompiledModelCache shape-key, which removes the cross-test cache-key drift that caused the throw in the first place. Build clean on net10.0 + net471. Co-authored-by: franklinic <franklin@ivorycloud.com>
1 parent befe892 commit 900134c

2 files changed

Lines changed: 57 additions & 2 deletions

File tree

src/NeuralNetworks/NeuralNetworkBase.cs

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6262,20 +6262,39 @@ private bool TryTrainWithFusedOptimizer(
62626262
// plan (strict single-plan policy refused to configure it)
62636263
// - mutated optimizer hyperparameters between steps
62646264
// (attached LR scheduler, changed betas, etc.)
6265+
// - a kernel-level exception caught and swallowed
6266+
// (CompiledTrainingPlan.Step / ConfigureOptimizer threw)
62656267
// Resolution: call ResetState() or InvalidateParameterCountCache()
62666268
// to fully reset training state, then retrain with stable
62676269
// shapes + fixed hyperparameters. Or disable compilation via
62686270
// AllowNondeterminism / Configure(JitCompilationConfig.Disabled)
62696271
// so training runs entirely on the eager path from the start.
6272+
//
6273+
// AiDotNet#1395: quote the swallowed exception (if any) inline so
6274+
// failing tests don't have to chase Trace output to learn the
6275+
// root cause (Parameter N non-contiguous CPU layout, shape mismatch
6276+
// in a backward kernel, NaN guard trip, etc.). The inner exception
6277+
// is also attached for catch (InvalidOperationException ex) callers
6278+
// that introspect ex.InnerException.
6279+
var fallbackEx = Training.CompiledTapeTrainingStep<T>.GetLastFallbackException();
6280+
var rootCauseSuffix = fallbackEx is not null
6281+
? $" Root-cause exception (caught in CompiledTapeTrainingStep): " +
6282+
$"{fallbackEx.GetType().FullName}: {fallbackEx.Message}"
6283+
: " (No exception was caught — fused path returned false from one of the explicit " +
6284+
"refuse paths: plan reference changed, optimizer hyperparameters drifted, " +
6285+
"TensorCodecOptions.EnableCompilation=false, or numeric/optimizer type unsupported.)";
62706286
throw new InvalidOperationException(
62716287
"Fused compiled training has already run successfully, but the current step cannot " +
62726288
"engage the fused path. The plan-embedded Adam/AdamW/SGD state cannot be transferred " +
62736289
"to the eager optimizer, so falling back silently would produce a trajectory that " +
62746290
"diverges from the previous fused steps. Common causes: variable input/target shape " +
6275-
"(new compiled plan), LR scheduler or adaptive-rate changes, attached AMSGrad. " +
6291+
"(new compiled plan), LR scheduler or adaptive-rate changes, attached AMSGrad, or a " +
6292+
"kernel-level exception in plan.Step/ConfigureOptimizer that was caught and swallowed. " +
62766293
"Resolution: keep shapes and optimizer hyperparameters stable across steps, OR call " +
62776294
"ResetState() / InvalidateParameterCountCache() to explicitly reset training state, " +
6278-
"OR disable compilation (AiModelBuilder.ConfigureJitCompilation(JitCompilationConfig.Disabled)).");
6295+
"OR disable compilation (AiModelBuilder.ConfigureJitCompilation(JitCompilationConfig.Disabled))." +
6296+
rootCauseSuffix,
6297+
innerException: fallbackEx);
62796298
}
62806299
else
62816300
{

src/Training/CompiledTapeTrainingStep.cs

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,30 @@ public static class CompiledTapeTrainingStep<T>
102102
/// <summary>Resets the fused-step counter on the calling thread to zero.</summary>
103103
public static void ResetFusedStepCount() { _fusedStepCount = 0; }
104104

105+
/// <summary>
106+
/// AiDotNet#1395: when <see cref="TryStepWithFusedOptimizer"/> falls back via
107+
/// the catch path, the underlying exception is stored here so the caller
108+
/// (NeuralNetworkBase) can surface it in the "fused has committed but step N
109+
/// can't engage" InvalidOperationException. Previously the catch's exception
110+
/// was logged to Trace only — users debugging from a failing test never saw
111+
/// the actual root cause (e.g. "Parameter N non-contiguous CPU layout" from
112+
/// <see cref="Tensors.Engines.Compilation.CompiledTrainingPlan{T}.ConfigureOptimizer"/>,
113+
/// a shape mismatch from a backward kernel, a NaN guard trip). Now the
114+
/// caller can quote the original exception's type + message + stack so the
115+
/// error is self-diagnosing.
116+
/// </summary>
117+
[ThreadStatic]
118+
private static System.Exception? _lastFallbackException;
119+
120+
/// <summary>
121+
/// AiDotNet#1395: read the last exception that caused
122+
/// <see cref="TryStepWithFusedOptimizer"/> to fall back, or <c>null</c> if
123+
/// the most recent fallback was due to one of the explicit return-false
124+
/// paths (plan switch, config drift, EnableCompilation=false, etc.) rather
125+
/// than a swallowed exception.
126+
/// </summary>
127+
public static System.Exception? GetLastFallbackException() => _lastFallbackException;
128+
105129
// Reflection-cached lookup of ICompiledTrainingPlan<T>.SetMaxGradNorm(double).
106130
// Populated lazily on first call per process and reused on every subsequent
107131
// step. Returns null when the underlying Tensors assembly pre-dates the
@@ -282,6 +306,11 @@ internal static bool TryStepWithFusedOptimizer(
282306
AiDotNet.Tensors.Engines.Compilation.LrSchedule? lrSchedule = null)
283307
{
284308
lossValue = MathHelper.GetNumericOperations<T>().Zero;
309+
// AiDotNet#1395: clear the previous-call's exception buffer so the
310+
// caller's GetLastFallbackException reflects only the outcome of THIS
311+
// call. (Cleared on entry, not on success, so a successful step doesn't
312+
// leak a stale exception from earlier.)
313+
_lastFallbackException = null;
285314

286315
if (!TensorCodecOptions.Current.EnableCompilation) return false;
287316
// Fused optimizer kernels support float and double on the Tensors
@@ -505,6 +534,13 @@ or AiDotNet.Tensors.Engines.Compilation.OptimizerType.Adam
505534
// a fused-path regression from logs requires reproducing the
506535
// failure locally. Clear the single-slot config state so any
507536
// next attempt reconfigures fresh.
537+
//
538+
// AiDotNet#1395: also stash the exception so the caller's
539+
// "fused has committed but step cannot engage" InvalidOperationException
540+
// can quote the underlying cause (Parameter N non-contiguous CPU
541+
// layout, shape mismatch, NaN guard, etc.). Trace alone wasn't
542+
// enough — failing tests don't surface Trace output by default.
543+
_lastFallbackException = ex;
508544
System.Diagnostics.Trace.TraceWarning(
509545
$"CompiledTapeTrainingStep.TryStepWithFusedOptimizer failed, falling back to eager: " +
510546
$"{ex}");

0 commit comments

Comments
 (0)