fix(#1304 c6): drop Dropout from OccupancyNN defaults; fix memorization invariant (#1391)

ooples · franklinic · web-flow · commit 720798355c20 · 2026-05-19T22:32:03.000-04:00
PR #1290 CI Cluster 6 #1304: OccupancyNeuralNetworkTests.LossStrictlyDecreasesOnMemorizationTask was reported to be fixed by PR #1329's BatchNorm→LayerNorm swap, but the test was still red on master with loss step 1=0.6936, step 100=0.7032 (slightly INCREASING) — model stuck at the BCE-ln(2) baseline through 100 gradient steps. ## Root cause PR #1329 fixed the BN-at-batch-1 degeneracy (σ²=0 → y=β collapses the gradient through normalization) but the *Dropout layer*'s memorization-blocking effect was not addressed. The default Occupancy layer stack was: Dense(64)+ReLU → LayerNorm → Dropout(0.3) → Dense(32)+ReLU → LayerNorm → Dropout(0.2) → Dense(16)+ReLU → Dense(out)+Sigmoid Under the model-family LossStrictlyDecreasesOnMemorizationTask invariant — train the SAME (x, target) pair for 100 iterations and assert loss strictly decreases — every forward pass under Dropout sees a DIFFERENT random sub-network (~56% of hidden units active = 0.7 × 0.8). On a 3 → 64 → 32 → 16 → 1 MLP (~2k params), the per-step mask randomness injects more variance than the gradient can subtract over 100 steps, leaving loss flat or slightly RISING at the BCE-ln(2) baseline. ## Fix Remove Dropout from both `CreateDefaultOccupancyLayers` and `CreateDefaultOccupancyTemporalLayers` in `LayerHelper<T>`. At this network size Dropout adds no useful regularization (the model has fewer params than typical sensor batches have rows); callers who genuinely need regularization on a larger Occupancy MLP can pass an explicit architecture with their preferred Dropout rate. ## Verification $ dotnet test --filter "FullyQualifiedName~OccupancyNeuralNetworkTests" Passed! - Failed: 0, Passed: 21, Skipped: 0, Total: 21 All 21 OccupancyNN tests pass (was 1 failing). The 4 remaining #1304 tests post-fix: - SimCSETests.TrainingError_ShouldNotExceedTestError PASS (was passing already on current master) - SimCSETests.Training_ShouldChangeParameters PASS (was passing already on current master) - DenseNetNetworkTests.MoreData_ShouldNotDegrade Adam-overshoot divergence (200-iter loss > 50-iter loss); separate follow-up issue - NEATTests.Training_ShouldReduceLoss timeout (perf gap, similar to #1390); separate follow-up issue Closes #1304 partially. DenseNet + NEAT follow-ups tracked separately. Co-authored-by: franklinic <franklin@ivorycloud.com>
diff --git a/src/Helpers/LayerHelper.cs b/src/Helpers/LayerHelper.cs
@@ -787,14 +787,16 @@ public static IEnumerable<ILayer<T>> CreateDefaultOccupancyTemporalLayers(
         // Dense layers for further processing. LayerNormalization (Ba 2016)
         // rather than BatchNormalization so the head still normalizes at any
         // batch size — memorization-style training runs at batch=1 and BN
-        // collapses (σ² = 0) under those conditions.
+        // collapses (σ² = 0) under those conditions. Dropout removed for
+        // the same reason as the non-temporal variant above (#1304
+        // cluster-6 follow-up) — per-step mask randomness exceeds the
+        // gradient signal on a 100-iter memorization task and stalls
+        // loss at the BCE-ln(2) baseline.
         yield return new DenseLayer<T>(64, new ReLUActivation<T>() as IActivationFunction<T>);
         yield return new LayerNormalizationLayer<T>();
-        yield return new DropoutLayer<T>(0.3f);
 
         yield return new DenseLayer<T>(32, new ReLUActivation<T>() as IActivationFunction<T>);
         yield return new LayerNormalizationLayer<T>();
-        yield return new DropoutLayer<T>(0.2f);
 
         // Output layer
         yield return new DenseLayer<T>(architecture.OutputSize, new SigmoidActivation<T>() as IActivationFunction<T>);
@@ -879,13 +881,30 @@ public static IEnumerable<ILayer<T>> CreateDefaultOccupancyLayers(
         // memorization-style training. LayerNorm normalizes across the
         // feature axis within each sample and is the modern default for
         // small dense MLPs.
+        //
+        // Dropout removed (#1304 cluster-6 follow-up): the prior layout
+        // applied Dropout(0.3) + Dropout(0.2) on a tiny 3 → 64 → 32 → 16
+        // → 1 MLP (~2k params). On a memorization task that trains the
+        // same (x, target) pair for 100 iterations, every forward sees a
+        // DIFFERENT random sub-network (roughly 56% of hidden units
+        // active = 0.7 × 0.8) so the optimizer can never learn the pair
+        // — Dropout's per-step mask injects more variance than the
+        // gradient can subtract over 100 steps, leaving loss flat or
+        // slightly RISING at the BCE-ln(2) baseline. PR #1329 fixed the
+        // BN-at-batch-1 layer of this stack but the Dropout layer's
+        // memorization-blocking effect was left. At this network size
+        // Dropout adds no useful regularization (the model has fewer
+        // params than typical sensor batches have rows); callers who
+        // genuinely need regularization on a larger Occupancy MLP can
+        // pass an explicit architecture with their preferred Dropout
+        // rate. Closes the LossStrictlyDecreasesOnMemorizationTask
+        // signal that's been red on OccupancyNeuralNetworkTests since
+        // the cluster-6 sweep.
         yield return new DenseLayer<T>(64, new ReLUActivation<T>() as IActivationFunction<T>);
         yield return new LayerNormalizationLayer<T>();
-        yield return new DropoutLayer<T>(0.3f);
 
         yield return new DenseLayer<T>(32, new ReLUActivation<T>() as IActivationFunction<T>);
         yield return new LayerNormalizationLayer<T>();
-        yield return new DropoutLayer<T>(0.2f);
 
         yield return new DenseLayer<T>(16, new ReLUActivation<T>() as IActivationFunction<T>);