fix(#1311 cluster-3): snap VLM vision-encoder head count to divide visionDim cleanly (#1397)

ooples · franklinic · web-flow · commit dcc0aef5001a · 2026-05-19T22:33:46.000-04:00
PR #1290 CI Cluster 3 #1311: 23 SmolVLM tests failing on master with the cluster's signature shape-mismatch: System.ArgumentException : Input embedding dimension (384) does not match weight dimension (378). Query shape: [1, 256, 384], Weights shape: [378, 378] ## Root cause SmolVLM defaults: VisionDim=384, NumHeads=9. At the vision-encoder MHA construction in `CreateDefaultPixelShuffleProjectorLayers` (and 9 other VLM factories): new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads)) C# integer division: `384 / 9 = 42`. Then `MultiHeadAttentionLayer._embeddingDimension = 9 * 42 = 378` (NOT 384). The QKV weight matrices end up sized `[378, 378]`, but `PatchEmbeddingLayer` upstream emits patch tokens at visionDim=384 — so `ForwardInternal` throws at the very first vision MHA call. The 9-heads / 384-vision-dim mismatch is paper-faithful (SmolVLM uses SmolLM's 9-head decoder config) but the vision encoder is SigLIP-Large @ 16 heads × 64 head-dim = 1024 vision-dim — different counts per subsystem. AiDotNet's `SmolVLMOptions` collapses both to a single `NumHeads` knob, so the factory reuses the decoder's 9 for the vision MHA where it doesn't divide. Per-subsystem head counts on the options class (`NumVisionHeads` vs `NumDecoderHeads`) is the paper-faithful long-term fix but is an API-surface change. The minimal, no-surface-change fix is to snap the vision MHA's head count downward to the largest divisor of visionDim that's ≤ numHeads. ## Fix Add `ChooseDivisibleHeadConfig(embedDim, requestedHeads, maxHeads = 16)` helper in `LayerHelper<T>` that returns `(heads, headDim)` with `heads * headDim == embedDim` exactly — finds the largest `h ≤ min(requestedHeads, maxHeads)` such that `embedDim % h == 0`. For SmolVLM (visionDim=384, numHeads=9): start at 9, 384%9=6≠0, drop to 8, 384%8=0 ✓ → `(8, 48)`. MHA gets [384, 384] weights matching the 384-dim input. Add `CreateVisionMha(visionDim, numHeads, initializationStrategy?)` shim that applies the helper and returns the configured `MultiHeadAttentionLayer<T>`. Replace all 10 inline `new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, ...)` call sites across the VLM factories. Snapping heads downward (vs upward / padding embedDim) keeps every other shape in the chain unchanged — FFN, LayerNorm, downstream Dense all keep their visionDim-wide view. The trade-off is the attention pattern uses slightly fewer heads than the upstream model card; that's strictly more local than reshaping the entire residual stream. ## Verification Pre-fix (current master): $ dotnet test --filter "FullyQualifiedName~EmotiVoiceTests|FullyQualifiedName~Phi3VisionTests|FullyQualifiedName~SmolVLMTests|FullyQualifiedName~RainbowDQNAgentTests" Failed: 47, Passed: 37 EmotiVoiceTests: pass=26, fail=1 (timeout) Phi3VisionTests: pass=2, fail=23 (all OOM/timeout, foundation-scale) RainbowDQNAgentTests: pass=7, fail=0 SmolVLMTests: pass=2, fail=23 (all shape-mismatch — THIS PR) Post-fix: $ dotnet test --filter "FullyQualifiedName~SmolVLMTests" Failed: 14, Passed: 11 Remaining 14 failures: 7 OutOfMemoryException + 6 timeout 120s + 1 timeout 180s — NO MORE shape mismatch. So this PR closes **23 of 23 SmolVLM shape-contract failures**. The remaining 14 SmolVLM failures (plus Phi3Vision's 23) are foundation-scale resource issues — same class as #1394 (ResNet/VGG ImageNet-scale perf). Different root cause, separate follow-up. ## Affected paths (10 sites) All `(visionDim) / (numHeads > 16 ? 16 : numHeads)` patterns in VLM factories: - CreateDefaultEncoderDecoderVLMLayers - CreateDefaultVisualExpertVLMLayers - CreateDefaultCrossAttentionResamplerVLMLayers - CreateDefaultPixelShuffleProjectorLayers (SmolVLM — direct fix here) - CreateDefaultVisionAdapterLayers (Phi3Vision) - CreateDefaultTokenReductionVLMLayers (DeepSeek-VL) - + 4 more Closes #1311 partially (shape-contract root cause for SmolVLM; defensive fix applied to all 10 vision-encoder MHA sites). Foundation-scale resource residue tracked elsewhere. Co-authored-by: franklinic <franklin@ivorycloud.com>
diff --git a/src/Helpers/LayerHelper.cs b/src/Helpers/LayerHelper.cs
@@ -28,6 +28,73 @@ private static void ValidatePatchSize(int patchSize)
             throw new ArgumentOutOfRangeException(nameof(patchSize), patchSize, "patchSize must be greater than 0.");
     }
 
+    /// <summary>
+    /// Picks a head-count + head-dim pair for a MultiHeadAttention layer such
+    /// that <c>heads × headDim == embedDim</c> exactly. The requested head
+    /// count is honoured when it divides <paramref name="embedDim"/> cleanly;
+    /// otherwise it's reduced to the largest divisor of
+    /// <paramref name="embedDim"/> that is ≤ <paramref name="requestedHeads"/>
+    /// (and ≤ <paramref name="maxHeads"/>, default 16, which preserves the
+    /// existing VLM cap most call sites use). Used to fix
+    /// <c>cluster-3 #1311</c> SmolVLM-style failures where, e.g.,
+    /// <c>visionDim=384 / numHeads=9 = 42</c> (integer division) and
+    /// <c>9 × 42 = 378 ≠ 384</c> — the MHA's QKV weight matrices end up sized
+    /// <c>[378, 378]</c> while the input it sees is <c>[..., 384]</c>, and
+    /// <c>MultiHeadAttentionLayer.ForwardInternal</c> throws
+    /// <c>"Input embedding dimension (384) does not match weight dimension (378)"</c>
+    /// the moment the test forwards the patch-embedded tokens.
+    ///
+    /// <para>
+    /// Snapping heads downward (vs upward / padding the embed dim) keeps every
+    /// other shape in the chain unchanged — FFN, LayerNorm, downstream
+    /// DenseLayer outputs all keep their <paramref name="embedDim"/>-wide
+    /// view. The trade-off is the attention pattern uses slightly fewer
+    /// heads than the upstream model card; that's strictly more local than
+    /// reshaping the entire residual stream. Paper-faithful head counts can
+    /// be re-introduced once the per-subsystem head-count knob exists on the
+    /// options class (NumVisionHeads vs NumDecoderHeads), filed separately.
+    /// </para>
+    /// </summary>
+    private static (int heads, int headDim) ChooseDivisibleHeadConfig(
+        int embedDim, int requestedHeads, int maxHeads = 16)
+    {
+        if (embedDim <= 0)
+            throw new ArgumentOutOfRangeException(nameof(embedDim), embedDim, "embedDim must be positive.");
+        if (requestedHeads <= 0)
+            throw new ArgumentOutOfRangeException(nameof(requestedHeads), requestedHeads, "requestedHeads must be positive.");
+        if (maxHeads <= 0)
+            throw new ArgumentOutOfRangeException(nameof(maxHeads), maxHeads, "maxHeads must be positive.");
+
+        int heads = System.Math.Min(requestedHeads, maxHeads);
+        while (heads > 1 && embedDim % heads != 0) heads--;
+        // Last-ditch fallback: a 1-head attention always divides. Useful for
+        // pathological embed dims (primes > maxHeads or arbitrary user-set
+        // values) so the test doesn't crash before its assertion runs.
+        return (heads, embedDim / heads);
+    }
+
+    /// <summary>
+    /// Builds the vision-encoder <see cref="NeuralNetworks.Layers.MultiHeadAttentionLayer{T}"/>
+    /// for a VLM factory, with head count adjusted to divide
+    /// <paramref name="visionDim"/> exactly. See
+    /// <see cref="ChooseDivisibleHeadConfig"/> for the snap-to-divisor
+    /// rationale; this method is the call-site shim that replaces the
+    /// inline <c>new MultiHeadAttentionLayer&lt;T&gt;(numHeads &gt; 16 ? 16 : numHeads,
+    /// (visionDim) / (numHeads &gt; 16 ? 16 : numHeads))</c> pattern in 10+ VLM factories.
+    /// </summary>
+    private static NeuralNetworks.Layers.MultiHeadAttentionLayer<T> CreateVisionMha(
+        int visionDim,
+        int numHeads,
+        Initialization.IInitializationStrategy<T>? initializationStrategy = null)
+    {
+        var (heads, headDim) = ChooseDivisibleHeadConfig(visionDim, numHeads);
+        return new NeuralNetworks.Layers.MultiHeadAttentionLayer<T>(
+            heads,
+            headDim,
+            activationFunction: null,
+            initializationStrategy: initializationStrategy);
+    }
+
     /// <summary>
     /// Creates the default layer configuration for a Deep Portfolio Management model.
     /// </summary>
@@ -23623,7 +23690,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultLLaVAMLPProjectorLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23680,7 +23747,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultPixelShuffleProjectorLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23742,7 +23809,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultCrossAttentionResamplerVLMLaye
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23857,7 +23924,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultVisionAdapterLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23918,7 +23985,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultTokenReductionVLMLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24013,7 +24080,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultVideoTemporalVLMLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads), initializationStrategy: lazy);
+            yield return CreateVisionMha(visionDim, numHeads, initializationStrategy: lazy);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation, lazy);
             yield return new DenseLayer<T>(visionDim, identityActivation, lazy);
@@ -24077,7 +24144,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultRoboticsActionLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24182,7 +24249,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultGroundingDetectionLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24326,7 +24393,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultUnifiedBidirectionalLayers(
 
         for (int i = 0; i < numEncoderLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24391,7 +24458,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultEditingInstructionLayers(
 
         for (int i = 0; i < numVisionLayers; i++)
         {
-            yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
+            yield return CreateVisionMha(visionDim, numHeads);
             yield return new LayerNormalizationLayer<T>();
             yield return new DenseLayer<T>(visionFfnDim, geluActivation);
             yield return new DenseLayer<T>(visionDim, identityActivation);