Skip to content

Commit dcc0aef

Browse files
ooplesfranklinic
andauthored
fix(#1311 cluster-3): snap VLM vision-encoder head count to divide visionDim cleanly (#1397)
PR #1290 CI Cluster 3 #1311: 23 SmolVLM tests failing on master with the cluster's signature shape-mismatch: System.ArgumentException : Input embedding dimension (384) does not match weight dimension (378). Query shape: [1, 256, 384], Weights shape: [378, 378] ## Root cause SmolVLM defaults: VisionDim=384, NumHeads=9. At the vision-encoder MHA construction in `CreateDefaultPixelShuffleProjectorLayers` (and 9 other VLM factories): new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads)) C# integer division: `384 / 9 = 42`. Then `MultiHeadAttentionLayer._embeddingDimension = 9 * 42 = 378` (NOT 384). The QKV weight matrices end up sized `[378, 378]`, but `PatchEmbeddingLayer` upstream emits patch tokens at visionDim=384 — so `ForwardInternal` throws at the very first vision MHA call. The 9-heads / 384-vision-dim mismatch is paper-faithful (SmolVLM uses SmolLM's 9-head decoder config) but the vision encoder is SigLIP-Large @ 16 heads × 64 head-dim = 1024 vision-dim — different counts per subsystem. AiDotNet's `SmolVLMOptions` collapses both to a single `NumHeads` knob, so the factory reuses the decoder's 9 for the vision MHA where it doesn't divide. Per-subsystem head counts on the options class (`NumVisionHeads` vs `NumDecoderHeads`) is the paper-faithful long-term fix but is an API-surface change. The minimal, no-surface-change fix is to snap the vision MHA's head count downward to the largest divisor of visionDim that's ≤ numHeads. ## Fix Add `ChooseDivisibleHeadConfig(embedDim, requestedHeads, maxHeads = 16)` helper in `LayerHelper<T>` that returns `(heads, headDim)` with `heads * headDim == embedDim` exactly — finds the largest `h ≤ min(requestedHeads, maxHeads)` such that `embedDim % h == 0`. For SmolVLM (visionDim=384, numHeads=9): start at 9, 384%9=6≠0, drop to 8, 384%8=0 ✓ → `(8, 48)`. MHA gets [384, 384] weights matching the 384-dim input. Add `CreateVisionMha(visionDim, numHeads, initializationStrategy?)` shim that applies the helper and returns the configured `MultiHeadAttentionLayer<T>`. Replace all 10 inline `new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, ...)` call sites across the VLM factories. Snapping heads downward (vs upward / padding embedDim) keeps every other shape in the chain unchanged — FFN, LayerNorm, downstream Dense all keep their visionDim-wide view. The trade-off is the attention pattern uses slightly fewer heads than the upstream model card; that's strictly more local than reshaping the entire residual stream. ## Verification Pre-fix (current master): $ dotnet test --filter "FullyQualifiedName~EmotiVoiceTests|FullyQualifiedName~Phi3VisionTests|FullyQualifiedName~SmolVLMTests|FullyQualifiedName~RainbowDQNAgentTests" Failed: 47, Passed: 37 EmotiVoiceTests: pass=26, fail=1 (timeout) Phi3VisionTests: pass=2, fail=23 (all OOM/timeout, foundation-scale) RainbowDQNAgentTests: pass=7, fail=0 SmolVLMTests: pass=2, fail=23 (all shape-mismatch — THIS PR) Post-fix: $ dotnet test --filter "FullyQualifiedName~SmolVLMTests" Failed: 14, Passed: 11 Remaining 14 failures: 7 OutOfMemoryException + 6 timeout 120s + 1 timeout 180s — NO MORE shape mismatch. So this PR closes **23 of 23 SmolVLM shape-contract failures**. The remaining 14 SmolVLM failures (plus Phi3Vision's 23) are foundation-scale resource issues — same class as #1394 (ResNet/VGG ImageNet-scale perf). Different root cause, separate follow-up. ## Affected paths (10 sites) All `(visionDim) / (numHeads > 16 ? 16 : numHeads)` patterns in VLM factories: - CreateDefaultEncoderDecoderVLMLayers - CreateDefaultVisualExpertVLMLayers - CreateDefaultCrossAttentionResamplerVLMLayers - CreateDefaultPixelShuffleProjectorLayers (SmolVLM — direct fix here) - CreateDefaultVisionAdapterLayers (Phi3Vision) - CreateDefaultTokenReductionVLMLayers (DeepSeek-VL) - + 4 more Closes #1311 partially (shape-contract root cause for SmolVLM; defensive fix applied to all 10 vision-encoder MHA sites). Foundation-scale resource residue tracked elsewhere. Co-authored-by: franklinic <franklin@ivorycloud.com>
1 parent 7207983 commit dcc0aef

1 file changed

Lines changed: 77 additions & 10 deletions

File tree

src/Helpers/LayerHelper.cs

Lines changed: 77 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,73 @@ private static void ValidatePatchSize(int patchSize)
2828
throw new ArgumentOutOfRangeException(nameof(patchSize), patchSize, "patchSize must be greater than 0.");
2929
}
3030

31+
/// <summary>
32+
/// Picks a head-count + head-dim pair for a MultiHeadAttention layer such
33+
/// that <c>heads × headDim == embedDim</c> exactly. The requested head
34+
/// count is honoured when it divides <paramref name="embedDim"/> cleanly;
35+
/// otherwise it's reduced to the largest divisor of
36+
/// <paramref name="embedDim"/> that is ≤ <paramref name="requestedHeads"/>
37+
/// (and ≤ <paramref name="maxHeads"/>, default 16, which preserves the
38+
/// existing VLM cap most call sites use). Used to fix
39+
/// <c>cluster-3 #1311</c> SmolVLM-style failures where, e.g.,
40+
/// <c>visionDim=384 / numHeads=9 = 42</c> (integer division) and
41+
/// <c>9 × 42 = 378 ≠ 384</c> — the MHA's QKV weight matrices end up sized
42+
/// <c>[378, 378]</c> while the input it sees is <c>[..., 384]</c>, and
43+
/// <c>MultiHeadAttentionLayer.ForwardInternal</c> throws
44+
/// <c>"Input embedding dimension (384) does not match weight dimension (378)"</c>
45+
/// the moment the test forwards the patch-embedded tokens.
46+
///
47+
/// <para>
48+
/// Snapping heads downward (vs upward / padding the embed dim) keeps every
49+
/// other shape in the chain unchanged — FFN, LayerNorm, downstream
50+
/// DenseLayer outputs all keep their <paramref name="embedDim"/>-wide
51+
/// view. The trade-off is the attention pattern uses slightly fewer
52+
/// heads than the upstream model card; that's strictly more local than
53+
/// reshaping the entire residual stream. Paper-faithful head counts can
54+
/// be re-introduced once the per-subsystem head-count knob exists on the
55+
/// options class (NumVisionHeads vs NumDecoderHeads), filed separately.
56+
/// </para>
57+
/// </summary>
58+
private static (int heads, int headDim) ChooseDivisibleHeadConfig(
59+
int embedDim, int requestedHeads, int maxHeads = 16)
60+
{
61+
if (embedDim <= 0)
62+
throw new ArgumentOutOfRangeException(nameof(embedDim), embedDim, "embedDim must be positive.");
63+
if (requestedHeads <= 0)
64+
throw new ArgumentOutOfRangeException(nameof(requestedHeads), requestedHeads, "requestedHeads must be positive.");
65+
if (maxHeads <= 0)
66+
throw new ArgumentOutOfRangeException(nameof(maxHeads), maxHeads, "maxHeads must be positive.");
67+
68+
int heads = System.Math.Min(requestedHeads, maxHeads);
69+
while (heads > 1 && embedDim % heads != 0) heads--;
70+
// Last-ditch fallback: a 1-head attention always divides. Useful for
71+
// pathological embed dims (primes > maxHeads or arbitrary user-set
72+
// values) so the test doesn't crash before its assertion runs.
73+
return (heads, embedDim / heads);
74+
}
75+
76+
/// <summary>
77+
/// Builds the vision-encoder <see cref="NeuralNetworks.Layers.MultiHeadAttentionLayer{T}"/>
78+
/// for a VLM factory, with head count adjusted to divide
79+
/// <paramref name="visionDim"/> exactly. See
80+
/// <see cref="ChooseDivisibleHeadConfig"/> for the snap-to-divisor
81+
/// rationale; this method is the call-site shim that replaces the
82+
/// inline <c>new MultiHeadAttentionLayer&lt;T&gt;(numHeads &gt; 16 ? 16 : numHeads,
83+
/// (visionDim) / (numHeads &gt; 16 ? 16 : numHeads))</c> pattern in 10+ VLM factories.
84+
/// </summary>
85+
private static NeuralNetworks.Layers.MultiHeadAttentionLayer<T> CreateVisionMha(
86+
int visionDim,
87+
int numHeads,
88+
Initialization.IInitializationStrategy<T>? initializationStrategy = null)
89+
{
90+
var (heads, headDim) = ChooseDivisibleHeadConfig(visionDim, numHeads);
91+
return new NeuralNetworks.Layers.MultiHeadAttentionLayer<T>(
92+
heads,
93+
headDim,
94+
activationFunction: null,
95+
initializationStrategy: initializationStrategy);
96+
}
97+
3198
/// <summary>
3299
/// Creates the default layer configuration for a Deep Portfolio Management model.
33100
/// </summary>
@@ -23623,7 +23690,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultLLaVAMLPProjectorLayers(
2362323690

2362423691
for (int i = 0; i < numVisionLayers; i++)
2362523692
{
23626-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
23693+
yield return CreateVisionMha(visionDim, numHeads);
2362723694
yield return new LayerNormalizationLayer<T>();
2362823695
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2362923696
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23680,7 +23747,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultPixelShuffleProjectorLayers(
2368023747

2368123748
for (int i = 0; i < numVisionLayers; i++)
2368223749
{
23683-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
23750+
yield return CreateVisionMha(visionDim, numHeads);
2368423751
yield return new LayerNormalizationLayer<T>();
2368523752
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2368623753
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23742,7 +23809,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultCrossAttentionResamplerVLMLaye
2374223809

2374323810
for (int i = 0; i < numVisionLayers; i++)
2374423811
{
23745-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
23812+
yield return CreateVisionMha(visionDim, numHeads);
2374623813
yield return new LayerNormalizationLayer<T>();
2374723814
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2374823815
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23857,7 +23924,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultVisionAdapterLayers(
2385723924

2385823925
for (int i = 0; i < numVisionLayers; i++)
2385923926
{
23860-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
23927+
yield return CreateVisionMha(visionDim, numHeads);
2386123928
yield return new LayerNormalizationLayer<T>();
2386223929
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2386323930
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -23918,7 +23985,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultTokenReductionVLMLayers(
2391823985

2391923986
for (int i = 0; i < numVisionLayers; i++)
2392023987
{
23921-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
23988+
yield return CreateVisionMha(visionDim, numHeads);
2392223989
yield return new LayerNormalizationLayer<T>();
2392323990
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2392423991
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24013,7 +24080,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultVideoTemporalVLMLayers(
2401324080

2401424081
for (int i = 0; i < numVisionLayers; i++)
2401524082
{
24016-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads), initializationStrategy: lazy);
24083+
yield return CreateVisionMha(visionDim, numHeads, initializationStrategy: lazy);
2401724084
yield return new LayerNormalizationLayer<T>();
2401824085
yield return new DenseLayer<T>(visionFfnDim, geluActivation, lazy);
2401924086
yield return new DenseLayer<T>(visionDim, identityActivation, lazy);
@@ -24077,7 +24144,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultRoboticsActionLayers(
2407724144

2407824145
for (int i = 0; i < numVisionLayers; i++)
2407924146
{
24080-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
24147+
yield return CreateVisionMha(visionDim, numHeads);
2408124148
yield return new LayerNormalizationLayer<T>();
2408224149
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2408324150
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24182,7 +24249,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultGroundingDetectionLayers(
2418224249

2418324250
for (int i = 0; i < numVisionLayers; i++)
2418424251
{
24185-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
24252+
yield return CreateVisionMha(visionDim, numHeads);
2418624253
yield return new LayerNormalizationLayer<T>();
2418724254
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2418824255
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24326,7 +24393,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultUnifiedBidirectionalLayers(
2432624393

2432724394
for (int i = 0; i < numEncoderLayers; i++)
2432824395
{
24329-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
24396+
yield return CreateVisionMha(visionDim, numHeads);
2433024397
yield return new LayerNormalizationLayer<T>();
2433124398
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2433224399
yield return new DenseLayer<T>(visionDim, identityActivation);
@@ -24391,7 +24458,7 @@ public static IEnumerable<ILayer<T>> CreateDefaultEditingInstructionLayers(
2439124458

2439224459
for (int i = 0; i < numVisionLayers; i++)
2439324460
{
24394-
yield return new MultiHeadAttentionLayer<T>(numHeads > 16 ? 16 : numHeads, (visionDim) / (numHeads > 16 ? 16 : numHeads));
24461+
yield return CreateVisionMha(visionDim, numHeads);
2439524462
yield return new LayerNormalizationLayer<T>();
2439624463
yield return new DenseLayer<T>(visionFfnDim, geluActivation);
2439724464
yield return new DenseLayer<T>(visionDim, identityActivation);

0 commit comments

Comments
 (0)