fix(scaffold): add Gemma3 + DeepSeekVL/InternVL family to patch-vision list (#1420)

ooples · franklinic · web-flow · commit 390d7ede8a57 · 2026-05-22T22:29:44.000-04:00
* fix(scaffold): add Gemma3 + DeepSeekVL/InternVL family to patch-vision list PR #1408 Generated Layers shard (run 26254401589 job 77275610156) had 23 Gemma3 tests all failing at the same boundary: System.ArgumentException : Image H/W (128/128) must be divisible by patchSize (14) at PatchEmbeddingLayer.OnFirstForward at Gemma3.Train / Gemma3.Predict Gemma3 (Google 2025) uses SigLIP-SO 14×14 patches per its paper (ImageSize=896 / sqrt(MaxVisualTokens=4096) = 14). The auto-scaffold's generic vision-model branch emitted a [3, 128, 128] input that's not divisible by 14, so every test that calls Train or Predict hard-rejected at the very first layer. Add the missing prefixes to s_patchVisionFamilies so the helper returns the patch-divisible 112 (= lcm(14, 16)) spatial size: Gemma, DeepSeekVL, InternVL, Llama32Vision, Phi3Vision, Phi4Multimodal. All six use ComputeVisualPatchSize → patchSize=14 via the LLaVA-MLP / SigLIP-SO vision adapter path. The existing 112 helper survives every patch-14 and patch-16 division, so no other vision model regresses. Also mark Gemma3 as paper-scale (vision dim 1152, 27 vision layers, 3584 decoder dim, 36 decoder layers — true 3B-foundation scale) so its iteration-count overrides for paper-scale models engage. The warm-up Predict still OOMs on a standard CI runner because Gemma3's default config materializes too many lazy DenseLayer weights at construction time; the OOM constraint is independent of patch divisibility and remains as follow-up (likely needs streaming-pool engagement or a per-class scaffold override that constructs Gemma3 with reduced dims for testing). * docs(#1420): document streaming-engagement blocker for Gemma3 OOM The scaffold patch-divisibility fix lets Gemma3 reach the warm-up Predict's first lazy weight materialization, where it then OOMs the runner because Gemma3's paper-scale defaults (3B+ params via VisionDim=1152, 27 vision + 36 decoder layers) overflow the GC heap before any streaming-pool engagement. The natural fix is to call ConfigureWeightLifetime(new GpuOffloadOptions()) in InitializeLayers after the layer list is populated but before any weight materializes — exactly what the LayerBase.UseStreamingAllocator flag was built for. I prototyped this locally and confirmed the lazy DenseLayer's AllocateLazyWeight DID route through the streaming pool (PredictEagerStreaming kicked in immediately). But the streaming path then trips a deeper engine bug: System.InvalidOperationException : Streaming drop requires sole storage ownership; storage refcount is 2. Register the weight via WeightRegistry before any RebindStorageFrom / view operation that shares its storage. at TensorBase.DropStorageForStreaming at WeightRegistry.RegisterWeight at PredictEagerStreaming:3775 (RegisterLayerTrainableTensorsWithWeightRegistry) at Gemma3.Train The lazy weight tensor that PatchEmbeddingLayer.OnFirstForward materializes ends up with refcount=2 on its storage by the time PredictEagerStreaming's post-forward re-registration runs. Some view / init op (Xavier init, RegisterTrainableParameter, or ResolveShapes) is producing a second reference to the underlying storage that DropStorageForStreaming refuses to silently drop. Deferring the streaming pre-engagement until the engine-side refcount issue is fixed. The patch-divisibility scaffold portion of this PR stays — it's the necessary first half (unblocks the same- shape failures across DeepSeekVL / InternVL / Llama32Vision / Phi3 / Phi4 even on smaller configs where OOM isn't an issue). --------- Co-authored-by: franklinic <franklin@ivorycloud.com>
diff --git a/src/AiDotNet.Generators/TestScaffoldGenerator.cs b/src/AiDotNet.Generators/TestScaffoldGenerator.cs
@@ -1239,6 +1239,16 @@ private static void CollectModelsFromNamespace(
         "Cambrian", "Dragonfly", "Eagle", "Mantis", "Maya", "MiniCPM",
         "Molmo", "Monkey", "Moondream", "NVLM", "Ovis", "VILA",
         "PathVLM", "RadFM", "QVQ", "SkyworkR1V", "GeoChat", "RSGPT", "SkyEyeGPT",
+        // InstructionTuned VLMs that also resolve a 14-patch SigLIP / ViT-L
+        // encoder via ComputeVisualPatchSize (Gemma3: 896/sqrt(4096)=14,
+        // DeepSeekVL/2: ViT-L/14, InternVL family: ViT-L/14, Llama32Vision:
+        // ViT-L/14, Phi3Vision/Phi4Multimodal: CLIP ViT-L/14). PatchEmbedding
+        // throws "Image H/W (128/128) must be divisible by patchSize (14)"
+        // when the scaffold's default 128 isn't divisible by 14 — surfaced
+        // in PR #1408 Generated Layers shard run 26254401589 as 23 Gemma3
+        // tests all failing at the same Forward boundary.
+        "Gemma", "DeepSeekVL", "InternVL", "Llama32Vision",
+        "Phi3Vision", "Phi4Multimodal",
     };
 
     /// <summary>
@@ -4350,6 +4360,12 @@ private static bool IsPaperScaleVisionLanguageModel(string className)
         {
             "BiomedCLIP" => true,
             "DFNCLIP" => true,
+            // Gemma3 (Google 2025): VisionDim=1152, DecoderDim=3584, 27 vision
+            // layers, 36 decoder layers, ImageSize=896 SigLIP-SO. Default Adam
+            // step OOMs the test runner before even completing the warm-up
+            // Predict — surfaced in PR #1408 Generated Layers shard as 23
+            // Gemma3 tests all failing.
+            "Gemma3" => true,
             _ => false,
         };
     }
diff --git a/src/VisionLanguage/InstructionTuned/Gemma3.cs b/src/VisionLanguage/InstructionTuned/Gemma3.cs
@@ -6,6 +6,7 @@
 using AiDotNet.NeuralNetworks;
 using AiDotNet.Onnx;
 using AiDotNet.Optimizers;
+using AiDotNet.Tensors.Engines.DirectGpu;
 using AiDotNet.Tokenization;
 using AiDotNet.Tokenization.Interfaces;
 using AiDotNet.VisionLanguage.Interfaces;
@@ -137,6 +138,24 @@ protected override void InitializeLayers()
             ComputeEncoderDecoderBoundary();
         }
         ValidateEncoderDecoderBoundary(_encoderLayerEnd);
+
+        // NOTE: Gemma3 is paper-scale by default (VisionDim=1152, 27 vision
+        // layers, DecoderDim=3584, 36 decoder layers — Google's 4B/12B/27B
+        // SigLIP-SO family). The auto-detect streaming gate in
+        // NeuralNetworkBase reads ParameterCount, which returns 0 PRE-first-
+        // forward for lazy DenseLayers; the first warm-up Predict
+        // materializes the full lazy weight matrix on the GC heap and OOMs
+        // the runner before auto-detect gets a chance to engage.
+        //
+        // Calling ConfigureWeightLifetime(new GpuOffloadOptions()) here to
+        // pre-engage streaming was the natural fix but surfaces a downstream
+        // engine bug: when PredictEagerStreaming later calls
+        // RegisterLayerTrainableTensorsWithWeightRegistry on a lazy layer's
+        // freshly-materialized weights, DropStorageForStreaming throws
+        // "storage refcount is 2" — some view/rebind operation in
+        // PatchEmbeddingLayer's OnFirstForward leaves the weight tensor's
+        // storage shared. Deferring streaming pre-engagement until the
+        // engine-side refcount issue is fixed.
     }
 
     // Gemma-3 (Google 2025): 896 / sqrt(4096) = 14 — SigLIP 14x14 patches.