Skip to content

Commit 390d7ed

Browse files
ooplesfranklinic
andauthored
fix(scaffold): add Gemma3 + DeepSeekVL/InternVL family to patch-vision list (#1420)
* fix(scaffold): add Gemma3 + DeepSeekVL/InternVL family to patch-vision list PR #1408 Generated Layers shard (run 26254401589 job 77275610156) had 23 Gemma3 tests all failing at the same boundary: System.ArgumentException : Image H/W (128/128) must be divisible by patchSize (14) at PatchEmbeddingLayer.OnFirstForward at Gemma3.Train / Gemma3.Predict Gemma3 (Google 2025) uses SigLIP-SO 14×14 patches per its paper (ImageSize=896 / sqrt(MaxVisualTokens=4096) = 14). The auto-scaffold's generic vision-model branch emitted a [3, 128, 128] input that's not divisible by 14, so every test that calls Train or Predict hard-rejected at the very first layer. Add the missing prefixes to s_patchVisionFamilies so the helper returns the patch-divisible 112 (= lcm(14, 16)) spatial size: Gemma, DeepSeekVL, InternVL, Llama32Vision, Phi3Vision, Phi4Multimodal. All six use ComputeVisualPatchSize → patchSize=14 via the LLaVA-MLP / SigLIP-SO vision adapter path. The existing 112 helper survives every patch-14 and patch-16 division, so no other vision model regresses. Also mark Gemma3 as paper-scale (vision dim 1152, 27 vision layers, 3584 decoder dim, 36 decoder layers — true 3B-foundation scale) so its iteration-count overrides for paper-scale models engage. The warm-up Predict still OOMs on a standard CI runner because Gemma3's default config materializes too many lazy DenseLayer weights at construction time; the OOM constraint is independent of patch divisibility and remains as follow-up (likely needs streaming-pool engagement or a per-class scaffold override that constructs Gemma3 with reduced dims for testing). * docs(#1420): document streaming-engagement blocker for Gemma3 OOM The scaffold patch-divisibility fix lets Gemma3 reach the warm-up Predict's first lazy weight materialization, where it then OOMs the runner because Gemma3's paper-scale defaults (3B+ params via VisionDim=1152, 27 vision + 36 decoder layers) overflow the GC heap before any streaming-pool engagement. The natural fix is to call ConfigureWeightLifetime(new GpuOffloadOptions()) in InitializeLayers after the layer list is populated but before any weight materializes — exactly what the LayerBase.UseStreamingAllocator flag was built for. I prototyped this locally and confirmed the lazy DenseLayer's AllocateLazyWeight DID route through the streaming pool (PredictEagerStreaming kicked in immediately). But the streaming path then trips a deeper engine bug: System.InvalidOperationException : Streaming drop requires sole storage ownership; storage refcount is 2. Register the weight via WeightRegistry before any RebindStorageFrom / view operation that shares its storage. at TensorBase.DropStorageForStreaming at WeightRegistry.RegisterWeight at PredictEagerStreaming:3775 (RegisterLayerTrainableTensorsWithWeightRegistry) at Gemma3.Train The lazy weight tensor that PatchEmbeddingLayer.OnFirstForward materializes ends up with refcount=2 on its storage by the time PredictEagerStreaming's post-forward re-registration runs. Some view / init op (Xavier init, RegisterTrainableParameter, or ResolveShapes) is producing a second reference to the underlying storage that DropStorageForStreaming refuses to silently drop. Deferring the streaming pre-engagement until the engine-side refcount issue is fixed. The patch-divisibility scaffold portion of this PR stays — it's the necessary first half (unblocks the same- shape failures across DeepSeekVL / InternVL / Llama32Vision / Phi3 / Phi4 even on smaller configs where OOM isn't an issue). --------- Co-authored-by: franklinic <franklin@ivorycloud.com>
1 parent a801d11 commit 390d7ed

2 files changed

Lines changed: 35 additions & 0 deletions

File tree

src/AiDotNet.Generators/TestScaffoldGenerator.cs

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1239,6 +1239,16 @@ private static void CollectModelsFromNamespace(
12391239
"Cambrian", "Dragonfly", "Eagle", "Mantis", "Maya", "MiniCPM",
12401240
"Molmo", "Monkey", "Moondream", "NVLM", "Ovis", "VILA",
12411241
"PathVLM", "RadFM", "QVQ", "SkyworkR1V", "GeoChat", "RSGPT", "SkyEyeGPT",
1242+
// InstructionTuned VLMs that also resolve a 14-patch SigLIP / ViT-L
1243+
// encoder via ComputeVisualPatchSize (Gemma3: 896/sqrt(4096)=14,
1244+
// DeepSeekVL/2: ViT-L/14, InternVL family: ViT-L/14, Llama32Vision:
1245+
// ViT-L/14, Phi3Vision/Phi4Multimodal: CLIP ViT-L/14). PatchEmbedding
1246+
// throws "Image H/W (128/128) must be divisible by patchSize (14)"
1247+
// when the scaffold's default 128 isn't divisible by 14 — surfaced
1248+
// in PR #1408 Generated Layers shard run 26254401589 as 23 Gemma3
1249+
// tests all failing at the same Forward boundary.
1250+
"Gemma", "DeepSeekVL", "InternVL", "Llama32Vision",
1251+
"Phi3Vision", "Phi4Multimodal",
12421252
};
12431253

12441254
/// <summary>
@@ -4350,6 +4360,12 @@ private static bool IsPaperScaleVisionLanguageModel(string className)
43504360
{
43514361
"BiomedCLIP" => true,
43524362
"DFNCLIP" => true,
4363+
// Gemma3 (Google 2025): VisionDim=1152, DecoderDim=3584, 27 vision
4364+
// layers, 36 decoder layers, ImageSize=896 SigLIP-SO. Default Adam
4365+
// step OOMs the test runner before even completing the warm-up
4366+
// Predict — surfaced in PR #1408 Generated Layers shard as 23
4367+
// Gemma3 tests all failing.
4368+
"Gemma3" => true,
43534369
_ => false,
43544370
};
43554371
}

src/VisionLanguage/InstructionTuned/Gemma3.cs

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
using AiDotNet.NeuralNetworks;
77
using AiDotNet.Onnx;
88
using AiDotNet.Optimizers;
9+
using AiDotNet.Tensors.Engines.DirectGpu;
910
using AiDotNet.Tokenization;
1011
using AiDotNet.Tokenization.Interfaces;
1112
using AiDotNet.VisionLanguage.Interfaces;
@@ -137,6 +138,24 @@ protected override void InitializeLayers()
137138
ComputeEncoderDecoderBoundary();
138139
}
139140
ValidateEncoderDecoderBoundary(_encoderLayerEnd);
141+
142+
// NOTE: Gemma3 is paper-scale by default (VisionDim=1152, 27 vision
143+
// layers, DecoderDim=3584, 36 decoder layers — Google's 4B/12B/27B
144+
// SigLIP-SO family). The auto-detect streaming gate in
145+
// NeuralNetworkBase reads ParameterCount, which returns 0 PRE-first-
146+
// forward for lazy DenseLayers; the first warm-up Predict
147+
// materializes the full lazy weight matrix on the GC heap and OOMs
148+
// the runner before auto-detect gets a chance to engage.
149+
//
150+
// Calling ConfigureWeightLifetime(new GpuOffloadOptions()) here to
151+
// pre-engage streaming was the natural fix but surfaces a downstream
152+
// engine bug: when PredictEagerStreaming later calls
153+
// RegisterLayerTrainableTensorsWithWeightRegistry on a lazy layer's
154+
// freshly-materialized weights, DropStorageForStreaming throws
155+
// "storage refcount is 2" — some view/rebind operation in
156+
// PatchEmbeddingLayer's OnFirstForward leaves the weight tensor's
157+
// storage shared. Deferring streaming pre-engagement until the
158+
// engine-side refcount issue is fixed.
140159
}
141160

142161
// Gemma-3 (Google 2025): 896 / sqrt(4096) = 14 — SigLIP 14x14 patches.

0 commit comments

Comments
 (0)