perf(#1624): optimizer ladder + COW clone (fixed) + streaming param setter + lazy foundation-scale construct#1633
Conversation
…#1624) Picks up the recent Tensors improvements that this training-scale work builds on: - copy-on-write Tensor.Clone() (O(1)-until-write) — the foundation for the missing clone-footprint lever (G6) that addresses the Clone_ShouldProduceIdenticalOutput OOMs. - GPU/CPU parity kernel fixes (#626) + GQA backward kernel fix (#628 in flight). v0.98.0 contains the COW work (git tag --contains the #624 merge). AiDotNet core builds clean against it (0 errors) — no API breaks 0.96->0.98. Native OneDNN/OpenBLAS/CLBlast co-released in lockstep at 0.98.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Clone OOM (#1624) The large-diffusion-model Clone_ShouldProduceIdenticalOutput tests OOM the 16 GB runner because DiffusionModelBase clones via `new(...); SetParameters(GetParameters())` — a full-model flatten that materializes the entire weight set a SECOND time (plus a giant intermediate flat vector). This adds a copy-on-write parameter share built on the Tensors O(1)-until-write CloneShared (#624, pkg 0.98.0). - DiffusionModelBase.TryShareParametersFrom(source): parallel reflection walk of source+clone trainable layers (same field-order assumption CollectTrainableParameters already relies on); re-binds each clone layer's parameters to CloneShared() views of the source's. Fidelity-equivalent to the existing flat copy (both transfer exactly the trainable tensors — diffusion models carry no BatchNorm running stats / serialization extras), but O(1)-until-write. Falls back to the flat copy on any 1:1 structure mismatch. - Wired into Flux2SchnellModel + ControlNetPlusPlusFluxModel Clone() (the direct SetParameters pattern). VALIDATED: Flux2Schnell Clone_ShouldProduceIdenticalOutput passes under the issue's exact repro (DOTNET_GCHeapHardLimit=0x400000000 / 8 cores) in 36s — where #1624 documents an OOM — plus full no-cap fidelity. CogVideo clones sub-models (_videoUnet/_temporalVae.Clone()) so it needs the same share pushed into those sub-models (the universal rollout). Also lands the NeuralNetworkBase COW DeepCopy scaffold (TryDeepCopyCopyOnWrite) DEFAULT-OFF (AIDOTNET_COW_DEEPCOPY=1 to opt in): it is correct for inference clones but a layer can hold trained state outside GetTrainableParameters (BatchNorm extras, registered buffers, + a further FT/TabTransformer category) that the share path doesn't yet carry, which Clone_AfterTraining_ShouldPreserveLearnedWeights catches. Full NN fidelity is the follow-up before flipping it on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rd Clone-OOM target) (#1624) Completes the three explicit #1624 Clone-OOM targets (Flux2Schnell + ControlNet++ already validated at the 16GB/8-core repro; CogVideo here). Extracts the share into a single base-agnostic helper: - AiDotNet.Helpers.CopyOnWriteCloneHelper.TryShareTrainableParameters<T>(source, dest): parallel reflection walk that re-binds each dest trainable tensor to an O(1)-until-write CloneShared() view of source's (Tensors #624 / pkg 0.98.0). Guards on type/structure mismatch; never half-shares. DiffusionModelBase.TryShareParametersFrom now delegates to it (one implementation). - CogVideo (Type-B): build with fresh sub-models and share, instead of cloning each sub-model's full weight set. Scoped deliberately to the three explicit targets rather than a blanket conversion of all ~120 diffusion models: large-model Clone tests TIME OUT on the forward pass locally (not the clone), so a broad rollout cannot be reliably validated on this box — it needs the actual 16GB CI runner, plus per-model lazy-layer materialization (TriggerLazyShapeResolution / probe-forward, PR #1596) before sharing so a fresh clone's lazy layers don't re-initialize over the shared tensors. Tracked as the universal-rollout follow-up. Build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
, #2) Rebuilds the NeuralNetworkBase COW DeepCopy path (still default-off; AIDOTNET_COW_DEEPCOPY=1 to opt in) to capture the trained state the previous _layers-only walk silently dropped: - Reflection walk (CopyOnWriteCloneHelper.CollectTrainableLayers) captures trainable layers held in dedicated FIELDS — e.g. a tabular transformer's feature tokenizer / encoder stack / final layer-norm — not just the base _layers list. This was the FT/TabTransformer "third category" (their components live in fields, so a _layers walk cloned fresh-random weights for them). - Serialization EXTRAS (BatchNorm running mean/variance) copied eagerly via SetExtraParameters (self-allocating, validated against the resolved dst shape) instead of falling back — fixes DenseNet/ ResNet/VGG. - Lazy destination layers resolved from the source input shape before sharing so a fresh clone's lazy layers don't re-initialize over the shared tensors. - Public helper APIs are now typed IFullModel<T,Tensor<T>,Tensor<T>> (all model bases implement it) instead of `object` — the internal reflection walk stays object? as it traverses arbitrary fields. Clean-env validation (flag on): the previously-failing FT/Tab + DenseNet + ResNet + Autoencoder + CNN Clone (incl. Clone_AfterTraining) now PASS. Remaining before default-on: VariationalAutoencoder fails deterministically — its reparameterization sampling depends on a serialized RNG seed that the eager serialize-clone reproduces but a COW clone doesn't (non-tensor state); plus a SiameseNetwork case to confirm vs the known shared-engine-singleton test contention. Build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mplete (#1624, #2) Adds a cheap, side-effect-free guard to the NeuralNetworkBase COW path (still default-off): after the reflection walk, sum the element count of the tensors GetTrainableParameters exposes and compare to the model's ParameterCount. If they differ, the model's GetParameters reads weights NOT surfaced through a trainable layer (a nested sub-model's own params, a custom VAE/Siamese layout) — the share would be incomplete — so fall back to the eager full-fidelity copy. No flatten, no forward, so it doesn't reintroduce the OOM or mutate stateful models. Clean-env validation (flag on): VariationalAutoencoder + SiameseNetwork now PASS (auto-fall-back) on top of the previously-fixed FT/Tab + DenseNet + ResNet + Autoencoder + CNN (which keep verified COW). Remaining before default-on: LSTM. It has FULL coverage (walked == ParameterCount) but still diverges — LSTMLayer registers its 12 weight/bias FIELDS via manual RegisterTrainableParameter and has NO SetTrainableParameters override, so the base impl updates the registered list but not the fields the recurrent forward reads (the "fused/field" divergence #624 flagged). Fixing it needs a per-layer SetTrainableParameters override (watch RegisterTrainableParameter's role-collapse) + an audit of other manual-registration layers. Default-off, so no impact on current behavior. Build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…OMs (#1624, #2) Flips UseCopyOnWriteDeepCopy to default-ON (AIDOTNET_COW_DEEPCOPY=0 to disable). The COW path is correct for every model by construction — it shares trainable tensors only when the reflection walk fully accounts for ParameterCount AND copies serialization extras, and otherwise falls back to the eager full-fidelity copy — so it never yields a less-faithful clone than before. Validated (broad NN Clone_AfterTraining sweep, ON vs OFF): - COW ON regresses NOTHING — its only failures (HTMNetwork, LSTM) also fail with COW OFF (pre-existing; LSTM is the #1221 lazy-deserialize eager bug, not COW). - COW ON FIXES large-model trained-clone OOM/timeouts that the eager clone's memory doubling caused: ResNet, VGG, SimCSE (and others) now PASS at O(1)-until-write — the broad ON run was 6x faster (30s vs 2m52s) with 2 failures vs 32. - FT/Tab/DenseNet/Autoencoder/CNN share verified-correct; VAE/Siamese fall back via the coverage guard. The largest models (BGE/ColBERT/ViT-class) still OOM on TRAINING memory itself (not the clone) — that's the G7/G8 training-lever work (#3), separate from this clone-footprint lever. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GetOrCreateBaseOptimizer() now returns Adam8BitOptimizer (block-quantized m/v moment state) instead of fp32 AdamOptimizer when the model is large (ParameterCount >= 16M, ~256 MB of fp32 optimizer state). A standard Adam keeps two fp32 moment buffers, each the size of the model = 2x the weights, the dominant training-step memory cost after the parameters; 8-bit quantization takes that to ~0.5x, saving ~1.5x the model size of RAM per step so large models that OOM their training footprint on the 8c/16 GB runner fit. Small/medium models keep fp32 for exact reproducibility. AIDOTNET_ADAM8BIT=0 forces fp32 everywhere; =1 forces 8-bit (testing). Validated: forcing 8-bit on small models, LossStrictlyDecreases + Training_ShouldReduce stay green (8/8) — 8-bit Adam matches fp32 within tolerance, so convergence is preserved. The memory win on the actual OOM models is confirmed on the CI repro (they time out on the forward locally). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hNorm models (#1624) TrainCore now routes large models with a large batch and no BatchNorm to TrainWithGradientAccumulation (chunk size 8): forward+backward per chunk, accumulate, single optimizer step. This caps the per-step ACTIVATION peak (only chunk-many samples' activations resident at once vs the whole batch) so models that OOM their training-step activation footprint on the 8c/16 GB runner fit. Gating (all correctness-necessary except the size heuristic): - batch > chunk size (8) — nothing to chunk otherwise. - LossFunctionBase present — required by the accumulation path's tape loss. - NO BatchNorm layer — micro-batching changes per-chunk batch statistics, so accumulation is only gradient-EQUIVALENT for per-sample norms (LayerNorm/GroupNorm/InstanceNorm/RMSNorm). BatchNorm models keep the full-batch step. - ParameterCount >= 16M (heuristic) — only worth the extra steps for large models; AIDOTNET_MICROBATCH=1 forces regardless of size (testing), =0 disables. Validated: gradient-accumulation equivalence tests (Issue1296) 21/21; with micro-batch FORCED on non-BN models (Attention/Transformer/Autoencoder), LossStrictlyDecreases + Training_ShouldReduce + MoreData stay green (12/12) — convergence preserved. Activation-memory win on the OOM models confirmed on the CI repro. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ize) (#1624) Size-based default-on engagement (>=16M params) regressed ~12 medium-large models that fit fine in fp32: a broad NN training sweep went 84 -> 96 failures because 8-bit Adam moments (G2) and micro-batch accumulation (G8) slightly change convergence, so engaging them on a model that is NOT memory-bound is a needless regression. Switch both levers to reactive engagement: train normally, and only after a training step throws OutOfMemoryException latch the levers ON, drop the fp32 optimizer so it rebuilds as 8-bit, reclaim the failed step transients, and retry. A model that fits never reaches that path, so it keeps exact fp32 full-batch training (no convergence change); only a model that would otherwise crash pays the levers cost. BatchNorm models still skip G8 (correctness over recovery). AIDOTNET_ADAM8BIT / AIDOTNET_MICROBATCH still force for testing. Sweep with reactive defaults returns to baseline (86 vs 84; +-2 is contention noise), eliminating the +12 regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds copy-on-write parameter sharing helpers and chunked parameter assignment APIs, updates many diffusion and predictor clone paths to prefer shared parameters with fallback copying, and adds BF16 optimizer-moment support plus memory-adaptive training behavior. ChangesCOW cloning, chunk streaming, and training-memory updates
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…odels (#1624) Adds the gentle middle rung to the G2 optimizer-memory ladder: fp32 (4B, default) -> BF16 (2B, proactive) -> 8-bit block-quant (1B, reactive). BF16 keeps the full float32 exponent (only the mantissa shortens), so unlike the 8-bit block quantization it changes Adam's trajectory negligibly and can be engaged proactively by a size gate without a convergence regression. Adam8BitOptimizer gains a UseBFloat16MomentStorage mode storing m/v as 2-byte BF16 in the tape state (no blocks/scales); the per-parameter loop expands only one parameter's moments to full precision at a time, so resident state stays at the compressed width. GPU/sparse fast-paths are gated off for the BF16 mode. BF16 pack/unpack reuses the TFM-safe BitConverterHelper (union fallback on net471). NeuralNetworkBase wires the ladder: bf16 for models >=50M params (AIDOTNET_BF16_ADAM overrides), escalating reactively to 8-bit on an actual OOM, else plain fp32 Adam. Validation (NN training sweep, 172 tests): baseline 84 fail; bf16 gated-default 85 (+1, contention noise); bf16 forced-everywhere 91; 8-bit-everywhere was 96. The >=50M gate keeps small models on fp32 untouched. net10.0 + net471 build green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ation (#1624) Pins the load-bearing guarantee of the copy-on-write Clone lever: a clone is observationally identical to its source, and the first in-place write to either side privatizes the shared tensor storage so the other is never corrupted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… correctness fixes The manual copy-on-write design here (DiffusionModelBase.ShareWeightsFrom / EnsureOwnWeights / WeightShareGroup + ~80 per-model Clone edits) is superseded by the single global tensor-level COW in #1633 (Tensor.CloneShared, automatic privatization, no per-mutation-site guard). Reverted all of that to keep ONE COW design across the library. Retained the genuinely-necessary, non-COW work: - DiTNoisePredictor / UNetNoisePredictor: run the per-step denoising forward EAGERLY instead of through the single-input compile cache. The compiled plan re-binds only the noisy-sample slot and bakes the per-step timestep embedding as a constant, so the denoising loop would replay step 0's output every step (silent corruption, compilation defaults on). Eager is the only correct path until the compile layer supports multiple mutable inputs. - DiffusionModelTestBase: BLAS-thread cap (#1305) so parallel foundation-scale diffusion tests share cores instead of each grabbing all of them and blowing the 120s timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ption 2, #1624) Bridge until the full PyTorch-style transparent-Clone refactor (separate PR): give the ~100 diffusion model / noise-predictor / VAE Clone() overrides O(1) copy-on-write by swapping clone.SetParameters(GetParameters()) for if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters()); which shares weight storage via Tensor.CloneShared (privatize-on-write) and falls back to the eager flat copy if the trainable-layer structure doesn't line up 1:1. - CopyOnWriteCloneHelper: add an object-graph overload so models that aren't IFullModel (NoisePredictorBase, VAEModelBase — the heavy DiT/UNet predictors + large VAEs whose flat-copy Clone OOMs) can use the same share logic; existing IFullModel overload unchanged. - NoisePredictorBase / VAEModelBase: add protected TryShareParametersFrom(...) wrapping it. Builds net10.0 + net471; COW correctness test still green.
Resolve conflicts where master gained independent fixes to files this branch COW-swapped: - Directory.Packages.props: keep this branch's newer AiDotNet.Tensors 0.98.0. - FlagDiTPredictor / Multi-/Spot-/StitchDiffusionModel: take master's corrected Clone constructor (full Flag-DiT ctor args; pass the cloned predictor+VAE to avoid the 643M-vs-10M param-count mismatch) AND apply the COW swap on top. - VideoUNetPredictor: keep master's deliberate paired per-layer copy (its fused-CPU weight pack needs the explicit copy to avoid stale-pack divergence) — NOT routed through COW; documented why. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Deployment failed with the following error: Learn More: https://vercel.com/franklins-projects-02a0b5a0?upgradeToPro=build-rate-limit |
NoisePredictorBase and VAEModelBase ARE IFullModel<T,Tensor<T>,Tensor<T>> (transitively, via INoisePredictor<T> / IVAEModel<T>), so the existing IFullModel overload of CopyOnWriteCloneHelper.TryShareTrainableParameters already covers them — the added object-graph overload was dead (overload resolution prefers the more specific IFullModel signature) and based on a mistaken reading of the direct class declarations. Restore the single IFullModel overload; the bases' TryShareParametersFrom calls bind to it via the standard interface conversion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CogVideoModel.Clone was rewritten in this PR to build a fresh-default sub-model graph and transfer weights at the model level. A fresh VideoUNetPredictor has UNRESOLVED lazy layers, so neither the COW share (structure mismatch -> falls back) nor the flat SetParameters could reshape the clone to the source's resolved 573M-parameter structure: the clone silently came out as a different, smaller network (425M params, clone output diverged ~2x). Verified via CogVideoModelTests.Clone_ShouldProduceIdenticalOutput (was FAIL, now PASS: identical param count, 0 weight mismatches, maxOutDiff=0). Fix: compose the clone from each sub-model's own Clone(), which resolves its lazy shapes first and applies the memory-efficient transfer internally (VideoUNetPredictor: paired per-layer copy that avoids the fused-CPU stale-pack divergence; TemporalVAE: copy-on-write). This is the codebase's established safe pattern for fused-CPU-pack predictors. Also remove a duplicated TryShareParametersFrom guard (if (!x) if (!x) ...) in Flux2SchnellModel.Clone and ControlNetPlusPlusFluxModel.Clone left by the COW swap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 28
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
src/Models/Options/Adam8BitOptimizerOptions.cs (1)
26-27:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winBlocking: add the required options constructors and copy the new BF16 flag
This options type is missing the explicit default constructor and copy constructor required by the options golden pattern. With the new
UseBFloat16MomentStorageflag on Line 153, copy semantics are now especially important to avoid silent configuration drift.Suggested fix
public class Adam8BitOptimizerOptions<T, TInput, TOutput> : AdamOptimizerOptions<T, TInput, TOutput> { + public Adam8BitOptimizerOptions() + { + } + + public Adam8BitOptimizerOptions(Adam8BitOptimizerOptions<T, TInput, TOutput> other) + { + ArgumentNullException.ThrowIfNull(other); + + BlockSize = other.BlockSize; + UseDynamicQuantization = other.UseDynamicQuantization; + QuantizationPercentile = other.QuantizationPercentile; + FullPrecisionUpdateFrequency = other.FullPrecisionUpdateFrequency; + UseStochasticRounding = other.UseStochasticRounding; + CompressBothMoments = other.CompressBothMoments; + UseBFloat16MomentStorage = other.UseBFloat16MomentStorage; + } + // existing properties... }As per coding guidelines:
Default Constructor: Parameterless constructor public ModelNameOptions() { }andCopy Constructor ... Copies ALL properties from other to thisare required for options classes.Also applies to: 134-154
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/Models/Options/Adam8BitOptimizerOptions.cs` around lines 26 - 27, The Adam8BitOptimizerOptions class is missing required constructors per the options golden pattern. Add a parameterless default constructor and a copy constructor that accepts another Adam8BitOptimizerOptions instance and copies all properties from it (including the new UseBFloat16MomentStorage flag introduced on line 153). Ensure the copy constructor properly initializes all inherited and new properties to prevent configuration drift between option instances.Source: Coding guidelines
src/Optimizers/Adam8BitOptimizer.cs (1)
529-542:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCached tape-state mode mismatch can null-deref after options updates
Steponly reallocates state on missing key/length mismatch. If_options.UseBFloat16MomentStoragechanges on an existing optimizer instance, old non-BF16 entries remain cached and Line 562 (state.MBf16!) can dereference null.Suggested fix
- if (!_tapeStates.TryGetValue(param, out var state) || state.Length != param.Length) + bool incompatibleMode = + _tapeStates.TryGetValue(param, out var s) && + ((_options.UseBFloat16MomentStorage && (s.MBf16 is null || s.VBf16 is null)) || + (!_options.UseBFloat16MomentStorage && (s.VQuantized is null || s.VScales is null))); + + if (!_tapeStates.TryGetValue(param, out var state) || state.Length != param.Length || incompatibleMode) { ... }protected override void UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput> options) { if (options is Adam8BitOptimizerOptions<T, TInput, TOutput> adamOptions) { + bool modeChanged = _options.UseBFloat16MomentStorage != adamOptions.UseBFloat16MomentStorage + || _options.CompressBothMoments != adamOptions.CompressBothMoments + || _options.BlockSize != adamOptions.BlockSize; + _options = adamOptions; + if (modeChanged) _tapeStates.Clear(); }Also applies to: 560-564, 1166-1169
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/Optimizers/Adam8BitOptimizer.cs` around lines 529 - 542, The state reallocation logic in the condition starting at line 529 only checks for missing key or length mismatch, but does not validate whether the cached state's storage mode matches the current _options.UseBFloat16MomentStorage setting. When this option is toggled on an existing optimizer instance, old cached entries with the wrong storage mode remain in _tapeStates, causing null-dereference when accessing state.MBf16 at line 562. Add an additional condition to the state validity check that verifies the storage mode of the cached state matches the current UseBFloat16MomentStorage option, and reallocate the state via AllocateTapeState if they don't match. Apply this same validation check to the other locations mentioned (lines 560-564 and 1166-1169) where BF16 state access occurs.src/Diffusion/Control/ControlNetLiteModel.cs (1)
118-126:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClone fallback can return a non-equivalent model state.
When Line 150 falls back to
SetParameters(GetParameters()),_vaeis not copied by this class’s parameter contract (Lines 118–141 exclude it). IfTryShareParametersFromreturns false, the clone keeps a fresh VAE, so clone output can diverge from source.Proposed fix
public override IDiffusionModel<T> Clone() { var clone = new ControlNetLiteModel<T>(controlType: _controlType, conditioner: _conditioner, seed: RandomGenerator.Next()); - if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters()); + if (!clone.TryShareParametersFrom(this)) + { + clone.SetParameters(GetParameters()); + clone._vae.SetParameters(_vae.GetParameters()); + } return clone; }Also applies to: 129-141, 150-150
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/Diffusion/Control/ControlNetLiteModel.cs` around lines 118 - 126, The GetParameters method in ControlNetLiteModel only retrieves parameters from _baseUNet and _controlEncoder, but excludes _vae. When the Clone method falls back to SetParameters(GetParameters()) at line 150 (when TryShareParametersFrom returns false), the _vae is not copied, resulting in a clone with a fresh VAE instance that can diverge from the source model. Fix this by ensuring the _vae state is properly included in the parameter cloning process. Either extend the GetParameters and SetParameters methods to handle _vae parameters alongside the existing _baseUNet and _controlEncoder parameters, or implement explicit _vae cloning logic in the Clone method fallback to ensure the cloned model maintains equivalent state to the source.src/Diffusion/Control/ControlNetSD3Model.cs (1)
76-77:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBlocking: Clone fallback can return a non-equivalent model when COW sharing fails.
Clone()falls back toSetParameters(GetParameters()), but this class’s parameter path excludes_vae. IfTryShareParametersFromreturnsfalse, the clone keeps fresh VAE weights and no longer mirrors the source model.Proposed fix
public override IDiffusionModel<T> Clone() { var clone = new ControlNetSD3Model<T>( controlType: _controlType, conditioner: _conditioner, seed: RandomGenerator.Next()); - if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters()); + if (!clone.TryShareParametersFrom(this)) + { + clone.SetParameters(GetParameters()); + clone._vae.SetParameters(_vae.GetParameters()); + } return clone; }Also applies to: 128-160, 166-173
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/Diffusion/Control/ControlNetSD3Model.cs` around lines 76 - 77, The Clone() method in ControlNetSD3Model falls back to SetParameters(GetParameters()) but this approach excludes the _vae field from being cloned, causing fresh VAE weights when the copy-on-write sharing fails. Ensure that both GetParameters() and SetParameters() methods include the _vae parameters along with _predictor and _controlEncoder parameters, so that Clone() can properly restore all model components. Alternatively, modify the Clone() method to explicitly handle cloning or sharing the _vae field in addition to using SetParameters(GetParameters()), ensuring the cloned model fully mirrors the source model's state including its VAE component.src/Diffusion/VAE/AutoencoderKL.cs (1)
427-445:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBlocking: Clone does not preserve non-default architecture configuration and can silently partial-copy weights.
Clone()hardcodes constructor values (numResBlocks,numGroups,inputSpatialSize) instead of reusing the source configuration. If shape/parameter layout differs, COW sharing can fail andSetParameterscurrently allows partial assignment without length enforcement.Proposed fix
+private readonly int _numResBlocks; +private readonly int _numGroups; +private readonly int _inputSpatialSize; public AutoencoderKL( int inputChannels = 3, int latentChannels = 4, int baseChannels = 128, int[]? channelMults = null, int numResBlocks = 2, int numGroups = 32, double? latentScaleFactor = null, int inputSpatialSize = 512, ILossFunction<T>? lossFunction = null, int? seed = null) : base(lossFunction, seed) { + _numResBlocks = numResBlocks; + _numGroups = numGroups; + _inputSpatialSize = inputSpatialSize; ... } public override void SetParameters(Vector<T> parameters) { var encoderParams = _encoder.GetParameters(); var decoderParams = _decoder.GetParameters(); + var expectedTotal = encoderParams.Length + decoderParams.Length; + if (parameters.Length != expectedTotal) + throw new ArgumentException($"Expected {expectedTotal} parameters, got {parameters.Length}.", nameof(parameters)); ... } public override IVAEModel<T> Clone() { var clone = new AutoencoderKL<T>( _inputChannels, _latentChannels, _baseChannels, _channelMults, - numResBlocks: 2, - numGroups: 32, + numResBlocks: _numResBlocks, + numGroups: _numGroups, _latentScaleFactor, - inputSpatialSize: 512, + inputSpatialSize: _inputSpatialSize, LossFunction); if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters()); return clone; }Also applies to: 536-550
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/Diffusion/VAE/AutoencoderKL.cs` around lines 427 - 445, The SetParameters method in AutoencoderKL allows partial weight assignment without proper validation, which can silently fail to copy all required parameters. Add a length validation check at the beginning of the SetParameters method to ensure that the input parameters vector length exactly equals the sum of encoderParams.Length and decoderParams.Length, and throw an appropriate exception if there is a mismatch instead of allowing incomplete parameter assignment.src/Diffusion/NoisePredictors/UViTNoisePredictor.cs (2)
584-592:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBlocking:
Clone()does not preserve full construction/runtime state.
Clone()dropslatentSpatialSizeandseed, and does not copy_posEmbed. A clone can diverge from the source even if COW sharing succeeds.💡 Proposed fix
+private readonly int _latentSpatialSize; +private readonly int? _seed; + public UViTNoisePredictor( int inputChannels = 4, int hiddenSize = 512, int numLayers = 12, int numHeads = 8, int patchSize = 2, int contextDim = 0, int latentSpatialSize = 32, int? seed = null) : base(seed: seed) { + _seed = seed; + _latentSpatialSize = latentSpatialSize; numLayers = numLayers % 2 == 0 ? numLayers : numLayers + 1; @@ public override INoisePredictor<T> Clone() { var clone = new UViTNoisePredictor<T>( inputChannels: _inputChannels, hiddenSize: _hiddenSize, numLayers: _numLayers, numHeads: _numHeads, patchSize: _patchSize, - contextDim: _contextDim); + contextDim: _contextDim, + latentSpatialSize: _latentSpatialSize, + seed: _seed); + + if (_posEmbed is not null) + clone._posEmbed = (Tensor<T>)_posEmbed.CloneShared(); + if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters()); return clone; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/Diffusion/NoisePredictors/UViTNoisePredictor.cs` around lines 584 - 592, The Clone() method in UViTNoisePredictor<T> is not preserving the full state of the object. When creating the clone instance, add latentSpatialSize and seed as parameters to the constructor call (in addition to the existing parameters like inputChannels, hiddenSize, etc.). Additionally, after the TryShareParametersFrom and SetParameters calls, you must copy the _posEmbed field from the source instance to the cloned instance to ensure the clone has identical runtime state as the original object.
546-552:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winBlocking:
SetParametersonly applies a partial model state.
GetParametersemits encoder/middle/decoder/skip/final parameters, butSetParameterswrites only_patchEmbed,_timeEmbed1,_timeEmbed2. This silently leaves most weights stale.💡 Proposed fix
public override void SetParameters(Vector<T> parameters) { int offset = 0; offset = SetLayerParams(_patchEmbed, parameters, offset); offset = SetLayerParams(_timeEmbed1, parameters, offset); offset = SetLayerParams(_timeEmbed2, parameters, offset); + + foreach (var block in _encoderBlocks) + offset = SetBlockParams(block, parameters, offset); + + offset = SetBlockParams(_middleBlock, parameters, offset); + + for (int i = 0; i < _decoderBlocks.Count; i++) + { + offset = SetBlockParams(_decoderBlocks[i], parameters, offset); + offset = SetLayerParams(_skipProjections[i], parameters, offset); + } + + offset = SetLayerParams(_finalNorm, parameters, offset); + offset = SetLayerParams(_outputProj, parameters, offset); + + if (offset != parameters.Length) + { + throw new ArgumentException( + $"Expected to consume {offset} parameters but received {parameters.Length}.", + nameof(parameters)); + } } + +private static int SetBlockParams(UViTBlock block, Vector<T> parameters, int offset) +{ + if (block.Norm1 != null) offset = SetLayerParams(block.Norm1, parameters, offset); + if (block.Attention != null) offset = SetLayerParams(block.Attention, parameters, offset); + if (block.Norm2 != null) offset = SetLayerParams(block.Norm2, parameters, offset); + if (block.MLP1 != null) offset = SetLayerParams(block.MLP1, parameters, offset); + if (block.MLP2 != null) offset = SetLayerParams(block.MLP2, parameters, offset); + return offset; +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/Diffusion/NoisePredictors/UViTNoisePredictor.cs` around lines 546 - 552, The SetParameters method in the UViTNoisePredictor class is incomplete and only restores parameters for _patchEmbed, _timeEmbed1, and _timeEmbed2, while GetParameters retrieves parameters for encoder, middle, decoder, skip, and final components. Expand the SetParameters method to call SetLayerParams for all the same layer components that GetParameters emits, in the same order, to ensure the complete model state is restored when loading parameters. Reference the GetParameters method to determine which layers need to be included in the SetParameters call sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/Diffusion/FastGeneration/AutoRegressiveMaskedDiffusion.cs`:
- Around line 140-142: The clone method in AutoRegressiveMaskedDiffusion<T>
creates a new instance with only conditioner and seed parameters, losing custom
configuration from the source instance. When the source was created with custom
architecture, options, scheduler, predictor, or vae components, the clone will
have incompatible layouts and the fallback SetParameters call will fail. Instead
of reconstructing the clone with default components, ensure that the new
AutoRegressiveMaskedDiffusion<T> constructor call passes all configuration
parameters (architecture, options, scheduler, predictor, vae) from the current
instance so the clone maintains the same setup as the source.
In `@src/Diffusion/FastGeneration/SANASprintModel.cs`:
- Around line 146-148: The Clone method in SANASprintModel<T> is creating a new
instance with only the conditioner and seed parameters, which leaves other
architectural components uninitialized. This causes the cloned model to have a
different structure than the source, making parameter sharing and the fallback
copy operation fail or produce mismatched layouts. Modify the constructor call
on line 146 to also pass the source model's architecture, options, scheduler,
predictor, and vae properties to ensure the clone has an identical structure
before attempting to share or copy parameters.
In `@src/Diffusion/FastGeneration/SCottModel.cs`:
- Around line 143-145: The Clone method in SCottModel<T> creates a new instance
with only the conditioner parameter, which does not preserve custom predictor or
VAE configurations from the source model. This causes parameter-count mismatches
when parameter sharing fails. Modify the clone instantiation to pass all
relevant configuration properties from the source model to the constructor
(beyond just conditioner), ensuring the cloned instance has identical
configuration to the original before attempting parameter sharing or copying.
In `@src/Diffusion/FastGeneration/TransfusionModel.cs`:
- Around line 143-145: The Clone() method in TransfusionModel<T> creates a fresh
default instance instead of preserving the original's custom configuration like
architecture, options, scheduler, predictor, and vae. When
TryShareParametersFrom(this) fails on line 144, the fallback
SetParameters(GetParameters()) can throw if parameter counts mismatch. Modify
the Clone() method to copy all custom configuration settings from the source
instance to the cloned instance before attempting parameter sharing, ensuring
the cloned model maintains the same setup as the original and eliminates
reliance on the fragile SetParameters fallback.
In `@src/Diffusion/ImageEditing/FlowEditModel.cs`:
- Around line 137-139: The Clone method in FlowEditModel<T> creates a new
instance with only the conditioner and seed parameters, but discards other
source constructor state like architecture, options, scheduler, predictor, and
vae. This causes the subsequent TryShareParametersFrom call to operate on an
improperly initialized instance. Modify the constructor call on line 137 to pass
all necessary parameters from the current instance (this._architecture,
this._options, this._scheduler, this._predictor, this._vae, etc.) in addition to
the conditioner and RandomGenerator.Next() seed, ensuring the clone is fully
initialized with the same configuration as the source before attempting
parameter sharing.
In `@src/Diffusion/ImageEditing/FreeInpaintModel.cs`:
- Around line 131-133: The FreeInpaintModel<T> constructor creates a clone with
default values that do not match the source model's architecture, options,
scheduler, predictor, and vae settings. This structural mismatch causes
SetParameters to fail when TryShareParametersFrom returns false. Modify the
FreeInpaintModel<T> clone instantiation at line 131 to pass the same
architecture configuration parameters (such as options, scheduler, predictor,
and vae) from the current instance to the clone constructor, ensuring the clone
has structural parity with the source before attempting to share or apply
parameters.
In `@src/Diffusion/ImageEditing/HDPainterModel.cs`:
- Around line 136-138: The Clone method in HDPainterModel is not preserving the
full configuration state when creating a cloned instance. The new
HDPainterModel<T> being instantiated on line 136 only passes the conditioner and
a new random seed, but does not preserve other constructor parameters from the
source instance. This causes model layout mismatches when the source has been
customized, making the TryShareParametersFrom fallback fail on line 137. Store
all constructor parameters as private fields in HDPainterModel so they can be
accessed during cloning, then pass these preserved parameters when creating the
cloned instance to ensure the clone has the same configuration as the source.
In `@src/Diffusion/ImageEditing/SD3InpaintingModel.cs`:
- Around line 137-139: The Clone() method in the SD3InpaintingModel class
creates a new instance with only the conditioner and seed, which does not
preserve the original source construction parameters and dependencies. This
causes the TryShareParametersFrom call to potentially fail, resulting in a
fallback to SetParameters that throws due to parameter-length mismatch. Fix this
by ensuring the new SD3InpaintingModel instance created in the clone operation
is constructed with all the same dependencies and parameters as the source, so
that either TryShareParametersFrom succeeds or the clone is already properly
initialized. Review the SD3InpaintingModel constructor to identify all required
dependencies beyond conditioner and seed, and pass them when creating the clone
instance.
In `@src/Diffusion/ImageEditing/Step1XEditModel.cs`:
- Around line 150-152: The Clone() method in Step1XEditModel is discarding the
source instance's configuration (architecture, options, scheduler, predictor,
vae) when creating a new Step1XEditModel instance on line 150. When the
TryShareParametersFrom fallback on line 151 fails, the SetParameters call can
then fail due to mismatched parameter layouts. Fix this by passing all the
source instance's configuration parameters (architecture, options, scheduler,
predictor, vae) to the Step1XEditModel constructor when creating the clone, not
just the conditioner and new seed, so that the cloned instance preserves the
complete configuration of the original.
In `@src/Diffusion/ImageEditing/TurboEditModel.cs`:
- Around line 136-138: The Clone() method in TurboEditModel<T> is reconstructing
the cloned instance with only the conditioner and seed parameters, but it is not
preserving the original model's custom predictor and VAE configuration. This
causes TryShareParametersFrom to fail and subsequently SetParameters to throw
due to mismatched parameter counts. When creating the new clone instance in the
Clone() method, pass the original model's predictor and VAE configuration along
with the conditioner and seed, so that the cloned model maintains the same
wiring and configuration as the source model before attempting to share or set
parameters.
In `@src/Diffusion/MotionGeneration/MoMaskModel.cs`:
- Around line 122-124: The Clone() method in MoMaskModel is creating a new
instance with only conditioner and seed parameters on line 122, but it is not
carrying over critical architectural state like architecture, options,
scheduler, predictor, and vae from the source instance. This causes a mismatch
when the source was built with non-default components and the fallback
SetParameters() call is insufficient. Modify the MoMaskModel constructor call on
line 122 to pass all the necessary architectural and configuration parameters
(architecture, options, scheduler, predictor, vae) from the current instance in
addition to conditioner and seed, ensuring the cloned instance is fully
initialized with the same state as the source.
In `@src/Diffusion/NoisePredictors/MMDiTXNoisePredictor.cs`:
- Around line 336-337: The issue is that when TryShareParametersFrom returns
true in the clone creation logic, the _posEmbed state is not being copied to the
clone, leaving it in an incomplete state. Since TryShareParametersFrom only
handles trainable layer tensors, you need to ensure _posEmbed is explicitly
copied regardless of whether parameter sharing succeeds or fails. Add a separate
statement after the conditional to copy _posEmbed from this instance to the
clone, or ensure _posEmbed is copied in both the success path (when
TryShareParametersFrom returns true) and the fallback path (when SetParameters
is called).
In `@src/Diffusion/Panorama/DiffPanoModel.cs`:
- Around line 138-140: The Clone() method in DiffPanoModel creates a new
instance with default architecture/options/scheduler/predictor/vae settings,
which causes SetParameters to fail if TryShareParametersFrom cannot share
parameters due to structural mismatch. When constructing the cloned instance in
the Clone() method, capture and pass the source instance's architecture,
options, scheduler, predictor, and vae configuration settings to the
DiffPanoModel constructor instead of relying on defaults, ensuring the cloned
graph structure matches the source before attempting parameter operations.
In `@src/Diffusion/StyleTransfer/ConsisLoRAModel.cs`:
- Around line 111-112: The Clone() method in ConsisLoRAModel is creating a new
instance without preserving the source's configuration. The constructor call on
line 111 only passes conditioner and seed, causing the clone to use default
values for Architecture, Options, Scheduler, and custom predictor/VAE instances.
This leads to parameter-count mismatch or behavioral differences when
SetParameters is called on line 112. Modify the ConsisLoRAModel constructor call
to include all source configuration parameters (Architecture, Options,
Scheduler, and any custom predictor/VAE instances) so the clone preserves the
exact configuration of the source object rather than rebuilding with defaults.
In `@src/Diffusion/StyleTransfer/InstantStyleModel.cs`:
- Around line 114-115: The Clone() method in the InstantStyleModel class is not
preserving the source model's topology and configuration. The issue is in the
instantiation of the new InstantStyleModel on line 114, which only passes the
conditioner and a new seed, discarding any custom configuration from the source
model. To fix this, ensure that when creating the cloned instance, you preserve
all relevant configuration and topology from the source model (this) before
attempting to share or copy parameters. This will prevent parameter size
mismatches and ensure the fallback SetParameters call on line 115 succeeds with
compatible parameter dimensions.
In `@src/Diffusion/StyleTransfer/KLoRAStyleModel.cs`:
- Around line 111-112: The Clone() method in the KLoRAStyleModel class is
instantiating a new instance with insufficient parameters, losing source
configuration state. Line 111 currently only passes the conditioner and a random
seed to the constructor, but it should clone all relevant source state including
custom predictor and VAE topologies from the current instance. Modify the
constructor call in Clone() to pass all necessary configuration parameters from
the source instance (this) to the new clone, ensuring the fallback SetParameters
call on line 112 has a properly configured clone to work with.
In `@src/Diffusion/StyleTransfer/SASTDModel.cs`:
- Around line 111-112: The Clone() method in the SASTDModel class is creating a
new instance with a randomly generated seed instead of cloning the source
model's actual configuration. Replace the random seed generation
(RandomGenerator.Next()) with the current instance's actual seed value to
properly preserve the source model's state. Additionally, review the constructor
call in Clone() to ensure all relevant source components and configuration are
being properly copied from the current instance (this) rather than using default
values, as the parameter sharing fallback on the next line is insufficient for
maintaining clone fidelity.
In `@src/Diffusion/StyleTransfer/StyDiffModel.cs`:
- Around line 111-112: The Clone() method in StyDiffModel is not preserving the
source instance's configuration when creating the cloned instance. When
instantiating the new StyDiffModel on line 111, only the conditioner and a new
seed are being passed to the constructor, which means other configuration
settings from the source instance are lost and initialized to defaults. This
causes the subsequent parameter-sharing or parameter-setting operations to fail
due to configuration mismatches. To fix this, identify all configuration
parameters/fields that define the source instance's state (besides the seed
which should be randomized for the clone), and pass them to the new StyDiffModel
constructor to ensure the clone is configuration-faithful to the source before
attempting to share or set parameters.
In `@src/Diffusion/StyleTransfer/StyleAlignedEditModel.cs`:
- Around line 114-115: The Clone() method in StyleAlignedEditModel class is
creating a new instance with only the conditioner and a new random seed, but it
does not preserve the full source configuration. Instead of creating a default
instance and relying on a fallback parameter copy via TryShareParametersFrom(),
the clone should be instantiated with all the same configuration parameters as
the source instance (beyond just the conditioner). Ensure that all relevant
configuration state from the current instance is passed to the new
StyleAlignedEditModel constructor so the clone is semantically equivalent to the
source, then remove or adjust the fallback logic on line 115 accordingly.
In `@src/Helpers/CopyOnWriteCloneHelper.cs`:
- Around line 24-70: The helper methods TryShareTrainableParameters and
CollectTrainableLayers in the CopyOnWriteCloneHelper class are currently exposed
as public API, which expands the user-facing surface beyond the intended facade
boundary. Change the access modifiers from public static to internal static for
both methods to restrict them to internal use only, ensuring users interact only
through AiModelBuilder.cs and AiModelResult.cs as intended by the coding
guidelines.
- Around line 45-58: The validation in the first loop of the
CopyOnWriteCloneHelper method only checks parameter counts but does not validate
the actual tensor compatibility (shape, data type, etc.) between source and
destination layers. Extend the validation loop that checks
srcLayers[i].GetTrainableParameters().Count to also iterate through each
parameter and verify that the corresponding tensors in srcLayers[i] and
dstLayers[i] have compatible shapes and properties before proceeding to the
CloneShared operations. If any tensor is incompatible, return false to skip the
mutation and use the eager fallback, ensuring half-shared clones are never
created.
In `@src/NeuralNetworks/NeuralNetworkBase.cs`:
- Around line 9226-9229: The UseCopyOnWriteDeepCopy property should not be
exposed as part of the public API surface since it is an implementation and
testing detail rather than a user-facing facade capability. Change the access
modifier of the UseCopyOnWriteDeepCopy property from public to internal to keep
it as internal plumbing, preventing external callers from mutating global clone
behavior outside the intended AiModelBuilder and AiModelResult interfaces.
- Around line 8157-8168: The cache field `_hasCrossBatchNormCached` is not being
invalidated when the neural network's layer structure changes. Add cache
invalidation for `_hasCrossBatchNormCached` alongside the existing
layer-structure cache resets in the methods `AddLayerToCollection`,
`RemoveLayerFromCollection`, and `ClearLayers`. Set `_hasCrossBatchNormCached`
to null (or its default value) in these methods to ensure the cache is
recalculated when layer composition changes, preventing stale cached values from
affecting micro-batch normalization decisions.
- Around line 5713-5727: The OutOfMemoryException catch block in the method
containing TrainCore retries from a potentially corrupted state if parameter
mutations have already begun. Add a boolean field _trainMutationStarted to track
when in-place parameter or moment buffer mutations start. Set
_trainMutationStarted to true immediately before each non-transactional write
operation including optimizer.Step calls, streaming optimizer Apply/EndStep
paths, fused plan execution, and raw extra tensor updates throughout the
training code paths. In the OutOfMemoryException catch block, check if
_trainMutationStarted is true before retrying; if it is, rethrow the exception
instead of retrying since the state is already corrupted. Reset
_trainMutationStarted to false after successful retry completion.
- Around line 9408-9423: The coverage guard that validates whether the
copy-on-write mechanism can safely share tensors only checks layer trainable
parameters from GetTrainableParameters() but ignores GetExtraTrainableTensors()
which are still updated during training. After the loop that counts
walkedParamCount from srcLayers, add a check to detect if there are any extra
trainable tensors by calling GetExtraTrainableTensors() on the source model. If
any raw trainable tensors exist (the result list is non-empty), fall back to the
eager copy path by disposing the copy and returning false, consistent with the
existing behavior when walkedParamCount != ParameterCount.
- Around line 8134-8139: The current check for `CollectParameters(Layers).Count
== 0` only verifies materialized tensors, which causes lazy trainable layers to
be skipped during the OOM retry mechanism before their first forward pass.
Replace this check with a condition that gates on the presence of trainable
layers themselves rather than their materialized parameters, allowing lazy
layers to properly participate in the micro-batch retry flow that already exists
in TrainWithGradientAccumulation.
- Around line 5721-5722: The issue is that clearing _baseTrainOptimizer
unconditionally during OOM recovery silently replaces user-configured optimizers
(injected via AiModelBuilder.ConfigureOptimizer and SetBaseTrainOptimizer) with
the default memory-ladder Adam. To fix this, introduce a tracking flag
_baseTrainOptimizerIsFrameworkDefault that marks whether the current optimizer
was framework-created. Set this flag to true in every branch where
GetOrCreateBaseOptimizer() creates and assigns the default optimizer. Then
during OOM recovery, only clear _baseTrainOptimizer if the flag indicates it was
framework-created; otherwise preserve the user-configured optimizer or handle it
appropriately rather than silently replacing it.
In `@src/Optimizers/Adam8BitOptimizer.cs`:
- Around line 421-425: The BF16 tape-state branch (around line 706-716) leaves
VQuantized and VScales fields as null when UseBFloat16MomentStorage is enabled,
but the GetMemoryUsage and GetTapeStateSnapshotForTests methods unconditionally
dereference these fields without null checks, causing NullReferenceExceptions.
Add appropriate null checks in GetMemoryUsage and GetTapeStateSnapshotForTests
to handle the case where these fields are null in BF16 mode, ensuring the
methods gracefully skip or handle quantized fields when they are not allocated.
---
Outside diff comments:
In `@src/Diffusion/Control/ControlNetLiteModel.cs`:
- Around line 118-126: The GetParameters method in ControlNetLiteModel only
retrieves parameters from _baseUNet and _controlEncoder, but excludes _vae. When
the Clone method falls back to SetParameters(GetParameters()) at line 150 (when
TryShareParametersFrom returns false), the _vae is not copied, resulting in a
clone with a fresh VAE instance that can diverge from the source model. Fix this
by ensuring the _vae state is properly included in the parameter cloning
process. Either extend the GetParameters and SetParameters methods to handle
_vae parameters alongside the existing _baseUNet and _controlEncoder parameters,
or implement explicit _vae cloning logic in the Clone method fallback to ensure
the cloned model maintains equivalent state to the source.
In `@src/Diffusion/Control/ControlNetSD3Model.cs`:
- Around line 76-77: The Clone() method in ControlNetSD3Model falls back to
SetParameters(GetParameters()) but this approach excludes the _vae field from
being cloned, causing fresh VAE weights when the copy-on-write sharing fails.
Ensure that both GetParameters() and SetParameters() methods include the _vae
parameters along with _predictor and _controlEncoder parameters, so that Clone()
can properly restore all model components. Alternatively, modify the Clone()
method to explicitly handle cloning or sharing the _vae field in addition to
using SetParameters(GetParameters()), ensuring the cloned model fully mirrors
the source model's state including its VAE component.
In `@src/Diffusion/NoisePredictors/UViTNoisePredictor.cs`:
- Around line 584-592: The Clone() method in UViTNoisePredictor<T> is not
preserving the full state of the object. When creating the clone instance, add
latentSpatialSize and seed as parameters to the constructor call (in addition to
the existing parameters like inputChannels, hiddenSize, etc.). Additionally,
after the TryShareParametersFrom and SetParameters calls, you must copy the
_posEmbed field from the source instance to the cloned instance to ensure the
clone has identical runtime state as the original object.
- Around line 546-552: The SetParameters method in the UViTNoisePredictor class
is incomplete and only restores parameters for _patchEmbed, _timeEmbed1, and
_timeEmbed2, while GetParameters retrieves parameters for encoder, middle,
decoder, skip, and final components. Expand the SetParameters method to call
SetLayerParams for all the same layer components that GetParameters emits, in
the same order, to ensure the complete model state is restored when loading
parameters. Reference the GetParameters method to determine which layers need to
be included in the SetParameters call sequence.
In `@src/Diffusion/VAE/AutoencoderKL.cs`:
- Around line 427-445: The SetParameters method in AutoencoderKL allows partial
weight assignment without proper validation, which can silently fail to copy all
required parameters. Add a length validation check at the beginning of the
SetParameters method to ensure that the input parameters vector length exactly
equals the sum of encoderParams.Length and decoderParams.Length, and throw an
appropriate exception if there is a mismatch instead of allowing incomplete
parameter assignment.
In `@src/Models/Options/Adam8BitOptimizerOptions.cs`:
- Around line 26-27: The Adam8BitOptimizerOptions class is missing required
constructors per the options golden pattern. Add a parameterless default
constructor and a copy constructor that accepts another Adam8BitOptimizerOptions
instance and copies all properties from it (including the new
UseBFloat16MomentStorage flag introduced on line 153). Ensure the copy
constructor properly initializes all inherited and new properties to prevent
configuration drift between option instances.
In `@src/Optimizers/Adam8BitOptimizer.cs`:
- Around line 529-542: The state reallocation logic in the condition starting at
line 529 only checks for missing key or length mismatch, but does not validate
whether the cached state's storage mode matches the current
_options.UseBFloat16MomentStorage setting. When this option is toggled on an
existing optimizer instance, old cached entries with the wrong storage mode
remain in _tapeStates, causing null-dereference when accessing state.MBf16 at
line 562. Add an additional condition to the state validity check that verifies
the storage mode of the cached state matches the current
UseBFloat16MomentStorage option, and reallocate the state via AllocateTapeState
if they don't match. Apply this same validation check to the other locations
mentioned (lines 560-564 and 1166-1169) where BF16 state access occurs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 526e5d60-fc0a-4ba2-be90-30b34767f423
📒 Files selected for processing (109)
Directory.Packages.propssrc/Diffusion/Audio/DiffWaveModel.cssrc/Diffusion/Control/ControlARModel.cssrc/Diffusion/Control/ControlNeXtModel.cssrc/Diffusion/Control/ControlNetInpaintingModel.cssrc/Diffusion/Control/ControlNetLiteModel.cssrc/Diffusion/Control/ControlNetModel.cssrc/Diffusion/Control/ControlNetPlusPlusFluxModel.cssrc/Diffusion/Control/ControlNetPlusPlusModel.cssrc/Diffusion/Control/ControlNetSD3Model.cssrc/Diffusion/Control/ControlNetTileModel.cssrc/Diffusion/Control/ControlNetUnionProModel.cssrc/Diffusion/Control/IPAdapterFaceIDPlusModel.cssrc/Diffusion/Control/IPAdapterModel.cssrc/Diffusion/Control/IPAdapterPlusModel.cssrc/Diffusion/Control/ReferenceOnlyModel.cssrc/Diffusion/DiffusionModelBase.cssrc/Diffusion/FastGeneration/AutoRegressiveMaskedDiffusion.cssrc/Diffusion/FastGeneration/DMD2Model.cssrc/Diffusion/FastGeneration/FlowMapModel.cssrc/Diffusion/FastGeneration/Flux2SchnellModel.cssrc/Diffusion/FastGeneration/HyperSDModel.cssrc/Diffusion/FastGeneration/InstaFlowModel.cssrc/Diffusion/FastGeneration/MARModel.cssrc/Diffusion/FastGeneration/MultistepLCModel.cssrc/Diffusion/FastGeneration/OSDSModel.cssrc/Diffusion/FastGeneration/PCMModel.cssrc/Diffusion/FastGeneration/PeRFlowModel.cssrc/Diffusion/FastGeneration/PixArtDeltaLCMModel.cssrc/Diffusion/FastGeneration/SANASprintModel.cssrc/Diffusion/FastGeneration/SCottModel.cssrc/Diffusion/FastGeneration/SD3FlashModel.cssrc/Diffusion/FastGeneration/SD3TurboModel.cssrc/Diffusion/FastGeneration/SDXLLightningModel.cssrc/Diffusion/FastGeneration/SenseFlowModel.cssrc/Diffusion/FastGeneration/SiDDiTModel.cssrc/Diffusion/FastGeneration/SiDModel.cssrc/Diffusion/FastGeneration/SwiftBrushModel.cssrc/Diffusion/FastGeneration/TCDModel.cssrc/Diffusion/FastGeneration/TrainingEfficientLCM.cssrc/Diffusion/FastGeneration/TransfusionModel.cssrc/Diffusion/ImageEditing/AnyEditModel.cssrc/Diffusion/ImageEditing/BrushEditModel.cssrc/Diffusion/ImageEditing/BrushNetModel.cssrc/Diffusion/ImageEditing/BrushNetXModel.cssrc/Diffusion/ImageEditing/CycleGANTurboModel.cssrc/Diffusion/ImageEditing/FlowEditModel.cssrc/Diffusion/ImageEditing/FreeInpaintModel.cssrc/Diffusion/ImageEditing/HDPainterModel.cssrc/Diffusion/ImageEditing/PowerPaintModel.cssrc/Diffusion/ImageEditing/RADModel.cssrc/Diffusion/ImageEditing/ReplaceAnythingModel.cssrc/Diffusion/ImageEditing/SD3InpaintingModel.cssrc/Diffusion/ImageEditing/Step1XEditModel.cssrc/Diffusion/ImageEditing/TurboEditModel.cssrc/Diffusion/ImageEditing/UltraEditModel.cssrc/Diffusion/MotionGeneration/MoMaskModel.cssrc/Diffusion/MotionGeneration/MotionDiffusionModel.cssrc/Diffusion/NoisePredictors/AsymmDiTPredictor.cssrc/Diffusion/NoisePredictors/EMMDiTPredictor.cssrc/Diffusion/NoisePredictors/FlagDiTPredictor.cssrc/Diffusion/NoisePredictors/FluxDoubleStreamPredictor.cssrc/Diffusion/NoisePredictors/MMDiTXNoisePredictor.cssrc/Diffusion/NoisePredictors/NoisePredictorBase.cssrc/Diffusion/NoisePredictors/SiTPredictor.cssrc/Diffusion/NoisePredictors/UNetNoisePredictor.cssrc/Diffusion/NoisePredictors/UViTNoisePredictor.cssrc/Diffusion/NoisePredictors/VideoUNetPredictor.cssrc/Diffusion/Panorama/CubeDiffModel.cssrc/Diffusion/Panorama/DiffPanoModel.cssrc/Diffusion/Panorama/MultiDiffusionModel.cssrc/Diffusion/Panorama/SpotDiffusionModel.cssrc/Diffusion/Panorama/StitchDiffusionModel.cssrc/Diffusion/StyleTransfer/ConsisLoRAModel.cssrc/Diffusion/StyleTransfer/InstantStyleModel.cssrc/Diffusion/StyleTransfer/KLoRAStyleModel.cssrc/Diffusion/StyleTransfer/RBModulationModel.cssrc/Diffusion/StyleTransfer/SASTDModel.cssrc/Diffusion/StyleTransfer/StyDiffModel.cssrc/Diffusion/StyleTransfer/StyleAlignedEditModel.cssrc/Diffusion/StyleTransfer/StyleStudioModel.cssrc/Diffusion/StyleTransfer/TLoRAModel.cssrc/Diffusion/SuperResolution/CCSRModel.cssrc/Diffusion/SuperResolution/PASDModel.cssrc/Diffusion/SuperResolution/SeeSRModel.cssrc/Diffusion/ThreeD/DreamFusionModel.cssrc/Diffusion/ThreeD/Zero123Model.cssrc/Diffusion/VAE/AudioVAE.cssrc/Diffusion/VAE/AutoencoderKL.cssrc/Diffusion/VAE/Causal3DVAE.cssrc/Diffusion/VAE/DeepCompressionVAE.cssrc/Diffusion/VAE/EQVAEModel.cssrc/Diffusion/VAE/ImprovedVideoVAE.cssrc/Diffusion/VAE/LiteVAEModel.cssrc/Diffusion/VAE/SDXLVAEModel.cssrc/Diffusion/VAE/TemporalInterpolationVAE.cssrc/Diffusion/VAE/TemporalVAE.cssrc/Diffusion/VAE/VAEModelBase.cssrc/Diffusion/Video/CogVideoModel.cssrc/Diffusion/VirtualTryOn/CATDMModel.cssrc/Diffusion/VirtualTryOn/FashionVDMModel.cssrc/Diffusion/VirtualTryOn/IDMVTONModel.cssrc/Diffusion/VirtualTryOn/StableVITONModel.cssrc/Helpers/CopyOnWriteCloneHelper.cssrc/MixedPrecision/Float8Types.cssrc/Models/Options/Adam8BitOptimizerOptions.cssrc/NeuralNetworks/NeuralNetworkBase.cssrc/Optimizers/Adam8BitOptimizer.cstests/AiDotNet.Tests/UnitTests/NeuralNetworks/CopyOnWriteCloneTests.cs
…p for foundation-scale models The get-side streaming API (GetParameterChunks, PyTorch nn.Module.parameters() style) already existed, but there was NO streaming setter, so a get->set->get round-trip still had to flatten through SetParameters(Vector<T>). For Flux-2-scale models that flat int-indexed Vector<T> OOMs the host (and overflows int.MaxValue above 2.1B params). Flux2Model_GetSetParameters_RoundTrips was failing with OutOfMemoryException in exactly this flat path. Add SetParameterChunks(IEnumerable<Tensor<T>>) mirroring GetParameterChunks: - IParameterizable: default interface method (non-framework) buffering to a flat vector + SetParameters (back-compat for tractable models). - DiffusionModelBase / NoisePredictorBase: concrete virtual (both targets) with the same back-compat default. - LatentDiffusionModelBase: override that distributes chunks to the noise predictor, VAE, then conditioner off a SINGLE shared enumerator (one chunk in flight at a time — never buffers the whole stream). - FluxDoubleStreamPredictor: overrides GetParameterChunks + SetParameterChunks to stream one DenseLayer sub-block at a time, and replaces the 2x-peak Vector.Concatenate in GetParameters with a single pre-sized buffer. Migrate Flux2Model_GetSetParameters_RoundTrips to the streaming round-trip (still asserts get->set->get fidelity, incl. first-chunk values — not a weakening). Verified PASS (was OOM): 1 passed, 31s, net10.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onstruct without OOM FlagDiTPredictor force-resolved every adaLN-zero projection in its constructor (ZeroInitialize -> ResolveFromShape -> GetParameters -> SetParameters(zeros)), eagerly allocating weight tensors for a multi-billion-parameter stack the moment the model was constructed. LuminaImage2Model_DefaultConstructor_CreatesValidModel failed with OutOfMemoryException in exactly that path (DenseLayer.EnsureInitialized -> TensorAllocator.Rent), even though the test only checks the sub-models are non-null — every other FlagDiT layer is already lazy. Add LazyDenseZero on NoisePredictorBase: a DenseLayer built with the (non-lazy) Zero strategy but resolved shapes-only, so it allocates NOTHING at construction and zero-fills its weights+biases on first resolve. Behaviourally identical to eager adaLN-zero (block still begins as the identity, Peebles & Xie 2022) but defers the allocation. Use it for FlagDiT's per-block and final adaLN layers; remove the now dead ZeroInitialize. Verified PASS (was OOM): LuminaImage2Model_DefaultConstructor, 17s, net10.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on SDPA (#1624) The MMDiT joint-block attention (shared by the faithful Flux/MMDiTX/EMMDiT/AsymmDiT predictors) hand-rolled scaled-dot-product attention as ~7 separate ops — reshape-for-heads ×3, transpose K, BatchMatMul(Q,Kᵀ) which MATERIALIZES the O(seq²) scores matrix, scalar-scale, Softmax, BatchMatMul(weights,V), reshape-back — each a full pass + fresh allocation, run twice per joint block × every layer × every denoising step. That is the implementation gap vs PyTorch's fused SDPA (FlashAttention: tiled, no seq×seq materialization, one kernel), and what pushed the now-faithful dual-stream predictors past the CI test budget. The package already ships that fused kernel (Engines/Autodiff/FlashAttention*.cs, FusedAttention.cs) exposed as IEngine.ScaledDotProductAttention — the #476/#479 work measured under PyTorch's SDPA. DiT already used the fast SelfAttentionLayer (why PixArtDelta passed at 59s while MMDiT-based models timed out). This routes the MMDiT family to the same fused primitive, using the [B,H,S,D] 4-D layout the engine SDPA requires for true multi-head (FusedAttention does not head-split a rank-3 input). Same math (autodiff-aware backward included), fused execution — no model shrinking. Builds clean on Tensors 0.101.7. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aram count IsQuantizationAwareTrainingEnabled auto-engaged QAT for every model >= 500M params (#1624 G5). A CPU profile of a Kandinsky (1.8B) train step showed that turned ~57% of the step into pure overhead: each iteration fake-quantized ALL ~1.8B weights — CollectTrainableParameters + a serial ToVector copy of the full weight set (~82s FakeQuant + ~79s ToVector at the CI thread cap) — and, being lossy, it changed the training trajectory of the vanilla-DDPM contract tests. Disabling the auto-engage drops the Kandinsky train step from ~168s/iter to ~72s/iter (2-thread CI sim). QAT is now opt-in (default off), matching the Train() method's own "opt-in, default OFF" comment, the project's streaming/activation-checkpointing convention, and PyTorch (torch QAT is always explicit). Foundation models targeting int8 deployment opt in via EnableQuantizationAwareTraining(). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ed/lazy-graph host) PredictCompiledMulti always routed the per-step forward through the compiled model host, which builds + realizes a lazy graph (CompiledModelHost.Predict + LazyNode.Realize). That host is an INFERENCE replay optimization; under an active gradient tape it is pure overhead on top of the eager op-record the backward needs anyway — a CPU profile measured it at ~14% of a foundation-scale diffusion train step, and bypassing it (plus the attention-backward vectorization in Tensors#655) cut the Kandinsky train step ~47s -> ~28s/iter at 4 threads. Under a tape (and not NoGradScope), run the eager fallback directly: it records the identical ops, so gradients are unchanged. Inference (no tape / NoGradScope) still gets the compiled multi-input replay (the #1620 path), unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r the same input (#1624) The diffusion model-family contract is that Predict(sameInput) is deterministic (Predict_ShouldBeDeterministic). ~22 models violated it: their inference/generation paths drew fresh noise from the ADVANCING per-instance RandomGenerator whenever no explicit seed was given (`seed.HasValue ? CreateSeededRandom(seed) : RandomGenerator`), so two consecutive Predict calls consumed different random draws → different outputs (e.g. AudioLDM 3.19 vs 3.55). Pre-existing; only now surfaced because S1 + the faithful-predictor rebuild let the model-family shards run far enough to reach these models (they previously OOM'd first — the "test unmasking" pattern). Fix: add DiffusionModelBase.CreateInferenceRng(int? seed), which uses a STABLE seed (the model's construction seed, else a fixed constant) when none is passed — never the advancing RandomGenerator — so Predict reproduces while staying seedable for sample variety. Training keeps using RandomGenerator (must advance per step). Replaced the buggy pattern across the diffusion model + base classes; VAEModelBase (not a DiffusionModelBase) gets the equivalent stable-seed fix inlined. Builds clean (net10 + net471). Verified: AudioLDM determinism now passes; the RNG-induced divergence is eliminated (residual sub-1e-3 diffs under concurrent test execution are FP reduction-order, a separate matter). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cratch) Predict_ShouldBeDeterministic failed for diffusion noise predictors: the same input predicted twice produced different output. Bisected to the #1622 verify-then-trust gate, NOT RNG/compile-trace/lazy-init. Root cause (3x-probe + param/input hashing): the gate must execute a compiled plan ONCE to compare it against eager. The AiDotNet.Tensors compiled lazy-graph executor (through 0.101.7) shares process-global scratch buffers with the eager executor, so that single compiled execution leaves the scratch in a state that makes every subsequent eager forward for a REJECTED shape oscillate with period 2 (call #1 == call #3 != call #2) — non-deterministic AND numerically wrong on half the calls. Proven: model parameters and inputs are bit-identical across calls (hashed), and invalidating the cached plan does not help, so the corruption lives in the package's global scratch, not anything this layer owns. Diffusion predictors fall back to eager for any shape whose compiled plan does not match (the common case for these architectures), so the default-on path silently corrupted inference. Fix: make the compiled inference replay opt-in (AIDOTNET_ENABLE_AUTO_COMPILE=1) and default to pure eager, which is correct and bit-identical across calls and across the denoising loop. No impact on #1624 foundation-scale TRAINING: PredictCompiledMulti already runs eager whenever a gradient tape is active, so training never touched the compiled inference path. Re-enable per model once the package isolates the two executors' scratch buffers. Verified: AudioLDM/AudioLDM2/AuraFlow/BlendedDiffusion/CogView4 determinism tests pass; remaining diffusion determinism failures are pre-existing 120s perf timeouts (ControlNet, ConsistencyModel), unchanged with auto-compile on or off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
952e1a4 to
0ed12e3
Compare
… bug) NeuralNetworkBase composes the same #1622 verify-then-trust gate as the diffusion NoisePredictorBase, so it carries the same latent bug: the AiDotNet.Tensors compiled lazy-graph executor (through 0.101.7) shares process-global scratch with the eager executor, and the gate's one mandatory compiled-vs-eager verification run leaves that scratch dirty — so a REJECTED shape's eager fallback returns the wrong value. The single-input value memo masks the determinism symptom for repeated identical inputs (why NN determinism tests pass), but a memo-MISS input at a rejected shape after a compiled run still reads corrupted scratch. Mirror the diffusion fix: gate the whole compiled-inference path behind a process-wide opt-in (AIDOTNET_ENABLE_AUTO_COMPILE=1), default off. Both the explicit EnableInferenceAcceleration() and the silent foundation-scale auto-engagement now require it, so no model silently runs the corrupting path. The flag is read from the env var at load but exposed as a test-settable static (SetAutoCompiledInferenceEnabledForTesting) so the gate's parity invariants stay covered in-process. Inference-only: training never touched this path. AcceleratedInferenceTests opt in for the duration (restored in Dispose) so all existing parity/memo/shape-guard invariants still run; added a test asserting the shipped default (opt-in off → explicit opt-in stays eager, gate never runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on + partial #632) Brings the Phase C fp16-weight / fp32-master mixed-precision storage (Tensors #650) and the #632 compiled-inference MemoryPlanning aliasing fix (BlasBatch hoist vs memory planning in attention inference). Native packages (OneDNN/OpenBLAS/CLBlast) bumped in lockstep. Verified via local-source wire-and-bisect that #632 is the commit that resolves the diffusion Predict_ShouldBeDeterministic corruption for the small/no-cross-attention case (reproduces on 0.101.7, gone on 0.102.x). HOWEVER #632 is only a PARTIAL fix: the real model-family test configs (cross-attention, contextDim>0) still replay non-deterministically with compiled inference ENABLED on 0.102.2 (AudioLDM/AuraFlow reproduce the exact 3.19-vs-3.55 signature). The residual cross-attention aliasing fix is still outstanding upstream. Therefore the consumer opt-in gate stays default-OFF: compiled inference is opt-in via AIDOTNET_ENABLE_AUTO_COMPILE=1, and the determinism suite passes on pure eager (verified on 0.102.2). Builds clean on net10.0 + net471. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) The high-level SGD mixed-precision path (CompiledTapeTrainingStep.StepMixedPrecision) called MixedPrecisionReflection.StepSgd WITHOUT a GradScaler, so FP16 backward gradients could underflow to zero before reaching the FP32 master weights. The Adam and generic-optimizer consumer paths already loss-scale via _mpScaler; this brings the SGD path to parity. Two changes: - MixedPrecisionReflection now binds the 3-param Step(IReadOnlyList<Tensor<float>>, float, GradScaler?) introduced in Tensors 0.102.0 (#650) — the prior exact-type 2-arg GetMethod did not resolve it, which would have silently dropped the SGD FP16 path to the eager fallback on 0.102.0. Falls back to the legacy 2-arg Step for older packages; StepSgd dispatches on the resolved parameter count. - StepMixedPrecision persists a GradScaler (dynamic loss scale + skip-on-overflow handled inside Step) and passes it, matching the Adam/generic paths. Builds clean on Tensors 0.102.0. Gated behind AIDOTNET_FP16_ACTIVATIONS (default off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 51e2b2d)
…ned Tensors 0.102.2 The Phase E SGD loss-scaling path (StepMixedPrecision -> StepSgd with a GradScaler) depends on the Tensors 0.102.0+ GradScaler API resolving through MixedPrecisionReflection. Pin that it resolves on the pinned package (0.102.2): if a downgrade drops it, StepSgd silently falls back to the unscaled 2-arg Step and FP16 SGD grads underflow — so failing loudly here is the intended guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… parity (determinism) The verify-then-trust gate returned the EAGER reference on the verify call but the compiled plan on every subsequent trusted call. When compiled matches eager within tolerance (trusted) but not bit-for- bit (legitimate ~ULP op-order rounding), call #1 (eager) then differed from calls #2..N (compiled), breaking Predict(x)-twice determinism for cross-attention diffusion models (e.g. AuraFlow: ~1e-8 divergence tripping the 12-decimal-place assertion) even though both are within the gate's tolerance. Fix: on parity, return the compiled candidate so call #1 matches every later trusted replay. Use CloneDeepCopy (NOT Clone — which is copy-on-write and shares the buffer): the compile host hands back a RECYCLED output buffer the next compiled call overwrites in place, bypassing COW, so a shared clone would be mutated by a later call (observed corrupting AcceleratedInferenceTests' held results). On rejection the eager reference stays the source of truth. Verified: AcceleratedInferenceTests 7/7 and AudioLDM/AudioLDM2/AuraFlow Predict_ShouldBeDeterministic all pass with compiled inference enabled (against the local Tensors source that also carries the ConstantFolding-off fix). Note: compiled inference stays opt-in (AIDOTNET_ENABLE_AUTO_COMPILE=1, default eager) in the consumer until AiDotNet.Tensors ships the ConstantFolding-off fix (>0.102.2); this gate fix makes the path correct for when it is re-enabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scale # Conflicts: # Directory.Packages.props
) * docs(#1662): design spec for backward-pass optimizations (levers #4, #1, #3) Cross-repo, Tensors-first design for the backward/training side of #653/#1624: backward buffer-reuse audit (#4), optimizer-in-backward adaptive hybrid (#1), and FlashAttention-style tiled backward (#3). Excludes lever #2 (checkpointing, owned by open PR #1633). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(#1662): correct lever #1 scope — optimizer-in-backward is OOM-only today TrainWithTapeStreaming already does single-pass-fused/two-pass-norm, but only engages above 0.5x-RAM; the common (fits-in-memory) case is collect-then-step, which merely ties PyTorch. Lever #1's real deliverable: promote single-pass fused optimizer-in-backward to the common-case default for unclipped training, add opt-in fast-clip for single-pass clipped, and prove a handy PyTorch CPU win on per-step time / peak RSS / allocation via --trainbench (default-on flip gated on that proof). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(#1662): implementation plan — levers #4, #3, #1 (Tensors-first, PyTorch proof) Phased TDD plan: Tensors #4 (trainbench probe + alloc audit + arena guard + fused-norm overload) -> Tensors #3 (tiled FlashAttention backward, parity-gated) -> v0.102.0 release -> AiDotNet #1 (full-precision streaming optimizer, engage single-pass fused-in-backward as the unclipped common-case default, opt-in fast-clip, PyTorch CPU comparison proving a handy win on time/alloc/RSS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#1662): full-precision streaming Adam (bit-identical fused-in-backward foundation) Adds FullPrecisionStreamingOptimizer<T> (per-tensor double[] moments, one global step counter, no quantization, no update clamping) + FullPrecisionStreamingAdam<T> matching the classic AdamOptimizer formula exactly. This is the bit-identical streaming optimizer that lets single-pass fused optimizer-in-backward be the common-case DEFAULT for models that fit in memory (vs the 8-bit OOM-survival variants whose ClampUpdate + block quantization diverge from classic Adam). Compiles clean; not yet wired into GetOrCreateStreamingOptimizer (next commit) — the bit-identical gate test lands with the wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#1662): wire full-precision fused optimizer-in-backward (unclipped) + gate test Resolver gains SupportsFullPrecision (Adam-family today) + a full-precision branch; TrainWithTapeStreaming uses the bit-identical FullPrecisionStreamingAdam when the user opts in (ForceOn) for an unclipped Adam model. Auto default unchanged (the unclipped-fitting default-on flip is gated on the §5d PyTorch proof). Gate test FusedInBackward_Unclipped_MatchesClassicAdam_ToFloatPrecision: fused single-pass optimizer-in-backward tracks classic eager collect-then-step Adam to float precision (1e-3 over 20 steps, identical init). PASSES. Surfaced (pre-existing, separate): MaxGradNorm defaults to 1.0, so the COMMON case is clipped -> the two-pass clipped streaming path, which currently throws "persistent tape activations released" (sets Persistent=true but not ReleaseStreamingActivations=false). Fix tracked next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#1662): clipped two-pass streaming path + bit-identical full-precision for clipped Fixes the pre-existing clipped-streaming crash: the two-pass path builds a Persistent tape but ComputeGradientsStreaming releases activations by default (process-global GradientTape<T>.ReleaseStreamingActivations), so pass 2 threw "activations released". Now save/restore that flag (false during the clipped passes) so both passes share the recorded graph; the setting never leaks. Also drops the !clip restriction on full-precision selection: clipping only chooses single-pass vs two-pass, not precision, so the clipped (common-case, MaxGradNorm=1.0 default) path now uses the bit-identical FullPrecisionStreamingAdam too. PyTorch's apply_optimizer_in_backward does not support clipping at all. Gate: FusedInBackward_Clipped_MatchesClassicAdam_ToFloatPrecision (+ the unclipped one) both PASS — fused optimizer-in-backward tracks classic clip-then-step Adam to float precision over 20 steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(#1662): pytorch cpu comparison - honest result (we do not beat torch on speed) Adds benchmarks/trainbench_torch.py (mirrors the --trainbench residual-FFN shape) and the head-to-head results. Finding on S=128/D=384/10-layer/8-thread: torch: median 69.6 ms/step, peak RSS 312 MB aidotnet: median 252.9 ms/step (3.6x SLOWER), peak WS 489 MB, alloc 0.127 MB/step Identical loss. The optimizer-in-backward + arena wins ONLY on allocation/GC churn (near-zero), not throughput or peak RSS. The 3.6x per-step gap is GEMM + autodiff overhead (#653 core CPU-parity), orthogonal to optimizer-in-backward. Consequence: the §5a default-on flip is NOT justified (no speed win) — fused optimizer-in-backward stays opt-in (ForceOn), valued for bounded peak-grad memory + zero GC churn (bit-identical) + clipped support PyTorch lacks. "Beat PyTorch on all metrics" is not currently true and must not be claimed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(#1662): gap investigation — per-step deficit is small-M GEMM parallel scaling Decomposed the 3.6x per-step gap by batch: S=128 -> 3.6x, S=1024 -> 1.43x. The gap collapses with larger M, matching the known #475 finding (managed microkernel is MKL-parity, but small-M GEMM parallel scaling plateaus ~2x). So the speed deficit is the #653 core CPU-parity problem (small-M GEMM dispatch/scaling), orthogonal to optimizer-in-backward, which can't change matmul cost. Lever #1's real win is the bit-identical allocation/peak-grad-memory reduction, which holds at all batch sizes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(#1662): opt-in single-pass fast approximate grad-clip (§5c) Adds FastApproxGradClip (default OFF). When on + clipping active, the streaming clipped path runs as a SINGLE backward pass: the clip scale comes from an EMA of the previous step's global grad-norm (NFNet-style adaptive clipping), and this step's exact norm is accumulated in the same pass to update the EMA. First step seeds the EMA without clipping. NOT bit-identical (documented approximation) — this is the clipped path that beats PyTorch on backward-pass count (torch's apply_optimizer_in_backward cannot clip at all). Branch restructured into unclipped single-pass / fast-clip single-pass / exact two-pass; persistent tape + ReleaseStreamingActivations toggle now gated on the genuine two-pass only. Convergence test (loss decreases, stays finite) PASSES; 14/14 fused+streaming tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: franklinic <franklin@ivorycloud.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#1624 — training-bound timeouts & OOM on the 16 GB CI runner (forward-OK models)
Partial — not confirmed to close #1624 (29-model acceptance run is foundation-scale → high-RAM CI). Lands: optimizer-memory ladder (G2), corrected O(1) clone (G6), streaming parameter API across 8 predictors (G1), lazy foundation-scale construction, quantization-aware training default-on for foundation-scale (G5), and G4 activation checkpointing — now root-caused, fixed at the package level, and re-wired across ALL DiT-family predictors (single-stream, conditioning-closure, UViT skips, and MMDiT dual-stream) (see below; depends on
AiDotNet.Tensors0.101.4).G2 — optimizer-state memory ladder
fp32 (default) → BF16 (proactive ≥50M) → 8-bit (reactive on OOM).G6 — copy-on-write
Clone()— design correctedCOW lives inside each sub-model's
Clone(); diffusion models compose sub-model clones (model-level COW regressed CogVideo). Verified FAIL→PASS.G1 — streaming parameter API: set-side + 8 predictors
SetParameterChunksmirrorsGetParameterChunks; per-block overrides on all 8 predictors, each verified index-identical toGetParameters+ cross-instance round-trip.Lazy foundation-scale construction
LazyDenseZero— adaLN-zero allocates nothing at construction. FixedLuminaImage2Model_DefaultConstructorOOM.UViT correctness fixes (surfaced by the streaming round-trip)
Truncated
SetParametersnow full;Cloneprobe-forwards lazy attention before copying. Verified full Clone round-trip.G4 — activation checkpointing: ROOT-CAUSED, FIXED AT THE PACKAGE LEVEL, RE-WIRED
The earlier attempt was removed because a direct gradient-equivalence test failed on every predictor. Investigation found the real cause is a package bug, not the wiring:
AiDotNet.Tensors'GradientCheckpointing.Checkpointrecompute backward differentiated only w.r.t. the segment input (ComputeGradients(pseudoLoss, sources: [reInput])), silently dropping the weight gradients of every checkpointed layer — so checkpointed layers did not learn. #361 fixed the input-grad shape crash (#1341) but left the weight-grad gap, masked because non-checkpointed params (embeddings, output projection) still trained.Proven 3 ways: isolated single-
DenseLayercheckpoint (weight grad absent); the #361 code comment itself ("VJP w.r.t. reInput"); and an unconfounded Transformer test (dropout=0, identical init, one step with vs without checkpointing) → 2560/5408 params (47%) diverged.Fix (ooples/AiDotNet.Tensors#643): the recompute now differentiates the whole segment (
sources: null) and scatters every leaf gradient — input and weights — into the outer accumulator (matchestorch.utils.checkpoint). Correct and non-double-counting because the outer forward ran the segment underNoGrad. Verified in the Tensors repo (Checkpoint_ProducesParameterGradients_MatchingEager, net10.0 + net471; 113 autodiff tests, no regressions).Re-wired via
NoisePredictorBase.CheckpointBlocks(package primitive, sqrt(N) multi-segment) across AsymmDiT, EMMDiT, MMDiTX, SiT, FlagDiT, DiT, FluxDoubleStream, UViT (per-block, preserving long skips), and MMDiT's single-stream tail. MMDiT's dual-stream joint blocks thread an (image, text) pair and don't fit the single-residual-stream primitive without a packed wrapper — intentionally left un-checkpointed (documented inline; possible follow-up). The dead consumer-side customAutogradFunctionapproach is removed. G4 is opt-in — OFF by default, matching PyTorch'storch.utils.checkpoint(no auto-engage threshold); enable per predictor viaActivationCheckpointingEnabled. Gradient-equivalence covered byCheckpointGradientEquivalenceTests(input + weight grads) and an unconfounded one-step parameter-update equivalence test.G5 — quantization-aware training, DEFAULT-ON for foundation-scale
True training-time QAT in
Train: fake-quantizes the forward weights (symmetric int8, fresh per-tensor scales/step) with full-precision shadows restored before the optimizer step (STE). Default-ON whenParameterCount ≥ 500M; explicit enable/disable override. Verified: off-by-default for small, auto-engages above threshold, override wins, QAT training converges (doesn't diverge). NOTE: current QAT keeps fp32 shadows (~1× extra weight memory) — it's an accuracy lever; the training-memory win needs int8 weight storage (tracked).G8 / G3 / G7 — verified
G8 micro-batch (reactive, BatchNorm-gated); G3 streaming training > 0.5× RAM; G7
TensorArenazero-GC loop.Packages
AiDotNet.Tensors0.96.1 → 0.101.5(0.101.5 carries both G4 checkpoint fixes — weight-grad (#643) and multi-segment detach (#645)); native packages at0.101.2(no checkpoint code).Known gaps / Remaining (tracked)
🤖 Generated with Claude Code
Summary by CodeRabbit