Skip to content

perf(#1624): optimizer ladder + COW clone (fixed) + streaming param setter + lazy foundation-scale construct#1633

Merged
ooples merged 68 commits into
masterfrom
perf/1624-training-scale
Jun 22, 2026
Merged

perf(#1624): optimizer ladder + COW clone (fixed) + streaming param setter + lazy foundation-scale construct#1633
ooples merged 68 commits into
masterfrom
perf/1624-training-scale

Conversation

@ooples

@ooples ooples commented Jun 16, 2026

Copy link
Copy Markdown
Owner

#1624 — training-bound timeouts & OOM on the 16 GB CI runner (forward-OK models)

Partial — not confirmed to close #1624 (29-model acceptance run is foundation-scale → high-RAM CI). Lands: optimizer-memory ladder (G2), corrected O(1) clone (G6), streaming parameter API across 8 predictors (G1), lazy foundation-scale construction, quantization-aware training default-on for foundation-scale (G5), and G4 activation checkpointing — now root-caused, fixed at the package level, and re-wired across ALL DiT-family predictors (single-stream, conditioning-closure, UViT skips, and MMDiT dual-stream) (see below; depends on AiDotNet.Tensors 0.101.4).

G2 — optimizer-state memory ladder

fp32 (default) → BF16 (proactive ≥50M) → 8-bit (reactive on OOM).

G6 — copy-on-write Clone() — design corrected

COW lives inside each sub-model's Clone(); diffusion models compose sub-model clones (model-level COW regressed CogVideo). Verified FAIL→PASS.

G1 — streaming parameter API: set-side + 8 predictors

SetParameterChunks mirrors GetParameterChunks; per-block overrides on all 8 predictors, each verified index-identical to GetParameters + cross-instance round-trip.

Lazy foundation-scale construction

LazyDenseZero — adaLN-zero allocates nothing at construction. Fixed LuminaImage2Model_DefaultConstructor OOM.

UViT correctness fixes (surfaced by the streaming round-trip)

Truncated SetParameters now full; Clone probe-forwards lazy attention before copying. Verified full Clone round-trip.

G4 — activation checkpointing: ROOT-CAUSED, FIXED AT THE PACKAGE LEVEL, RE-WIRED

The earlier attempt was removed because a direct gradient-equivalence test failed on every predictor. Investigation found the real cause is a package bug, not the wiring: AiDotNet.Tensors' GradientCheckpointing.Checkpoint recompute backward differentiated only w.r.t. the segment input (ComputeGradients(pseudoLoss, sources: [reInput])), silently dropping the weight gradients of every checkpointed layer — so checkpointed layers did not learn. #361 fixed the input-grad shape crash (#1341) but left the weight-grad gap, masked because non-checkpointed params (embeddings, output projection) still trained.

Proven 3 ways: isolated single-DenseLayer checkpoint (weight grad absent); the #361 code comment itself ("VJP w.r.t. reInput"); and an unconfounded Transformer test (dropout=0, identical init, one step with vs without checkpointing) → 2560/5408 params (47%) diverged.

Fix (ooples/AiDotNet.Tensors#643): the recompute now differentiates the whole segment (sources: null) and scatters every leaf gradient — input and weights — into the outer accumulator (matches torch.utils.checkpoint). Correct and non-double-counting because the outer forward ran the segment under NoGrad. Verified in the Tensors repo (Checkpoint_ProducesParameterGradients_MatchingEager, net10.0 + net471; 113 autodiff tests, no regressions).

Re-wired via NoisePredictorBase.CheckpointBlocks (package primitive, sqrt(N) multi-segment) across AsymmDiT, EMMDiT, MMDiTX, SiT, FlagDiT, DiT, FluxDoubleStream, UViT (per-block, preserving long skips), and MMDiT's single-stream tail. MMDiT's dual-stream joint blocks thread an (image, text) pair and don't fit the single-residual-stream primitive without a packed wrapper — intentionally left un-checkpointed (documented inline; possible follow-up). The dead consumer-side custom AutogradFunction approach is removed. G4 is opt-in — OFF by default, matching PyTorch's torch.utils.checkpoint (no auto-engage threshold); enable per predictor via ActivationCheckpointingEnabled. Gradient-equivalence covered by CheckpointGradientEquivalenceTests (input + weight grads) and an unconfounded one-step parameter-update equivalence test.

Cross-repo dependency: this PR bumps AiDotNet.Tensors to 0.101.5, which must be published (from PR #643 + PR #645) before CI can restore.

Multi-segment (sqrt(N)) checkpointing. G4 checkpoints each predictor's block stack with a sqrt(N) segment size — the backward recomputes one segment at a time, bounding PEAK activation memory to O(sqrt(N)), which is what relieves the #1624 OOM (a single segment would recompute the whole stack at once and spike peak back to a full forward, so it is NOT a valid memory fix). Multi-segment correctness required a SECOND package fix — ooples/AiDotNet.Tensors#645 (0.101.5): each segment's input is detached in the recompute so a later segment cannot re-enter and double-count an earlier one. Verified gradient-exact (input + weights) across single-/multi-segment, 4-segment scaling, and residual (attention-shape) blocks in the Tensors suite, and end-to-end for GELU DenseLayer blocks in the diffusion checkpoint tests.

G5 — quantization-aware training, DEFAULT-ON for foundation-scale

True training-time QAT in Train: fake-quantizes the forward weights (symmetric int8, fresh per-tensor scales/step) with full-precision shadows restored before the optimizer step (STE). Default-ON when ParameterCount ≥ 500M; explicit enable/disable override. Verified: off-by-default for small, auto-engages above threshold, override wins, QAT training converges (doesn't diverge). NOTE: current QAT keeps fp32 shadows (~1× extra weight memory) — it's an accuracy lever; the training-memory win needs int8 weight storage (tracked).

G8 / G3 / G7 — verified

G8 micro-batch (reactive, BatchNorm-gated); G3 streaming training > 0.5× RAM; G7 TensorArena zero-GC loop.

Packages

AiDotNet.Tensors 0.96.1 → 0.101.5 (0.101.5 carries both G4 checkpoint fixes — weight-grad (#643) and multi-segment detach (#645)); native packages at 0.101.2 (no checkpoint code).

Known gaps / Remaining (tracked)

  • Verify the 29 target models under the repro on the high-RAM CI runner — the actual acceptance criterion.
  • G5 real memory win: int8 weight storage (no fp32 shadow) — needs CI convergence/memory validation.
  • Compiled-training path — baked-per-step-timestep bug + multi-input compiled inference + stripped-GEMM forward.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added BF16-backed moment storage option for the Adam 8-bit optimizer.
    • Introduced streaming parameter “chunk” APIs for models and diffusion predictors.
    • Added activation checkpointing controls and improved training robustness (automatic OOM recovery, optional micro-batching).
    • Improved clone behavior to prefer memory-efficient copy-on-write when possible.
  • Bug Fixes
    • Fixed parameter serialization/loading for U-ViT so all layers restore correctly.
  • Tests
    • Added coverage for copy-on-write cloning, parameter chunk streaming, and activation checkpointing.
  • Chores
    • Updated central package version pins to 0.98.0.

ooples and others added 10 commits June 16, 2026 10:31
…#1624)

Picks up the recent Tensors improvements that this training-scale work builds on:
- copy-on-write Tensor.Clone() (O(1)-until-write) — the foundation for the missing clone-footprint
  lever (G6) that addresses the Clone_ShouldProduceIdenticalOutput OOMs.
- GPU/CPU parity kernel fixes (#626) + GQA backward kernel fix (#628 in flight).

v0.98.0 contains the COW work (git tag --contains the #624 merge). AiDotNet core builds clean against
it (0 errors) — no API breaks 0.96->0.98. Native OneDNN/OpenBLAS/CLBlast co-released in lockstep at 0.98.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Clone OOM (#1624)

The large-diffusion-model Clone_ShouldProduceIdenticalOutput tests OOM the 16 GB runner because
DiffusionModelBase clones via `new(...); SetParameters(GetParameters())` — a full-model flatten that
materializes the entire weight set a SECOND time (plus a giant intermediate flat vector). This adds a
copy-on-write parameter share built on the Tensors O(1)-until-write CloneShared (#624, pkg 0.98.0).

- DiffusionModelBase.TryShareParametersFrom(source): parallel reflection walk of source+clone trainable
  layers (same field-order assumption CollectTrainableParameters already relies on); re-binds each clone
  layer's parameters to CloneShared() views of the source's. Fidelity-equivalent to the existing flat
  copy (both transfer exactly the trainable tensors — diffusion models carry no BatchNorm running stats /
  serialization extras), but O(1)-until-write. Falls back to the flat copy on any 1:1 structure mismatch.
- Wired into Flux2SchnellModel + ControlNetPlusPlusFluxModel Clone() (the direct SetParameters pattern).
  VALIDATED: Flux2Schnell Clone_ShouldProduceIdenticalOutput passes under the issue's exact repro
  (DOTNET_GCHeapHardLimit=0x400000000 / 8 cores) in 36s — where #1624 documents an OOM — plus full
  no-cap fidelity. CogVideo clones sub-models (_videoUnet/_temporalVae.Clone()) so it needs the same
  share pushed into those sub-models (the universal rollout).

Also lands the NeuralNetworkBase COW DeepCopy scaffold (TryDeepCopyCopyOnWrite) DEFAULT-OFF
(AIDOTNET_COW_DEEPCOPY=1 to opt in): it is correct for inference clones but a layer can hold trained
state outside GetTrainableParameters (BatchNorm extras, registered buffers, + a further FT/TabTransformer
category) that the share path doesn't yet carry, which Clone_AfterTraining_ShouldPreserveLearnedWeights
catches. Full NN fidelity is the follow-up before flipping it on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rd Clone-OOM target) (#1624)

Completes the three explicit #1624 Clone-OOM targets (Flux2Schnell + ControlNet++ already validated at
the 16GB/8-core repro; CogVideo here). Extracts the share into a single base-agnostic helper:

- AiDotNet.Helpers.CopyOnWriteCloneHelper.TryShareTrainableParameters<T>(source, dest): parallel
  reflection walk that re-binds each dest trainable tensor to an O(1)-until-write CloneShared() view
  of source's (Tensors #624 / pkg 0.98.0). Guards on type/structure mismatch; never half-shares.
  DiffusionModelBase.TryShareParametersFrom now delegates to it (one implementation).
- CogVideo (Type-B): build with fresh sub-models and share, instead of cloning each sub-model's full
  weight set.

Scoped deliberately to the three explicit targets rather than a blanket conversion of all ~120 diffusion
models: large-model Clone tests TIME OUT on the forward pass locally (not the clone), so a broad rollout
cannot be reliably validated on this box — it needs the actual 16GB CI runner, plus per-model lazy-layer
materialization (TriggerLazyShapeResolution / probe-forward, PR #1596) before sharing so a fresh clone's
lazy layers don't re-initialize over the shared tensors. Tracked as the universal-rollout follow-up.
Build clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
, #2)

Rebuilds the NeuralNetworkBase COW DeepCopy path (still default-off; AIDOTNET_COW_DEEPCOPY=1 to opt in)
to capture the trained state the previous _layers-only walk silently dropped:

- Reflection walk (CopyOnWriteCloneHelper.CollectTrainableLayers) captures trainable layers held in
  dedicated FIELDS — e.g. a tabular transformer's feature tokenizer / encoder stack / final layer-norm —
  not just the base _layers list. This was the FT/TabTransformer "third category" (their components live
  in fields, so a _layers walk cloned fresh-random weights for them).
- Serialization EXTRAS (BatchNorm running mean/variance) copied eagerly via SetExtraParameters
  (self-allocating, validated against the resolved dst shape) instead of falling back — fixes DenseNet/
  ResNet/VGG.
- Lazy destination layers resolved from the source input shape before sharing so a fresh clone's lazy
  layers don't re-initialize over the shared tensors.
- Public helper APIs are now typed IFullModel<T,Tensor<T>,Tensor<T>> (all model bases implement it)
  instead of `object` — the internal reflection walk stays object? as it traverses arbitrary fields.

Clean-env validation (flag on): the previously-failing FT/Tab + DenseNet + ResNet + Autoencoder + CNN
Clone (incl. Clone_AfterTraining) now PASS. Remaining before default-on: VariationalAutoencoder fails
deterministically — its reparameterization sampling depends on a serialized RNG seed that the eager
serialize-clone reproduces but a COW clone doesn't (non-tensor state); plus a SiameseNetwork case to
confirm vs the known shared-engine-singleton test contention. Build clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mplete (#1624, #2)

Adds a cheap, side-effect-free guard to the NeuralNetworkBase COW path (still default-off): after the
reflection walk, sum the element count of the tensors GetTrainableParameters exposes and compare to the
model's ParameterCount. If they differ, the model's GetParameters reads weights NOT surfaced through a
trainable layer (a nested sub-model's own params, a custom VAE/Siamese layout) — the share would be
incomplete — so fall back to the eager full-fidelity copy. No flatten, no forward, so it doesn't
reintroduce the OOM or mutate stateful models.

Clean-env validation (flag on): VariationalAutoencoder + SiameseNetwork now PASS (auto-fall-back) on top
of the previously-fixed FT/Tab + DenseNet + ResNet + Autoencoder + CNN (which keep verified COW).

Remaining before default-on: LSTM. It has FULL coverage (walked == ParameterCount) but still diverges —
LSTMLayer registers its 12 weight/bias FIELDS via manual RegisterTrainableParameter and has NO
SetTrainableParameters override, so the base impl updates the registered list but not the fields the
recurrent forward reads (the "fused/field" divergence #624 flagged). Fixing it needs a per-layer
SetTrainableParameters override (watch RegisterTrainableParameter's role-collapse) + an audit of other
manual-registration layers. Default-off, so no impact on current behavior. Build clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…OMs (#1624, #2)

Flips UseCopyOnWriteDeepCopy to default-ON (AIDOTNET_COW_DEEPCOPY=0 to disable). The COW path is correct
for every model by construction — it shares trainable tensors only when the reflection walk fully accounts
for ParameterCount AND copies serialization extras, and otherwise falls back to the eager full-fidelity
copy — so it never yields a less-faithful clone than before.

Validated (broad NN Clone_AfterTraining sweep, ON vs OFF):
- COW ON regresses NOTHING — its only failures (HTMNetwork, LSTM) also fail with COW OFF (pre-existing;
  LSTM is the #1221 lazy-deserialize eager bug, not COW).
- COW ON FIXES large-model trained-clone OOM/timeouts that the eager clone's memory doubling caused:
  ResNet, VGG, SimCSE (and others) now PASS at O(1)-until-write — the broad ON run was 6x faster (30s vs
  2m52s) with 2 failures vs 32.
- FT/Tab/DenseNet/Autoencoder/CNN share verified-correct; VAE/Siamese fall back via the coverage guard.

The largest models (BGE/ColBERT/ViT-class) still OOM on TRAINING memory itself (not the clone) — that's
the G7/G8 training-lever work (#3), separate from this clone-footprint lever.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GetOrCreateBaseOptimizer() now returns Adam8BitOptimizer (block-quantized m/v moment state) instead of
fp32 AdamOptimizer when the model is large (ParameterCount >= 16M, ~256 MB of fp32 optimizer state).
A standard Adam keeps two fp32 moment buffers, each the size of the model = 2x the weights, the dominant
training-step memory cost after the parameters; 8-bit quantization takes that to ~0.5x, saving ~1.5x the
model size of RAM per step so large models that OOM their training footprint on the 8c/16 GB runner fit.

Small/medium models keep fp32 for exact reproducibility. AIDOTNET_ADAM8BIT=0 forces fp32 everywhere; =1
forces 8-bit (testing). Validated: forcing 8-bit on small models, LossStrictlyDecreases + Training_ShouldReduce
stay green (8/8) — 8-bit Adam matches fp32 within tolerance, so convergence is preserved. The memory win
on the actual OOM models is confirmed on the CI repro (they time out on the forward locally).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hNorm models (#1624)

TrainCore now routes large models with a large batch and no BatchNorm to TrainWithGradientAccumulation
(chunk size 8): forward+backward per chunk, accumulate, single optimizer step. This caps the per-step
ACTIVATION peak (only chunk-many samples' activations resident at once vs the whole batch) so models that
OOM their training-step activation footprint on the 8c/16 GB runner fit.

Gating (all correctness-necessary except the size heuristic):
- batch > chunk size (8) — nothing to chunk otherwise.
- LossFunctionBase present — required by the accumulation path's tape loss.
- NO BatchNorm layer — micro-batching changes per-chunk batch statistics, so accumulation is only
  gradient-EQUIVALENT for per-sample norms (LayerNorm/GroupNorm/InstanceNorm/RMSNorm). BatchNorm models
  keep the full-batch step.
- ParameterCount >= 16M (heuristic) — only worth the extra steps for large models; AIDOTNET_MICROBATCH=1
  forces regardless of size (testing), =0 disables.

Validated: gradient-accumulation equivalence tests (Issue1296) 21/21; with micro-batch FORCED on non-BN
models (Attention/Transformer/Autoencoder), LossStrictlyDecreases + Training_ShouldReduce + MoreData stay
green (12/12) — convergence preserved. Activation-memory win on the OOM models confirmed on the CI repro.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ize) (#1624)

Size-based default-on engagement (>=16M params) regressed ~12 medium-large
models that fit fine in fp32: a broad NN training sweep went 84 -> 96 failures
because 8-bit Adam moments (G2) and micro-batch accumulation (G8) slightly
change convergence, so engaging them on a model that is NOT memory-bound is a
needless regression.

Switch both levers to reactive engagement: train normally, and only after a
training step throws OutOfMemoryException latch the levers ON, drop the fp32
optimizer so it rebuilds as 8-bit, reclaim the failed step transients, and
retry. A model that fits never reaches that path, so it keeps exact fp32
full-batch training (no convergence change); only a model that would otherwise
crash pays the levers cost. BatchNorm models still skip G8 (correctness over
recovery). AIDOTNET_ADAM8BIT / AIDOTNET_MICROBATCH still force for testing.

Sweep with reactive defaults returns to baseline (86 vs 84; +-2 is contention
noise), eliminating the +12 regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
aidotnet_website Ignored Ignored Preview Jun 22, 2026 9:44pm
aidotnet-playground-api Ignored Ignored Preview Jun 22, 2026 9:44pm

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds copy-on-write parameter sharing helpers and chunked parameter assignment APIs, updates many diffusion and predictor clone paths to prefer shared parameters with fallback copying, and adds BF16 optimizer-moment support plus memory-adaptive training behavior.

Changes

COW cloning, chunk streaming, and training-memory updates

Layer / File(s) Summary
COW helper, chunk contracts, and base hooks
src/Helpers/CopyOnWriteCloneHelper.cs, src/Interfaces/IParameterizable.cs, src/Diffusion/DiffusionModelBase.cs, src/Diffusion/NoisePredictors/NoisePredictorBase.cs, src/Diffusion/VAE/VAEModelBase.cs, Directory.Packages.props
Adds CopyOnWriteCloneHelper, adds SetParameterChunks defaults, adds shared-parameter hooks on model bases, and updates central package pins to 0.98.0.
Chunk streaming, checkpointing, and contract tests
src/Diffusion/LatentDiffusionModelBase.cs, src/Diffusion/NoisePredictors/FluxDoubleStreamPredictor.cs, src/Diffusion/NoisePredictors/AsymmDiTPredictor.cs, src/Diffusion/NoisePredictors/EMMDiTPredictor.cs, src/Diffusion/NoisePredictors/FlagDiTPredictor.cs, src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs, src/Diffusion/NoisePredictors/MMDiTXNoisePredictor.cs, src/Diffusion/NoisePredictors/SiTPredictor.cs, src/Diffusion/NoisePredictors/UViTNoisePredictor.cs, tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs, tests/AiDotNet.Tests/UnitTests/Diffusion/PredictorParameterStreamingTests.cs
Implements ordered chunk distribution in LatentDiffusionModelBase, chunked parameter serialization and validation in multiple noise predictors, and updates diffusion contract tests to round-trip with the chunk APIs.
Diffusion model clone migration
src/Diffusion/Audio/*, src/Diffusion/Control/*, src/Diffusion/FastGeneration/*, src/Diffusion/ImageEditing/*, src/Diffusion/MotionGeneration/*, src/Diffusion/Panorama/*, src/Diffusion/StyleTransfer/*, src/Diffusion/SuperResolution/*, src/Diffusion/ThreeD/*, src/Diffusion/VirtualTryOn/*, src/Diffusion/Video/CogVideoModel.cs
Applies TryShareParametersFrom(this) with SetParameters(GetParameters()) fallback across diffusion model Clone() methods and keeps CogVideoModel composed from sub-model Clone() calls with explanatory comments.
Noise predictor and VAE clone migration
src/Diffusion/NoisePredictors/*, src/Diffusion/VAE/*
Applies the same share-then-copy clone pattern to noise predictor and VAE subclasses, including DeepCopy() where present, and notes the explicit non-use of the global COW helper in VideoUNetPredictor.
NeuralNetworkBase memory ladder and BF16 deep copy
src/NeuralNetworks/NeuralNetworkBase.cs, src/MixedPrecision/Float8Types.cs, src/Models/Options/Adam8BitOptimizerOptions.cs, src/Optimizers/Adam8BitOptimizer.cs, tests/AiDotNet.Tests/UnitTests/NeuralNetworks/CopyOnWriteCloneTests.cs
Adds OOM retry behavior, micro-batch gating, optimizer selection ladder, copy-on-write DeepCopy, BF16 moment storage and conversion support, and clone-independence coverage.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Poem

🧬 Shared tensors wake, and the old copies sleep
Chunked layers march in orderly streams so deep
BF16 moments hum where int8 used to dwell
And OOM retries whisper, “all is well” 🌙

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Directory.Packages.props changes NuGet package versions (AiDotNet from 0.97.2 to 0.98.0) which are unrelated to the primary optimization objectives (G1-G6) and introduce external dependency shifts without documented justification or compatibility verification. Either justify package upgrade as prerequisite with compatibility matrix, or defer to separate dependency-management PR; document why 0.98.0 is required for the optimization work.
Docstring Coverage ⚠️ Warning Docstring coverage is 70.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The PR implements G1, G2, G4 (foundation), and G6 from issue #1624, but the issue explicitly marks several sub-tasks incomplete and requires verification across 29 models under specific CI conditions (8-core/16GB with heap limits). Run full ModelFamily test classes for all 29 listed models under DOTNET_GCHeapHardLimit=0x400000000 + DOTNET_PROCESSOR_COUNT=8 to verify training test OOM/timeout elimination before merge.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main changes: optimizer ladder, copy-on-write clone fix, streaming parameter setter, and lazy foundation-scale construction (#1624).
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/1624-training-scale

Comment @coderabbitai help to get the list of available commands.

…odels (#1624)

Adds the gentle middle rung to the G2 optimizer-memory ladder:
fp32 (4B, default) -> BF16 (2B, proactive) -> 8-bit block-quant (1B, reactive).

BF16 keeps the full float32 exponent (only the mantissa shortens), so unlike the
8-bit block quantization it changes Adam's trajectory negligibly and can be engaged
proactively by a size gate without a convergence regression. Adam8BitOptimizer gains
a UseBFloat16MomentStorage mode storing m/v as 2-byte BF16 in the tape state (no
blocks/scales); the per-parameter loop expands only one parameter's moments to full
precision at a time, so resident state stays at the compressed width. GPU/sparse
fast-paths are gated off for the BF16 mode. BF16 pack/unpack reuses the TFM-safe
BitConverterHelper (union fallback on net471).

NeuralNetworkBase wires the ladder: bf16 for models >=50M params (AIDOTNET_BF16_ADAM
overrides), escalating reactively to 8-bit on an actual OOM, else plain fp32 Adam.

Validation (NN training sweep, 172 tests): baseline 84 fail; bf16 gated-default 85
(+1, contention noise); bf16 forced-everywhere 91; 8-bit-everywhere was 96. The >=50M
gate keeps small models on fp32 untouched. net10.0 + net471 build green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ooples ooples changed the title perf(#1624): training-scale memory levers — universal COW clone + reactive 8-bit Adam & micro-batch perf(#1624): optimizer-memory ladder (fp32→bf16→8-bit) + COW clone foundation Jun 16, 2026
…ation (#1624)

Pins the load-bearing guarantee of the copy-on-write Clone lever: a clone is
observationally identical to its source, and the first in-place write to either
side privatizes the shared tensor storage so the other is never corrupted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ooples ooples marked this pull request as ready for review June 17, 2026 18:02
ooples pushed a commit that referenced this pull request Jun 17, 2026
… correctness fixes

The manual copy-on-write design here (DiffusionModelBase.ShareWeightsFrom /
EnsureOwnWeights / WeightShareGroup + ~80 per-model Clone edits) is superseded by the
single global tensor-level COW in #1633 (Tensor.CloneShared, automatic privatization,
no per-mutation-site guard). Reverted all of that to keep ONE COW design across the
library.

Retained the genuinely-necessary, non-COW work:
- DiTNoisePredictor / UNetNoisePredictor: run the per-step denoising forward EAGERLY
  instead of through the single-input compile cache. The compiled plan re-binds only the
  noisy-sample slot and bakes the per-step timestep embedding as a constant, so the
  denoising loop would replay step 0's output every step (silent corruption, compilation
  defaults on). Eager is the only correct path until the compile layer supports multiple
  mutable inputs.
- DiffusionModelTestBase: BLAS-thread cap (#1305) so parallel foundation-scale diffusion
  tests share cores instead of each grabbing all of them and blowing the 120s timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ption 2, #1624)

Bridge until the full PyTorch-style transparent-Clone refactor (separate PR): give the
~100 diffusion model / noise-predictor / VAE Clone() overrides O(1) copy-on-write by
swapping clone.SetParameters(GetParameters()) for
  if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters());
which shares weight storage via Tensor.CloneShared (privatize-on-write) and falls back to
the eager flat copy if the trainable-layer structure doesn't line up 1:1.

- CopyOnWriteCloneHelper: add an object-graph overload so models that aren't IFullModel
  (NoisePredictorBase, VAEModelBase — the heavy DiT/UNet predictors + large VAEs whose
  flat-copy Clone OOMs) can use the same share logic; existing IFullModel overload unchanged.
- NoisePredictorBase / VAEModelBase: add protected TryShareParametersFrom(...) wrapping it.

Builds net10.0 + net471; COW correctness test still green.
Resolve conflicts where master gained independent fixes to files this branch COW-swapped:
- Directory.Packages.props: keep this branch's newer AiDotNet.Tensors 0.98.0.
- FlagDiTPredictor / Multi-/Spot-/StitchDiffusionModel: take master's corrected Clone
  constructor (full Flag-DiT ctor args; pass the cloned predictor+VAE to avoid the
  643M-vs-10M param-count mismatch) AND apply the COW swap on top.
- VideoUNetPredictor: keep master's deliberate paired per-layer copy (its fused-CPU weight
  pack needs the explicit copy to avoid stale-pack divergence) — NOT routed through COW;
  documented why.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deployment failed with the following error:

Resource is limited - try again in 24 hours (more than 100, code: "api-deployments-free-per-day").

Learn More: https://vercel.com/franklins-projects-02a0b5a0?upgradeToPro=build-rate-limit

franklinic and others added 3 commits June 17, 2026 15:35
NoisePredictorBase and VAEModelBase ARE IFullModel<T,Tensor<T>,Tensor<T>> (transitively, via
INoisePredictor<T> / IVAEModel<T>), so the existing IFullModel overload of
CopyOnWriteCloneHelper.TryShareTrainableParameters already covers them — the added object-graph
overload was dead (overload resolution prefers the more specific IFullModel signature) and based
on a mistaken reading of the direct class declarations. Restore the single IFullModel overload;
the bases' TryShareParametersFrom calls bind to it via the standard interface conversion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CogVideoModel.Clone was rewritten in this PR to build a fresh-default
sub-model graph and transfer weights at the model level. A fresh
VideoUNetPredictor has UNRESOLVED lazy layers, so neither the COW share
(structure mismatch -> falls back) nor the flat SetParameters could
reshape the clone to the source's resolved 573M-parameter structure:
the clone silently came out as a different, smaller network
(425M params, clone output diverged ~2x). Verified via
CogVideoModelTests.Clone_ShouldProduceIdenticalOutput (was FAIL,
now PASS: identical param count, 0 weight mismatches, maxOutDiff=0).

Fix: compose the clone from each sub-model's own Clone(), which resolves
its lazy shapes first and applies the memory-efficient transfer
internally (VideoUNetPredictor: paired per-layer copy that avoids the
fused-CPU stale-pack divergence; TemporalVAE: copy-on-write). This is the
codebase's established safe pattern for fused-CPU-pack predictors.

Also remove a duplicated TryShareParametersFrom guard
(if (!x) if (!x) ...) in Flux2SchnellModel.Clone and
ControlNetPlusPlusFluxModel.Clone left by the COW swap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 28

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
src/Models/Options/Adam8BitOptimizerOptions.cs (1)

26-27: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Blocking: add the required options constructors and copy the new BF16 flag

This options type is missing the explicit default constructor and copy constructor required by the options golden pattern. With the new UseBFloat16MomentStorage flag on Line 153, copy semantics are now especially important to avoid silent configuration drift.

Suggested fix
 public class Adam8BitOptimizerOptions<T, TInput, TOutput> : AdamOptimizerOptions<T, TInput, TOutput>
 {
+    public Adam8BitOptimizerOptions()
+    {
+    }
+
+    public Adam8BitOptimizerOptions(Adam8BitOptimizerOptions<T, TInput, TOutput> other)
+    {
+        ArgumentNullException.ThrowIfNull(other);
+
+        BlockSize = other.BlockSize;
+        UseDynamicQuantization = other.UseDynamicQuantization;
+        QuantizationPercentile = other.QuantizationPercentile;
+        FullPrecisionUpdateFrequency = other.FullPrecisionUpdateFrequency;
+        UseStochasticRounding = other.UseStochasticRounding;
+        CompressBothMoments = other.CompressBothMoments;
+        UseBFloat16MomentStorage = other.UseBFloat16MomentStorage;
+    }
+
     // existing properties...
 }

As per coding guidelines: Default Constructor: Parameterless constructor public ModelNameOptions() { } and Copy Constructor ... Copies ALL properties from other to this are required for options classes.

Also applies to: 134-154

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Models/Options/Adam8BitOptimizerOptions.cs` around lines 26 - 27, The
Adam8BitOptimizerOptions class is missing required constructors per the options
golden pattern. Add a parameterless default constructor and a copy constructor
that accepts another Adam8BitOptimizerOptions instance and copies all properties
from it (including the new UseBFloat16MomentStorage flag introduced on line
153). Ensure the copy constructor properly initializes all inherited and new
properties to prevent configuration drift between option instances.

Source: Coding guidelines

src/Optimizers/Adam8BitOptimizer.cs (1)

529-542: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cached tape-state mode mismatch can null-deref after options updates

Step only reallocates state on missing key/length mismatch. If _options.UseBFloat16MomentStorage changes on an existing optimizer instance, old non-BF16 entries remain cached and Line 562 (state.MBf16!) can dereference null.

Suggested fix
- if (!_tapeStates.TryGetValue(param, out var state) || state.Length != param.Length)
+ bool incompatibleMode =
+     _tapeStates.TryGetValue(param, out var s) &&
+     ((_options.UseBFloat16MomentStorage && (s.MBf16 is null || s.VBf16 is null)) ||
+      (!_options.UseBFloat16MomentStorage && (s.VQuantized is null || s.VScales is null)));
+
+ if (!_tapeStates.TryGetValue(param, out var state) || state.Length != param.Length || incompatibleMode)
 {
     ...
 }
 protected override void UpdateOptions(OptimizationAlgorithmOptions<T, TInput, TOutput> options)
 {
     if (options is Adam8BitOptimizerOptions<T, TInput, TOutput> adamOptions)
     {
+        bool modeChanged = _options.UseBFloat16MomentStorage != adamOptions.UseBFloat16MomentStorage
+            || _options.CompressBothMoments != adamOptions.CompressBothMoments
+            || _options.BlockSize != adamOptions.BlockSize;
+
         _options = adamOptions;
+        if (modeChanged) _tapeStates.Clear();
     }

Also applies to: 560-564, 1166-1169

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Optimizers/Adam8BitOptimizer.cs` around lines 529 - 542, The state
reallocation logic in the condition starting at line 529 only checks for missing
key or length mismatch, but does not validate whether the cached state's storage
mode matches the current _options.UseBFloat16MomentStorage setting. When this
option is toggled on an existing optimizer instance, old cached entries with the
wrong storage mode remain in _tapeStates, causing null-dereference when
accessing state.MBf16 at line 562. Add an additional condition to the state
validity check that verifies the storage mode of the cached state matches the
current UseBFloat16MomentStorage option, and reallocate the state via
AllocateTapeState if they don't match. Apply this same validation check to the
other locations mentioned (lines 560-564 and 1166-1169) where BF16 state access
occurs.
src/Diffusion/Control/ControlNetLiteModel.cs (1)

118-126: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clone fallback can return a non-equivalent model state.

When Line 150 falls back to SetParameters(GetParameters()), _vae is not copied by this class’s parameter contract (Lines 118–141 exclude it). If TryShareParametersFrom returns false, the clone keeps a fresh VAE, so clone output can diverge from source.

Proposed fix
 public override IDiffusionModel<T> Clone()
 {
     var clone = new ControlNetLiteModel<T>(controlType: _controlType, conditioner: _conditioner, seed: RandomGenerator.Next());
-    if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters());
+    if (!clone.TryShareParametersFrom(this))
+    {
+        clone.SetParameters(GetParameters());
+        clone._vae.SetParameters(_vae.GetParameters());
+    }
     return clone;
 }

Also applies to: 129-141, 150-150

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Diffusion/Control/ControlNetLiteModel.cs` around lines 118 - 126, The
GetParameters method in ControlNetLiteModel only retrieves parameters from
_baseUNet and _controlEncoder, but excludes _vae. When the Clone method falls
back to SetParameters(GetParameters()) at line 150 (when TryShareParametersFrom
returns false), the _vae is not copied, resulting in a clone with a fresh VAE
instance that can diverge from the source model. Fix this by ensuring the _vae
state is properly included in the parameter cloning process. Either extend the
GetParameters and SetParameters methods to handle _vae parameters alongside the
existing _baseUNet and _controlEncoder parameters, or implement explicit _vae
cloning logic in the Clone method fallback to ensure the cloned model maintains
equivalent state to the source.
src/Diffusion/Control/ControlNetSD3Model.cs (1)

76-77: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Blocking: Clone fallback can return a non-equivalent model when COW sharing fails.

Clone() falls back to SetParameters(GetParameters()), but this class’s parameter path excludes _vae. If TryShareParametersFrom returns false, the clone keeps fresh VAE weights and no longer mirrors the source model.

Proposed fix
 public override IDiffusionModel<T> Clone()
 {
     var clone = new ControlNetSD3Model<T>(
         controlType: _controlType,
         conditioner: _conditioner,
         seed: RandomGenerator.Next());
-    if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters());
+    if (!clone.TryShareParametersFrom(this))
+    {
+        clone.SetParameters(GetParameters());
+        clone._vae.SetParameters(_vae.GetParameters());
+    }
     return clone;
 }

Also applies to: 128-160, 166-173

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Diffusion/Control/ControlNetSD3Model.cs` around lines 76 - 77, The
Clone() method in ControlNetSD3Model falls back to
SetParameters(GetParameters()) but this approach excludes the _vae field from
being cloned, causing fresh VAE weights when the copy-on-write sharing fails.
Ensure that both GetParameters() and SetParameters() methods include the _vae
parameters along with _predictor and _controlEncoder parameters, so that Clone()
can properly restore all model components. Alternatively, modify the Clone()
method to explicitly handle cloning or sharing the _vae field in addition to
using SetParameters(GetParameters()), ensuring the cloned model fully mirrors
the source model's state including its VAE component.
src/Diffusion/VAE/AutoencoderKL.cs (1)

427-445: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Blocking: Clone does not preserve non-default architecture configuration and can silently partial-copy weights.

Clone() hardcodes constructor values (numResBlocks, numGroups, inputSpatialSize) instead of reusing the source configuration. If shape/parameter layout differs, COW sharing can fail and SetParameters currently allows partial assignment without length enforcement.

Proposed fix
+private readonly int _numResBlocks;
+private readonly int _numGroups;
+private readonly int _inputSpatialSize;

 public AutoencoderKL(
     int inputChannels = 3,
     int latentChannels = 4,
     int baseChannels = 128,
     int[]? channelMults = null,
     int numResBlocks = 2,
     int numGroups = 32,
     double? latentScaleFactor = null,
     int inputSpatialSize = 512,
     ILossFunction<T>? lossFunction = null,
     int? seed = null)
     : base(lossFunction, seed)
 {
+    _numResBlocks = numResBlocks;
+    _numGroups = numGroups;
+    _inputSpatialSize = inputSpatialSize;
     ...
 }

 public override void SetParameters(Vector<T> parameters)
 {
     var encoderParams = _encoder.GetParameters();
     var decoderParams = _decoder.GetParameters();
+    var expectedTotal = encoderParams.Length + decoderParams.Length;
+    if (parameters.Length != expectedTotal)
+        throw new ArgumentException($"Expected {expectedTotal} parameters, got {parameters.Length}.", nameof(parameters));
     ...
 }

 public override IVAEModel<T> Clone()
 {
     var clone = new AutoencoderKL<T>(
         _inputChannels,
         _latentChannels,
         _baseChannels,
         _channelMults,
-        numResBlocks: 2,
-        numGroups: 32,
+        numResBlocks: _numResBlocks,
+        numGroups: _numGroups,
         _latentScaleFactor,
-        inputSpatialSize: 512,
+        inputSpatialSize: _inputSpatialSize,
         LossFunction);

     if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters());
     return clone;
 }

Also applies to: 536-550

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Diffusion/VAE/AutoencoderKL.cs` around lines 427 - 445, The SetParameters
method in AutoencoderKL allows partial weight assignment without proper
validation, which can silently fail to copy all required parameters. Add a
length validation check at the beginning of the SetParameters method to ensure
that the input parameters vector length exactly equals the sum of
encoderParams.Length and decoderParams.Length, and throw an appropriate
exception if there is a mismatch instead of allowing incomplete parameter
assignment.
src/Diffusion/NoisePredictors/UViTNoisePredictor.cs (2)

584-592: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Blocking: Clone() does not preserve full construction/runtime state.

Clone() drops latentSpatialSize and seed, and does not copy _posEmbed. A clone can diverge from the source even if COW sharing succeeds.

💡 Proposed fix
+private readonly int _latentSpatialSize;
+private readonly int? _seed;
+
 public UViTNoisePredictor(
     int inputChannels = 4,
     int hiddenSize = 512,
     int numLayers = 12,
     int numHeads = 8,
     int patchSize = 2,
     int contextDim = 0,
     int latentSpatialSize = 32,
     int? seed = null)
     : base(seed: seed)
 {
+    _seed = seed;
+    _latentSpatialSize = latentSpatialSize;
     numLayers = numLayers % 2 == 0 ? numLayers : numLayers + 1;
@@
 public override INoisePredictor<T> Clone()
 {
     var clone = new UViTNoisePredictor<T>(
         inputChannels: _inputChannels,
         hiddenSize: _hiddenSize,
         numLayers: _numLayers,
         numHeads: _numHeads,
         patchSize: _patchSize,
-        contextDim: _contextDim);
+        contextDim: _contextDim,
+        latentSpatialSize: _latentSpatialSize,
+        seed: _seed);
+
+    if (_posEmbed is not null)
+        clone._posEmbed = (Tensor<T>)_posEmbed.CloneShared();
+
     if (!clone.TryShareParametersFrom(this)) clone.SetParameters(GetParameters());
     return clone;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Diffusion/NoisePredictors/UViTNoisePredictor.cs` around lines 584 - 592,
The Clone() method in UViTNoisePredictor<T> is not preserving the full state of
the object. When creating the clone instance, add latentSpatialSize and seed as
parameters to the constructor call (in addition to the existing parameters like
inputChannels, hiddenSize, etc.). Additionally, after the TryShareParametersFrom
and SetParameters calls, you must copy the _posEmbed field from the source
instance to the cloned instance to ensure the clone has identical runtime state
as the original object.

546-552: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Blocking: SetParameters only applies a partial model state.

GetParameters emits encoder/middle/decoder/skip/final parameters, but SetParameters writes only _patchEmbed, _timeEmbed1, _timeEmbed2. This silently leaves most weights stale.

💡 Proposed fix
 public override void SetParameters(Vector<T> parameters)
 {
     int offset = 0;
     offset = SetLayerParams(_patchEmbed, parameters, offset);
     offset = SetLayerParams(_timeEmbed1, parameters, offset);
     offset = SetLayerParams(_timeEmbed2, parameters, offset);
+
+    foreach (var block in _encoderBlocks)
+        offset = SetBlockParams(block, parameters, offset);
+
+    offset = SetBlockParams(_middleBlock, parameters, offset);
+
+    for (int i = 0; i < _decoderBlocks.Count; i++)
+    {
+        offset = SetBlockParams(_decoderBlocks[i], parameters, offset);
+        offset = SetLayerParams(_skipProjections[i], parameters, offset);
+    }
+
+    offset = SetLayerParams(_finalNorm, parameters, offset);
+    offset = SetLayerParams(_outputProj, parameters, offset);
+
+    if (offset != parameters.Length)
+    {
+        throw new ArgumentException(
+            $"Expected to consume {offset} parameters but received {parameters.Length}.",
+            nameof(parameters));
+    }
 }
+
+private static int SetBlockParams(UViTBlock block, Vector<T> parameters, int offset)
+{
+    if (block.Norm1 != null) offset = SetLayerParams(block.Norm1, parameters, offset);
+    if (block.Attention != null) offset = SetLayerParams(block.Attention, parameters, offset);
+    if (block.Norm2 != null) offset = SetLayerParams(block.Norm2, parameters, offset);
+    if (block.MLP1 != null) offset = SetLayerParams(block.MLP1, parameters, offset);
+    if (block.MLP2 != null) offset = SetLayerParams(block.MLP2, parameters, offset);
+    return offset;
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Diffusion/NoisePredictors/UViTNoisePredictor.cs` around lines 546 - 552,
The SetParameters method in the UViTNoisePredictor class is incomplete and only
restores parameters for _patchEmbed, _timeEmbed1, and _timeEmbed2, while
GetParameters retrieves parameters for encoder, middle, decoder, skip, and final
components. Expand the SetParameters method to call SetLayerParams for all the
same layer components that GetParameters emits, in the same order, to ensure the
complete model state is restored when loading parameters. Reference the
GetParameters method to determine which layers need to be included in the
SetParameters call sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Diffusion/FastGeneration/AutoRegressiveMaskedDiffusion.cs`:
- Around line 140-142: The clone method in AutoRegressiveMaskedDiffusion<T>
creates a new instance with only conditioner and seed parameters, losing custom
configuration from the source instance. When the source was created with custom
architecture, options, scheduler, predictor, or vae components, the clone will
have incompatible layouts and the fallback SetParameters call will fail. Instead
of reconstructing the clone with default components, ensure that the new
AutoRegressiveMaskedDiffusion<T> constructor call passes all configuration
parameters (architecture, options, scheduler, predictor, vae) from the current
instance so the clone maintains the same setup as the source.

In `@src/Diffusion/FastGeneration/SANASprintModel.cs`:
- Around line 146-148: The Clone method in SANASprintModel<T> is creating a new
instance with only the conditioner and seed parameters, which leaves other
architectural components uninitialized. This causes the cloned model to have a
different structure than the source, making parameter sharing and the fallback
copy operation fail or produce mismatched layouts. Modify the constructor call
on line 146 to also pass the source model's architecture, options, scheduler,
predictor, and vae properties to ensure the clone has an identical structure
before attempting to share or copy parameters.

In `@src/Diffusion/FastGeneration/SCottModel.cs`:
- Around line 143-145: The Clone method in SCottModel<T> creates a new instance
with only the conditioner parameter, which does not preserve custom predictor or
VAE configurations from the source model. This causes parameter-count mismatches
when parameter sharing fails. Modify the clone instantiation to pass all
relevant configuration properties from the source model to the constructor
(beyond just conditioner), ensuring the cloned instance has identical
configuration to the original before attempting parameter sharing or copying.

In `@src/Diffusion/FastGeneration/TransfusionModel.cs`:
- Around line 143-145: The Clone() method in TransfusionModel<T> creates a fresh
default instance instead of preserving the original's custom configuration like
architecture, options, scheduler, predictor, and vae. When
TryShareParametersFrom(this) fails on line 144, the fallback
SetParameters(GetParameters()) can throw if parameter counts mismatch. Modify
the Clone() method to copy all custom configuration settings from the source
instance to the cloned instance before attempting parameter sharing, ensuring
the cloned model maintains the same setup as the original and eliminates
reliance on the fragile SetParameters fallback.

In `@src/Diffusion/ImageEditing/FlowEditModel.cs`:
- Around line 137-139: The Clone method in FlowEditModel<T> creates a new
instance with only the conditioner and seed parameters, but discards other
source constructor state like architecture, options, scheduler, predictor, and
vae. This causes the subsequent TryShareParametersFrom call to operate on an
improperly initialized instance. Modify the constructor call on line 137 to pass
all necessary parameters from the current instance (this._architecture,
this._options, this._scheduler, this._predictor, this._vae, etc.) in addition to
the conditioner and RandomGenerator.Next() seed, ensuring the clone is fully
initialized with the same configuration as the source before attempting
parameter sharing.

In `@src/Diffusion/ImageEditing/FreeInpaintModel.cs`:
- Around line 131-133: The FreeInpaintModel<T> constructor creates a clone with
default values that do not match the source model's architecture, options,
scheduler, predictor, and vae settings. This structural mismatch causes
SetParameters to fail when TryShareParametersFrom returns false. Modify the
FreeInpaintModel<T> clone instantiation at line 131 to pass the same
architecture configuration parameters (such as options, scheduler, predictor,
and vae) from the current instance to the clone constructor, ensuring the clone
has structural parity with the source before attempting to share or apply
parameters.

In `@src/Diffusion/ImageEditing/HDPainterModel.cs`:
- Around line 136-138: The Clone method in HDPainterModel is not preserving the
full configuration state when creating a cloned instance. The new
HDPainterModel<T> being instantiated on line 136 only passes the conditioner and
a new random seed, but does not preserve other constructor parameters from the
source instance. This causes model layout mismatches when the source has been
customized, making the TryShareParametersFrom fallback fail on line 137. Store
all constructor parameters as private fields in HDPainterModel so they can be
accessed during cloning, then pass these preserved parameters when creating the
cloned instance to ensure the clone has the same configuration as the source.

In `@src/Diffusion/ImageEditing/SD3InpaintingModel.cs`:
- Around line 137-139: The Clone() method in the SD3InpaintingModel class
creates a new instance with only the conditioner and seed, which does not
preserve the original source construction parameters and dependencies. This
causes the TryShareParametersFrom call to potentially fail, resulting in a
fallback to SetParameters that throws due to parameter-length mismatch. Fix this
by ensuring the new SD3InpaintingModel instance created in the clone operation
is constructed with all the same dependencies and parameters as the source, so
that either TryShareParametersFrom succeeds or the clone is already properly
initialized. Review the SD3InpaintingModel constructor to identify all required
dependencies beyond conditioner and seed, and pass them when creating the clone
instance.

In `@src/Diffusion/ImageEditing/Step1XEditModel.cs`:
- Around line 150-152: The Clone() method in Step1XEditModel is discarding the
source instance's configuration (architecture, options, scheduler, predictor,
vae) when creating a new Step1XEditModel instance on line 150. When the
TryShareParametersFrom fallback on line 151 fails, the SetParameters call can
then fail due to mismatched parameter layouts. Fix this by passing all the
source instance's configuration parameters (architecture, options, scheduler,
predictor, vae) to the Step1XEditModel constructor when creating the clone, not
just the conditioner and new seed, so that the cloned instance preserves the
complete configuration of the original.

In `@src/Diffusion/ImageEditing/TurboEditModel.cs`:
- Around line 136-138: The Clone() method in TurboEditModel<T> is reconstructing
the cloned instance with only the conditioner and seed parameters, but it is not
preserving the original model's custom predictor and VAE configuration. This
causes TryShareParametersFrom to fail and subsequently SetParameters to throw
due to mismatched parameter counts. When creating the new clone instance in the
Clone() method, pass the original model's predictor and VAE configuration along
with the conditioner and seed, so that the cloned model maintains the same
wiring and configuration as the source model before attempting to share or set
parameters.

In `@src/Diffusion/MotionGeneration/MoMaskModel.cs`:
- Around line 122-124: The Clone() method in MoMaskModel is creating a new
instance with only conditioner and seed parameters on line 122, but it is not
carrying over critical architectural state like architecture, options,
scheduler, predictor, and vae from the source instance. This causes a mismatch
when the source was built with non-default components and the fallback
SetParameters() call is insufficient. Modify the MoMaskModel constructor call on
line 122 to pass all the necessary architectural and configuration parameters
(architecture, options, scheduler, predictor, vae) from the current instance in
addition to conditioner and seed, ensuring the cloned instance is fully
initialized with the same state as the source.

In `@src/Diffusion/NoisePredictors/MMDiTXNoisePredictor.cs`:
- Around line 336-337: The issue is that when TryShareParametersFrom returns
true in the clone creation logic, the _posEmbed state is not being copied to the
clone, leaving it in an incomplete state. Since TryShareParametersFrom only
handles trainable layer tensors, you need to ensure _posEmbed is explicitly
copied regardless of whether parameter sharing succeeds or fails. Add a separate
statement after the conditional to copy _posEmbed from this instance to the
clone, or ensure _posEmbed is copied in both the success path (when
TryShareParametersFrom returns true) and the fallback path (when SetParameters
is called).

In `@src/Diffusion/Panorama/DiffPanoModel.cs`:
- Around line 138-140: The Clone() method in DiffPanoModel creates a new
instance with default architecture/options/scheduler/predictor/vae settings,
which causes SetParameters to fail if TryShareParametersFrom cannot share
parameters due to structural mismatch. When constructing the cloned instance in
the Clone() method, capture and pass the source instance's architecture,
options, scheduler, predictor, and vae configuration settings to the
DiffPanoModel constructor instead of relying on defaults, ensuring the cloned
graph structure matches the source before attempting parameter operations.

In `@src/Diffusion/StyleTransfer/ConsisLoRAModel.cs`:
- Around line 111-112: The Clone() method in ConsisLoRAModel is creating a new
instance without preserving the source's configuration. The constructor call on
line 111 only passes conditioner and seed, causing the clone to use default
values for Architecture, Options, Scheduler, and custom predictor/VAE instances.
This leads to parameter-count mismatch or behavioral differences when
SetParameters is called on line 112. Modify the ConsisLoRAModel constructor call
to include all source configuration parameters (Architecture, Options,
Scheduler, and any custom predictor/VAE instances) so the clone preserves the
exact configuration of the source object rather than rebuilding with defaults.

In `@src/Diffusion/StyleTransfer/InstantStyleModel.cs`:
- Around line 114-115: The Clone() method in the InstantStyleModel class is not
preserving the source model's topology and configuration. The issue is in the
instantiation of the new InstantStyleModel on line 114, which only passes the
conditioner and a new seed, discarding any custom configuration from the source
model. To fix this, ensure that when creating the cloned instance, you preserve
all relevant configuration and topology from the source model (this) before
attempting to share or copy parameters. This will prevent parameter size
mismatches and ensure the fallback SetParameters call on line 115 succeeds with
compatible parameter dimensions.

In `@src/Diffusion/StyleTransfer/KLoRAStyleModel.cs`:
- Around line 111-112: The Clone() method in the KLoRAStyleModel class is
instantiating a new instance with insufficient parameters, losing source
configuration state. Line 111 currently only passes the conditioner and a random
seed to the constructor, but it should clone all relevant source state including
custom predictor and VAE topologies from the current instance. Modify the
constructor call in Clone() to pass all necessary configuration parameters from
the source instance (this) to the new clone, ensuring the fallback SetParameters
call on line 112 has a properly configured clone to work with.

In `@src/Diffusion/StyleTransfer/SASTDModel.cs`:
- Around line 111-112: The Clone() method in the SASTDModel class is creating a
new instance with a randomly generated seed instead of cloning the source
model's actual configuration. Replace the random seed generation
(RandomGenerator.Next()) with the current instance's actual seed value to
properly preserve the source model's state. Additionally, review the constructor
call in Clone() to ensure all relevant source components and configuration are
being properly copied from the current instance (this) rather than using default
values, as the parameter sharing fallback on the next line is insufficient for
maintaining clone fidelity.

In `@src/Diffusion/StyleTransfer/StyDiffModel.cs`:
- Around line 111-112: The Clone() method in StyDiffModel is not preserving the
source instance's configuration when creating the cloned instance. When
instantiating the new StyDiffModel on line 111, only the conditioner and a new
seed are being passed to the constructor, which means other configuration
settings from the source instance are lost and initialized to defaults. This
causes the subsequent parameter-sharing or parameter-setting operations to fail
due to configuration mismatches. To fix this, identify all configuration
parameters/fields that define the source instance's state (besides the seed
which should be randomized for the clone), and pass them to the new StyDiffModel
constructor to ensure the clone is configuration-faithful to the source before
attempting to share or set parameters.

In `@src/Diffusion/StyleTransfer/StyleAlignedEditModel.cs`:
- Around line 114-115: The Clone() method in StyleAlignedEditModel class is
creating a new instance with only the conditioner and a new random seed, but it
does not preserve the full source configuration. Instead of creating a default
instance and relying on a fallback parameter copy via TryShareParametersFrom(),
the clone should be instantiated with all the same configuration parameters as
the source instance (beyond just the conditioner). Ensure that all relevant
configuration state from the current instance is passed to the new
StyleAlignedEditModel constructor so the clone is semantically equivalent to the
source, then remove or adjust the fallback logic on line 115 accordingly.

In `@src/Helpers/CopyOnWriteCloneHelper.cs`:
- Around line 24-70: The helper methods TryShareTrainableParameters and
CollectTrainableLayers in the CopyOnWriteCloneHelper class are currently exposed
as public API, which expands the user-facing surface beyond the intended facade
boundary. Change the access modifiers from public static to internal static for
both methods to restrict them to internal use only, ensuring users interact only
through AiModelBuilder.cs and AiModelResult.cs as intended by the coding
guidelines.
- Around line 45-58: The validation in the first loop of the
CopyOnWriteCloneHelper method only checks parameter counts but does not validate
the actual tensor compatibility (shape, data type, etc.) between source and
destination layers. Extend the validation loop that checks
srcLayers[i].GetTrainableParameters().Count to also iterate through each
parameter and verify that the corresponding tensors in srcLayers[i] and
dstLayers[i] have compatible shapes and properties before proceeding to the
CloneShared operations. If any tensor is incompatible, return false to skip the
mutation and use the eager fallback, ensuring half-shared clones are never
created.

In `@src/NeuralNetworks/NeuralNetworkBase.cs`:
- Around line 9226-9229: The UseCopyOnWriteDeepCopy property should not be
exposed as part of the public API surface since it is an implementation and
testing detail rather than a user-facing facade capability. Change the access
modifier of the UseCopyOnWriteDeepCopy property from public to internal to keep
it as internal plumbing, preventing external callers from mutating global clone
behavior outside the intended AiModelBuilder and AiModelResult interfaces.
- Around line 8157-8168: The cache field `_hasCrossBatchNormCached` is not being
invalidated when the neural network's layer structure changes. Add cache
invalidation for `_hasCrossBatchNormCached` alongside the existing
layer-structure cache resets in the methods `AddLayerToCollection`,
`RemoveLayerFromCollection`, and `ClearLayers`. Set `_hasCrossBatchNormCached`
to null (or its default value) in these methods to ensure the cache is
recalculated when layer composition changes, preventing stale cached values from
affecting micro-batch normalization decisions.
- Around line 5713-5727: The OutOfMemoryException catch block in the method
containing TrainCore retries from a potentially corrupted state if parameter
mutations have already begun. Add a boolean field _trainMutationStarted to track
when in-place parameter or moment buffer mutations start. Set
_trainMutationStarted to true immediately before each non-transactional write
operation including optimizer.Step calls, streaming optimizer Apply/EndStep
paths, fused plan execution, and raw extra tensor updates throughout the
training code paths. In the OutOfMemoryException catch block, check if
_trainMutationStarted is true before retrying; if it is, rethrow the exception
instead of retrying since the state is already corrupted. Reset
_trainMutationStarted to false after successful retry completion.
- Around line 9408-9423: The coverage guard that validates whether the
copy-on-write mechanism can safely share tensors only checks layer trainable
parameters from GetTrainableParameters() but ignores GetExtraTrainableTensors()
which are still updated during training. After the loop that counts
walkedParamCount from srcLayers, add a check to detect if there are any extra
trainable tensors by calling GetExtraTrainableTensors() on the source model. If
any raw trainable tensors exist (the result list is non-empty), fall back to the
eager copy path by disposing the copy and returning false, consistent with the
existing behavior when walkedParamCount != ParameterCount.
- Around line 8134-8139: The current check for `CollectParameters(Layers).Count
== 0` only verifies materialized tensors, which causes lazy trainable layers to
be skipped during the OOM retry mechanism before their first forward pass.
Replace this check with a condition that gates on the presence of trainable
layers themselves rather than their materialized parameters, allowing lazy
layers to properly participate in the micro-batch retry flow that already exists
in TrainWithGradientAccumulation.
- Around line 5721-5722: The issue is that clearing _baseTrainOptimizer
unconditionally during OOM recovery silently replaces user-configured optimizers
(injected via AiModelBuilder.ConfigureOptimizer and SetBaseTrainOptimizer) with
the default memory-ladder Adam. To fix this, introduce a tracking flag
_baseTrainOptimizerIsFrameworkDefault that marks whether the current optimizer
was framework-created. Set this flag to true in every branch where
GetOrCreateBaseOptimizer() creates and assigns the default optimizer. Then
during OOM recovery, only clear _baseTrainOptimizer if the flag indicates it was
framework-created; otherwise preserve the user-configured optimizer or handle it
appropriately rather than silently replacing it.

In `@src/Optimizers/Adam8BitOptimizer.cs`:
- Around line 421-425: The BF16 tape-state branch (around line 706-716) leaves
VQuantized and VScales fields as null when UseBFloat16MomentStorage is enabled,
but the GetMemoryUsage and GetTapeStateSnapshotForTests methods unconditionally
dereference these fields without null checks, causing NullReferenceExceptions.
Add appropriate null checks in GetMemoryUsage and GetTapeStateSnapshotForTests
to handle the case where these fields are null in BF16 mode, ensuring the
methods gracefully skip or handle quantized fields when they are not allocated.

---

Outside diff comments:
In `@src/Diffusion/Control/ControlNetLiteModel.cs`:
- Around line 118-126: The GetParameters method in ControlNetLiteModel only
retrieves parameters from _baseUNet and _controlEncoder, but excludes _vae. When
the Clone method falls back to SetParameters(GetParameters()) at line 150 (when
TryShareParametersFrom returns false), the _vae is not copied, resulting in a
clone with a fresh VAE instance that can diverge from the source model. Fix this
by ensuring the _vae state is properly included in the parameter cloning
process. Either extend the GetParameters and SetParameters methods to handle
_vae parameters alongside the existing _baseUNet and _controlEncoder parameters,
or implement explicit _vae cloning logic in the Clone method fallback to ensure
the cloned model maintains equivalent state to the source.

In `@src/Diffusion/Control/ControlNetSD3Model.cs`:
- Around line 76-77: The Clone() method in ControlNetSD3Model falls back to
SetParameters(GetParameters()) but this approach excludes the _vae field from
being cloned, causing fresh VAE weights when the copy-on-write sharing fails.
Ensure that both GetParameters() and SetParameters() methods include the _vae
parameters along with _predictor and _controlEncoder parameters, so that Clone()
can properly restore all model components. Alternatively, modify the Clone()
method to explicitly handle cloning or sharing the _vae field in addition to
using SetParameters(GetParameters()), ensuring the cloned model fully mirrors
the source model's state including its VAE component.

In `@src/Diffusion/NoisePredictors/UViTNoisePredictor.cs`:
- Around line 584-592: The Clone() method in UViTNoisePredictor<T> is not
preserving the full state of the object. When creating the clone instance, add
latentSpatialSize and seed as parameters to the constructor call (in addition to
the existing parameters like inputChannels, hiddenSize, etc.). Additionally,
after the TryShareParametersFrom and SetParameters calls, you must copy the
_posEmbed field from the source instance to the cloned instance to ensure the
clone has identical runtime state as the original object.
- Around line 546-552: The SetParameters method in the UViTNoisePredictor class
is incomplete and only restores parameters for _patchEmbed, _timeEmbed1, and
_timeEmbed2, while GetParameters retrieves parameters for encoder, middle,
decoder, skip, and final components. Expand the SetParameters method to call
SetLayerParams for all the same layer components that GetParameters emits, in
the same order, to ensure the complete model state is restored when loading
parameters. Reference the GetParameters method to determine which layers need to
be included in the SetParameters call sequence.

In `@src/Diffusion/VAE/AutoencoderKL.cs`:
- Around line 427-445: The SetParameters method in AutoencoderKL allows partial
weight assignment without proper validation, which can silently fail to copy all
required parameters. Add a length validation check at the beginning of the
SetParameters method to ensure that the input parameters vector length exactly
equals the sum of encoderParams.Length and decoderParams.Length, and throw an
appropriate exception if there is a mismatch instead of allowing incomplete
parameter assignment.

In `@src/Models/Options/Adam8BitOptimizerOptions.cs`:
- Around line 26-27: The Adam8BitOptimizerOptions class is missing required
constructors per the options golden pattern. Add a parameterless default
constructor and a copy constructor that accepts another Adam8BitOptimizerOptions
instance and copies all properties from it (including the new
UseBFloat16MomentStorage flag introduced on line 153). Ensure the copy
constructor properly initializes all inherited and new properties to prevent
configuration drift between option instances.

In `@src/Optimizers/Adam8BitOptimizer.cs`:
- Around line 529-542: The state reallocation logic in the condition starting at
line 529 only checks for missing key or length mismatch, but does not validate
whether the cached state's storage mode matches the current
_options.UseBFloat16MomentStorage setting. When this option is toggled on an
existing optimizer instance, old cached entries with the wrong storage mode
remain in _tapeStates, causing null-dereference when accessing state.MBf16 at
line 562. Add an additional condition to the state validity check that verifies
the storage mode of the cached state matches the current
UseBFloat16MomentStorage option, and reallocate the state via AllocateTapeState
if they don't match. Apply this same validation check to the other locations
mentioned (lines 560-564 and 1166-1169) where BF16 state access occurs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 526e5d60-fc0a-4ba2-be90-30b34767f423

📥 Commits

Reviewing files that changed from the base of the PR and between 6588159 and 374987f.

📒 Files selected for processing (109)
  • Directory.Packages.props
  • src/Diffusion/Audio/DiffWaveModel.cs
  • src/Diffusion/Control/ControlARModel.cs
  • src/Diffusion/Control/ControlNeXtModel.cs
  • src/Diffusion/Control/ControlNetInpaintingModel.cs
  • src/Diffusion/Control/ControlNetLiteModel.cs
  • src/Diffusion/Control/ControlNetModel.cs
  • src/Diffusion/Control/ControlNetPlusPlusFluxModel.cs
  • src/Diffusion/Control/ControlNetPlusPlusModel.cs
  • src/Diffusion/Control/ControlNetSD3Model.cs
  • src/Diffusion/Control/ControlNetTileModel.cs
  • src/Diffusion/Control/ControlNetUnionProModel.cs
  • src/Diffusion/Control/IPAdapterFaceIDPlusModel.cs
  • src/Diffusion/Control/IPAdapterModel.cs
  • src/Diffusion/Control/IPAdapterPlusModel.cs
  • src/Diffusion/Control/ReferenceOnlyModel.cs
  • src/Diffusion/DiffusionModelBase.cs
  • src/Diffusion/FastGeneration/AutoRegressiveMaskedDiffusion.cs
  • src/Diffusion/FastGeneration/DMD2Model.cs
  • src/Diffusion/FastGeneration/FlowMapModel.cs
  • src/Diffusion/FastGeneration/Flux2SchnellModel.cs
  • src/Diffusion/FastGeneration/HyperSDModel.cs
  • src/Diffusion/FastGeneration/InstaFlowModel.cs
  • src/Diffusion/FastGeneration/MARModel.cs
  • src/Diffusion/FastGeneration/MultistepLCModel.cs
  • src/Diffusion/FastGeneration/OSDSModel.cs
  • src/Diffusion/FastGeneration/PCMModel.cs
  • src/Diffusion/FastGeneration/PeRFlowModel.cs
  • src/Diffusion/FastGeneration/PixArtDeltaLCMModel.cs
  • src/Diffusion/FastGeneration/SANASprintModel.cs
  • src/Diffusion/FastGeneration/SCottModel.cs
  • src/Diffusion/FastGeneration/SD3FlashModel.cs
  • src/Diffusion/FastGeneration/SD3TurboModel.cs
  • src/Diffusion/FastGeneration/SDXLLightningModel.cs
  • src/Diffusion/FastGeneration/SenseFlowModel.cs
  • src/Diffusion/FastGeneration/SiDDiTModel.cs
  • src/Diffusion/FastGeneration/SiDModel.cs
  • src/Diffusion/FastGeneration/SwiftBrushModel.cs
  • src/Diffusion/FastGeneration/TCDModel.cs
  • src/Diffusion/FastGeneration/TrainingEfficientLCM.cs
  • src/Diffusion/FastGeneration/TransfusionModel.cs
  • src/Diffusion/ImageEditing/AnyEditModel.cs
  • src/Diffusion/ImageEditing/BrushEditModel.cs
  • src/Diffusion/ImageEditing/BrushNetModel.cs
  • src/Diffusion/ImageEditing/BrushNetXModel.cs
  • src/Diffusion/ImageEditing/CycleGANTurboModel.cs
  • src/Diffusion/ImageEditing/FlowEditModel.cs
  • src/Diffusion/ImageEditing/FreeInpaintModel.cs
  • src/Diffusion/ImageEditing/HDPainterModel.cs
  • src/Diffusion/ImageEditing/PowerPaintModel.cs
  • src/Diffusion/ImageEditing/RADModel.cs
  • src/Diffusion/ImageEditing/ReplaceAnythingModel.cs
  • src/Diffusion/ImageEditing/SD3InpaintingModel.cs
  • src/Diffusion/ImageEditing/Step1XEditModel.cs
  • src/Diffusion/ImageEditing/TurboEditModel.cs
  • src/Diffusion/ImageEditing/UltraEditModel.cs
  • src/Diffusion/MotionGeneration/MoMaskModel.cs
  • src/Diffusion/MotionGeneration/MotionDiffusionModel.cs
  • src/Diffusion/NoisePredictors/AsymmDiTPredictor.cs
  • src/Diffusion/NoisePredictors/EMMDiTPredictor.cs
  • src/Diffusion/NoisePredictors/FlagDiTPredictor.cs
  • src/Diffusion/NoisePredictors/FluxDoubleStreamPredictor.cs
  • src/Diffusion/NoisePredictors/MMDiTXNoisePredictor.cs
  • src/Diffusion/NoisePredictors/NoisePredictorBase.cs
  • src/Diffusion/NoisePredictors/SiTPredictor.cs
  • src/Diffusion/NoisePredictors/UNetNoisePredictor.cs
  • src/Diffusion/NoisePredictors/UViTNoisePredictor.cs
  • src/Diffusion/NoisePredictors/VideoUNetPredictor.cs
  • src/Diffusion/Panorama/CubeDiffModel.cs
  • src/Diffusion/Panorama/DiffPanoModel.cs
  • src/Diffusion/Panorama/MultiDiffusionModel.cs
  • src/Diffusion/Panorama/SpotDiffusionModel.cs
  • src/Diffusion/Panorama/StitchDiffusionModel.cs
  • src/Diffusion/StyleTransfer/ConsisLoRAModel.cs
  • src/Diffusion/StyleTransfer/InstantStyleModel.cs
  • src/Diffusion/StyleTransfer/KLoRAStyleModel.cs
  • src/Diffusion/StyleTransfer/RBModulationModel.cs
  • src/Diffusion/StyleTransfer/SASTDModel.cs
  • src/Diffusion/StyleTransfer/StyDiffModel.cs
  • src/Diffusion/StyleTransfer/StyleAlignedEditModel.cs
  • src/Diffusion/StyleTransfer/StyleStudioModel.cs
  • src/Diffusion/StyleTransfer/TLoRAModel.cs
  • src/Diffusion/SuperResolution/CCSRModel.cs
  • src/Diffusion/SuperResolution/PASDModel.cs
  • src/Diffusion/SuperResolution/SeeSRModel.cs
  • src/Diffusion/ThreeD/DreamFusionModel.cs
  • src/Diffusion/ThreeD/Zero123Model.cs
  • src/Diffusion/VAE/AudioVAE.cs
  • src/Diffusion/VAE/AutoencoderKL.cs
  • src/Diffusion/VAE/Causal3DVAE.cs
  • src/Diffusion/VAE/DeepCompressionVAE.cs
  • src/Diffusion/VAE/EQVAEModel.cs
  • src/Diffusion/VAE/ImprovedVideoVAE.cs
  • src/Diffusion/VAE/LiteVAEModel.cs
  • src/Diffusion/VAE/SDXLVAEModel.cs
  • src/Diffusion/VAE/TemporalInterpolationVAE.cs
  • src/Diffusion/VAE/TemporalVAE.cs
  • src/Diffusion/VAE/VAEModelBase.cs
  • src/Diffusion/Video/CogVideoModel.cs
  • src/Diffusion/VirtualTryOn/CATDMModel.cs
  • src/Diffusion/VirtualTryOn/FashionVDMModel.cs
  • src/Diffusion/VirtualTryOn/IDMVTONModel.cs
  • src/Diffusion/VirtualTryOn/StableVITONModel.cs
  • src/Helpers/CopyOnWriteCloneHelper.cs
  • src/MixedPrecision/Float8Types.cs
  • src/Models/Options/Adam8BitOptimizerOptions.cs
  • src/NeuralNetworks/NeuralNetworkBase.cs
  • src/Optimizers/Adam8BitOptimizer.cs
  • tests/AiDotNet.Tests/UnitTests/NeuralNetworks/CopyOnWriteCloneTests.cs

Comment thread src/Diffusion/FastGeneration/AutoRegressiveMaskedDiffusion.cs Outdated
Comment thread src/Diffusion/FastGeneration/SANASprintModel.cs Outdated
Comment thread src/Diffusion/FastGeneration/SCottModel.cs Outdated
Comment thread src/Diffusion/FastGeneration/TransfusionModel.cs Outdated
Comment thread src/Diffusion/ImageEditing/FlowEditModel.cs Outdated
Comment thread src/NeuralNetworks/NeuralNetworkBase.cs Outdated
Comment thread src/NeuralNetworks/NeuralNetworkBase.cs
Comment thread src/NeuralNetworks/NeuralNetworkBase.cs Outdated
Comment thread src/NeuralNetworks/NeuralNetworkBase.cs
Comment thread src/Optimizers/Adam8BitOptimizer.cs
franklinic and others added 2 commits June 17, 2026 21:48
…p for foundation-scale models

The get-side streaming API (GetParameterChunks, PyTorch nn.Module.parameters()
style) already existed, but there was NO streaming setter, so a get->set->get
round-trip still had to flatten through SetParameters(Vector<T>). For Flux-2-scale
models that flat int-indexed Vector<T> OOMs the host (and overflows int.MaxValue
above 2.1B params). Flux2Model_GetSetParameters_RoundTrips was failing with
OutOfMemoryException in exactly this flat path.

Add SetParameterChunks(IEnumerable<Tensor<T>>) mirroring GetParameterChunks:
- IParameterizable: default interface method (non-framework) buffering to a flat
  vector + SetParameters (back-compat for tractable models).
- DiffusionModelBase / NoisePredictorBase: concrete virtual (both targets) with
  the same back-compat default.
- LatentDiffusionModelBase: override that distributes chunks to the noise
  predictor, VAE, then conditioner off a SINGLE shared enumerator (one chunk in
  flight at a time — never buffers the whole stream).
- FluxDoubleStreamPredictor: overrides GetParameterChunks + SetParameterChunks to
  stream one DenseLayer sub-block at a time, and replaces the 2x-peak
  Vector.Concatenate in GetParameters with a single pre-sized buffer.

Migrate Flux2Model_GetSetParameters_RoundTrips to the streaming round-trip (still
asserts get->set->get fidelity, incl. first-chunk values — not a weakening).
Verified PASS (was OOM): 1 passed, 31s, net10.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onstruct without OOM

FlagDiTPredictor force-resolved every adaLN-zero projection in its constructor
(ZeroInitialize -> ResolveFromShape -> GetParameters -> SetParameters(zeros)),
eagerly allocating weight tensors for a multi-billion-parameter stack the moment
the model was constructed. LuminaImage2Model_DefaultConstructor_CreatesValidModel
failed with OutOfMemoryException in exactly that path (DenseLayer.EnsureInitialized
-> TensorAllocator.Rent), even though the test only checks the sub-models are
non-null — every other FlagDiT layer is already lazy.

Add LazyDenseZero on NoisePredictorBase: a DenseLayer built with the (non-lazy)
Zero strategy but resolved shapes-only, so it allocates NOTHING at construction and
zero-fills its weights+biases on first resolve. Behaviourally identical to eager
adaLN-zero (block still begins as the identity, Peebles & Xie 2022) but defers the
allocation. Use it for FlagDiT's per-block and final adaLN layers; remove the now
dead ZeroInitialize.

Verified PASS (was OOM): LuminaImage2Model_DefaultConstructor, 17s, net10.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on SDPA (#1624)

The MMDiT joint-block attention (shared by the faithful Flux/MMDiTX/EMMDiT/AsymmDiT
predictors) hand-rolled scaled-dot-product attention as ~7 separate ops —
reshape-for-heads ×3, transpose K, BatchMatMul(Q,Kᵀ) which MATERIALIZES the
O(seq²) scores matrix, scalar-scale, Softmax, BatchMatMul(weights,V), reshape-back —
each a full pass + fresh allocation, run twice per joint block × every layer ×
every denoising step. That is the implementation gap vs PyTorch's fused SDPA
(FlashAttention: tiled, no seq×seq materialization, one kernel), and what pushed
the now-faithful dual-stream predictors past the CI test budget.

The package already ships that fused kernel (Engines/Autodiff/FlashAttention*.cs,
FusedAttention.cs) exposed as IEngine.ScaledDotProductAttention — the #476/#479
work measured under PyTorch's SDPA. DiT already used the fast SelfAttentionLayer
(why PixArtDelta passed at 59s while MMDiT-based models timed out). This routes the
MMDiT family to the same fused primitive, using the [B,H,S,D] 4-D layout the engine
SDPA requires for true multi-head (FusedAttention does not head-split a rank-3
input). Same math (autodiff-aware backward included), fused execution — no model
shrinking. Builds clean on Tensors 0.101.7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ooples and others added 3 commits June 20, 2026 19:53
…aram count

IsQuantizationAwareTrainingEnabled auto-engaged QAT for every model >= 500M
params (#1624 G5). A CPU profile of a Kandinsky (1.8B) train step showed that
turned ~57% of the step into pure overhead: each iteration fake-quantized ALL
~1.8B weights — CollectTrainableParameters + a serial ToVector copy of the full
weight set (~82s FakeQuant + ~79s ToVector at the CI thread cap) — and, being
lossy, it changed the training trajectory of the vanilla-DDPM contract tests.

Disabling the auto-engage drops the Kandinsky train step from ~168s/iter to
~72s/iter (2-thread CI sim). QAT is now opt-in (default off), matching the
Train() method's own "opt-in, default OFF" comment, the project's
streaming/activation-checkpointing convention, and PyTorch (torch QAT is always
explicit). Foundation models targeting int8 deployment opt in via
EnableQuantizationAwareTraining().

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ed/lazy-graph host)

PredictCompiledMulti always routed the per-step forward through the compiled
model host, which builds + realizes a lazy graph (CompiledModelHost.Predict +
LazyNode.Realize). That host is an INFERENCE replay optimization; under an active
gradient tape it is pure overhead on top of the eager op-record the backward
needs anyway — a CPU profile measured it at ~14% of a foundation-scale diffusion
train step, and bypassing it (plus the attention-backward vectorization in
Tensors#655) cut the Kandinsky train step ~47s -> ~28s/iter at 4 threads.

Under a tape (and not NoGradScope), run the eager fallback directly: it records
the identical ops, so gradients are unchanged. Inference (no tape / NoGradScope)
still gets the compiled multi-input replay (the #1620 path), unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
franklinic and others added 2 commits June 20, 2026 22:30
…r the same input (#1624)

The diffusion model-family contract is that Predict(sameInput) is deterministic
(Predict_ShouldBeDeterministic). ~22 models violated it: their inference/generation
paths drew fresh noise from the ADVANCING per-instance RandomGenerator whenever no
explicit seed was given (`seed.HasValue ? CreateSeededRandom(seed) : RandomGenerator`),
so two consecutive Predict calls consumed different random draws → different outputs
(e.g. AudioLDM 3.19 vs 3.55). Pre-existing; only now surfaced because S1 +
the faithful-predictor rebuild let the model-family shards run far enough to reach
these models (they previously OOM'd first — the "test unmasking" pattern).

Fix: add DiffusionModelBase.CreateInferenceRng(int? seed), which uses a STABLE seed
(the model's construction seed, else a fixed constant) when none is passed — never
the advancing RandomGenerator — so Predict reproduces while staying seedable for
sample variety. Training keeps using RandomGenerator (must advance per step).
Replaced the buggy pattern across the diffusion model + base classes; VAEModelBase
(not a DiffusionModelBase) gets the equivalent stable-seed fix inlined.

Builds clean (net10 + net471). Verified: AudioLDM determinism now passes; the
RNG-induced divergence is eliminated (residual sub-1e-3 diffs under concurrent test
execution are FP reduction-order, a separate matter).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cratch)

Predict_ShouldBeDeterministic failed for diffusion noise predictors: the
same input predicted twice produced different output. Bisected to the #1622
verify-then-trust gate, NOT RNG/compile-trace/lazy-init.

Root cause (3x-probe + param/input hashing): the gate must execute a compiled
plan ONCE to compare it against eager. The AiDotNet.Tensors compiled lazy-graph
executor (through 0.101.7) shares process-global scratch buffers with the eager
executor, so that single compiled execution leaves the scratch in a state that
makes every subsequent eager forward for a REJECTED shape oscillate with
period 2 (call #1 == call #3 != call #2) — non-deterministic AND numerically
wrong on half the calls. Proven: model parameters and inputs are bit-identical
across calls (hashed), and invalidating the cached plan does not help, so the
corruption lives in the package's global scratch, not anything this layer owns.
Diffusion predictors fall back to eager for any shape whose compiled plan does
not match (the common case for these architectures), so the default-on path
silently corrupted inference.

Fix: make the compiled inference replay opt-in (AIDOTNET_ENABLE_AUTO_COMPILE=1)
and default to pure eager, which is correct and bit-identical across calls and
across the denoising loop. No impact on #1624 foundation-scale TRAINING:
PredictCompiledMulti already runs eager whenever a gradient tape is active, so
training never touched the compiled inference path. Re-enable per model once the
package isolates the two executors' scratch buffers.

Verified: AudioLDM/AudioLDM2/AuraFlow/BlendedDiffusion/CogView4 determinism
tests pass; remaining diffusion determinism failures are pre-existing 120s perf
timeouts (ControlNet, ConsistencyModel), unchanged with auto-compile on or off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ooples ooples force-pushed the perf/1624-training-scale branch from 952e1a4 to 0ed12e3 Compare June 21, 2026 05:20
franklinic and others added 7 commits June 21, 2026 10:41
… bug)

NeuralNetworkBase composes the same #1622 verify-then-trust gate as the
diffusion NoisePredictorBase, so it carries the same latent bug: the
AiDotNet.Tensors compiled lazy-graph executor (through 0.101.7) shares
process-global scratch with the eager executor, and the gate's one mandatory
compiled-vs-eager verification run leaves that scratch dirty — so a REJECTED
shape's eager fallback returns the wrong value. The single-input value memo
masks the determinism symptom for repeated identical inputs (why NN
determinism tests pass), but a memo-MISS input at a rejected shape after a
compiled run still reads corrupted scratch.

Mirror the diffusion fix: gate the whole compiled-inference path behind a
process-wide opt-in (AIDOTNET_ENABLE_AUTO_COMPILE=1), default off. Both the
explicit EnableInferenceAcceleration() and the silent foundation-scale
auto-engagement now require it, so no model silently runs the corrupting path.
The flag is read from the env var at load but exposed as a test-settable static
(SetAutoCompiledInferenceEnabledForTesting) so the gate's parity invariants stay
covered in-process. Inference-only: training never touched this path.

AcceleratedInferenceTests opt in for the duration (restored in Dispose) so all
existing parity/memo/shape-guard invariants still run; added a test asserting
the shipped default (opt-in off → explicit opt-in stays eager, gate never runs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on + partial #632)

Brings the Phase C fp16-weight / fp32-master mixed-precision storage (Tensors #650)
and the #632 compiled-inference MemoryPlanning aliasing fix (BlasBatch hoist vs memory
planning in attention inference). Native packages (OneDNN/OpenBLAS/CLBlast) bumped in
lockstep.

Verified via local-source wire-and-bisect that #632 is the commit that resolves the
diffusion Predict_ShouldBeDeterministic corruption for the small/no-cross-attention
case (reproduces on 0.101.7, gone on 0.102.x). HOWEVER #632 is only a PARTIAL fix: the
real model-family test configs (cross-attention, contextDim>0) still replay
non-deterministically with compiled inference ENABLED on 0.102.2 (AudioLDM/AuraFlow
reproduce the exact 3.19-vs-3.55 signature). The residual cross-attention aliasing fix
is still outstanding upstream.

Therefore the consumer opt-in gate stays default-OFF: compiled inference is opt-in via
AIDOTNET_ENABLE_AUTO_COMPILE=1, and the determinism suite passes on pure eager (verified
on 0.102.2). Builds clean on net10.0 + net471.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

The high-level SGD mixed-precision path (CompiledTapeTrainingStep.StepMixedPrecision)
called MixedPrecisionReflection.StepSgd WITHOUT a GradScaler, so FP16 backward
gradients could underflow to zero before reaching the FP32 master weights. The
Adam and generic-optimizer consumer paths already loss-scale via _mpScaler; this
brings the SGD path to parity.

Two changes:
- MixedPrecisionReflection now binds the 3-param Step(IReadOnlyList<Tensor<float>>,
  float, GradScaler?) introduced in Tensors 0.102.0 (#650) — the prior exact-type
  2-arg GetMethod did not resolve it, which would have silently dropped the SGD
  FP16 path to the eager fallback on 0.102.0. Falls back to the legacy 2-arg Step
  for older packages; StepSgd dispatches on the resolved parameter count.
- StepMixedPrecision persists a GradScaler (dynamic loss scale + skip-on-overflow
  handled inside Step) and passes it, matching the Adam/generic paths.

Builds clean on Tensors 0.102.0. Gated behind AIDOTNET_FP16_ACTIVATIONS (default off).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 51e2b2d)
…ned Tensors 0.102.2

The Phase E SGD loss-scaling path (StepMixedPrecision -> StepSgd with a GradScaler) depends on the
Tensors 0.102.0+ GradScaler API resolving through MixedPrecisionReflection. Pin that it resolves on
the pinned package (0.102.2): if a downgrade drops it, StepSgd silently falls back to the unscaled
2-arg Step and FP16 SGD grads underflow — so failing loudly here is the intended guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… parity (determinism)

The verify-then-trust gate returned the EAGER reference on the verify call but the compiled plan on
every subsequent trusted call. When compiled matches eager within tolerance (trusted) but not bit-for-
bit (legitimate ~ULP op-order rounding), call #1 (eager) then differed from calls #2..N (compiled),
breaking Predict(x)-twice determinism for cross-attention diffusion models (e.g. AuraFlow: ~1e-8
divergence tripping the 12-decimal-place assertion) even though both are within the gate's tolerance.

Fix: on parity, return the compiled candidate so call #1 matches every later trusted replay. Use
CloneDeepCopy (NOT Clone — which is copy-on-write and shares the buffer): the compile host hands back a
RECYCLED output buffer the next compiled call overwrites in place, bypassing COW, so a shared clone
would be mutated by a later call (observed corrupting AcceleratedInferenceTests' held results). On
rejection the eager reference stays the source of truth. Verified: AcceleratedInferenceTests 7/7 and
AudioLDM/AudioLDM2/AuraFlow Predict_ShouldBeDeterministic all pass with compiled inference enabled
(against the local Tensors source that also carries the ConstantFolding-off fix).

Note: compiled inference stays opt-in (AIDOTNET_ENABLE_AUTO_COMPILE=1, default eager) in the consumer
until AiDotNet.Tensors ships the ConstantFolding-off fix (>0.102.2); this gate fix makes the path
correct for when it is re-enabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scale

# Conflicts:
#	Directory.Packages.props
ooples added a commit that referenced this pull request Jun 22, 2026
)

* docs(#1662): design spec for backward-pass optimizations (levers #4, #1, #3)

Cross-repo, Tensors-first design for the backward/training side of #653/#1624:
backward buffer-reuse audit (#4), optimizer-in-backward adaptive hybrid (#1),
and FlashAttention-style tiled backward (#3). Excludes lever #2 (checkpointing,
owned by open PR #1633).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(#1662): correct lever #1 scope — optimizer-in-backward is OOM-only today

TrainWithTapeStreaming already does single-pass-fused/two-pass-norm, but only
engages above 0.5x-RAM; the common (fits-in-memory) case is collect-then-step,
which merely ties PyTorch. Lever #1's real deliverable: promote single-pass
fused optimizer-in-backward to the common-case default for unclipped training,
add opt-in fast-clip for single-pass clipped, and prove a handy PyTorch CPU win
on per-step time / peak RSS / allocation via --trainbench (default-on flip gated
on that proof).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(#1662): implementation plan — levers #4, #3, #1 (Tensors-first, PyTorch proof)

Phased TDD plan: Tensors #4 (trainbench probe + alloc audit + arena guard +
fused-norm overload) -> Tensors #3 (tiled FlashAttention backward, parity-gated)
-> v0.102.0 release -> AiDotNet #1 (full-precision streaming optimizer, engage
single-pass fused-in-backward as the unclipped common-case default, opt-in
fast-clip, PyTorch CPU comparison proving a handy win on time/alloc/RSS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#1662): full-precision streaming Adam (bit-identical fused-in-backward foundation)

Adds FullPrecisionStreamingOptimizer<T> (per-tensor double[] moments, one global
step counter, no quantization, no update clamping) + FullPrecisionStreamingAdam<T>
matching the classic AdamOptimizer formula exactly. This is the bit-identical
streaming optimizer that lets single-pass fused optimizer-in-backward be the
common-case DEFAULT for models that fit in memory (vs the 8-bit OOM-survival
variants whose ClampUpdate + block quantization diverge from classic Adam).

Compiles clean; not yet wired into GetOrCreateStreamingOptimizer (next commit) —
the bit-identical gate test lands with the wiring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#1662): wire full-precision fused optimizer-in-backward (unclipped) + gate test

Resolver gains SupportsFullPrecision (Adam-family today) + a full-precision branch;
TrainWithTapeStreaming uses the bit-identical FullPrecisionStreamingAdam when the
user opts in (ForceOn) for an unclipped Adam model. Auto default unchanged (the
unclipped-fitting default-on flip is gated on the §5d PyTorch proof).

Gate test FusedInBackward_Unclipped_MatchesClassicAdam_ToFloatPrecision: fused
single-pass optimizer-in-backward tracks classic eager collect-then-step Adam to
float precision (1e-3 over 20 steps, identical init). PASSES.

Surfaced (pre-existing, separate): MaxGradNorm defaults to 1.0, so the COMMON case
is clipped -> the two-pass clipped streaming path, which currently throws
"persistent tape activations released" (sets Persistent=true but not
ReleaseStreamingActivations=false). Fix tracked next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#1662): clipped two-pass streaming path + bit-identical full-precision for clipped

Fixes the pre-existing clipped-streaming crash: the two-pass path builds a
Persistent tape but ComputeGradientsStreaming releases activations by default
(process-global GradientTape<T>.ReleaseStreamingActivations), so pass 2 threw
"activations released". Now save/restore that flag (false during the clipped
passes) so both passes share the recorded graph; the setting never leaks.

Also drops the !clip restriction on full-precision selection: clipping only
chooses single-pass vs two-pass, not precision, so the clipped (common-case,
MaxGradNorm=1.0 default) path now uses the bit-identical FullPrecisionStreamingAdam
too. PyTorch's apply_optimizer_in_backward does not support clipping at all.

Gate: FusedInBackward_Clipped_MatchesClassicAdam_ToFloatPrecision (+ the unclipped
one) both PASS — fused optimizer-in-backward tracks classic clip-then-step Adam to
float precision over 20 steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(#1662): pytorch cpu comparison - honest result (we do not beat torch on speed)

Adds benchmarks/trainbench_torch.py (mirrors the --trainbench residual-FFN shape)
and the head-to-head results. Finding on S=128/D=384/10-layer/8-thread:

  torch:    median 69.6 ms/step, peak RSS 312 MB
  aidotnet: median 252.9 ms/step (3.6x SLOWER), peak WS 489 MB, alloc 0.127 MB/step

Identical loss. The optimizer-in-backward + arena wins ONLY on allocation/GC churn
(near-zero), not throughput or peak RSS. The 3.6x per-step gap is GEMM + autodiff
overhead (#653 core CPU-parity), orthogonal to optimizer-in-backward.

Consequence: the §5a default-on flip is NOT justified (no speed win) — fused
optimizer-in-backward stays opt-in (ForceOn), valued for bounded peak-grad memory +
zero GC churn (bit-identical) + clipped support PyTorch lacks. "Beat PyTorch on all
metrics" is not currently true and must not be claimed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(#1662): gap investigation — per-step deficit is small-M GEMM parallel scaling

Decomposed the 3.6x per-step gap by batch: S=128 -> 3.6x, S=1024 -> 1.43x. The gap
collapses with larger M, matching the known #475 finding (managed microkernel is
MKL-parity, but small-M GEMM parallel scaling plateaus ~2x). So the speed deficit is
the #653 core CPU-parity problem (small-M GEMM dispatch/scaling), orthogonal to
optimizer-in-backward, which can't change matmul cost. Lever #1's real win is the
bit-identical allocation/peak-grad-memory reduction, which holds at all batch sizes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(#1662): opt-in single-pass fast approximate grad-clip (§5c)

Adds FastApproxGradClip (default OFF). When on + clipping active, the streaming
clipped path runs as a SINGLE backward pass: the clip scale comes from an EMA of
the previous step's global grad-norm (NFNet-style adaptive clipping), and this
step's exact norm is accumulated in the same pass to update the EMA. First step
seeds the EMA without clipping. NOT bit-identical (documented approximation) — this
is the clipped path that beats PyTorch on backward-pass count (torch's
apply_optimizer_in_backward cannot clip at all).

Branch restructured into unclipped single-pass / fast-clip single-pass / exact
two-pass; persistent tape + ReleaseStreamingActivations toggle now gated on the
genuine two-pass only. Convergence test (loss decreases, stays finite) PASSES;
14/14 fused+streaming tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training/perf-scale: eliminate training-bound timeouts & OOM on 16GB CI runner (forward-OK models) — tracker

2 participants