Skip to content

fix(diffusion): resolve #1671 TCD clone + default-construction timeouts#1677

Merged
ooples merged 5 commits into
masterfrom
fix/1671-tcd-clone-default-ctor
Jun 24, 2026
Merged

fix(diffusion): resolve #1671 TCD clone + default-construction timeouts#1677
ooples merged 5 commits into
masterfrom
fix/1671-tcd-clone-default-ctor

Conversation

@ooples

@ooples ooples commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Fixes the two failures called out in #1671 (clean run 28023414937, master @ 01a4ddb4). Both turned out to be 120 s test timeouts, not the clone-equivalence divergence the issue hypothesized for bug 1 — confirmed by reading the actual CI logs (the failing test's error was Test execution timed out after 120000 milliseconds, and the adjacent Predict_ShouldBeDeterministic passed at 1 m 28 s).

Bug 1 — TCDModelTests.Clone_ShouldProduceIdenticalOutput (ModelFamily Diffusion T-Z)

TCD is a trajectory-consistency distillation model built for few-step sampling — its own metadata already declared optimal_steps = 4, and the paper reports 4 as the sweet spot. But the ctor never set DefaultInferenceSteps, so Predict ran the generic 10-step DDIM default (~2.5× the intended compute). The Clone test runs two SD-1.5-scale UNet+VAE predicts (original + clone), which blew past the 120 s budget under CI load.

Fix: pin DefaultInferenceSteps = 4 (the model's own optimal), matching every other few-step model in FastGeneration/ (ConsistencyModel = 2, FluxSchnell = 4, …). This is the paper-faithful default, not a test accommodation; clone equivalence is preserved (original and clone use the same step count).

Bug 2 — DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException (Integration D)

Two models exceeded the 10 s/ctor budget: LuminaImage2Model and FlagDiTPredictor. FlagDiTPredictor already builds its FFN/adaLN via the lazy DenseLayer path (NoisePredictorBase.LazyDense, zero-sized until first forward) so a foundation-scale stack constructs cheaply and streams weights at forward — but its 32 GroupedQueryAttentionLayer blocks were still eagerly allocated (~1.3 B attention weights, gigabytes, before a single forward), defeating that design.

Fix: add an opt-in deferAllocation path to GroupedQueryAttentionLayer that mirrors the lazy DenseLayer contract — projection weights stay zero-sized at construction and materialize (allocate + initialize) on first use, guarded at every weight-access entry point (Forward / GetParameters / SetParameters / GetParameterGradients / UpdateParameters). ParameterCount is derived exactly from the dimension fields while deferred, so a parent predictor's CalculateParameterCount stays correct without forcing allocation. FlagDiTPredictor opts in; eager callers (the default) are completely unchanged.

Verification (local, AIDOTNET_INFERENCE_ARENA=0, CPU)

Test Before After
TCDModelTests.Clone_ShouldProduceIdenticalOutput timeout (>120 s) ✅ ~56 s
TCDModelTests.Predict_ShouldBeDeterministic ~88 s (CI) ✅ ~21 s
DefaultConstructionTests…ShouldConstructWithoutException 2 TIMEOUT ✅ 824 OK / 0 TIMEOUT (~14 s)
GroupedQueryAttentionLayerTests + GQAInferenceIntegrationTests (eager path) ✅ 42/42
LuminaImage2ModelTests Scheduler + Parameters_ShouldBeNonEmpty ✅ pass

Scope note

The TCD class also has Training_ShouldReducePredictionError and ForwardPass_ShouldBeFinite_AfterTraining failures, and LuminaImage2ModelTests.Metadata_ShouldExist — these are training-scale timeouts (each trains a full SD/foundation-scale model) that are not affected by either fix here and are out of #1671's scope (they belong to the broader training/perf inventory, #1624). This PR resolves exactly the two tests #1671 named.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Performance Improvements
    • Improved startup and responsiveness by deferring allocation of large attention weights until they’re required.
    • Reduced memory and allocation overhead during parameter handling by switching to streaming, chunk-based workflows for large diffusion models.
  • Bug Fixes
    • Fixed clone behavior for lazily-initialized predictor weights to avoid unsafe parameter round-trips.
    • Kept “optimal_steps” metadata aligned with the configured few-step inference guidance.
  • Tests
    • Strengthened large-model parameter round-trip tests to better validate chunk-based serialization and avoid CI out-of-memory issues.

ooples and others added 2 commits June 23, 2026 15:35
TCD is a trajectory-consistency distillation model whose whole purpose is
high-quality few-step generation (the paper reports 4 steps as optimal, 8 max,
and the model's own metadata already declared optimal_steps=4). The ctor never
set DefaultInferenceSteps, so Predict ran the generic 10-step DDIM default —
~2.5x the intended compute. Two SD-1.5-scale UNet+VAE predicts (original +
clone) then blew past the 120s test budget, timing out
TCDModelTests.Clone_ShouldProduceIdenticalOutput on the ModelFamily Diffusion
T-Z shard.

Pin DefaultInferenceSteps to the model's optimal_steps (4), matching every
other few-step model in this folder (ConsistencyModel=2, FluxSchnell=4, ...).
This is the paper-faithful default, not a test accommodation. Clone now passes
(~56s local) and Predict_ShouldBeDeterministic drops from ~88s to ~21s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ion timeout (#1671)

DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException
timed out (>10s/ctor) on LuminaImage2Model and FlagDiTPredictor. FlagDiTPredictor
already builds its FFN/adaLN via the lazy DenseLayer path (NoisePredictorBase.LazyDense,
zero-sized until first forward) precisely so a foundation-scale stack constructs cheaply
and streams weights at forward — but the 32 GroupedQueryAttentionLayer blocks were still
allocated eagerly, ~1.3 B attention weights (gigabytes, >10s) before a single forward,
defeating that design.

Add an opt-in deferAllocation path to GroupedQueryAttentionLayer mirroring the lazy Dense
contract: projection weights stay zero-sized at construction and materialize (allocate +
initialize) on first use via EnsureWeightsMaterialized, guarded at every weight-access
entry point (Forward/GetParameters/SetParameters/GetParameterGradients/UpdateParameters).
ParameterCount is derived exactly from the dimension fields while deferred, so a parent
predictor's CalculateParameterCount is correct without forcing allocation. Eager callers
(the default, deferAllocation=false) are completely unchanged.

FlagDiTPredictor opts in. DefaultConstructionTests now reports 824 OK / 0 TIMEOUT (~14s);
the GroupedQueryAttentionLayer unit + GQA inference suites stay green (42/42); LuminaImage2
construction and ParameterCount pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
aidotnet_website Ignored Ignored Preview Jun 23, 2026 11:21pm
aidotnet-playground-api Ignored Ignored Preview Jun 23, 2026 11:21pm

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@ooples, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 2 hours, 51 minutes, and 9 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 331e38b1-f986-47c9-9db5-208462577449

📥 Commits

Reviewing files that changed from the base of the PR and between 626c8d8 and 35074b2.

📒 Files selected for processing (6)
  • src/Diffusion/FastGeneration/FluxSchnellModel.cs
  • src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs
  • src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
  • src/NeuralNetworks/Layers/LayerBase.cs
  • tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs
  • tests/AiDotNet.Tests/UnitTests/Diffusion/Models/NewConditionerContractTests.cs

Walkthrough

GroupedQueryAttentionLayer<T> gains a deferAllocation constructor parameter enabling zero-sized placeholder tensors at construction, with one-shot materialization on first use via EnsureWeightsMaterialized(). LayerBase adds MaterializeParameters() and CopyTrainableParametersFrom() to support deferred patterns. FlagDiTPredictor opts all 32 attention blocks into deferred mode. MMDiTNoisePredictor rewrites parameter chunking to stream per-layer trainable tensors by reference instead of flattening, exposes WasForwarded state, and FluxSchnellModel clone uses safe chunking path only after forward pass. Tests use allocation-free chunk round-trip validation, serialize foundation-scale execution via shared xUnit collection, and switch to float + chunk assertions.

Changes

Deferred weight allocation infrastructure

Layer / File(s) Summary
GroupedQueryAttentionLayer deferred state and constructor
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
Adds _weightsDeferred volatile flag and _materializeLock for thread-safe one-shot initialization. Extends constructor with bool deferAllocation = false parameter. Reworks initialization to create zero-sized placeholder tensors (skipping InitializeParameters) when deferred, or full-shaped tensors (calling InitializeParameters) when eager. Adds private EnsureWeightsMaterialized() to perform one-time allocation, initialization, and state flip under lock.
GroupedQueryAttentionLayer ParameterCount and materialization guards
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
Updates ParameterCount override to compute exact parameter total from dimensional fields when weights are deferred, rather than reading uninitialized tensor lengths. Guards Forward, UpdateParameters, GetParameters, GetParameterGradients, and SetParameters with EnsureWeightsMaterialized() calls to ensure weights are always materialized before any tensor-dependent operation.
LayerBase materialization and in-place copy helpers
src/NeuralNetworks/Layers/LayerBase.cs
Adds public MaterializeParameters() to force lazy weight allocation when !IsInitialized and IsShapeResolved. Adds public virtual CopyTrainableParametersFrom(IReadOnlyList<Tensor<T>>) to copy values into existing trainable tensors by reference (no rebinding), validating tensor count/length, copying data via spans, and invalidating destination persistent tensors so the GPU engine re-uploads updated values.
FlagDiTPredictor wires deferAllocation: true
src/Diffusion/NoisePredictors/FlagDiTPredictor.cs
InitializeLayers passes deferAllocation: true to all 32 GroupedQueryAttentionLayer<T> constructor calls, deferring large attention weight allocation from model construction time to first forward pass, preventing constructor timeout and excessive eager memory consumption.

Foundation-scale parameter serialization and cloning

Layer / File(s) Summary
MMDiTNoisePredictor WasForwarded property and chunk-based parameter transfer
src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs
Adds public WasForwarded property backed by _patchEmbed.IsInitialized to indicate lazy weight materialization. Rewrites GetParameterChunks() to materialize per-layer lazy weights (for LayerBase<T> only), then yield each trainable tensor from GetTrainableParameters() by reference, skipping empty tensors—replacing old per-layer GetParameters() → flat-vector-wrapping approach. Rewrites SetParameterChunks() to materialize each LayerBase<T> layer, then copy incoming chunks into trainable tensors in place via CopyTrainableParametersFrom, preserving alignment for empty trainable slots—replacing old ParameterCount-based single-chunk-per-layer SetParameters(toVector()) flow.
FluxSchnellModel clone safe lazy-preserving copy
src/Diffusion/FastGeneration/FluxSchnellModel.cs
Changes Clone() to conditionally copy predictor weights only when _predictor.WasForwarded is true, using safer chunking path via SetParameterChunks(_predictor.GetParameterChunks()) instead of unconditional SetParameters(GetParameters()) round-trip. Avoids int-overflow "Array dimensions exceeded supported range" by not forcing materialization of un-forwarded weights into a single flat vector.

Test infrastructure, collection organization, and contract validation

Layer / File(s) Summary
DiffusionUnitTestBase AssertParameterChunksRoundTrip validation
tests/AiDotNet.Tests/UnitTests/Diffusion/DiffusionUnitTestBase.cs
Updates using directives to import System.Collections.Generic and AiDotNet.Tensors.LinearAlgebra. Implements AssertParameterChunksRoundTrip(getChunks, setChunks) as deterministic allocation-free verification: writes global-index ramp directly into chunks via chunk.Data.Span, applies setChunks(getChunks()), reads back through getChunks, and per-element compares against deterministic mapping. Asserts non-empty chunk exposure and verified element count equals originally written total, supporting validation of foundation-scale models without large intermediate allocations.
Foundation-scale test collection and model parameter updates
tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs, tests/AiDotNet.Tests/UnitTests/Diffusion/Models/NewConditionerContractTests.cs
Both test classes add [Collection("DiffusionFoundationScaleSerial")] attribute to force sequential execution within a shared collection, preventing concurrent OOM during foundation-scale model materialization. FastGenContractTests updates FluxSchnellModel_Clone and Flux2Model_GetSetParameters_RoundTrips: switches models from double to float, increases timeouts to 10 minutes, replaces manual chunk generation/assertions with AssertParameterChunksRoundTrip. NewConditionerContractTests updates MMDiTXNoisePredictor and FluxDoubleStreamPredictor parameter tests: switches from double GetParameters/SetParameters to float GetParameterChunks/SetParameterChunks, increases timeout to 10 minutes, adds Task.Yield(), delegates assertions to AssertParameterChunksRoundTrip.

TCDModel step count constant

Layer / File(s) Summary
OPTIMAL_INFERENCE_STEPS constant
src/Diffusion/FastGeneration/TCDModel.cs
Introduces private constant OPTIMAL_INFERENCE_STEPS = 4 with expanded doc comment clarifying paper-optimal few-step sampling count for the TCD distilled model. Assigns constant to DefaultInferenceSteps in constructor instead of hardcoded value. Replaces literal 4 in "optimal_steps" metadata property with named constant, establishing single source of truth for configuration.

Sequence Diagram(s)

sequenceDiagram
    participant Model as FlagDiTPredictor
    participant GQA as GroupedQueryAttentionLayer[i]
    participant LayerMem as Tensor Storage
    Model->>GQA: new(..., deferAllocation: true)
    GQA->>LayerMem: allocate zero-sized placeholders
    GQA-->>Model: ready (weights deferred)
    Model->>Model: forward pass
    Model->>GQA: Forward()
    GQA->>GQA: EnsureWeightsMaterialized()
    GQA->>LayerMem: allocate real-sized tensors
    LayerMem-->>GQA: weights materialized
    GQA->>LayerMem: compute attention
    LayerMem-->>GQA: output
    GQA-->>Model: forward result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • Foundation-scale model CI: eliminate all timeouts/OOM via stacked acceleration (tracker) #1622: The deferAllocation parameter added to GroupedQueryAttentionLayer, materialization helpers in LayerBase, and the chunk-based parameter serialization in MMDiTNoisePredictor directly realize the deferred-allocation and per-tensor chunking acceleration levers described in that tracker issue for eliminating foundation-scale model CI timeouts and OOM; serial test collection organization also prevents concurrent large-model materialization as discussed.

Possibly related PRs

  • ooples/AiDotNet#1633: Both the main PR and retrieved PR modify the per-layer chunk-based parameter transfer logic for DiT-style noise predictors (notably MMDiTNoisePredictor's GetParameterChunks/SetParameterChunks), extending the same streaming/chunking infrastructure to avoid large vector allocations.

Poem

🦥 No more weights born at zero hour,
The tensor shapes bloom only with power.
Thirty-two layers stay lean on their birth,
Till Forward arrives to awaken their worth.
Chunks flow like streams—no vector overflow,
Production-ready now—the tests tell us so. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely captures the main objective: resolving timeout issues (Issue #1671) in TCD clone and default construction by optimizing step counts and lazy allocation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1671-tcd-clone-default-ctor

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 234-246: The EnsureWeightsMaterialized method has a race condition
where the _weightsDeferred flag is cleared before the weight tensors are
actually allocated, creating a window where concurrent calls could bypass
initialization. Either add thread-safety documentation to the
GroupedQueryAttentionLayer class (similar to MultiHeadAttentionLayer and
EmbeddingLayer) explicitly stating the layer is not thread-safe and assumes
single-threaded usage within a FlagDiTPredictor instance, or protect the entire
materialization logic in EnsureWeightsMaterialized with a lock to ensure
atomicity between clearing the flag and allocating all tensors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 90bf1b8a-3b08-48d5-adb1-e383c544717e

📥 Commits

Reviewing files that changed from the base of the PR and between 264e75c and 2bf339d.

📒 Files selected for processing (3)
  • src/Diffusion/FastGeneration/TCDModel.cs
  • src/Diffusion/NoisePredictors/FlagDiTPredictor.cs
  • src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs

Comment thread src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
…review)

CodeRabbit flagged a latent race in EnsureWeightsMaterialized: the _weightsDeferred
flag was cleared before the weight tensors were allocated, so a concurrent Forward on
the same instance could observe zero-sized weights. Not exploited today (CheckpointBlocks,
weight streaming, and diffusion sampling are all sequential), but it should be correct by
construction rather than rely on callers staying single-threaded.

Make _weightsDeferred volatile and guard the one-time materialization with double-checked
locking, flipping the flag to false LAST (release) only after every tensor is allocated and
initialized. The fast path stays lock-free (volatile acquire read), so already-materialized
forwards pay nothing; the lock is only ever taken on the first touch and is uncontended in
the sequential paths that exist today.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs (1)

643-661: 🩺 Stability & Availability | 🟠 Major

Public weight getters bypass EnsureWeightsMaterialized() and deliver zero-sized placeholders to production code.

Forward, GetParameters, SetParameters, GetParameterGradients, and UpdateParameters are all guarded by EnsureWeightsMaterialized(), but the public accessors GetQueryWeights, GetKeyWeights, GetValueWeights, and GetOutputWeights are not. For layers constructed with deferAllocation: true (all 32 attention blocks in FlagDiTPredictor), these getters return the [0]-shaped placeholder tensors until something else triggers materialization.

This matters because these getters are called in real production paths:

  • QuantizedAttentionLayer.cs (lines 99–102, 135–138) directly feeds these tensors into quantization logic with no shape validation — zero-sized tensors silently produce incorrect quantized weights.
  • TransformerDecoderLayer.cs (1272–1275) and TransformerEncoderLayer.cs (1121–1124) call these getters for computation graph construction; null checks cannot catch zero-shaped tensors.

Guard each getter through EnsureWeightsMaterialized() to prevent external consumers from observing uninitialized placeholder tensors:

Proposed fix
-    public Tensor<T> GetQueryWeights() => _queryWeights;
+    public Tensor<T> GetQueryWeights() { EnsureWeightsMaterialized(); return _queryWeights; }

-    public Tensor<T> GetKeyWeights() => _keyWeights;
+    public Tensor<T> GetKeyWeights() { EnsureWeightsMaterialized(); return _keyWeights; }

-    public Tensor<T> GetValueWeights() => _valueWeights;
+    public Tensor<T> GetValueWeights() { EnsureWeightsMaterialized(); return _valueWeights; }

-    public Tensor<T> GetOutputWeights() => _outputWeights;
+    public Tensor<T> GetOutputWeights() { EnsureWeightsMaterialized(); return _outputWeights; }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs` around lines 643 -
661, The public getter methods GetQueryWeights, GetKeyWeights, GetValueWeights,
and GetOutputWeights do not call EnsureWeightsMaterialized() before returning
the weight tensors, which means they can return uninitialized zero-sized
placeholder tensors when the layer is constructed with deferred allocation. Add
a call to EnsureWeightsMaterialized() at the start of each getter method to
ensure the weights are materialized before they are returned, matching the
pattern used in other methods like Forward and GetParameters. This will prevent
external consumers in QuantizedAttentionLayer and the Transformer layers from
receiving invalid placeholder tensors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 643-661: The public getter methods GetQueryWeights, GetKeyWeights,
GetValueWeights, and GetOutputWeights do not call EnsureWeightsMaterialized()
before returning the weight tensors, which means they can return uninitialized
zero-sized placeholder tensors when the layer is constructed with deferred
allocation. Add a call to EnsureWeightsMaterialized() at the start of each
getter method to ensure the weights are materialized before they are returned,
matching the pattern used in other methods like Forward and GetParameters. This
will prevent external consumers in QuantizedAttentionLayer and the Transformer
layers from receiving invalid placeholder tensors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 84836efa-e293-4e61-ad35-efc51c7f8af4

📥 Commits

Reviewing files that changed from the base of the PR and between 2bf339d and 3f73d93.

📒 Files selected for processing (1)
  • src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs

…MDiT models

The FastGen/Conditioner unit shard's foundation-scale parameter round-trip + clone
contract tests OOM'd or timed out: FLUX/MMDiT predictors have >2.1B parameters, so a
flat GetParameters() builds a List<T>/Vector<T> that overflows the max array element
count ("Array dimensions exceeded supported range"), and the streaming chunk path
copied each layer's params into a transient multi-GB Vector<T> per pass — GC-thrashing
the heap to >10min and an OutOfMemoryException.

Make GetParameterChunks/SetParameterChunks allocation-free:
- LayerBase.MaterializeParameters() forces lazy weight allocation (the same EnsureInitialized
  the first forward runs) so callers can stream resident tensors by reference.
- LayerBase.CopyTrainableParametersFrom() span-copies values INTO the existing tensors and
  invalidates each persistent-tensor cache — the aliasing-free, allocation-free counterpart
  to SetTrainableParameters (rebinding would alias a clone's weights onto the source's).
- MMDiTNoisePredictor.GetParameterChunks now yields each layer's resident trainable tensors
  by reference (one chunk per tensor) after materializing; SetParameterChunks copies in place.
  Consumers (Clone, LatentDiffusionModelBase) pair get/set and count chunks dynamically, so the
  per-tensor framing stays consistent. No serialization path uses the chunk API, so the flat
  GetParameters/Serialize path is unchanged.
- FluxSchnellModel.Clone is now lazy-preserving (MMDiTNoisePredictor.WasForwarded): a
  never-forwarded foundation model copies nothing, so the clone stays lazy instead of
  materializing the predictor twice and OOMing.

Tests: the FLUX/MMDiT round-trip + clone contract tests use FP32 (a materialized FP64 model
is ~34GB and OOMs the 16GB runner) and share one serial collection (one multi-GB model resident
at a time); the round-trip helper writes/reads the ramp in place via the zero-copy chunks, never
allocating a replacement aggregate. Full FastGen/Conditioner shard: 80/80 green (was 4 OOM/timeout)
— Flux2 4m22s, FluxDoubleStream 4m33s, MMDiTX 38s, FluxSchnell clone 4s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Diffusion/FastGeneration/FluxSchnellModel.cs`:
- Around line 142-150: The clone created on line 142 receives a fresh
RandomGenerator.Next() seed, but when the weight copy is skipped in the if
(_predictor.WasForwarded) branch, the clone retains this different seed and will
materialize different lazy weights later, breaking Clone() equivalence. In the
skipped-copy branch (when WasForwarded is false), copy the predictor's lazy
initialization state/configuration or seed from the source predictor to the
clone instead of leaving it with the newly generated seed, ensuring the clone
will materialize identically lazy weights when needed.

In `@src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs`:
- Around line 1198-1203: The WasForwarded property is implementation detail
(materialization state) and should not be exposed as public API. Change the
visibility of WasForwarded from public to internal, rename it to better reflect
that it exposes materialization state rather than forwarded behavior (consider
something like IsMaterialized or HasMaterializedWeights based on the underlying
_patchEmbed.IsInitialized check), and update all internal callers within the
assembly to use the new internal property name.

In `@src/NeuralNetworks/Layers/LayerBase.cs`:
- Around line 477-483: The MaterializeParameters() method in LayerBase should be
marked as internal rather than public since it is a parameter-materialization
helper method intended for internal streaming/clone operations and not for
user-facing APIs. Change the access modifier of the MaterializeParameters()
method from public to internal. Additionally, review the code at lines 3548-3562
to identify any similar helper or plumbing methods that should also be changed
from public to internal visibility to maintain the proper API surface where only
AiModelBuilder and AiModelResult are user-facing entry points.
- Line 481: The GroupedQueryAttentionLayer class does not override the
IsInitialized property, causing it to inherit the default true value from
LayerBase, which prevents MaterializeParameters() from invoking
EnsureInitialized() and skipping weight materialization. To fix this, override
the IsInitialized property in GroupedQueryAttentionLayer to return
!_weightsDeferred, and override the EnsureInitialized() method to call
EnsureWeightsMaterialized(), ensuring that MaterializeParameters() correctly
triggers the deferred weight materialization path. Alternatively, add
EnsureWeightsMaterialized() calls at the beginning of all public getter methods
such as GetQueryWeights(), GetKeyWeights(), GetValueWeights(), and
GetOutputWeights() to ensure weights are always materialized before being
returned to callers.

In `@tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs`:
- Around line 576-585: The method FluxSchnellModel_Clone_CreatesIndependentCopy
is declared as async Task with a Timeout attribute but contains no await
statements, making the timeout enforcement ineffective since xUnit requires an
await point to enforce the cancellation token. Add await Task.Yield(); as the
first statement in the FluxSchnellModel_Clone_CreatesIndependentCopy method body
to establish the necessary async suspension point that allows the timeout
mechanism to work properly.
- Around line 14-18: The Collection attribute name in FastGenContractTests.cs
does not match the defined collection name. Change the collection attribute from
[Collection("DiffusionFoundationScaleSerial")] to
[Collection("FoundationScaleSerial")] to match the actual CollectionDefinition
with DisableParallelization set to true in FoundationScaleSerialCollection.cs.
Apply the same correction to NewConditionerContractTests.cs to ensure both test
classes use the correct defined serial collection, preventing tests from running
in parallel.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8e7d6bc2-1f82-415d-a2d5-fe74fc1ac960

📥 Commits

Reviewing files that changed from the base of the PR and between 3f73d93 and 626c8d8.

📒 Files selected for processing (6)
  • src/Diffusion/FastGeneration/FluxSchnellModel.cs
  • src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs
  • src/NeuralNetworks/Layers/LayerBase.cs
  • tests/AiDotNet.Tests/UnitTests/Diffusion/DiffusionUnitTestBase.cs
  • tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs
  • tests/AiDotNet.Tests/UnitTests/Diffusion/Models/NewConditionerContractTests.cs

Comment thread src/Diffusion/FastGeneration/FluxSchnellModel.cs Outdated
Comment thread src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs Outdated
Comment thread src/NeuralNetworks/Layers/LayerBase.cs Outdated
Comment thread src/NeuralNetworks/Layers/LayerBase.cs
Comment thread tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs Outdated
…tion

CodeRabbit review on PR #1677.

- FluxSchnellModel.Clone: reuse the model's resolved construction seed (not a
  fresh one) so a never-forwarded — still lazy — predictor materializes
  identical weights in the clone, keeping Clone() equivalent.
- MMDiTNoisePredictor: WasForwarded -> internal WeightsMaterialized (it is
  clone/streaming materialization plumbing, not public model behavior).
- LayerBase: MaterializeParameters / CopyTrainableParametersFrom -> internal
  so this PR does not permanently expand the public layer contract.
- GroupedQueryAttentionLayer: route deferred-weight materialization through a
  new EnsureParametersMaterialized hook (the [TrainableParameter] generator
  owns EnsureInitialized, so it cannot be overridden), override IsInitialized,
  and materialize in the public weight getters — so MaterializeParameters() and
  the chunk-streaming path read real weights, not zero-length placeholders.
- Tests: use the existing [Collection("FoundationScaleSerial")] (the previous
  name had no matching definition, so the foundation-scale tests ran in
  parallel and risked OOM on the 16 GB runner); add await Task.Yield() so the
  FluxSchnell clone test's [Fact(Timeout)] is actually enforced.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant