fix(diffusion): resolve #1671 TCD clone + default-construction timeouts by ooples · Pull Request #1677 · ooples/AiDotNet

ooples · 2026-06-23T19:41:02Z

Fixes the two failures called out in #1671 (clean run 28023414937, master @ 01a4ddb4). Both turned out to be 120 s test timeouts, not the clone-equivalence divergence the issue hypothesized for bug 1 — confirmed by reading the actual CI logs (the failing test's error was Test execution timed out after 120000 milliseconds, and the adjacent Predict_ShouldBeDeterministic passed at 1 m 28 s).

Bug 1 — `TCDModelTests.Clone_ShouldProduceIdenticalOutput` (ModelFamily Diffusion T-Z)

TCD is a trajectory-consistency distillation model built for few-step sampling — its own metadata already declared optimal_steps = 4, and the paper reports 4 as the sweet spot. But the ctor never set DefaultInferenceSteps, so Predict ran the generic 10-step DDIM default (~2.5× the intended compute). The Clone test runs two SD-1.5-scale UNet+VAE predicts (original + clone), which blew past the 120 s budget under CI load.

Fix: pin DefaultInferenceSteps = 4 (the model's own optimal), matching every other few-step model in FastGeneration/ (ConsistencyModel = 2, FluxSchnell = 4, …). This is the paper-faithful default, not a test accommodation; clone equivalence is preserved (original and clone use the same step count).

Bug 2 — `DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException` (Integration D)

Two models exceeded the 10 s/ctor budget: LuminaImage2Model and FlagDiTPredictor. FlagDiTPredictor already builds its FFN/adaLN via the lazy DenseLayer path (NoisePredictorBase.LazyDense, zero-sized until first forward) so a foundation-scale stack constructs cheaply and streams weights at forward — but its 32 GroupedQueryAttentionLayer blocks were still eagerly allocated (~1.3 B attention weights, gigabytes, before a single forward), defeating that design.

Fix: add an opt-in deferAllocation path to GroupedQueryAttentionLayer that mirrors the lazy DenseLayer contract — projection weights stay zero-sized at construction and materialize (allocate + initialize) on first use, guarded at every weight-access entry point (Forward / GetParameters / SetParameters / GetParameterGradients / UpdateParameters). ParameterCount is derived exactly from the dimension fields while deferred, so a parent predictor's CalculateParameterCount stays correct without forcing allocation. FlagDiTPredictor opts in; eager callers (the default) are completely unchanged.

Verification (local, `AIDOTNET_INFERENCE_ARENA=0`, CPU)

Test	Before	After
`TCDModelTests.Clone_ShouldProduceIdenticalOutput`	timeout (>120 s)	✅ ~56 s
`TCDModelTests.Predict_ShouldBeDeterministic`	~88 s (CI)	✅ ~21 s
`DefaultConstructionTests…ShouldConstructWithoutException`	2 TIMEOUT	✅ 824 OK / 0 TIMEOUT (~14 s)
`GroupedQueryAttentionLayerTests` + `GQAInferenceIntegrationTests` (eager path)	—	✅ 42/42
`LuminaImage2ModelTests` `Scheduler` + `Parameters_ShouldBeNonEmpty`	—	✅ pass

Scope note

The TCD class also has Training_ShouldReducePredictionError and ForwardPass_ShouldBeFinite_AfterTraining failures, and LuminaImage2ModelTests.Metadata_ShouldExist — these are training-scale timeouts (each trains a full SD/foundation-scale model) that are not affected by either fix here and are out of #1671's scope (they belong to the broader training/perf inventory, #1624). This PR resolves exactly the two tests #1671 named.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Performance Improvements
- Improved startup and responsiveness by deferring allocation of large attention weights until they’re required.
- Reduced memory and allocation overhead during parameter handling by switching to streaming, chunk-based workflows for large diffusion models.
Bug Fixes
- Fixed clone behavior for lazily-initialized predictor weights to avoid unsafe parameter round-trips.
- Kept “optimal_steps” metadata aligned with the configured few-step inference guidance.
Tests
- Strengthened large-model parameter round-trip tests to better validate chunk-based serialization and avoid CI out-of-memory issues.

TCD is a trajectory-consistency distillation model whose whole purpose is high-quality few-step generation (the paper reports 4 steps as optimal, 8 max, and the model's own metadata already declared optimal_steps=4). The ctor never set DefaultInferenceSteps, so Predict ran the generic 10-step DDIM default — ~2.5x the intended compute. Two SD-1.5-scale UNet+VAE predicts (original + clone) then blew past the 120s test budget, timing out TCDModelTests.Clone_ShouldProduceIdenticalOutput on the ModelFamily Diffusion T-Z shard. Pin DefaultInferenceSteps to the model's optimal_steps (4), matching every other few-step model in this folder (ConsistencyModel=2, FluxSchnell=4, ...). This is the paper-faithful default, not a test accommodation. Clone now passes (~56s local) and Predict_ShouldBeDeterministic drops from ~88s to ~21s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ion timeout (#1671) DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException timed out (>10s/ctor) on LuminaImage2Model and FlagDiTPredictor. FlagDiTPredictor already builds its FFN/adaLN via the lazy DenseLayer path (NoisePredictorBase.LazyDense, zero-sized until first forward) precisely so a foundation-scale stack constructs cheaply and streams weights at forward — but the 32 GroupedQueryAttentionLayer blocks were still allocated eagerly, ~1.3 B attention weights (gigabytes, >10s) before a single forward, defeating that design. Add an opt-in deferAllocation path to GroupedQueryAttentionLayer mirroring the lazy Dense contract: projection weights stay zero-sized at construction and materialize (allocate + initialize) on first use via EnsureWeightsMaterialized, guarded at every weight-access entry point (Forward/GetParameters/SetParameters/GetParameterGradients/UpdateParameters). ParameterCount is derived exactly from the dimension fields while deferred, so a parent predictor's CalculateParameterCount is correct without forcing allocation. Eager callers (the default, deferAllocation=false) are completely unchanged. FlagDiTPredictor opts in. DefaultConstructionTests now reports 824 OK / 0 TIMEOUT (~14s); the GroupedQueryAttentionLayer unit + GQA inference suites stay green (42/42); LuminaImage2 construction and ParameterCount pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel · 2026-06-23T19:41:07Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
aidotnet_website	Ignored	Preview	Jun 23, 2026 11:21pm
aidotnet-playground-api	Ignored	Preview	Jun 23, 2026 11:21pm

coderabbitai · 2026-06-23T19:41:27Z

Warning

Review limit reached

@ooples, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 2 hours, 51 minutes, and 9 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 331e38b1-f986-47c9-9db5-208462577449

📥 Commits

Reviewing files that changed from the base of the PR and between 626c8d8 and 35074b2.

📒 Files selected for processing (6)

src/Diffusion/FastGeneration/FluxSchnellModel.cs
src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
src/NeuralNetworks/Layers/LayerBase.cs
tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs
tests/AiDotNet.Tests/UnitTests/Diffusion/Models/NewConditionerContractTests.cs

Walkthrough

GroupedQueryAttentionLayer<T> gains a deferAllocation constructor parameter enabling zero-sized placeholder tensors at construction, with one-shot materialization on first use via EnsureWeightsMaterialized(). LayerBase adds MaterializeParameters() and CopyTrainableParametersFrom() to support deferred patterns. FlagDiTPredictor opts all 32 attention blocks into deferred mode. MMDiTNoisePredictor rewrites parameter chunking to stream per-layer trainable tensors by reference instead of flattening, exposes WasForwarded state, and FluxSchnellModel clone uses safe chunking path only after forward pass. Tests use allocation-free chunk round-trip validation, serialize foundation-scale execution via shared xUnit collection, and switch to float + chunk assertions.

Changes

Deferred weight allocation infrastructure

Layer / File(s)	Summary
GroupedQueryAttentionLayer deferred state and constructor `src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`	Adds `_weightsDeferred` volatile flag and `_materializeLock` for thread-safe one-shot initialization. Extends constructor with `bool deferAllocation = false` parameter. Reworks initialization to create zero-sized placeholder tensors (skipping `InitializeParameters`) when deferred, or full-shaped tensors (calling `InitializeParameters`) when eager. Adds private `EnsureWeightsMaterialized()` to perform one-time allocation, initialization, and state flip under lock.
GroupedQueryAttentionLayer ParameterCount and materialization guards `src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`	Updates `ParameterCount` override to compute exact parameter total from dimensional fields when weights are deferred, rather than reading uninitialized tensor lengths. Guards `Forward`, `UpdateParameters`, `GetParameters`, `GetParameterGradients`, and `SetParameters` with `EnsureWeightsMaterialized()` calls to ensure weights are always materialized before any tensor-dependent operation.
LayerBase materialization and in-place copy helpers `src/NeuralNetworks/Layers/LayerBase.cs`	Adds public `MaterializeParameters()` to force lazy weight allocation when `!IsInitialized` and `IsShapeResolved`. Adds public virtual `CopyTrainableParametersFrom(IReadOnlyList<Tensor<T>>)` to copy values into existing trainable tensors by reference (no rebinding), validating tensor count/length, copying data via spans, and invalidating destination persistent tensors so the GPU engine re-uploads updated values.
FlagDiTPredictor wires deferAllocation: true `src/Diffusion/NoisePredictors/FlagDiTPredictor.cs`	`InitializeLayers` passes `deferAllocation: true` to all 32 `GroupedQueryAttentionLayer<T>` constructor calls, deferring large attention weight allocation from model construction time to first forward pass, preventing constructor timeout and excessive eager memory consumption.

Foundation-scale parameter serialization and cloning

Layer / File(s)	Summary
MMDiTNoisePredictor WasForwarded property and chunk-based parameter transfer `src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs`	Adds public `WasForwarded` property backed by `_patchEmbed.IsInitialized` to indicate lazy weight materialization. Rewrites `GetParameterChunks()` to materialize per-layer lazy weights (for `LayerBase<T>` only), then yield each trainable tensor from `GetTrainableParameters()` by reference, skipping empty tensors—replacing old per-layer `GetParameters()` → flat-vector-wrapping approach. Rewrites `SetParameterChunks()` to materialize each `LayerBase<T>` layer, then copy incoming chunks into trainable tensors in place via `CopyTrainableParametersFrom`, preserving alignment for empty trainable slots—replacing old `ParameterCount`-based single-chunk-per-layer `SetParameters(toVector())` flow.
FluxSchnellModel clone safe lazy-preserving copy `src/Diffusion/FastGeneration/FluxSchnellModel.cs`	Changes `Clone()` to conditionally copy predictor weights only when `_predictor.WasForwarded` is true, using safer chunking path via `SetParameterChunks(_predictor.GetParameterChunks())` instead of unconditional `SetParameters(GetParameters())` round-trip. Avoids int-overflow "Array dimensions exceeded supported range" by not forcing materialization of un-forwarded weights into a single flat vector.

Test infrastructure, collection organization, and contract validation

Layer / File(s)	Summary
DiffusionUnitTestBase AssertParameterChunksRoundTrip validation `tests/AiDotNet.Tests/UnitTests/Diffusion/DiffusionUnitTestBase.cs`	Updates using directives to import `System.Collections.Generic` and `AiDotNet.Tensors.LinearAlgebra`. Implements `AssertParameterChunksRoundTrip(getChunks, setChunks)` as deterministic allocation-free verification: writes global-index ramp directly into chunks via `chunk.Data.Span`, applies `setChunks(getChunks())`, reads back through `getChunks`, and per-element compares against deterministic mapping. Asserts non-empty chunk exposure and verified element count equals originally written total, supporting validation of foundation-scale models without large intermediate allocations.
Foundation-scale test collection and model parameter updates `tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs`, `tests/AiDotNet.Tests/UnitTests/Diffusion/Models/NewConditionerContractTests.cs`	Both test classes add `[Collection("DiffusionFoundationScaleSerial")]` attribute to force sequential execution within a shared collection, preventing concurrent OOM during foundation-scale model materialization. `FastGenContractTests` updates `FluxSchnellModel_Clone` and `Flux2Model_GetSetParameters_RoundTrips`: switches models from `double` to `float`, increases timeouts to 10 minutes, replaces manual chunk generation/assertions with `AssertParameterChunksRoundTrip`. `NewConditionerContractTests` updates `MMDiTXNoisePredictor` and `FluxDoubleStreamPredictor` parameter tests: switches from `double` `GetParameters`/`SetParameters` to `float` `GetParameterChunks`/`SetParameterChunks`, increases timeout to 10 minutes, adds `Task.Yield()`, delegates assertions to `AssertParameterChunksRoundTrip`.

TCDModel step count constant

Layer / File(s)	Summary
OPTIMAL_INFERENCE_STEPS constant `src/Diffusion/FastGeneration/TCDModel.cs`	Introduces private constant `OPTIMAL_INFERENCE_STEPS = 4` with expanded doc comment clarifying paper-optimal few-step sampling count for the TCD distilled model. Assigns constant to `DefaultInferenceSteps` in constructor instead of hardcoded value. Replaces literal `4` in `"optimal_steps"` metadata property with named constant, establishing single source of truth for configuration.

Sequence Diagram(s)

sequenceDiagram
    participant Model as FlagDiTPredictor
    participant GQA as GroupedQueryAttentionLayer[i]
    participant LayerMem as Tensor Storage
    Model->>GQA: new(..., deferAllocation: true)
    GQA->>LayerMem: allocate zero-sized placeholders
    GQA-->>Model: ready (weights deferred)
    Model->>Model: forward pass
    Model->>GQA: Forward()
    GQA->>GQA: EnsureWeightsMaterialized()
    GQA->>LayerMem: allocate real-sized tensors
    LayerMem-->>GQA: weights materialized
    GQA->>LayerMem: compute attention
    LayerMem-->>GQA: output
    GQA-->>Model: forward result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Foundation-scale model CI: eliminate all timeouts/OOM via stacked acceleration (tracker) #1622: The deferAllocation parameter added to GroupedQueryAttentionLayer, materialization helpers in LayerBase, and the chunk-based parameter serialization in MMDiTNoisePredictor directly realize the deferred-allocation and per-tensor chunking acceleration levers described in that tracker issue for eliminating foundation-scale model CI timeouts and OOM; serial test collection organization also prevents concurrent large-model materialization as discussed.

Possibly related PRs

ooples/AiDotNet#1633: Both the main PR and retrieved PR modify the per-layer chunk-based parameter transfer logic for DiT-style noise predictors (notably MMDiTNoisePredictor's GetParameterChunks/SetParameterChunks), extending the same streaming/chunking infrastructure to avoid large vector allocations.

Poem

🦥 No more weights born at zero hour,
The tensor shapes bloom only with power.
Thirty-two layers stay lean on their birth,
Till Forward arrives to awaken their worth.
Chunks flow like streams—no vector overflow,
Production-ready now—the tests tell us so. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and concisely captures the main objective: resolving timeout issues (Issue `#1671`) in TCD clone and default construction by optimizing step counts and lazy allocation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/1671-tcd-clone-default-ctor

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 234-246: The EnsureWeightsMaterialized method has a race condition
where the _weightsDeferred flag is cleared before the weight tensors are
actually allocated, creating a window where concurrent calls could bypass
initialization. Either add thread-safety documentation to the
GroupedQueryAttentionLayer class (similar to MultiHeadAttentionLayer and
EmbeddingLayer) explicitly stating the layer is not thread-safe and assumes
single-threaded usage within a FlagDiTPredictor instance, or protect the entire
materialization logic in EnsureWeightsMaterialized with a lock to ensure
atomicity between clearing the flag and allocating all tensors.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 90bf1b8a-3b08-48d5-adb1-e383c544717e

📥 Commits

Reviewing files that changed from the base of the PR and between 264e75c and 2bf339d.

📒 Files selected for processing (3)

src/Diffusion/FastGeneration/TCDModel.cs
src/Diffusion/NoisePredictors/FlagDiTPredictor.cs
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs

…review) CodeRabbit flagged a latent race in EnsureWeightsMaterialized: the _weightsDeferred flag was cleared before the weight tensors were allocated, so a concurrent Forward on the same instance could observe zero-sized weights. Not exploited today (CheckpointBlocks, weight streaming, and diffusion sampling are all sequential), but it should be correct by construction rather than rely on callers staying single-threaded. Make _weightsDeferred volatile and guard the one-time materialization with double-checked locking, flipping the flag to false LAST (release) only after every tensor is allocated and initialized. The fast path stays lock-free (volatile acquire read), so already-materialized forwards pay nothing; the lock is only ever taken on the first touch and is uncontended in the sequential paths that exist today. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs (1)
643-661: 🩺 Stability & Availability | 🟠 Major

Public weight getters bypass EnsureWeightsMaterialized() and deliver zero-sized placeholders to production code.

Forward, GetParameters, SetParameters, GetParameterGradients, and UpdateParameters are all guarded by EnsureWeightsMaterialized(), but the public accessors GetQueryWeights, GetKeyWeights, GetValueWeights, and GetOutputWeights are not. For layers constructed with deferAllocation: true (all 32 attention blocks in FlagDiTPredictor), these getters return the [0]-shaped placeholder tensors until something else triggers materialization.

This matters because these getters are called in real production paths:

QuantizedAttentionLayer.cs (lines 99–102, 135–138) directly feeds these tensors into quantization logic with no shape validation — zero-sized tensors silently produce incorrect quantized weights.

TransformerDecoderLayer.cs (1272–1275) and TransformerEncoderLayer.cs (1121–1124) call these getters for computation graph construction; null checks cannot catch zero-shaped tensors.

Guard each getter through EnsureWeightsMaterialized() to prevent external consumers from observing uninitialized placeholder tensors:
Proposed fix
-    public Tensor<T> GetQueryWeights() => _queryWeights;
+    public Tensor<T> GetQueryWeights() { EnsureWeightsMaterialized(); return _queryWeights; }

-    public Tensor<T> GetKeyWeights() => _keyWeights;
+    public Tensor<T> GetKeyWeights() { EnsureWeightsMaterialized(); return _keyWeights; }

-    public Tensor<T> GetValueWeights() => _valueWeights;
+    public Tensor<T> GetValueWeights() { EnsureWeightsMaterialized(); return _valueWeights; }

-    public Tensor<T> GetOutputWeights() => _outputWeights;
+    public Tensor<T> GetOutputWeights() { EnsureWeightsMaterialized(); return _outputWeights; }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs` around lines 643 -
661, The public getter methods GetQueryWeights, GetKeyWeights, GetValueWeights,
and GetOutputWeights do not call EnsureWeightsMaterialized() before returning
the weight tensors, which means they can return uninitialized zero-sized
placeholder tensors when the layer is constructed with deferred allocation. Add
a call to EnsureWeightsMaterialized() at the start of each getter method to
ensure the weights are materialized before they are returned, matching the
pattern used in other methods like Forward and GetParameters. This will prevent
external consumers in QuantizedAttentionLayer and the Transformer layers from
receiving invalid placeholder tensors.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 643-661: The public getter methods GetQueryWeights, GetKeyWeights,
GetValueWeights, and GetOutputWeights do not call EnsureWeightsMaterialized()
before returning the weight tensors, which means they can return uninitialized
zero-sized placeholder tensors when the layer is constructed with deferred
allocation. Add a call to EnsureWeightsMaterialized() at the start of each
getter method to ensure the weights are materialized before they are returned,
matching the pattern used in other methods like Forward and GetParameters. This
will prevent external consumers in QuantizedAttentionLayer and the Transformer
layers from receiving invalid placeholder tensors.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 84836efa-e293-4e61-ad35-efc51c7f8af4

📥 Commits

Reviewing files that changed from the base of the PR and between 2bf339d and 3f73d93.

📒 Files selected for processing (1)

src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs

…MDiT models The FastGen/Conditioner unit shard's foundation-scale parameter round-trip + clone contract tests OOM'd or timed out: FLUX/MMDiT predictors have >2.1B parameters, so a flat GetParameters() builds a List<T>/Vector<T> that overflows the max array element count ("Array dimensions exceeded supported range"), and the streaming chunk path copied each layer's params into a transient multi-GB Vector<T> per pass — GC-thrashing the heap to >10min and an OutOfMemoryException. Make GetParameterChunks/SetParameterChunks allocation-free: - LayerBase.MaterializeParameters() forces lazy weight allocation (the same EnsureInitialized the first forward runs) so callers can stream resident tensors by reference. - LayerBase.CopyTrainableParametersFrom() span-copies values INTO the existing tensors and invalidates each persistent-tensor cache — the aliasing-free, allocation-free counterpart to SetTrainableParameters (rebinding would alias a clone's weights onto the source's). - MMDiTNoisePredictor.GetParameterChunks now yields each layer's resident trainable tensors by reference (one chunk per tensor) after materializing; SetParameterChunks copies in place. Consumers (Clone, LatentDiffusionModelBase) pair get/set and count chunks dynamically, so the per-tensor framing stays consistent. No serialization path uses the chunk API, so the flat GetParameters/Serialize path is unchanged. - FluxSchnellModel.Clone is now lazy-preserving (MMDiTNoisePredictor.WasForwarded): a never-forwarded foundation model copies nothing, so the clone stays lazy instead of materializing the predictor twice and OOMing. Tests: the FLUX/MMDiT round-trip + clone contract tests use FP32 (a materialized FP64 model is ~34GB and OOMs the 16GB runner) and share one serial collection (one multi-GB model resident at a time); the round-trip helper writes/reads the ramp in place via the zero-copy chunks, never allocating a replacement aggregate. Full FastGen/Conditioner shard: 80/80 green (was 4 OOM/timeout) — Flux2 4m22s, FluxDoubleStream 4m33s, MMDiTX 38s, FluxSchnell clone 4s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Diffusion/FastGeneration/FluxSchnellModel.cs`:
- Around line 142-150: The clone created on line 142 receives a fresh
RandomGenerator.Next() seed, but when the weight copy is skipped in the if
(_predictor.WasForwarded) branch, the clone retains this different seed and will
materialize different lazy weights later, breaking Clone() equivalence. In the
skipped-copy branch (when WasForwarded is false), copy the predictor's lazy
initialization state/configuration or seed from the source predictor to the
clone instead of leaving it with the newly generated seed, ensuring the clone
will materialize identically lazy weights when needed.

In `@src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs`:
- Around line 1198-1203: The WasForwarded property is implementation detail
(materialization state) and should not be exposed as public API. Change the
visibility of WasForwarded from public to internal, rename it to better reflect
that it exposes materialization state rather than forwarded behavior (consider
something like IsMaterialized or HasMaterializedWeights based on the underlying
_patchEmbed.IsInitialized check), and update all internal callers within the
assembly to use the new internal property name.

In `@src/NeuralNetworks/Layers/LayerBase.cs`:
- Around line 477-483: The MaterializeParameters() method in LayerBase should be
marked as internal rather than public since it is a parameter-materialization
helper method intended for internal streaming/clone operations and not for
user-facing APIs. Change the access modifier of the MaterializeParameters()
method from public to internal. Additionally, review the code at lines 3548-3562
to identify any similar helper or plumbing methods that should also be changed
from public to internal visibility to maintain the proper API surface where only
AiModelBuilder and AiModelResult are user-facing entry points.
- Line 481: The GroupedQueryAttentionLayer class does not override the
IsInitialized property, causing it to inherit the default true value from
LayerBase, which prevents MaterializeParameters() from invoking
EnsureInitialized() and skipping weight materialization. To fix this, override
the IsInitialized property in GroupedQueryAttentionLayer to return
!_weightsDeferred, and override the EnsureInitialized() method to call
EnsureWeightsMaterialized(), ensuring that MaterializeParameters() correctly
triggers the deferred weight materialization path. Alternatively, add
EnsureWeightsMaterialized() calls at the beginning of all public getter methods
such as GetQueryWeights(), GetKeyWeights(), GetValueWeights(), and
GetOutputWeights() to ensure weights are always materialized before being
returned to callers.

In `@tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs`:
- Around line 576-585: The method FluxSchnellModel_Clone_CreatesIndependentCopy
is declared as async Task with a Timeout attribute but contains no await
statements, making the timeout enforcement ineffective since xUnit requires an
await point to enforce the cancellation token. Add await Task.Yield(); as the
first statement in the FluxSchnellModel_Clone_CreatesIndependentCopy method body
to establish the necessary async suspension point that allows the timeout
mechanism to work properly.
- Around line 14-18: The Collection attribute name in FastGenContractTests.cs
does not match the defined collection name. Change the collection attribute from
[Collection("DiffusionFoundationScaleSerial")] to
[Collection("FoundationScaleSerial")] to match the actual CollectionDefinition
with DisableParallelization set to true in FoundationScaleSerialCollection.cs.
Apply the same correction to NewConditionerContractTests.cs to ensure both test
classes use the correct defined serial collection, preventing tests from running
in parallel.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8e7d6bc2-1f82-415d-a2d5-fe74fc1ac960

📥 Commits

Reviewing files that changed from the base of the PR and between 3f73d93 and 626c8d8.

📒 Files selected for processing (6)

src/Diffusion/FastGeneration/FluxSchnellModel.cs
src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs
src/NeuralNetworks/Layers/LayerBase.cs
tests/AiDotNet.Tests/UnitTests/Diffusion/DiffusionUnitTestBase.cs
tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs
tests/AiDotNet.Tests/UnitTests/Diffusion/Models/NewConditionerContractTests.cs

…tion CodeRabbit review on PR #1677. - FluxSchnellModel.Clone: reuse the model's resolved construction seed (not a fresh one) so a never-forwarded — still lazy — predictor materializes identical weights in the clone, keeping Clone() equivalent. - MMDiTNoisePredictor: WasForwarded -> internal WeightsMaterialized (it is clone/streaming materialization plumbing, not public model behavior). - LayerBase: MaterializeParameters / CopyTrainableParametersFrom -> internal so this PR does not permanently expand the public layer contract. - GroupedQueryAttentionLayer: route deferred-weight materialization through a new EnsureParametersMaterialized hook (the [TrainableParameter] generator owns EnsureInitialized, so it cannot be overridden), override IsInitialized, and materialize in the public weight getters — so MaterializeParameters() and the chunk-streaming path read real weights, not zero-length placeholders. - Tests: use the existing [Collection("FoundationScaleSerial")] (the previous name had no matching definition, so the foundation-scale tests ran in parallel and risked OOM on the 16 GB runner); add await Task.Yield() so the FluxSchnell clone test's [Fact(Timeout)] is actually enforced. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ooples and others added 2 commits June 23, 2026 15:35

vercel Bot deployed to Preview – aidotnet-playground-api June 23, 2026 19:41 View deployment

vercel Bot deployed to Preview – aidotnet_website June 23, 2026 19:41 View deployment

coderabbitai Bot requested changes Jun 23, 2026

View reviewed changes

Comment thread src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 23, 2026

View reviewed changes

coderabbitai Bot requested changes Jun 23, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 24, 2026

View reviewed changes

ooples merged commit 8f24939 into master Jun 24, 2026
55 of 76 checks passed

ooples deleted the fix/1671-tcd-clone-default-ctor branch June 24, 2026 12:01

ooples mentioned this pull request Jun 24, 2026

Playbook: clear remaining #1677 OOM/timeout CI shards — memory-architecture fixes (caching arenas, fused optimizer-in-backward, COW clone, weight streaming) + float as the LAST lever #1684

Open

coderabbitai Bot mentioned this pull request Jun 30, 2026

fix(modelfamily): Generated A-M residual model bugs — BornRule loss + CSPDarknet activations #1719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(diffusion): resolve #1671 TCD clone + default-construction timeouts#1677

fix(diffusion): resolve #1671 TCD clone + default-construction timeouts#1677
ooples merged 5 commits into
masterfrom
fix/1671-tcd-clone-default-ctor

ooples commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Review limit reached

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ooples commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug 1 — TCDModelTests.Clone_ShouldProduceIdenticalOutput (ModelFamily Diffusion T-Z)

Bug 2 — DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException (Integration D)

Verification (local, AIDOTNET_INFERENCE_ARENA=0, CPU)

Scope note

Summary by CodeRabbit

Release Notes

Uh oh!

vercel Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ooples commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Bug 1 — `TCDModelTests.Clone_ShouldProduceIdenticalOutput` (ModelFamily Diffusion T-Z)

Bug 2 — `DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException` (Integration D)

Verification (local, `AIDOTNET_INFERENCE_ARENA=0`, CPU)

vercel Bot commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading