fix(diffusion): resolve #1671 TCD clone + default-construction timeouts#1677
Conversation
TCD is a trajectory-consistency distillation model whose whole purpose is high-quality few-step generation (the paper reports 4 steps as optimal, 8 max, and the model's own metadata already declared optimal_steps=4). The ctor never set DefaultInferenceSteps, so Predict ran the generic 10-step DDIM default — ~2.5x the intended compute. Two SD-1.5-scale UNet+VAE predicts (original + clone) then blew past the 120s test budget, timing out TCDModelTests.Clone_ShouldProduceIdenticalOutput on the ModelFamily Diffusion T-Z shard. Pin DefaultInferenceSteps to the model's optimal_steps (4), matching every other few-step model in this folder (ConsistencyModel=2, FluxSchnell=4, ...). This is the paper-faithful default, not a test accommodation. Clone now passes (~56s local) and Predict_ShouldBeDeterministic drops from ~88s to ~21s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ion timeout (#1671) DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException timed out (>10s/ctor) on LuminaImage2Model and FlagDiTPredictor. FlagDiTPredictor already builds its FFN/adaLN via the lazy DenseLayer path (NoisePredictorBase.LazyDense, zero-sized until first forward) precisely so a foundation-scale stack constructs cheaply and streams weights at forward — but the 32 GroupedQueryAttentionLayer blocks were still allocated eagerly, ~1.3 B attention weights (gigabytes, >10s) before a single forward, defeating that design. Add an opt-in deferAllocation path to GroupedQueryAttentionLayer mirroring the lazy Dense contract: projection weights stay zero-sized at construction and materialize (allocate + initialize) on first use via EnsureWeightsMaterialized, guarded at every weight-access entry point (Forward/GetParameters/SetParameters/GetParameterGradients/UpdateParameters). ParameterCount is derived exactly from the dimension fields while deferred, so a parent predictor's CalculateParameterCount is correct without forcing allocation. Eager callers (the default, deferAllocation=false) are completely unchanged. FlagDiTPredictor opts in. DefaultConstructionTests now reports 824 OK / 0 TIMEOUT (~14s); the GroupedQueryAttentionLayer unit + GQA inference suites stay green (42/42); LuminaImage2 construction and ParameterCount pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
|
Warning Review limit reached
More reviews will be available in 2 hours, 51 minutes, and 9 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (6)
Walkthrough
ChangesDeferred weight allocation infrastructure
Foundation-scale parameter serialization and cloning
Test infrastructure, collection organization, and contract validation
TCDModel step count constant
Sequence Diagram(s)sequenceDiagram
participant Model as FlagDiTPredictor
participant GQA as GroupedQueryAttentionLayer[i]
participant LayerMem as Tensor Storage
Model->>GQA: new(..., deferAllocation: true)
GQA->>LayerMem: allocate zero-sized placeholders
GQA-->>Model: ready (weights deferred)
Model->>Model: forward pass
Model->>GQA: Forward()
GQA->>GQA: EnsureWeightsMaterialized()
GQA->>LayerMem: allocate real-sized tensors
LayerMem-->>GQA: weights materialized
GQA->>LayerMem: compute attention
LayerMem-->>GQA: output
GQA-->>Model: forward result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 234-246: The EnsureWeightsMaterialized method has a race condition
where the _weightsDeferred flag is cleared before the weight tensors are
actually allocated, creating a window where concurrent calls could bypass
initialization. Either add thread-safety documentation to the
GroupedQueryAttentionLayer class (similar to MultiHeadAttentionLayer and
EmbeddingLayer) explicitly stating the layer is not thread-safe and assumes
single-threaded usage within a FlagDiTPredictor instance, or protect the entire
materialization logic in EnsureWeightsMaterialized with a lock to ensure
atomicity between clearing the flag and allocating all tensors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 90bf1b8a-3b08-48d5-adb1-e383c544717e
📒 Files selected for processing (3)
src/Diffusion/FastGeneration/TCDModel.cssrc/Diffusion/NoisePredictors/FlagDiTPredictor.cssrc/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
…review) CodeRabbit flagged a latent race in EnsureWeightsMaterialized: the _weightsDeferred flag was cleared before the weight tensors were allocated, so a concurrent Forward on the same instance could observe zero-sized weights. Not exploited today (CheckpointBlocks, weight streaming, and diffusion sampling are all sequential), but it should be correct by construction rather than rely on callers staying single-threaded. Make _weightsDeferred volatile and guard the one-time materialization with double-checked locking, flipping the flag to false LAST (release) only after every tensor is allocated and initialized. The fast path stays lock-free (volatile acquire read), so already-materialized forwards pay nothing; the lock is only ever taken on the first touch and is uncontended in the sequential paths that exist today. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs (1)
643-661: 🩺 Stability & Availability | 🟠 MajorPublic weight getters bypass
EnsureWeightsMaterialized()and deliver zero-sized placeholders to production code.
Forward,GetParameters,SetParameters,GetParameterGradients, andUpdateParametersare all guarded byEnsureWeightsMaterialized(), but the public accessorsGetQueryWeights,GetKeyWeights,GetValueWeights, andGetOutputWeightsare not. For layers constructed withdeferAllocation: true(all 32 attention blocks inFlagDiTPredictor), these getters return the[0]-shaped placeholder tensors until something else triggers materialization.This matters because these getters are called in real production paths:
QuantizedAttentionLayer.cs(lines 99–102, 135–138) directly feeds these tensors into quantization logic with no shape validation — zero-sized tensors silently produce incorrect quantized weights.TransformerDecoderLayer.cs(1272–1275) andTransformerEncoderLayer.cs(1121–1124) call these getters for computation graph construction; null checks cannot catch zero-shaped tensors.Guard each getter through
EnsureWeightsMaterialized()to prevent external consumers from observing uninitialized placeholder tensors:Proposed fix
- public Tensor<T> GetQueryWeights() => _queryWeights; + public Tensor<T> GetQueryWeights() { EnsureWeightsMaterialized(); return _queryWeights; } - public Tensor<T> GetKeyWeights() => _keyWeights; + public Tensor<T> GetKeyWeights() { EnsureWeightsMaterialized(); return _keyWeights; } - public Tensor<T> GetValueWeights() => _valueWeights; + public Tensor<T> GetValueWeights() { EnsureWeightsMaterialized(); return _valueWeights; } - public Tensor<T> GetOutputWeights() => _outputWeights; + public Tensor<T> GetOutputWeights() { EnsureWeightsMaterialized(); return _outputWeights; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs` around lines 643 - 661, The public getter methods GetQueryWeights, GetKeyWeights, GetValueWeights, and GetOutputWeights do not call EnsureWeightsMaterialized() before returning the weight tensors, which means they can return uninitialized zero-sized placeholder tensors when the layer is constructed with deferred allocation. Add a call to EnsureWeightsMaterialized() at the start of each getter method to ensure the weights are materialized before they are returned, matching the pattern used in other methods like Forward and GetParameters. This will prevent external consumers in QuantizedAttentionLayer and the Transformer layers from receiving invalid placeholder tensors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 643-661: The public getter methods GetQueryWeights, GetKeyWeights,
GetValueWeights, and GetOutputWeights do not call EnsureWeightsMaterialized()
before returning the weight tensors, which means they can return uninitialized
zero-sized placeholder tensors when the layer is constructed with deferred
allocation. Add a call to EnsureWeightsMaterialized() at the start of each
getter method to ensure the weights are materialized before they are returned,
matching the pattern used in other methods like Forward and GetParameters. This
will prevent external consumers in QuantizedAttentionLayer and the Transformer
layers from receiving invalid placeholder tensors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 84836efa-e293-4e61-ad35-efc51c7f8af4
📒 Files selected for processing (1)
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
…MDiT models
The FastGen/Conditioner unit shard's foundation-scale parameter round-trip + clone
contract tests OOM'd or timed out: FLUX/MMDiT predictors have >2.1B parameters, so a
flat GetParameters() builds a List<T>/Vector<T> that overflows the max array element
count ("Array dimensions exceeded supported range"), and the streaming chunk path
copied each layer's params into a transient multi-GB Vector<T> per pass — GC-thrashing
the heap to >10min and an OutOfMemoryException.
Make GetParameterChunks/SetParameterChunks allocation-free:
- LayerBase.MaterializeParameters() forces lazy weight allocation (the same EnsureInitialized
the first forward runs) so callers can stream resident tensors by reference.
- LayerBase.CopyTrainableParametersFrom() span-copies values INTO the existing tensors and
invalidates each persistent-tensor cache — the aliasing-free, allocation-free counterpart
to SetTrainableParameters (rebinding would alias a clone's weights onto the source's).
- MMDiTNoisePredictor.GetParameterChunks now yields each layer's resident trainable tensors
by reference (one chunk per tensor) after materializing; SetParameterChunks copies in place.
Consumers (Clone, LatentDiffusionModelBase) pair get/set and count chunks dynamically, so the
per-tensor framing stays consistent. No serialization path uses the chunk API, so the flat
GetParameters/Serialize path is unchanged.
- FluxSchnellModel.Clone is now lazy-preserving (MMDiTNoisePredictor.WasForwarded): a
never-forwarded foundation model copies nothing, so the clone stays lazy instead of
materializing the predictor twice and OOMing.
Tests: the FLUX/MMDiT round-trip + clone contract tests use FP32 (a materialized FP64 model
is ~34GB and OOMs the 16GB runner) and share one serial collection (one multi-GB model resident
at a time); the round-trip helper writes/reads the ramp in place via the zero-copy chunks, never
allocating a replacement aggregate. Full FastGen/Conditioner shard: 80/80 green (was 4 OOM/timeout)
— Flux2 4m22s, FluxDoubleStream 4m33s, MMDiTX 38s, FluxSchnell clone 4s.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/Diffusion/FastGeneration/FluxSchnellModel.cs`:
- Around line 142-150: The clone created on line 142 receives a fresh
RandomGenerator.Next() seed, but when the weight copy is skipped in the if
(_predictor.WasForwarded) branch, the clone retains this different seed and will
materialize different lazy weights later, breaking Clone() equivalence. In the
skipped-copy branch (when WasForwarded is false), copy the predictor's lazy
initialization state/configuration or seed from the source predictor to the
clone instead of leaving it with the newly generated seed, ensuring the clone
will materialize identically lazy weights when needed.
In `@src/Diffusion/NoisePredictors/MMDiTNoisePredictor.cs`:
- Around line 1198-1203: The WasForwarded property is implementation detail
(materialization state) and should not be exposed as public API. Change the
visibility of WasForwarded from public to internal, rename it to better reflect
that it exposes materialization state rather than forwarded behavior (consider
something like IsMaterialized or HasMaterializedWeights based on the underlying
_patchEmbed.IsInitialized check), and update all internal callers within the
assembly to use the new internal property name.
In `@src/NeuralNetworks/Layers/LayerBase.cs`:
- Around line 477-483: The MaterializeParameters() method in LayerBase should be
marked as internal rather than public since it is a parameter-materialization
helper method intended for internal streaming/clone operations and not for
user-facing APIs. Change the access modifier of the MaterializeParameters()
method from public to internal. Additionally, review the code at lines 3548-3562
to identify any similar helper or plumbing methods that should also be changed
from public to internal visibility to maintain the proper API surface where only
AiModelBuilder and AiModelResult are user-facing entry points.
- Line 481: The GroupedQueryAttentionLayer class does not override the
IsInitialized property, causing it to inherit the default true value from
LayerBase, which prevents MaterializeParameters() from invoking
EnsureInitialized() and skipping weight materialization. To fix this, override
the IsInitialized property in GroupedQueryAttentionLayer to return
!_weightsDeferred, and override the EnsureInitialized() method to call
EnsureWeightsMaterialized(), ensuring that MaterializeParameters() correctly
triggers the deferred weight materialization path. Alternatively, add
EnsureWeightsMaterialized() calls at the beginning of all public getter methods
such as GetQueryWeights(), GetKeyWeights(), GetValueWeights(), and
GetOutputWeights() to ensure weights are always materialized before being
returned to callers.
In `@tests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cs`:
- Around line 576-585: The method FluxSchnellModel_Clone_CreatesIndependentCopy
is declared as async Task with a Timeout attribute but contains no await
statements, making the timeout enforcement ineffective since xUnit requires an
await point to enforce the cancellation token. Add await Task.Yield(); as the
first statement in the FluxSchnellModel_Clone_CreatesIndependentCopy method body
to establish the necessary async suspension point that allows the timeout
mechanism to work properly.
- Around line 14-18: The Collection attribute name in FastGenContractTests.cs
does not match the defined collection name. Change the collection attribute from
[Collection("DiffusionFoundationScaleSerial")] to
[Collection("FoundationScaleSerial")] to match the actual CollectionDefinition
with DisableParallelization set to true in FoundationScaleSerialCollection.cs.
Apply the same correction to NewConditionerContractTests.cs to ensure both test
classes use the correct defined serial collection, preventing tests from running
in parallel.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8e7d6bc2-1f82-415d-a2d5-fe74fc1ac960
📒 Files selected for processing (6)
src/Diffusion/FastGeneration/FluxSchnellModel.cssrc/Diffusion/NoisePredictors/MMDiTNoisePredictor.cssrc/NeuralNetworks/Layers/LayerBase.cstests/AiDotNet.Tests/UnitTests/Diffusion/DiffusionUnitTestBase.cstests/AiDotNet.Tests/UnitTests/Diffusion/Models/FastGenContractTests.cstests/AiDotNet.Tests/UnitTests/Diffusion/Models/NewConditionerContractTests.cs
…tion CodeRabbit review on PR #1677. - FluxSchnellModel.Clone: reuse the model's resolved construction seed (not a fresh one) so a never-forwarded — still lazy — predictor materializes identical weights in the clone, keeping Clone() equivalent. - MMDiTNoisePredictor: WasForwarded -> internal WeightsMaterialized (it is clone/streaming materialization plumbing, not public model behavior). - LayerBase: MaterializeParameters / CopyTrainableParametersFrom -> internal so this PR does not permanently expand the public layer contract. - GroupedQueryAttentionLayer: route deferred-weight materialization through a new EnsureParametersMaterialized hook (the [TrainableParameter] generator owns EnsureInitialized, so it cannot be overridden), override IsInitialized, and materialize in the public weight getters — so MaterializeParameters() and the chunk-streaming path read real weights, not zero-length placeholders. - Tests: use the existing [Collection("FoundationScaleSerial")] (the previous name had no matching definition, so the foundation-scale tests ran in parallel and risked OOM on the 16 GB runner); add await Task.Yield() so the FluxSchnell clone test's [Fact(Timeout)] is actually enforced. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fixes the two failures called out in #1671 (clean run
28023414937, master @01a4ddb4). Both turned out to be 120 s test timeouts, not the clone-equivalence divergence the issue hypothesized for bug 1 — confirmed by reading the actual CI logs (the failing test's error wasTest execution timed out after 120000 milliseconds, and the adjacentPredict_ShouldBeDeterministicpassed at 1 m 28 s).Bug 1 —
TCDModelTests.Clone_ShouldProduceIdenticalOutput(ModelFamily Diffusion T-Z)TCD is a trajectory-consistency distillation model built for few-step sampling — its own metadata already declared
optimal_steps = 4, and the paper reports 4 as the sweet spot. But the ctor never setDefaultInferenceSteps, soPredictran the generic 10-step DDIM default (~2.5× the intended compute). The Clone test runs two SD-1.5-scale UNet+VAE predicts (original + clone), which blew past the 120 s budget under CI load.Fix: pin
DefaultInferenceSteps = 4(the model's own optimal), matching every other few-step model inFastGeneration/(ConsistencyModel = 2,FluxSchnell = 4, …). This is the paper-faithful default, not a test accommodation; clone equivalence is preserved (original and clone use the same step count).Bug 2 —
DefaultConstructionTests.AllDefaultConstructableModels_ShouldConstructWithoutException(Integration D)Two models exceeded the 10 s/ctor budget:
LuminaImage2ModelandFlagDiTPredictor.FlagDiTPredictoralready builds its FFN/adaLN via the lazyDenseLayerpath (NoisePredictorBase.LazyDense, zero-sized until first forward) so a foundation-scale stack constructs cheaply and streams weights at forward — but its 32GroupedQueryAttentionLayerblocks were still eagerly allocated (~1.3 B attention weights, gigabytes, before a single forward), defeating that design.Fix: add an opt-in
deferAllocationpath toGroupedQueryAttentionLayerthat mirrors the lazyDenseLayercontract — projection weights stay zero-sized at construction and materialize (allocate + initialize) on first use, guarded at every weight-access entry point (Forward/GetParameters/SetParameters/GetParameterGradients/UpdateParameters).ParameterCountis derived exactly from the dimension fields while deferred, so a parent predictor'sCalculateParameterCountstays correct without forcing allocation.FlagDiTPredictoropts in; eager callers (the default) are completely unchanged.Verification (local,
AIDOTNET_INFERENCE_ARENA=0, CPU)TCDModelTests.Clone_ShouldProduceIdenticalOutputTCDModelTests.Predict_ShouldBeDeterministicDefaultConstructionTests…ShouldConstructWithoutExceptionGroupedQueryAttentionLayerTests+GQAInferenceIntegrationTests(eager path)LuminaImage2ModelTestsScheduler+Parameters_ShouldBeNonEmptyScope note
The TCD class also has
Training_ShouldReducePredictionErrorandForwardPass_ShouldBeFinite_AfterTrainingfailures, andLuminaImage2ModelTests.Metadata_ShouldExist— these are training-scale timeouts (each trains a full SD/foundation-scale model) that are not affected by either fix here and are out of #1671's scope (they belong to the broader training/perf inventory, #1624). This PR resolves exactly the two tests #1671 named.🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes