perf(#1662): bit-identical fused optimizer-in-backward (lever #1) by ooples · Pull Request #1664 · ooples/AiDotNet

ooples · 2026-06-22T18:32:47Z

Lever #1 of #1662 (backward/training, part of #653). Draft. Lever #2 (checkpointing) is owned by #1633; levers #3/#4 are in AiDotNet.Tensors PR #665.

What this delivers

FullPrecisionStreamingAdam / FullPrecisionStreamingOptimizer — per-tensor fp32 moments, no quantization, no clamp → bit-identical to classic Adam.
Fused optimizer-in-backward engaged via StreamingTraining = ForceOn (opt-in):
- Unclipped → single-pass (each grad stepped + freed at its topological last-use).
- Clipped (the default, MaxGradNorm=1.0) → exact two-pass-norm, bit-identical.
- Opt-in FastApproxGradClip (OFF by default) → single-pass clipped via an EMA grad-norm (NFNet-style approximation; clipped-in-backward is something PyTorch's apply_optimizer_in_backward cannot do at all).
Fixed a pre-existing crash: the clipped two-pass streaming path set Persistent=true but not ReleaseStreamingActivations=false, so pass 2 threw "activations released". Now save/restored around the two-pass region.

Gates (all green; 14/14 fused+streaming tests pass, no regressions)

FusedInBackward_Unclipped_MatchesClassicAdam_ToFloatPrecision
FusedInBackward_Clipped_MatchesClassicAdam_ToFloatPrecision
FastClip_SinglePass_ReducesLoss_AndStaysFinite

⚠️ Honest performance verdict (does NOT beat PyTorch on speed)

Measured head-to-head (benchmarks/trainbench_torch.py, residual-FFN, S=128/D=384/10 layers, 8 threads):

	PyTorch CPU	AiDotNet
median ms/step	69.6	252.9 (3.6× slower)
peak RSS	312 MB	489 MB
alloc/step	n/a (C++)	0.127 MB (near-zero GC)

Optimizer-in-backward wins only on allocation/GC churn + bounded peak-grad memory (bit-identical), not throughput or peak RSS. The 3.6× is small-M GEMM parallel-scaling (#653; collapses to 1.43× at S=1024) — orthogonal to optimizer-in-backward. "Beats PyTorch" is NOT claimed. See docs/superpowers/specs/2026-06-22-1662-pytorch-comparison.md. The §5a default-on flip is intentionally NOT done (no speed win to justify it).

Notes

Built against the current AiDotNet.Tensors 0.94.2 pin (lever Updated language version to use latest Updated target frameworks to widen the range Added a bunch of new metrics such as R2, Std Deviation, Std Error, etc #1 needs no new Tensors API). Bump to 0.102.0 (PR test: Add integration tests for ProgramSynthesis module [P3] #665) is a trivial follow-up for the Added new normalization options #3/Added new regression and normalization options #4 wins.
Real path to a speed win = the test: Add integration tests for Diagnostics module [P3] #653 small-M GEMM sprint, not this PR.

🤖 Generated with Claude Code

…, #3) Cross-repo, Tensors-first design for the backward/training side of #653/#1624: backward buffer-reuse audit (#4), optimizer-in-backward adaptive hybrid (#1), and FlashAttention-style tiled backward (#3). Excludes lever #2 (checkpointing, owned by open PR #1633). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ly today TrainWithTapeStreaming already does single-pass-fused/two-pass-norm, but only engages above 0.5x-RAM; the common (fits-in-memory) case is collect-then-step, which merely ties PyTorch. Lever #1's real deliverable: promote single-pass fused optimizer-in-backward to the common-case default for unclipped training, add opt-in fast-clip for single-pass clipped, and prove a handy PyTorch CPU win on per-step time / peak RSS / allocation via --trainbench (default-on flip gated on that proof). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…PyTorch proof) Phased TDD plan: Tensors #4 (trainbench probe + alloc audit + arena guard + fused-norm overload) -> Tensors #3 (tiled FlashAttention backward, parity-gated) -> v0.102.0 release -> AiDotNet #1 (full-precision streaming optimizer, engage single-pass fused-in-backward as the unclipped common-case default, opt-in fast-clip, PyTorch CPU comparison proving a handy win on time/alloc/RSS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ckward foundation) Adds FullPrecisionStreamingOptimizer<T> (per-tensor double[] moments, one global step counter, no quantization, no update clamping) + FullPrecisionStreamingAdam<T> matching the classic AdamOptimizer formula exactly. This is the bit-identical streaming optimizer that lets single-pass fused optimizer-in-backward be the common-case DEFAULT for models that fit in memory (vs the 8-bit OOM-survival variants whose ClampUpdate + block quantization diverge from classic Adam). Compiles clean; not yet wired into GetOrCreateStreamingOptimizer (next commit) — the bit-identical gate test lands with the wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ed) + gate test Resolver gains SupportsFullPrecision (Adam-family today) + a full-precision branch; TrainWithTapeStreaming uses the bit-identical FullPrecisionStreamingAdam when the user opts in (ForceOn) for an unclipped Adam model. Auto default unchanged (the unclipped-fitting default-on flip is gated on the §5d PyTorch proof). Gate test FusedInBackward_Unclipped_MatchesClassicAdam_ToFloatPrecision: fused single-pass optimizer-in-backward tracks classic eager collect-then-step Adam to float precision (1e-3 over 20 steps, identical init). PASSES. Surfaced (pre-existing, separate): MaxGradNorm defaults to 1.0, so the COMMON case is clipped -> the two-pass clipped streaming path, which currently throws "persistent tape activations released" (sets Persistent=true but not ReleaseStreamingActivations=false). Fix tracked next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ision for clipped Fixes the pre-existing clipped-streaming crash: the two-pass path builds a Persistent tape but ComputeGradientsStreaming releases activations by default (process-global GradientTape<T>.ReleaseStreamingActivations), so pass 2 threw "activations released". Now save/restore that flag (false during the clipped passes) so both passes share the recorded graph; the setting never leaks. Also drops the !clip restriction on full-precision selection: clipping only chooses single-pass vs two-pass, not precision, so the clipped (common-case, MaxGradNorm=1.0 default) path now uses the bit-identical FullPrecisionStreamingAdam too. PyTorch's apply_optimizer_in_backward does not support clipping at all. Gate: FusedInBackward_Clipped_MatchesClassicAdam_ToFloatPrecision (+ the unclipped one) both PASS — fused optimizer-in-backward tracks classic clip-then-step Adam to float precision over 20 steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-06-22T18:32:53Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
aidotnet_website	Ready	Preview, Comment	Jun 22, 2026 9:40pm
aidotnet-playground-api	Ready	Preview, Comment	Jun 22, 2026 9:40pm

coderabbitai · 2026-06-22T18:32:57Z

Warning

Review limit reached

@ooples, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 14 minutes and 5 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2cf707e5-63c2-46b7-b6c3-b1c74c253815

📥 Commits

Reviewing files that changed from the base of the PR and between de9a34d and 6b60270.

📒 Files selected for processing (9)

benchmarks/trainbench_torch.py
docs/superpowers/plans/2026-06-22-1662-backward-pass-optimizations.md
docs/superpowers/specs/2026-06-22-1662-backward-pass-optimizations-design.md
docs/superpowers/specs/2026-06-22-1662-pytorch-comparison.md
src/NeuralNetworks/NeuralNetworkBase.cs
src/Training/FullPrecisionStreamingAdam.cs
src/Training/FullPrecisionStreamingOptimizer.cs
src/Training/StreamingOptimizerResolver.cs
tests/AiDotNet.Tests/IntegrationTests/NeuralNetworks/FusedOptimizerIntegrationTests.cs

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/1662-optimizer-in-backward

_{Comment @coderabbitai help to get the list of available commands.}

…orch on speed) Adds benchmarks/trainbench_torch.py (mirrors the --trainbench residual-FFN shape) and the head-to-head results. Finding on S=128/D=384/10-layer/8-thread: torch: median 69.6 ms/step, peak RSS 312 MB aidotnet: median 252.9 ms/step (3.6x SLOWER), peak WS 489 MB, alloc 0.127 MB/step Identical loss. The optimizer-in-backward + arena wins ONLY on allocation/GC churn (near-zero), not throughput or peak RSS. The 3.6x per-step gap is GEMM + autodiff overhead (#653 core CPU-parity), orthogonal to optimizer-in-backward. Consequence: the §5a default-on flip is NOT justified (no speed win) — fused optimizer-in-backward stays opt-in (ForceOn), valued for bounded peak-grad memory + zero GC churn (bit-identical) + clipped support PyTorch lacks. "Beat PyTorch on all metrics" is not currently true and must not be claimed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…allel scaling Decomposed the 3.6x per-step gap by batch: S=128 -> 3.6x, S=1024 -> 1.43x. The gap collapses with larger M, matching the known #475 finding (managed microkernel is MKL-parity, but small-M GEMM parallel scaling plateaus ~2x). So the speed deficit is the #653 core CPU-parity problem (small-M GEMM dispatch/scaling), orthogonal to optimizer-in-backward, which can't change matmul cost. Lever #1's real win is the bit-identical allocation/peak-grad-memory reduction, which holds at all batch sizes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds FastApproxGradClip (default OFF). When on + clipping active, the streaming clipped path runs as a SINGLE backward pass: the clip scale comes from an EMA of the previous step's global grad-norm (NFNet-style adaptive clipping), and this step's exact norm is accumulated in the same pass to update the EMA. First step seeds the EMA without clipping. NOT bit-identical (documented approximation) — this is the clipped path that beats PyTorch on backward-pass count (torch's apply_optimizer_in_backward cannot clip at all). Branch restructured into unclipped single-pass / fast-clip single-pass / exact two-pass; persistent tape + ReleaseStreamingActivations toggle now gated on the genuine two-pass only. Convergence test (loss decreases, stays finite) PASSES; 14/14 fused+streaming tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

franklinic and others added 6 commits June 22, 2026 10:58

franklinic and others added 3 commits June 22, 2026 17:39

ooples force-pushed the perf/1662-optimizer-in-backward branch from 01c6e73 to 6b60270 Compare June 22, 2026 21:39

vercel Bot deployed to Preview – aidotnet_website June 22, 2026 21:39 View deployment

vercel Bot deployed to Preview – aidotnet-playground-api June 22, 2026 21:40 View deployment

ooples marked this pull request as ready for review June 22, 2026 21:40

ooples merged commit 972a8eb into master Jun 22, 2026
15 checks passed

ooples deleted the perf/1662-optimizer-in-backward branch June 22, 2026 21:43

ooples mentioned this pull request Jun 24, 2026

Playbook: clear remaining #1677 OOM/timeout CI shards — memory-architecture fixes (caching arenas, fused optimizer-in-backward, COW clone, weight streaming) + float as the LAST lever #1684

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(#1662): bit-identical fused optimizer-in-backward (lever #1)#1664

perf(#1662): bit-identical fused optimizer-in-backward (lever #1)#1664
ooples merged 9 commits into
masterfrom
perf/1662-optimizer-in-backward

ooples commented Jun 22, 2026

Uh oh!

vercel Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ooples commented Jun 22, 2026

What this delivers

Gates (all green; 14/14 fused+streaming tests pass, no regressions)

⚠️ Honest performance verdict (does NOT beat PyTorch on speed)

Notes

Uh oh!

vercel Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 22, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading