perf(#1662): bit-identical fused optimizer-in-backward (lever #1)#1664
Conversation
…, #3) Cross-repo, Tensors-first design for the backward/training side of #653/#1624: backward buffer-reuse audit (#4), optimizer-in-backward adaptive hybrid (#1), and FlashAttention-style tiled backward (#3). Excludes lever #2 (checkpointing, owned by open PR #1633). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ly today TrainWithTapeStreaming already does single-pass-fused/two-pass-norm, but only engages above 0.5x-RAM; the common (fits-in-memory) case is collect-then-step, which merely ties PyTorch. Lever #1's real deliverable: promote single-pass fused optimizer-in-backward to the common-case default for unclipped training, add opt-in fast-clip for single-pass clipped, and prove a handy PyTorch CPU win on per-step time / peak RSS / allocation via --trainbench (default-on flip gated on that proof). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…PyTorch proof) Phased TDD plan: Tensors #4 (trainbench probe + alloc audit + arena guard + fused-norm overload) -> Tensors #3 (tiled FlashAttention backward, parity-gated) -> v0.102.0 release -> AiDotNet #1 (full-precision streaming optimizer, engage single-pass fused-in-backward as the unclipped common-case default, opt-in fast-clip, PyTorch CPU comparison proving a handy win on time/alloc/RSS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckward foundation) Adds FullPrecisionStreamingOptimizer<T> (per-tensor double[] moments, one global step counter, no quantization, no update clamping) + FullPrecisionStreamingAdam<T> matching the classic AdamOptimizer formula exactly. This is the bit-identical streaming optimizer that lets single-pass fused optimizer-in-backward be the common-case DEFAULT for models that fit in memory (vs the 8-bit OOM-survival variants whose ClampUpdate + block quantization diverge from classic Adam). Compiles clean; not yet wired into GetOrCreateStreamingOptimizer (next commit) — the bit-identical gate test lands with the wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed) + gate test Resolver gains SupportsFullPrecision (Adam-family today) + a full-precision branch; TrainWithTapeStreaming uses the bit-identical FullPrecisionStreamingAdam when the user opts in (ForceOn) for an unclipped Adam model. Auto default unchanged (the unclipped-fitting default-on flip is gated on the §5d PyTorch proof). Gate test FusedInBackward_Unclipped_MatchesClassicAdam_ToFloatPrecision: fused single-pass optimizer-in-backward tracks classic eager collect-then-step Adam to float precision (1e-3 over 20 steps, identical init). PASSES. Surfaced (pre-existing, separate): MaxGradNorm defaults to 1.0, so the COMMON case is clipped -> the two-pass clipped streaming path, which currently throws "persistent tape activations released" (sets Persistent=true but not ReleaseStreamingActivations=false). Fix tracked next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ision for clipped Fixes the pre-existing clipped-streaming crash: the two-pass path builds a Persistent tape but ComputeGradientsStreaming releases activations by default (process-global GradientTape<T>.ReleaseStreamingActivations), so pass 2 threw "activations released". Now save/restore that flag (false during the clipped passes) so both passes share the recorded graph; the setting never leaks. Also drops the !clip restriction on full-precision selection: clipping only chooses single-pass vs two-pass, not precision, so the clipped (common-case, MaxGradNorm=1.0 default) path now uses the bit-identical FullPrecisionStreamingAdam too. PyTorch's apply_optimizer_in_backward does not support clipping at all. Gate: FusedInBackward_Clipped_MatchesClassicAdam_ToFloatPrecision (+ the unclipped one) both PASS — fused optimizer-in-backward tracks classic clip-then-step Adam to float precision over 20 steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Review limit reached
More reviews will be available in 14 minutes and 5 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (9)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…orch on speed) Adds benchmarks/trainbench_torch.py (mirrors the --trainbench residual-FFN shape) and the head-to-head results. Finding on S=128/D=384/10-layer/8-thread: torch: median 69.6 ms/step, peak RSS 312 MB aidotnet: median 252.9 ms/step (3.6x SLOWER), peak WS 489 MB, alloc 0.127 MB/step Identical loss. The optimizer-in-backward + arena wins ONLY on allocation/GC churn (near-zero), not throughput or peak RSS. The 3.6x per-step gap is GEMM + autodiff overhead (#653 core CPU-parity), orthogonal to optimizer-in-backward. Consequence: the §5a default-on flip is NOT justified (no speed win) — fused optimizer-in-backward stays opt-in (ForceOn), valued for bounded peak-grad memory + zero GC churn (bit-identical) + clipped support PyTorch lacks. "Beat PyTorch on all metrics" is not currently true and must not be claimed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…allel scaling Decomposed the 3.6x per-step gap by batch: S=128 -> 3.6x, S=1024 -> 1.43x. The gap collapses with larger M, matching the known #475 finding (managed microkernel is MKL-parity, but small-M GEMM parallel scaling plateaus ~2x). So the speed deficit is the #653 core CPU-parity problem (small-M GEMM dispatch/scaling), orthogonal to optimizer-in-backward, which can't change matmul cost. Lever #1's real win is the bit-identical allocation/peak-grad-memory reduction, which holds at all batch sizes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds FastApproxGradClip (default OFF). When on + clipping active, the streaming clipped path runs as a SINGLE backward pass: the clip scale comes from an EMA of the previous step's global grad-norm (NFNet-style adaptive clipping), and this step's exact norm is accumulated in the same pass to update the EMA. First step seeds the EMA without clipping. NOT bit-identical (documented approximation) — this is the clipped path that beats PyTorch on backward-pass count (torch's apply_optimizer_in_backward cannot clip at all). Branch restructured into unclipped single-pass / fast-clip single-pass / exact two-pass; persistent tape + ReleaseStreamingActivations toggle now gated on the genuine two-pass only. Convergence test (loss decreases, stays finite) PASSES; 14/14 fused+streaming tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
01c6e73 to
6b60270
Compare
Lever #1 of #1662 (backward/training, part of #653). Draft. Lever #2 (checkpointing) is owned by #1633; levers #3/#4 are in AiDotNet.Tensors PR #665.
What this delivers
FullPrecisionStreamingAdam/FullPrecisionStreamingOptimizer— per-tensor fp32 moments, no quantization, no clamp → bit-identical to classic Adam.StreamingTraining = ForceOn(opt-in):MaxGradNorm=1.0) → exact two-pass-norm, bit-identical.FastApproxGradClip(OFF by default) → single-pass clipped via an EMA grad-norm (NFNet-style approximation; clipped-in-backward is something PyTorch'sapply_optimizer_in_backwardcannot do at all).Persistent=truebut notReleaseStreamingActivations=false, so pass 2 threw "activations released". Now save/restored around the two-pass region.Gates (all green; 14/14 fused+streaming tests pass, no regressions)
FusedInBackward_Unclipped_MatchesClassicAdam_ToFloatPrecisionFusedInBackward_Clipped_MatchesClassicAdam_ToFloatPrecisionFastClip_SinglePass_ReducesLoss_AndStaysFiniteMeasured head-to-head (
benchmarks/trainbench_torch.py, residual-FFN, S=128/D=384/10 layers, 8 threads):Optimizer-in-backward wins only on allocation/GC churn + bounded peak-grad memory (bit-identical), not throughput or peak RSS. The 3.6× is small-M GEMM parallel-scaling (#653; collapses to 1.43× at S=1024) — orthogonal to optimizer-in-backward. "Beats PyTorch" is NOT claimed. See
docs/superpowers/specs/2026-06-22-1662-pytorch-comparison.md. The §5a default-on flip is intentionally NOT done (no speed win to justify it).Notes
AiDotNet.Tensors 0.94.2pin (lever Updated language version to use latest Updated target frameworks to widen the range Added a bunch of new metrics such as R2, Std Deviation, Std Error, etc #1 needs no new Tensors API). Bump to 0.102.0 (PR test: Add integration tests for ProgramSynthesis module [P3] #665) is a trivial follow-up for the Added new normalization options #3/Added new regression and normalization options #4 wins.🤖 Generated with Claude Code