Commit 972a8eb
* docs(#1662): design spec for backward-pass optimizations (levers #4, #1, #3)
Cross-repo, Tensors-first design for the backward/training side of #653/#1624:
backward buffer-reuse audit (#4), optimizer-in-backward adaptive hybrid (#1),
and FlashAttention-style tiled backward (#3). Excludes lever #2 (checkpointing,
owned by open PR #1633).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(#1662): correct lever #1 scope — optimizer-in-backward is OOM-only today
TrainWithTapeStreaming already does single-pass-fused/two-pass-norm, but only
engages above 0.5x-RAM; the common (fits-in-memory) case is collect-then-step,
which merely ties PyTorch. Lever #1's real deliverable: promote single-pass
fused optimizer-in-backward to the common-case default for unclipped training,
add opt-in fast-clip for single-pass clipped, and prove a handy PyTorch CPU win
on per-step time / peak RSS / allocation via --trainbench (default-on flip gated
on that proof).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(#1662): implementation plan — levers #4, #3, #1 (Tensors-first, PyTorch proof)
Phased TDD plan: Tensors #4 (trainbench probe + alloc audit + arena guard +
fused-norm overload) -> Tensors #3 (tiled FlashAttention backward, parity-gated)
-> v0.102.0 release -> AiDotNet #1 (full-precision streaming optimizer, engage
single-pass fused-in-backward as the unclipped common-case default, opt-in
fast-clip, PyTorch CPU comparison proving a handy win on time/alloc/RSS).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(#1662): full-precision streaming Adam (bit-identical fused-in-backward foundation)
Adds FullPrecisionStreamingOptimizer<T> (per-tensor double[] moments, one global
step counter, no quantization, no update clamping) + FullPrecisionStreamingAdam<T>
matching the classic AdamOptimizer formula exactly. This is the bit-identical
streaming optimizer that lets single-pass fused optimizer-in-backward be the
common-case DEFAULT for models that fit in memory (vs the 8-bit OOM-survival
variants whose ClampUpdate + block quantization diverge from classic Adam).
Compiles clean; not yet wired into GetOrCreateStreamingOptimizer (next commit) —
the bit-identical gate test lands with the wiring.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(#1662): wire full-precision fused optimizer-in-backward (unclipped) + gate test
Resolver gains SupportsFullPrecision (Adam-family today) + a full-precision branch;
TrainWithTapeStreaming uses the bit-identical FullPrecisionStreamingAdam when the
user opts in (ForceOn) for an unclipped Adam model. Auto default unchanged (the
unclipped-fitting default-on flip is gated on the §5d PyTorch proof).
Gate test FusedInBackward_Unclipped_MatchesClassicAdam_ToFloatPrecision: fused
single-pass optimizer-in-backward tracks classic eager collect-then-step Adam to
float precision (1e-3 over 20 steps, identical init). PASSES.
Surfaced (pre-existing, separate): MaxGradNorm defaults to 1.0, so the COMMON case
is clipped -> the two-pass clipped streaming path, which currently throws
"persistent tape activations released" (sets Persistent=true but not
ReleaseStreamingActivations=false). Fix tracked next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(#1662): clipped two-pass streaming path + bit-identical full-precision for clipped
Fixes the pre-existing clipped-streaming crash: the two-pass path builds a
Persistent tape but ComputeGradientsStreaming releases activations by default
(process-global GradientTape<T>.ReleaseStreamingActivations), so pass 2 threw
"activations released". Now save/restore that flag (false during the clipped
passes) so both passes share the recorded graph; the setting never leaks.
Also drops the !clip restriction on full-precision selection: clipping only
chooses single-pass vs two-pass, not precision, so the clipped (common-case,
MaxGradNorm=1.0 default) path now uses the bit-identical FullPrecisionStreamingAdam
too. PyTorch's apply_optimizer_in_backward does not support clipping at all.
Gate: FusedInBackward_Clipped_MatchesClassicAdam_ToFloatPrecision (+ the unclipped
one) both PASS — fused optimizer-in-backward tracks classic clip-then-step Adam to
float precision over 20 steps.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(#1662): pytorch cpu comparison - honest result (we do not beat torch on speed)
Adds benchmarks/trainbench_torch.py (mirrors the --trainbench residual-FFN shape)
and the head-to-head results. Finding on S=128/D=384/10-layer/8-thread:
torch: median 69.6 ms/step, peak RSS 312 MB
aidotnet: median 252.9 ms/step (3.6x SLOWER), peak WS 489 MB, alloc 0.127 MB/step
Identical loss. The optimizer-in-backward + arena wins ONLY on allocation/GC churn
(near-zero), not throughput or peak RSS. The 3.6x per-step gap is GEMM + autodiff
overhead (#653 core CPU-parity), orthogonal to optimizer-in-backward.
Consequence: the §5a default-on flip is NOT justified (no speed win) — fused
optimizer-in-backward stays opt-in (ForceOn), valued for bounded peak-grad memory +
zero GC churn (bit-identical) + clipped support PyTorch lacks. "Beat PyTorch on all
metrics" is not currently true and must not be claimed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(#1662): gap investigation — per-step deficit is small-M GEMM parallel scaling
Decomposed the 3.6x per-step gap by batch: S=128 -> 3.6x, S=1024 -> 1.43x. The gap
collapses with larger M, matching the known #475 finding (managed microkernel is
MKL-parity, but small-M GEMM parallel scaling plateaus ~2x). So the speed deficit is
the #653 core CPU-parity problem (small-M GEMM dispatch/scaling), orthogonal to
optimizer-in-backward, which can't change matmul cost. Lever #1's real win is the
bit-identical allocation/peak-grad-memory reduction, which holds at all batch sizes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(#1662): opt-in single-pass fast approximate grad-clip (§5c)
Adds FastApproxGradClip (default OFF). When on + clipping active, the streaming
clipped path runs as a SINGLE backward pass: the clip scale comes from an EMA of
the previous step's global grad-norm (NFNet-style adaptive clipping), and this
step's exact norm is accumulated in the same pass to update the EMA. First step
seeds the EMA without clipping. NOT bit-identical (documented approximation) — this
is the clipped path that beats PyTorch on backward-pass count (torch's
apply_optimizer_in_backward cannot clip at all).
Branch restructured into unclipped single-pass / fast-clip single-pass / exact
two-pass; persistent tape + ReleaseStreamingActivations toggle now gated on the
genuine two-pass only. Convergence test (loss decreases, stays finite) PASSES;
14/14 fused+streaming tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: franklinic <franklin@ivorycloud.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent de9a34d commit 972a8eb
9 files changed
Lines changed: 1451 additions & 11 deletions
File tree
- benchmarks
- docs/superpowers
- plans
- specs
- src
- NeuralNetworks
- Training
- tests/AiDotNet.Tests/IntegrationTests/NeuralNetworks
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
0 commit comments