You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf(init): batched parallel Xavier normal weight initialization
Replaces the per-element SampleGaussian call loop (which ran a
virtual-dispatch Box-Muller + rejection test for every element) with a
tight specialized fill routine for double and float: one paired
Box-Muller transform produces two samples per pair of uniform draws,
halving the log/sqrt/sin/cos call count, and large layers (≥ 256K
elements) are partitioned across the thread pool so the ~29s of init
cost per DiT-XL-sized Dense layer (hidden 8192 × out 12288 = 100M
doubles per AdaLN modulation layer) is parallelized instead of running
single-threaded.
Context: even after the Tensors-side SIMD fixes on the forward matmul
path, the first Pika21 Predict paid ~150s of lazy-init overhead across
the 24 block layers because each first-call XavierNormalInitialize hit
a scalar loop doing 100M virtual calls. The cost is one-time per layer
but it dominated the first forward and pushed Training_Should* tests
that exercise a fresh model over the per-test xUnit budget.
Preserves reproducibility: per-chunk RNGs are seeded deterministically
from the master Random instance, so for a given parent seed the output
is stable across thread counts. Keeps the generic-T fallback on the
old path since only float/double are expected to be perf-critical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments