You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dot: size kc for L2 stripe reuse, not L1 per-call stripe
schedule_dot's k-block formula previously used block_n in the denominator,
which sized kc so that a per-kernel-call stripe (kc × block_n) fit in the
given cache. The real reuse pattern is different: inside an outer k
iteration the scheduler sweeps all (m, n) tiles, so the cross-m-iteration
stripe (kc × n) is what needs to stay resident in cache.
Changes here:
* schedule.cc — switch the k-block formula from `block_n` to the current
`n`. The cache_size argument now has its intended meaning (L2/SLC
budget, not L1 per-call footprint).
* schedule.cc — when k overflows the cache-sized kc_max, split into
`niter = ceil(k/kc_max)` near-equal iterations instead of one kc_max
iter plus a tail. Two reasons: small tails amortise kernel-call
overhead poorly, and a kc that nearly fills the cache with no headroom
causes the B stripe to spill into the outer cache, which has much
higher variance than a slightly smaller stripe that fits cleanly.
* schedule.cc — reserve 1/16 headroom from the cache budget before
computing kc_max. At the boundary where the stripe exactly fills the
budget, A/C tiles, TLB traffic, and prefetch interference evict parts
of the resident B stripe back into outer caches, hurting both mean and
run-to-run variance.
* base/arch.{h,cc} — new `get_l2_cache_size()` returning a GEMM cache
budget. When cpuinfo is available, iterates all reported L2 caches and
picks the one with the largest per-thread share (size /
processor_count), so on asymmetric systems (Apple M-series P+E, Arm
big.LITTLE) it deterministically selects the performance cluster. The
budget is max(per_thread * 2, total * 3/4):
- per_thread * 2 absorbs graceful spillover into outer SLC/L3.
- total * 3/4 recognises that on physically-shared L2 clusters
(M-series P-cluster, big.LITTLE P-cluster) a single-threaded GEMM
gets near-full L2 — the per-thread model under-counts. The 1/4
headroom covers A/C tiles, TLB, and prefetch interference near
capacity.
On per-core L2 systems (sharers=1) the first branch always dominates,
so behavior there is unchanged. Tuned for single-threaded latency; see
implementation comment for the multi-thread caveat. Fallback: 1 MiB.
* subgraph/dot.cc — replace the hardcoded `cache_size_l2 = 128 KiB` with
`get_l2_cache_size()`, the TODO the original code asked for.
* schedule_bench.cc — add an `auto:<cache1>[,<cache2>...]` flag that
drives `schedule_dot` directly with user-specified cache sizes, so this
(and the previous) commit can be exercised from the CLI without
hand-crafting loops.
* schedule_test.cc — targeted tests covering the no-blocking case, kc
sizing from current n, the even-split policy at various k/kc_max
ratios, and the fits-exactly boundary. Expectations account for the
1/16 safety headroom.
Benchmarked on Apple M4 Pro, dot_fp32_sme2 square GEMMs, N=12 samples per
cell with randomised order and cooldowns: geomean +8.2% across n ∈
{64..8192} against master's 128 KiB L1-style schedule, with wins of
+5-20% across n ∈ {1280..3072} and +21%/+51% at n=7168/8192 where the
old schedule's 512-kc stripe no longer fits L2. No regressions outside
the noise floor.
All tests in //ynnpack/kernels/dot/... pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments