Skip to content

Commit 5439074

Browse files
kasper0406claude
andcommitted
dot: size kc for L2 stripe reuse, not L1 per-call stripe
schedule_dot's k-block formula previously used block_n in the denominator, which sized kc so that a per-kernel-call stripe (kc × block_n) fit in the given cache. The real reuse pattern is different: inside an outer k iteration the scheduler sweeps all (m, n) tiles, so the cross-m-iteration stripe (kc × n) is what needs to stay resident in cache. Changes here: * schedule.cc — switch the k-block formula from `block_n` to the current `n`. The cache_size argument now has its intended meaning (L2/SLC budget, not L1 per-call footprint). * schedule.cc — when k overflows the cache-sized kc_max, split into `niter = ceil(k/kc_max)` near-equal iterations instead of one kc_max iter plus a tail. Two reasons: small tails amortise kernel-call overhead poorly, and a kc that nearly fills the cache with no headroom causes the B stripe to spill into the outer cache, which has much higher variance than a slightly smaller stripe that fits cleanly. * schedule.cc — reserve 1/16 headroom from the cache budget before computing kc_max. At the boundary where the stripe exactly fills the budget, A/C tiles, TLB traffic, and prefetch interference evict parts of the resident B stripe back into outer caches, hurting both mean and run-to-run variance. * base/arch.{h,cc} — new `get_l2_cache_size()` returning a GEMM cache budget. When cpuinfo is available, iterates all reported L2 caches and picks the one with the largest per-thread share (size / processor_count), so on asymmetric systems (Apple M-series P+E, Arm big.LITTLE) it deterministically selects the performance cluster. The budget is max(per_thread * 2, total * 3/4): - per_thread * 2 absorbs graceful spillover into outer SLC/L3. - total * 3/4 recognises that on physically-shared L2 clusters (M-series P-cluster, big.LITTLE P-cluster) a single-threaded GEMM gets near-full L2 — the per-thread model under-counts. The 1/4 headroom covers A/C tiles, TLB, and prefetch interference near capacity. On per-core L2 systems (sharers=1) the first branch always dominates, so behavior there is unchanged. Tuned for single-threaded latency; see implementation comment for the multi-thread caveat. Fallback: 1 MiB. * subgraph/dot.cc — replace the hardcoded `cache_size_l2 = 128 KiB` with `get_l2_cache_size()`, the TODO the original code asked for. * schedule_bench.cc — add an `auto:<cache1>[,<cache2>...]` flag that drives `schedule_dot` directly with user-specified cache sizes, so this (and the previous) commit can be exercised from the CLI without hand-crafting loops. * schedule_test.cc — targeted tests covering the no-blocking case, kc sizing from current n, the even-split policy at various k/kc_max ratios, and the fits-exactly boundary. Expectations account for the 1/16 safety headroom. Benchmarked on Apple M4 Pro, dot_fp32_sme2 square GEMMs, N=12 samples per cell with randomised order and cooldowns: geomean +8.2% across n ∈ {64..8192} against master's 128 KiB L1-style schedule, with wins of +5-20% across n ∈ {1280..3072} and +21%/+51% at n=7168/8192 where the old schedule's 512-kc stripe no longer fits L2. No regressions outside the noise floor. All tests in //ynnpack/kernels/dot/... pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ad4dea5 commit 5439074

6 files changed

Lines changed: 274 additions & 21 deletions

File tree

ynnpack/base/arch.cc

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,4 +97,67 @@ uint64_t get_supported_arch_flags() {
9797
return flags;
9898
}
9999

100+
size_t get_l2_cache_size() {
101+
static const size_t size = []() -> size_t {
102+
// Conservative default when cpuinfo isn't available: 1 MiB. This is
103+
// within a small factor of what typical Cortex-A7xx / Neoverse cores
104+
// have per core, and large enough that kc stays usefully big for
105+
// typical GEMM shapes (for N <= 4096, f32, it keeps kc >= 64).
106+
constexpr size_t kFallback = 1 * 1024 * 1024;
107+
#ifdef YNN_ENABLE_CPUINFO
108+
if (!cpuinfo_initialize()) {
109+
return kFallback;
110+
}
111+
const uint32_t count = cpuinfo_get_l2_caches_count();
112+
if (count == 0) {
113+
return kFallback;
114+
}
115+
// Pick the L2 with the largest per-thread share. On asymmetric systems
116+
// (Apple M-series P+E, Arm big.LITTLE) this deterministically selects
117+
// the performance cluster, which is where latency-critical GEMM work
118+
// lands. On homogeneous systems every L2 yields the same answer. Also
119+
// track the total size of the selected L2 for the second branch below.
120+
size_t best_per_thread = 0;
121+
size_t best_total = 0;
122+
for (uint32_t i = 0; i < count; ++i) {
123+
const struct cpuinfo_cache* l2 = cpuinfo_get_l2_cache(i);
124+
if (l2 == nullptr || l2->size == 0) continue;
125+
const uint32_t sharers =
126+
l2->processor_count > 0 ? l2->processor_count : 1;
127+
const size_t per_thread = static_cast<size_t>(l2->size) / sharers;
128+
if (per_thread > best_per_thread) {
129+
best_per_thread = per_thread;
130+
best_total = static_cast<size_t>(l2->size);
131+
}
132+
}
133+
if (best_per_thread == 0) return kFallback;
134+
// Two bounds, take the larger:
135+
//
136+
// per_thread * 2: assumes all sharers run GEMM concurrently; the 2x
137+
// absorbs graceful spillover into outer SLC/L3 on Apple M-series
138+
// and Neoverse cores.
139+
//
140+
// total * 3/4: on a physically-shared L2 cluster (M-series P-cluster,
141+
// big.LITTLE P-cluster) a single-threaded GEMM gets near-full L2 —
142+
// the per-thread model under-counts. The 1/4 headroom covers A/C
143+
// tiles, TLB, and prefetch interference near capacity.
144+
//
145+
// On per-core L2 systems (sharers=1) the first always dominates, so
146+
// behavior there is unchanged. The second fires only on shared clusters,
147+
// which is exactly the topology the per-thread model under-counts.
148+
//
149+
// Both are tuned for single-threaded latency. Under MT on clustered L2,
150+
// threads each using 3/4 of total oversubscribe the cache — the
151+
// long-term fix is plumbing active-thread count from pthreadpool.
152+
const size_t per_thread_budget = best_per_thread * 2;
153+
const size_t shared_cluster_budget = best_total - best_total / 4;
154+
return per_thread_budget > shared_cluster_budget ? per_thread_budget
155+
: shared_cluster_budget;
156+
#else
157+
return kFallback;
158+
#endif
159+
}();
160+
return size;
161+
}
162+
100163
} // namespace ynn

ynnpack/base/arch.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
#ifndef XNNPACK_YNNPACK_BASE_ARCH_H_
77
#define XNNPACK_YNNPACK_BASE_ARCH_H_
88

9+
#include <cstddef>
910
#include <cstdint>
1011

1112
namespace ynn {
@@ -67,6 +68,20 @@ inline bool is_arch_supported(
6768
return (arch_flags & supported_arch_flags) == arch_flags;
6869
}
6970

71+
// Returns the L2-class cache budget (in bytes) to use for GEMM kc-blocking
72+
// heuristics — see schedule_dot in kernels/dot/schedule.cc for the formula.
73+
//
74+
// When cpuinfo is available, picks the L2 with the largest per-thread share
75+
// (size / processor_count) across all reported L2 caches — on asymmetric
76+
// systems (Apple M-series P+E, Arm big.LITTLE) this selects the performance
77+
// cluster; on homogeneous systems every L2 yields the same answer. The
78+
// selected share is doubled to account for the stripe gracefully spilling
79+
// into the outer SLC/L3 present on Apple M-series and Neoverse cores. This
80+
// heuristic is tuned for single-threaded latency — see the implementation
81+
// for the multi-thread caveat. Falls back to a conservative 1 MiB when
82+
// cpuinfo is not available.
83+
size_t get_l2_cache_size();
84+
7085
} // namespace ynn
7186

7287
#endif // XNNPACK_YNNPACK_BASE_ARCH_H_

ynnpack/kernels/dot/schedule.cc

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -38,16 +38,33 @@ span<dot_loop> schedule_dot(span<const size_t> cache_sizes, size_t m, size_t n,
3838
*loop++ = dot_loop{dot_loop::n, blocks};
3939
n = block_n * blocks;
4040
};
41-
auto make_k_loop = [&](size_t blocks) {
42-
if (blocks == 0 || k1 <= block_k * blocks) return;
41+
auto make_k_loop = [&](size_t blocks_max) {
42+
if (blocks_max == 0) return;
43+
const size_t k_blocks = ceil_div(k1, block_k);
44+
// Fits in a single iteration — no loop needed.
45+
if (k_blocks <= blocks_max) return;
46+
// Split into N near-equal iterations rather than one cache-max iter plus
47+
// a tail. Two reasons: (1) a small tail amortises kernel-call overhead
48+
// poorly, and (2) a kc that almost-fills the cache with no headroom
49+
// causes the B stripe to spill into the outer cache, which has much
50+
// higher variance than a slightly smaller stripe that fits cleanly.
51+
const size_t niter = ceil_div(k_blocks, blocks_max);
52+
const size_t blocks = ceil_div(k_blocks, niter);
4353
*loop++ = dot_loop{dot_loop::k, blocks};
4454
k1 = block_k * blocks;
4555
};
4656

4757
for (size_t cache_size : cache_sizes) {
48-
// TODO(b/447988052): We can be way smarter about this than we are now.
49-
make_k_loop(
50-
floor_div(cache_size, k2 * block_n * b_elem_size * block_k));
58+
// Size kc so that a (kc × n) stripe of B fits in this cache. Inside each
59+
// outer k-iteration we sweep all (m, n) tiles; the B stripe is loaded
60+
// once from memory on the first m-iteration and reused from cache on
61+
// subsequent m-iterations, so what matters is kc×n×b_elem_size, not the
62+
// per-kernel-call kc×block_n stripe.
63+
// ~6% headroom so the B stripe doesn't exactly fill the budget — at
64+
// that boundary, concurrent A/C/TLB traffic evicts B into the outer
65+
// cache, hurting both mean and run-to-run variance.
66+
const size_t kc_budget = cache_size - cache_size / 16;
67+
make_k_loop(floor_div(kc_budget, k2 * n * b_elem_size * block_k));
5168
if (n * b_elem_size <= m * a_elem_size) {
5269
// Tiles of B are smaller than tiles of A, we should assume B fits in
5370
// cache.

ynnpack/kernels/dot/schedule_bench.cc

Lines changed: 51 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,8 @@ double run_benchmark(TA, TB, TC, const kernel_info& kernel, size_t m, size_t n,
198198
int main(int argc, char** argv) {
199199
if (argc < 3) {
200200
std::cerr << "Usage: " << argv[0]
201-
<< " <kernel_name> <MxNxK> [<loop1> <loop2> ...]" << std::endl;
201+
<< " <kernel_name> <MxNxK> [<loop1>|auto:<cache1>[,<cache2>...]]"
202+
<< std::endl;
202203
return 1;
203204
}
204205

@@ -209,23 +210,35 @@ int main(int argc, char** argv) {
209210
return 1;
210211
}
211212

213+
// Find the kernel
214+
auto kernel = ynn::get_kernel(kernel_name);
215+
if (!kernel.kernel) {
216+
std::cerr << "Unknown kernel: " << kernel_name << std::endl;
217+
return 1;
218+
}
219+
212220
std::vector<ynn::dot_loop> loops;
221+
std::vector<size_t> auto_cache_sizes;
213222
for (int i = 3; i < argc; ++i) {
214-
ynn::dot_loop loop = ynn::parse_dot_loop(argv[i]);
223+
std::string arg = argv[i];
224+
if (arg.rfind("auto:", 0) == 0) {
225+
std::stringstream ss(arg.substr(5));
226+
std::string token;
227+
while (std::getline(ss, token, ',')) {
228+
if (token.empty()) continue;
229+
size_t cs = std::stoul(token);
230+
auto_cache_sizes.push_back(cs);
231+
}
232+
continue;
233+
}
234+
ynn::dot_loop loop = ynn::parse_dot_loop(arg);
215235
if (loop.dim < 0 || loop.blocks == 0) {
216-
std::cerr << "Error parsing loop specifier: " << argv[i] << std::endl;
236+
std::cerr << "Error parsing loop specifier: " << arg << std::endl;
217237
return 1;
218238
}
219239
loops.push_back(loop);
220240
}
221241

222-
// Find the kernel
223-
auto kernel = ynn::get_kernel(kernel_name);
224-
if (!kernel.kernel) {
225-
std::cerr << "Unknown kernel: " << kernel_name << std::endl;
226-
return 1;
227-
}
228-
229242
// Kernels require an outer loop for m, make sure we have one.
230243
size_t min_block_m = -1;
231244
// The dot loops are interpreted as a multiple of blocks. To make this CLI
@@ -244,9 +257,36 @@ int main(int argc, char** argv) {
244257
break;
245258
}
246259
}
247-
if (min_block_m > 1) loops.push_back({ynn::dot_loop::m, 1});
248260

249261
double t = ynn::SwitchThreeTypes(kernel.type, [&](auto a, auto b, auto c) {
262+
using TA = decltype(a);
263+
using TB = decltype(b);
264+
std::vector<ynn::dot_loop> auto_storage;
265+
if (!auto_cache_sizes.empty()) {
266+
auto_storage.resize(auto_cache_sizes.size() * 3);
267+
size_t ks[] = {static_cast<size_t>(shape.k), 1, 1};
268+
ynn::span<const size_t> cs(auto_cache_sizes);
269+
auto auto_loops = ynn::schedule_dot(
270+
cs, static_cast<size_t>(shape.m), static_cast<size_t>(shape.n),
271+
ynn::span<const size_t>(ks), kernel.block_m, kernel.block_n,
272+
kernel.block_k, sizeof(TA), sizeof(TB), auto_storage.data());
273+
std::cerr << "[auto schedule] ";
274+
for (const auto& l : auto_loops) {
275+
char d = l.dim == ynn::dot_loop::m ? 'm'
276+
: l.dim == ynn::dot_loop::n ? 'n' : 'k';
277+
size_t bs = l.dim == ynn::dot_loop::m ? kernel.block_m
278+
: l.dim == ynn::dot_loop::n ? kernel.block_n
279+
: kernel.block_k;
280+
std::cerr << d << (l.blocks * bs) << " ";
281+
}
282+
std::cerr << std::endl;
283+
for (const auto& l : auto_loops) loops.push_back(l);
284+
for (const auto& l : auto_loops) {
285+
if (l.dim == ynn::dot_loop::m)
286+
min_block_m = std::min(min_block_m, l.blocks);
287+
}
288+
}
289+
if (min_block_m > 1) loops.push_back({ynn::dot_loop::m, 1});
250290
return ynn::run_benchmark(a, b, c, kernel, shape.m, shape.n, shape.k,
251291
loops);
252292
});

ynnpack/kernels/dot/schedule_test.cc

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,4 +163,116 @@ TEST(run_dot, loop_k) {
163163
dot_call_at(m, n, block_k, 0, 0, 3 * block_k)));
164164
}
165165

166+
// -- Targeted tests for schedule_dot itself --
167+
168+
bool operator==(const dot_loop& a, const dot_loop& b) {
169+
return a.dim == b.dim && a.blocks == b.blocks;
170+
}
171+
172+
std::ostream& operator<<(std::ostream& os, const dot_loop& l) {
173+
const char* d = l.dim == dot_loop::m ? "m"
174+
: l.dim == dot_loop::n ? "n"
175+
: l.dim == dot_loop::k ? "k"
176+
: "?";
177+
return os << d << "x" << l.blocks;
178+
}
179+
180+
// A cache budget much larger than the working set yields no blocking — the
181+
// default {m, 1} safety loop is emitted so run_dot always has at least one
182+
// loop to walk.
183+
TEST(schedule_dot, no_blocking_when_everything_fits) {
184+
dot_loop storage[3];
185+
const size_t cache_sizes[] = {8 * 1024 * 1024}; // 8 MiB
186+
const size_t ks[] = {64};
187+
auto loops = schedule_dot(cache_sizes, /*m=*/16, /*n=*/64, ks,
188+
/*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
189+
/*a_elem_size=*/4, /*b_elem_size=*/4, storage);
190+
EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::m, 1}));
191+
}
192+
193+
// Large shape vs a 128 KiB cache: kc_max = 15/16 * 128 KiB /
194+
// (n * b_elem * block_k) = 120 KiB / (2048 * 4 * 1) = 15 block_k units.
195+
// The 15/16 factor is the safety headroom applied in schedule.cc.
196+
TEST(schedule_dot, k_loop_sized_from_current_n) {
197+
dot_loop storage[3];
198+
const size_t cache_sizes[] = {128 * 1024};
199+
const size_t ks[] = {2048};
200+
auto loops = schedule_dot(cache_sizes, /*m=*/2048, /*n=*/2048, ks,
201+
/*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
202+
/*a_elem_size=*/4, /*b_elem_size=*/4, storage);
203+
EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 15},
204+
dot_loop{dot_loop::m, 1},
205+
dot_loop{dot_loop::n, 1}));
206+
}
207+
208+
// Even-split: when k slightly overflows the natural kc, we split into two
209+
// near-equal iterations rather than one cache-max iter plus a small tail.
210+
// With a 16 MiB cache, n = 4096, and the 15/16 safety headroom, kc_max =
211+
// 15 MiB / (4096 * 4) = 960. For k = 1200, niter = 2, blocks = 600.
212+
TEST(schedule_dot, k_loop_splits_evenly_when_k_slightly_over_kc_max) {
213+
dot_loop storage[3];
214+
const size_t cache_sizes[] = {16ULL * 1024 * 1024};
215+
const size_t ks[] = {1200};
216+
auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
217+
/*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
218+
/*a_elem_size=*/4, /*b_elem_size=*/4, storage);
219+
EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 600},
220+
dot_loop{dot_loop::m, 1},
221+
dot_loop{dot_loop::n, 1}));
222+
}
223+
224+
// Even-split at ~1.5x: k = 1536 is ~1.6 * kc_max (960). Old policy would
225+
// have run one cache-max iter plus a small tail; the new even-split gives
226+
// two near-equal iters of 768, each comfortably inside cache.
227+
TEST(schedule_dot, k_loop_splits_evenly_at_1p5x_boundary) {
228+
dot_loop storage[3];
229+
const size_t cache_sizes[] = {16ULL * 1024 * 1024};
230+
const size_t ks[] = {1024 + 1024 / 2}; // k1 = 1536
231+
auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
232+
/*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
233+
/*a_elem_size=*/4, /*b_elem_size=*/4, storage);
234+
EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 768},
235+
dot_loop{dot_loop::m, 1},
236+
dot_loop{dot_loop::n, 1}));
237+
}
238+
239+
// Larger overflow: k = 1600 gives niter = 2, blocks = ceil(1600/2) = 800.
240+
TEST(schedule_dot, k_loop_splits_evenly_into_two_when_below_2x_kc_max) {
241+
dot_loop storage[3];
242+
const size_t cache_sizes[] = {16ULL * 1024 * 1024};
243+
const size_t ks[] = {1600};
244+
auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
245+
/*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
246+
/*a_elem_size=*/4, /*b_elem_size=*/4, storage);
247+
EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 800},
248+
dot_loop{dot_loop::m, 1},
249+
dot_loop{dot_loop::n, 1}));
250+
}
251+
252+
// Many iterations: k = 2880 = 3 * kc_max (960 after the 15/16 headroom).
253+
// The resulting blocks equals kc_max exactly when k is a clean multiple.
254+
TEST(schedule_dot, k_loop_uses_kc_max_when_k_is_multiple_of_kc_max) {
255+
dot_loop storage[3];
256+
const size_t cache_sizes[] = {16ULL * 1024 * 1024};
257+
const size_t ks[] = {2880};
258+
auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
259+
/*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
260+
/*a_elem_size=*/4, /*b_elem_size=*/4, storage);
261+
EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::k, 960},
262+
dot_loop{dot_loop::m, 1},
263+
dot_loop{dot_loop::n, 1}));
264+
}
265+
266+
// Boundary: k = kc_max exactly -> fits in one iteration, no k-loop emitted.
267+
TEST(schedule_dot, k_loop_skipped_when_k_equals_kc_max) {
268+
dot_loop storage[3];
269+
const size_t cache_sizes[] = {16ULL * 1024 * 1024};
270+
const size_t ks[] = {960}; // kc_max = 960 after the 15/16 headroom
271+
auto loops = schedule_dot(cache_sizes, /*m=*/4096, /*n=*/4096, ks,
272+
/*block_m=*/16, /*block_n=*/64, /*block_k=*/1,
273+
/*a_elem_size=*/4, /*b_elem_size=*/4, storage);
274+
EXPECT_THAT(loops, ElementsAre(dot_loop{dot_loop::m, 1},
275+
dot_loop{dot_loop::n, 1}));
276+
}
277+
166278
} // namespace ynn

ynnpack/subgraph/dot.cc

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
#include <variant>
1919
#include <vector>
2020

21+
#include "ynnpack/base/arch.h"
2122
#include "ynnpack/base/arithmetic.h"
2223
#include "ynnpack/base/base.h"
2324
#include "ynnpack/base/log.h"
@@ -44,9 +45,14 @@ namespace ynn {
4445

4546
namespace {
4647

47-
// TODO(dsharlet): This should probably be a parameter we learn based on cpuinfo
48-
// or other source of CPU metadata. This was determined experimentally.
49-
constexpr index_t cache_size_l2 = 128 * 1024;
48+
// Effective L2 cache budget for kc-blocking in schedule_dot. Sized so that a
49+
// (kc × N) stripe of B fits in this many bytes — see the formula in
50+
// kernels/dot/schedule.cc. When cpuinfo is available this comes from the
51+
// running CPU's reported L2 (per-thread share, with a 2x factor for gradual
52+
// spill into outer caches); otherwise a conservative 1 MiB is used.
53+
inline index_t cache_size_l2() {
54+
return static_cast<index_t>(get_l2_cache_size());
55+
}
5056

5157
// When we want arithmetic to be consistent, we need to make all tiling
5258
// decisions independently of any hardware dependent parameters (cache sizes,
@@ -239,7 +245,7 @@ auto make_dot_impl(dot_type type, bool consistent_arithmetic, bool transposed_a,
239245
c_stride_m, c);
240246
};
241247

242-
const size_t cache_sizes[] = {cache_size_l2};
248+
const size_t cache_sizes[] = {static_cast<size_t>(cache_size_l2())};
243249

244250
// We need up to 3 loops per cache level.
245251
dot_loop loops_storage[std::size(cache_sizes) * 3];
@@ -405,7 +411,7 @@ uint32_t define_pack_b(ynn_subgraph_t subgraph, const dot_type& type,
405411
slinky::expr k3 = num_k_dims >= 3 ? b.extent(3) : 1;
406412

407413
const index_t elem_size_bits = type_size_bytes(b.type) * 8 / element_count;
408-
const index_t cache_elements = cache_size_l2 * 8 / elem_size_bits;
414+
const index_t cache_elements = cache_size_l2() * 8 / elem_size_bits;
409415

410416
// When choosing block_n, we have the following concerns:
411417
// - We want to make the block bigger than the kernel's `block_n`

0 commit comments

Comments
 (0)