Summary
Evaluate (and possibly port) the 128-byte chunked memory access pattern from the C++ BoxFilterDecimate — but only if benchmarks show cache misses dominate on real-world image sizes.
Context
Follow-up to #130 (M8 step 1). The C++ BoxFilterDecimate in WebARKitLib/lib/SRC/KPM/FreakMatcher/detectors/pyramid.cpp processes images in 128-byte column chunks with separate remainder handling:
int num_c = src_width/128;
int num_r = src_width%128;
for (int i = 0; i < src_height_minus_1; i += 2) {
// ... 128-byte column loop
for (int j = 0; j < num_c; j++, src_ptr += 128, dst_ptr += 64) {
HorizontalBoxFilterDecimate128(tmp1, src_ptr);
HorizontalBoxFilterDecimate128(tmp2, src_ptr + src_step);
VerticalBoxFilter64(dst_ptr, tmp1, tmp2);
}
// ... remainder
}
This is a 2013-era L1 cache optimization. PR #130 intentionally dropped it because:
- The output bits are identical with or without chunking — it's purely a memory access pattern.
- Modern LLVM auto-vectorizes the simple row-major loop well.
- Adding it back requires ~50 lines of split + remainder handling.
This issue tracks whether to add it back, gated on benchmark evidence.
Depends on
Scope
Acceptance criteria
- Profile data attached to the PR/issue.
- Either:
- A PR adding chunked path with benchmark proof, or
- This issue is closed
wontfix with a one-line note: "modern Rust/LLVM handles the access pattern well enough; chunking not worth the complexity."
References
Summary
Evaluate (and possibly port) the 128-byte chunked memory access pattern from the C++
BoxFilterDecimate— but only if benchmarks show cache misses dominate on real-world image sizes.Context
Follow-up to #130 (M8 step 1). The C++
BoxFilterDecimateinWebARKitLib/lib/SRC/KPM/FreakMatcher/detectors/pyramid.cppprocesses images in 128-byte column chunks with separate remainder handling:This is a 2013-era L1 cache optimization. PR #130 intentionally dropped it because:
This issue tracks whether to add it back, gated on benchmark evidence.
Depends on
Scope
perf stat -e cache-misses,LLC-misseson Linux; equivalent on Windows), prototype the chunked layout and measure.Acceptance criteria
wontfixwith a one-line note: "modern Rust/LLVM handles the access pattern well enough; chunking not worth the complexity."References
WebARKitLib/lib/SRC/KPM/FreakMatcher/detectors/pyramid.cpplines 369–409 (BoxFilterDecimate)