perf(kpm): evaluate 128-byte chunked memory layout for Pyramid downsample

## Summary

Evaluate (and possibly port) the 128-byte chunked memory access pattern from the C++ `BoxFilterDecimate` — but **only if benchmarks show cache misses dominate** on real-world image sizes.

## Context

Follow-up to #130 (M8 step 1). The C++ `BoxFilterDecimate` in `WebARKitLib/lib/SRC/KPM/FreakMatcher/detectors/pyramid.cpp` processes images in 128-byte column chunks with separate remainder handling:

```cpp
int num_c = src_width/128;
int num_r = src_width%128;

for (int i = 0; i < src_height_minus_1; i += 2) {
    // ... 128-byte column loop
    for (int j = 0; j < num_c; j++, src_ptr += 128, dst_ptr += 64) {
        HorizontalBoxFilterDecimate128(tmp1, src_ptr);
        HorizontalBoxFilterDecimate128(tmp2, src_ptr + src_step);
        VerticalBoxFilter64(dst_ptr, tmp1, tmp2);
    }
    // ... remainder
}
```

This is a **2013-era L1 cache optimization**. PR #130 intentionally dropped it because:

1. The output bits are **identical** with or without chunking — it's purely a memory access pattern.
2. Modern LLVM auto-vectorizes the simple row-major loop well.
3. Adding it back requires ~50 lines of split + remainder handling.

This issue tracks **whether** to add it back, gated on benchmark evidence.

## Depends on

- #131 (need baseline ns/iter numbers per size)
- Likely also #132 (SIMD will probably hide any cache cost the chunked layout was working around)

## Scope

- After #131 and #132 land, profile the SIMD downsample on cache-relevant sizes (1080p, 4K).
- If L1d misses are a measurable fraction of cycles (`perf stat -e cache-misses,LLC-misses` on Linux; equivalent on Windows), prototype the chunked layout and measure.
- Decision criteria: **port only if it shows ≥10% speedup over non-chunked SIMD** on at least one supported target.

## Acceptance criteria

- Profile data attached to the PR/issue.
- Either:
  - A PR adding chunked path with benchmark proof, **or**
  - This issue is closed `wontfix` with a one-line note: "modern Rust/LLVM handles the access pattern well enough; chunking not worth the complexity."

## References

- PR #130 — original port (scalar, no chunking)
- #131 — benchmark baseline
- #132 — SIMD path
- C++ reference: `WebARKitLib/lib/SRC/KPM/FreakMatcher/detectors/pyramid.cpp` lines 369–409 (`BoxFilterDecimate`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(kpm): evaluate 128-byte chunked memory layout for Pyramid downsample #133

Summary

Context

Depends on

Scope

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf(kpm): evaluate 128-byte chunked memory layout for Pyramid downsample #133

Description

Summary

Context

Depends on

Scope

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions