Skip to content

perf(kpm): evaluate 128-byte chunked memory layout for Pyramid downsample #133

@kalwalt

Description

@kalwalt

Summary

Evaluate (and possibly port) the 128-byte chunked memory access pattern from the C++ BoxFilterDecimate — but only if benchmarks show cache misses dominate on real-world image sizes.

Context

Follow-up to #130 (M8 step 1). The C++ BoxFilterDecimate in WebARKitLib/lib/SRC/KPM/FreakMatcher/detectors/pyramid.cpp processes images in 128-byte column chunks with separate remainder handling:

int num_c = src_width/128;
int num_r = src_width%128;

for (int i = 0; i < src_height_minus_1; i += 2) {
    // ... 128-byte column loop
    for (int j = 0; j < num_c; j++, src_ptr += 128, dst_ptr += 64) {
        HorizontalBoxFilterDecimate128(tmp1, src_ptr);
        HorizontalBoxFilterDecimate128(tmp2, src_ptr + src_step);
        VerticalBoxFilter64(dst_ptr, tmp1, tmp2);
    }
    // ... remainder
}

This is a 2013-era L1 cache optimization. PR #130 intentionally dropped it because:

  1. The output bits are identical with or without chunking — it's purely a memory access pattern.
  2. Modern LLVM auto-vectorizes the simple row-major loop well.
  3. Adding it back requires ~50 lines of split + remainder handling.

This issue tracks whether to add it back, gated on benchmark evidence.

Depends on

Scope

Acceptance criteria

  • Profile data attached to the PR/issue.
  • Either:
    • A PR adding chunked path with benchmark proof, or
    • This issue is closed wontfix with a one-line note: "modern Rust/LLVM handles the access pattern well enough; chunking not worth the complexity."

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions