perf(kpm): SIMD path for Pyramid downsample (AVX2/SSE4.1/wasm32)

## Summary

Add SIMD implementations of `kpm::freak::pyramid::downsample` gated behind the project's existing feature flags, with `criterion` benchmarks proving the speedup over the scalar baseline.

## Context

Follow-up to #130 (M8 step 1). The PR landed a clean scalar baseline. The downsample inner loop is a textbook SIMD target — for each output pixel it reads 4 source pixels and computes `(sum + 2) >> 2`.

The project already has feature flags ready for this:

- `simd-x86-avx2`
- `simd-x86-sse41`
- `simd-wasm32`

Canonical pattern from [CLAUDE.md §3](../blob/main/CLAUDE.md#simd-and-multithreading):

```rust
#[cfg(all(target_arch = "x86_64", feature = "simd-x86-avx2"))]
if is_x86_feature_detected!("avx2") {
    return unsafe { avx2_impl(input) };
}
// scalar fallback
scalar_impl(input)
```

## Depends on

- #131 (benchmark must land first to establish the baseline this PR will beat)

## Scope

- Implement `downsample_avx2` (process 32 output bytes per iteration via 256-bit lanes).
- Implement `downsample_sse41` (16 output bytes per iteration via 128-bit lanes).
- Implement `downsample_wasm32` using `core::arch::wasm32` SIMD intrinsics.
- Each path behind `unsafe` + `cfg(target_arch)` + `cfg(feature = ...)` + runtime detection (`is_x86_feature_detected!`).
- Each `unsafe` block prefixed with a `// SAFETY:` comment explaining the invariant.
- **Output must be byte-identical to scalar** — add a property test that compares scalar vs SIMD on random inputs.

## Acceptance criteria

- Output of each SIMD path is byte-identical to the scalar baseline for all tested input sizes (including odd dimensions and small images).
- Each SIMD path shows a measurable speedup over scalar on its target architecture, reported in the PR description with concrete ns/iter numbers from #131's benchmark.
- All existing `kpm::freak::pyramid` tests still pass.
- `cargo clippy --all-targets --all-features` clean.

## Out of scope

- `rayon` parallelization (could be a separate issue or bundled here — open question).
- Chunked memory layout (#132).

## References

- PR #130 — original port
- #131 — benchmark baseline
- refs #125 #126 — M8 parent issues
- CLAUDE.md §3 — SIMD conventions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(kpm): SIMD path for Pyramid downsample (AVX2/SSE4.1/wasm32) #132

Summary

Context

Depends on

Scope

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf(kpm): SIMD path for Pyramid downsample (AVX2/SSE4.1/wasm32) #132

Description

Summary

Context

Depends on

Scope

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions