|
| 1 | +# ndarray::simd Trojan Horse — Claude Code Flex Prompt |
| 2 | + |
| 3 | +Inject `ndarray::simd` into the most CPU-hungry layers of the legacy Bardioc |
| 4 | +stack (ClickHouse + Tantivy/Quickwit), measured against stock, in a single |
| 5 | +weekend. Goals: real-world validation of ndarray::simd against the gold-standard |
| 6 | +SIMD code on earth + strategic dependency injection that softens the eventual |
| 7 | +HHTL migration. |
| 8 | + |
| 9 | +Copy the block below into a fresh Claude Code session. Authorize `--allowed-tools '*'`, |
| 10 | +Docker, Rust 1.94, CMake, Clang ≥ 17, GCC ≥ 13. |
| 11 | + |
| 12 | +Budget: 48 hours wall-clock. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +```text |
| 17 | +You are injecting `ndarray::simd` (from adaworldapi/ndarray, AVX-512 default |
| 18 | +target via .cargo/config.toml `target-cpu=x86-64-v4`) into the hot SIMD paths |
| 19 | +of two CPU-hungry data systems: ClickHouse and Tantivy (which Quickwit |
| 20 | +inherits). The deliverable is a measurable speed/parity report against stock, |
| 21 | +not production patches. |
| 22 | +
|
| 23 | +Spawn 12 parallel workers + 1 coordinator (you). Use git worktrees per worker. |
| 24 | +Branch per worker: `simd-trojan/{role}-{id}`. Cargo gates skipped per worker; |
| 25 | +integration is the gate. Coordinator cherry-picks to main when smoke tests pass. |
| 26 | +
|
| 27 | +## Why this matters |
| 28 | +
|
| 29 | +ndarray::simd has been validated against synthetic micro-benchmarks but not |
| 30 | +against real-world OLAP/FTS workloads. ClickHouse C++ SIMD and Tantivy's |
| 31 | +packed-integer codecs are the gold standard. If ndarray::simd matches or |
| 32 | +beats them via direct integration, the result is: |
| 33 | +
|
| 34 | +1. Real-world validation against the most demanding SIMD workloads on earth. |
| 35 | +2. Upstream contribution opportunities (ClickHouse `rust/` workspace, Tantivy |
| 36 | + crate ecosystem). |
| 37 | +3. Strategic dependency injection: the legacy Bardioc stack now depends on |
| 38 | + ndarray::simd, so the eventual HHTL migration is "completing a dependency |
| 39 | + you've already accepted" rather than "rip-and-replace". |
| 40 | +4. A Trojan horse: ndarray::simd embedded in OSS infrastructure earns |
| 41 | + ecosystem trust independent of the AdaWorldAPI cognitive stack. |
| 42 | +
|
| 43 | +## Targets in scope (and explicitly OUT) |
| 44 | +
|
| 45 | +IN: |
| 46 | +- ClickHouse (C++ via existing `rust/` cargo workspace integration) |
| 47 | +- Tantivy (direct Rust dependency injection) |
| 48 | +- Quickwit (gets it for free via Tantivy) |
| 49 | +
|
| 50 | +OUT (do not waste worker-hours here): |
| 51 | +- Elasticsearch / Lucene: JNI overhead too high; bypass via Quickwit instead. |
| 52 | +- TinkerPop / Gremlin: mostly scalar traversal; JNI kills SIMD gains. |
| 53 | +- ScyllaDB: noted as a follow-up; not this weekend. |
| 54 | +
|
| 55 | +## ClickHouse injection plan |
| 56 | +
|
| 57 | +ClickHouse already has a Rust workspace at `rust/` (prql, skim_str, parquet, |
| 58 | +blake3, etc.) — use it. Add a new crate `rust/ndarray_simd_kernels/` that |
| 59 | +exposes C-ABI wrappers around ndarray::simd primitives. Wire into ClickHouse's |
| 60 | +function registration via the same pattern existing Rust crates use. |
| 61 | +
|
| 62 | +Target kernels (priority order — pick the ones with the most existing C++ |
| 63 | +SIMD hand-tuning to make the comparison fair): |
| 64 | +
|
| 65 | +1. `src/Functions/sum.cpp` → `ndarray::simd::reduce_sum_f32/f64/i32/i64` |
| 66 | +2. `src/AggregateFunctions/AggregateFunctionAvg.cpp` → sum + count combined |
| 67 | +3. `src/Functions/array/arrayMin.cpp` + `arrayMax.cpp` → |
| 68 | + `ndarray::simd::reduce_{min,max}` |
| 69 | +4. `src/Functions/like.cpp` (substring match) → `ndarray::simd::substring_find` |
| 70 | + (W1a closure-parameterized batch primitive) |
| 71 | +5. `src/Common/HashTable/Hash.h` (hash function batching) → |
| 72 | + `ndarray::simd::hash_xxh3_batch` |
| 73 | +6. `src/Functions/comparison.cpp` (`==`, `<`, `>` on numeric columns) → |
| 74 | + `ndarray::simd::compare_lt/eq/gt` |
| 75 | +
|
| 76 | +Function-call FFI overhead amortizes over `DEFAULT_BLOCK_SIZE = 65536` rows. |
| 77 | +Per-block FFI is fine; per-row FFI is not. Design the C ABI accordingly: |
| 78 | +batch-in, batch-out, no per-element callbacks across the FFI boundary. |
| 79 | +
|
| 80 | +Build setup: |
| 81 | +- ClickHouse build via `cmake -DENABLE_RUST=1 -DENABLE_TESTS=1` |
| 82 | +- `rust/ndarray_simd_kernels/Cargo.toml` depends on `ndarray = { git = "..." }` |
| 83 | +- `cbindgen` to generate the C header |
| 84 | +- Static link into ClickHouse server binary |
| 85 | +- Runtime dispatch: ClickHouse's `CpuFlags` decides which backend; ndarray |
| 86 | + exposes separate AVX-512 / AVX2 / NEON / scalar entry points |
| 87 | +
|
| 88 | +## Tantivy injection plan |
| 89 | +
|
| 90 | +Tantivy is already Rust — direct dependency. Fork Tantivy at the current tag, |
| 91 | +add `ndarray = { git = "..." }` to its `Cargo.toml`, replace its SIMD code |
| 92 | +paths with ndarray::simd calls. |
| 93 | +
|
| 94 | +Target paths (look for `#[cfg(target_feature)]` and packed-int decoding): |
| 95 | +
|
| 96 | +1. `src/postings/compression/` — bitpacked posting list decode → |
| 97 | + `ndarray::simd::bitpack_decode_u32` |
| 98 | +2. `src/aggregation/bucket/range.rs` — range bucketing → |
| 99 | + `ndarray::simd::bucketize_f64` |
| 100 | +3. `src/query/term_query.rs` — term frequency scoring (BM25) → |
| 101 | + `ndarray::simd::bm25_score_batch` |
| 102 | +4. `src/postings/skip.rs` — skip list intersection → |
| 103 | + `ndarray::simd::intersect_sorted_u32` |
| 104 | +5. `src/columnar/` — columnar reads → `ndarray::simd::gather_f32/u32` |
| 105 | +
|
| 106 | +Tantivy has a comprehensive test suite. The bar is: all Tantivy tests pass |
| 107 | +with the ndarray::simd backend, AND the bench suite shows parity or better. |
| 108 | +
|
| 109 | +## Worker split (12 + coordinator) |
| 110 | +
|
| 111 | +| Worker | Target | Role | |
| 112 | +|---|---|---| |
| 113 | +| W1 | ClickHouse | Build setup + `rust/ndarray_simd_kernels/` crate skeleton + cbindgen | |
| 114 | +| W2 | ClickHouse | Kernels 1+2 (sum, avg) + benches | |
| 115 | +| W3 | ClickHouse | Kernels 3+6 (min/max + comparison) + benches | |
| 116 | +| W4 | ClickHouse | Kernel 4 (substring match) + benches | |
| 117 | +| W5 | ClickHouse | Kernel 5 (hash batching) + benches | |
| 118 | +| W6 | Tantivy | Fork setup + ndarray dep wiring + paths 1+5 (bitpack + gather) | |
| 119 | +| W7 | Tantivy | Path 2 (range bucketing) + benches | |
| 120 | +| W8 | Tantivy | Path 3 (BM25 scoring) + benches | |
| 121 | +| W9 | Tantivy | Path 4 (skip list intersection) + benches | |
| 122 | +| W10 | ClickHouse + Tantivy | C ABI parity tests (same kernel from both sides returns same result) | |
| 123 | +| W11 | Both | Combined benchmark harness (docker-compose: ClickHouse + Quickwit + workload) | |
| 124 | +| W12 | Both | Report generator: stock-vs-ndarray-simd latency tables in markdown + plots | |
| 125 | +
|
| 126 | +Coordinator: integration testing, cherry-pick, docker-compose orchestration, |
| 127 | +final REPORT.md. |
| 128 | +
|
| 129 | +## Benchmarks (deliverable) |
| 130 | +
|
| 131 | +Run BOTH stock and ndarray::simd-injected versions against the SAME workloads: |
| 132 | +
|
| 133 | +ClickHouse workload: |
| 134 | +- TPC-H scale factor 10 (~10 GB) |
| 135 | +- Q1, Q3, Q6, Q14 (these stress the kernels we replaced) |
| 136 | +- Report: p50/p95/p99 query latency, CPU instructions retired (`perf stat`), |
| 137 | + IPC, cache miss rate |
| 138 | +
|
| 139 | +Tantivy/Quickwit workload: |
| 140 | +- StackOverflow dataset (~10M docs, full body) |
| 141 | +- 1000 queries from a realistic mix: term, phrase, range, aggregation |
| 142 | +- Report: p50/p95/p99 query latency, indexing throughput, cold-cache vs |
| 143 | + warm-cache latency |
| 144 | +
|
| 145 | +Output: `./benchmarks/REPORT.md` with side-by-side tables. |
| 146 | +
|
| 147 | +## Acceptance criteria |
| 148 | +
|
| 149 | +Per kernel: |
| 150 | +1. Correctness: parity with stock (bit-exact for integer, ULP-bounded for |
| 151 | + float) |
| 152 | +2. Performance: within 5% of stock OR faster. If slower by >5%, document |
| 153 | + why (function-call overhead per call, larger batch needed, missing AVX-512 |
| 154 | + primitive, etc.) |
| 155 | +3. Test coverage: existing test suites pass unchanged. |
| 156 | +
|
| 157 | +Per system: |
| 158 | +- ClickHouse: full `ctest` passes with `ENABLE_RUST=1` |
| 159 | +- Tantivy: `cargo test --all-features` passes |
| 160 | +- Benchmarks reproduce in a clean docker-compose stand-up |
| 161 | +
|
| 162 | +## Anti-goals |
| 163 | +
|
| 164 | +- Do NOT optimize ndarray::simd to win these specific benchmarks. The point |
| 165 | + is to measure what ndarray::simd is TODAY against the gold standard, not |
| 166 | + to fake a win. |
| 167 | +- Do NOT introduce nightly-only code paths (Rust 1.94 stable; portable_simd |
| 168 | + is gated, intrinsics via `core::arch::*` only). |
| 169 | +- Do NOT upstream patches this weekend. The deliverable is the validated fork |
| 170 | + + benchmark report, NOT merged PRs. Upstream contribution is a separate |
| 171 | + follow-on after the numbers are clean. |
| 172 | +- Do NOT touch HHTL substrate (PR-X4, PR-X9, etc.). This is independent |
| 173 | + validation of ndarray::simd, not HHTL development. |
| 174 | +- Do NOT add new SIMD primitives to ndarray::simd to plug gaps. If a kernel |
| 175 | + needs a primitive that doesn't exist, document the gap and skip that |
| 176 | + kernel — the gap becomes a follow-on ndarray::simd PR with W1a consumer |
| 177 | + contract. |
| 178 | +
|
| 179 | +## Time budget |
| 180 | +
|
| 181 | +| Hour 0-4 | Build setups (W1, W6) + worker bootstrap | |
| 182 | +| Hour 4-16 | Per-kernel implementation (W2-W5, W7-W9) in parallel | |
| 183 | +| Hour 16-24 | C ABI parity testing (W10) + first benchmark pass (W11) | |
| 184 | +| Hour 24-36 | Tune the worst regressions; iterate kernel-by-kernel | |
| 185 | +| Hour 36-44 | Final benchmark pass + report generation (W12) | |
| 186 | +| Hour 44-48 | REPORT.md write-up + identified upstream-PR opportunities + handoff | |
| 187 | +
|
| 188 | +If a kernel doesn't reach parity in its allotted window, document the gap |
| 189 | +(missing ndarray primitive, FFI overhead too high, layout mismatch) and |
| 190 | +move on. Honest negatives are also data. |
| 191 | +
|
| 192 | +## Strategic outcomes (what the report unlocks) |
| 193 | +
|
| 194 | +1. **Validation**: ndarray::simd benchmarked against ClickHouse C++ SIMD |
| 195 | + (decades of hand-tuning) and Tantivy's bitpacked codecs (Lucene-class |
| 196 | + FTS). This is the bar. |
| 197 | +
|
| 198 | +2. **Upstream PR pipeline**: each kernel that hits parity-or-better becomes |
| 199 | + a candidate upstream contribution. ClickHouse `rust/` workspace is the |
| 200 | + natural channel; Tantivy crate ecosystem the other. Earns ecosystem |
| 201 | + credibility independent of AdaWorldAPI. |
| 202 | +
|
| 203 | +3. **Migration pressure relief**: if the Bardioc stack itself gets faster |
| 204 | + via ndarray::simd injection, the cutover urgency decreases. That's |
| 205 | + actually GOOD — it lets HHTL ship on its own merits rather than under |
| 206 | + "we have to migrate, Bardioc is too slow" pressure. Honest migration |
| 207 | + conversation. |
| 208 | +
|
| 209 | +4. **Dependency Trojan horse**: when HHTL is ready, ClickHouse and Tantivy |
| 210 | + already depend on ndarray::simd. The migration is "completing a |
| 211 | + dependency you've already accepted" rather than "abandoning everything". |
| 212 | + Softer organizational change. |
| 213 | +
|
| 214 | +5. **Cross-team signal**: this weekend ships a benchmark report that any |
| 215 | + ClickHouse / Tantivy team can read and respond to. Opens conversations |
| 216 | + that pure cognitive-stack work doesn't. |
| 217 | +
|
| 218 | +Begin. Report progress every 4 hours with a status table per worker (kernel |
| 219 | +done / in-progress / blocked + correctness pass-fail + perf delta vs stock). |
| 220 | +``` |
| 221 | + |
| 222 | +--- |
| 223 | + |
| 224 | +## Notes for using this prompt |
| 225 | + |
| 226 | +- Drop into a fresh Claude Code session on a build machine with Rust 1.94 + |
| 227 | + Clang ≥ 17 + GCC ≥ 13 + CMake ≥ 3.27 + Docker. |
| 228 | +- ClickHouse build is heavy: ~40 GB disk, ~30 min full build. Plan accordingly. |
| 229 | +- Tantivy build is light: ~5 min. |
| 230 | +- Quickwit is the operational shell for Tantivy benchmarks — easier than |
| 231 | + running raw Tantivy bench harness. |
| 232 | +- The 12-worker pattern matches the master-consolidation protocol. brainstorm |
| 233 | + (Opus) for kernel design, scaffolding (Sonnet) for FFI wrappers, review |
| 234 | + (Opus) for correctness gates. |
| 235 | +- If you only have 24 hours (half-flex), cut Tantivy entirely and focus on |
| 236 | + ClickHouse kernels 1+2+4 (sum, avg, substring) — these are the most |
| 237 | + ClickHouse-celebrated SIMD paths and the most impressive parity comparison. |
| 238 | +- The REPORT.md should be writable as a blog post — that's the strategic |
| 239 | + amplification angle that turns a benchmark exercise into ecosystem signal. |
| 240 | + |
| 241 | +## Follow-on opportunities (NOT this weekend) |
| 242 | + |
| 243 | +- **Upstream PR cadence**: 1 ClickHouse PR per 2 weeks for each parity-or-better |
| 244 | + kernel. Tantivy PRs faster (no JVM in the build pipeline). |
| 245 | +- **ScyllaDB Rust driver SIMD**: hash-function family swap. Similar shape. |
| 246 | +- **cudf / Polars** (Rust DataFrame): Polars already uses ndarray-style |
| 247 | + vectorization; check if ndarray::simd primitives can replace its |
| 248 | + hand-rolled ones. |
| 249 | +- **Apache Arrow Rust**: arrow-rs has SIMD for filter/take/aggregate; |
| 250 | + ndarray::simd could plug in there too. |
| 251 | +- **DataFusion** (Rust SQL engine): similar to Arrow path; the SIMD layer |
| 252 | + is generic enough to swap. |
| 253 | + |
| 254 | +The pattern generalizes: any Rust-or-Rust-FFI-able data system with hot |
| 255 | +SIMD paths is a candidate. ndarray::simd as the canonical Rust SIMD |
| 256 | +substrate is a multi-year strategic position; this weekend is the proof |
| 257 | +of concept. |
0 commit comments