Skip to content

Commit ce66269

Browse files
committed
docs(simd): ndarray::simd Trojan horse — inject into ClickHouse + Tantivy
48-hour Claude Code flex prompt for direct ndarray::simd integration into the most CPU-hungry layers of the legacy Bardioc stack: - ClickHouse via its existing rust/ cargo workspace (sum, avg, min/max, substring, hash, comparison kernels) - Tantivy via direct dependency injection (bitpack decode, range bucketing, BM25, skip list intersection, columnar gather) - Quickwit inherits the Tantivy work for free Explicit non-targets: Elasticsearch/Lucene (JNI overhead too high; bypass via Quickwit), TinkerPop (mostly scalar traversal), ScyllaDB (follow-on). Strategic frame: this is a Trojan horse, not just a benchmark. Once the legacy stack depends on ndarray::simd, the HHTL migration becomes "completing a dependency you've already accepted" rather than rip-and- replace. Also: zero better validation of ndarray::simd than against ClickHouse C++ SIMD (decades of hand-tuning) and Tantivy bitpacked codecs (Lucene-class FTS). Anti-goals: do NOT add new ndarray primitives to fake parity; do NOT upstream patches this weekend (separate follow-on); do NOT touch HHTL. Companion to bardioc-weekend-rebuild-prompt.md and stack-consolidation-bardioc-to-hhtl.md.
1 parent 19f6876 commit ce66269

1 file changed

Lines changed: 257 additions & 0 deletions

File tree

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# ndarray::simd Trojan Horse — Claude Code Flex Prompt
2+
3+
Inject `ndarray::simd` into the most CPU-hungry layers of the legacy Bardioc
4+
stack (ClickHouse + Tantivy/Quickwit), measured against stock, in a single
5+
weekend. Goals: real-world validation of ndarray::simd against the gold-standard
6+
SIMD code on earth + strategic dependency injection that softens the eventual
7+
HHTL migration.
8+
9+
Copy the block below into a fresh Claude Code session. Authorize `--allowed-tools '*'`,
10+
Docker, Rust 1.94, CMake, Clang ≥ 17, GCC ≥ 13.
11+
12+
Budget: 48 hours wall-clock.
13+
14+
---
15+
16+
```text
17+
You are injecting `ndarray::simd` (from adaworldapi/ndarray, AVX-512 default
18+
target via .cargo/config.toml `target-cpu=x86-64-v4`) into the hot SIMD paths
19+
of two CPU-hungry data systems: ClickHouse and Tantivy (which Quickwit
20+
inherits). The deliverable is a measurable speed/parity report against stock,
21+
not production patches.
22+
23+
Spawn 12 parallel workers + 1 coordinator (you). Use git worktrees per worker.
24+
Branch per worker: `simd-trojan/{role}-{id}`. Cargo gates skipped per worker;
25+
integration is the gate. Coordinator cherry-picks to main when smoke tests pass.
26+
27+
## Why this matters
28+
29+
ndarray::simd has been validated against synthetic micro-benchmarks but not
30+
against real-world OLAP/FTS workloads. ClickHouse C++ SIMD and Tantivy's
31+
packed-integer codecs are the gold standard. If ndarray::simd matches or
32+
beats them via direct integration, the result is:
33+
34+
1. Real-world validation against the most demanding SIMD workloads on earth.
35+
2. Upstream contribution opportunities (ClickHouse `rust/` workspace, Tantivy
36+
crate ecosystem).
37+
3. Strategic dependency injection: the legacy Bardioc stack now depends on
38+
ndarray::simd, so the eventual HHTL migration is "completing a dependency
39+
you've already accepted" rather than "rip-and-replace".
40+
4. A Trojan horse: ndarray::simd embedded in OSS infrastructure earns
41+
ecosystem trust independent of the AdaWorldAPI cognitive stack.
42+
43+
## Targets in scope (and explicitly OUT)
44+
45+
IN:
46+
- ClickHouse (C++ via existing `rust/` cargo workspace integration)
47+
- Tantivy (direct Rust dependency injection)
48+
- Quickwit (gets it for free via Tantivy)
49+
50+
OUT (do not waste worker-hours here):
51+
- Elasticsearch / Lucene: JNI overhead too high; bypass via Quickwit instead.
52+
- TinkerPop / Gremlin: mostly scalar traversal; JNI kills SIMD gains.
53+
- ScyllaDB: noted as a follow-up; not this weekend.
54+
55+
## ClickHouse injection plan
56+
57+
ClickHouse already has a Rust workspace at `rust/` (prql, skim_str, parquet,
58+
blake3, etc.) — use it. Add a new crate `rust/ndarray_simd_kernels/` that
59+
exposes C-ABI wrappers around ndarray::simd primitives. Wire into ClickHouse's
60+
function registration via the same pattern existing Rust crates use.
61+
62+
Target kernels (priority order — pick the ones with the most existing C++
63+
SIMD hand-tuning to make the comparison fair):
64+
65+
1. `src/Functions/sum.cpp` → `ndarray::simd::reduce_sum_f32/f64/i32/i64`
66+
2. `src/AggregateFunctions/AggregateFunctionAvg.cpp` → sum + count combined
67+
3. `src/Functions/array/arrayMin.cpp` + `arrayMax.cpp` →
68+
`ndarray::simd::reduce_{min,max}`
69+
4. `src/Functions/like.cpp` (substring match) → `ndarray::simd::substring_find`
70+
(W1a closure-parameterized batch primitive)
71+
5. `src/Common/HashTable/Hash.h` (hash function batching) →
72+
`ndarray::simd::hash_xxh3_batch`
73+
6. `src/Functions/comparison.cpp` (`==`, `<`, `>` on numeric columns) →
74+
`ndarray::simd::compare_lt/eq/gt`
75+
76+
Function-call FFI overhead amortizes over `DEFAULT_BLOCK_SIZE = 65536` rows.
77+
Per-block FFI is fine; per-row FFI is not. Design the C ABI accordingly:
78+
batch-in, batch-out, no per-element callbacks across the FFI boundary.
79+
80+
Build setup:
81+
- ClickHouse build via `cmake -DENABLE_RUST=1 -DENABLE_TESTS=1`
82+
- `rust/ndarray_simd_kernels/Cargo.toml` depends on `ndarray = { git = "..." }`
83+
- `cbindgen` to generate the C header
84+
- Static link into ClickHouse server binary
85+
- Runtime dispatch: ClickHouse's `CpuFlags` decides which backend; ndarray
86+
exposes separate AVX-512 / AVX2 / NEON / scalar entry points
87+
88+
## Tantivy injection plan
89+
90+
Tantivy is already Rust — direct dependency. Fork Tantivy at the current tag,
91+
add `ndarray = { git = "..." }` to its `Cargo.toml`, replace its SIMD code
92+
paths with ndarray::simd calls.
93+
94+
Target paths (look for `#[cfg(target_feature)]` and packed-int decoding):
95+
96+
1. `src/postings/compression/` — bitpacked posting list decode →
97+
`ndarray::simd::bitpack_decode_u32`
98+
2. `src/aggregation/bucket/range.rs` — range bucketing →
99+
`ndarray::simd::bucketize_f64`
100+
3. `src/query/term_query.rs` — term frequency scoring (BM25) →
101+
`ndarray::simd::bm25_score_batch`
102+
4. `src/postings/skip.rs` — skip list intersection →
103+
`ndarray::simd::intersect_sorted_u32`
104+
5. `src/columnar/` — columnar reads → `ndarray::simd::gather_f32/u32`
105+
106+
Tantivy has a comprehensive test suite. The bar is: all Tantivy tests pass
107+
with the ndarray::simd backend, AND the bench suite shows parity or better.
108+
109+
## Worker split (12 + coordinator)
110+
111+
| Worker | Target | Role |
112+
|---|---|---|
113+
| W1 | ClickHouse | Build setup + `rust/ndarray_simd_kernels/` crate skeleton + cbindgen |
114+
| W2 | ClickHouse | Kernels 1+2 (sum, avg) + benches |
115+
| W3 | ClickHouse | Kernels 3+6 (min/max + comparison) + benches |
116+
| W4 | ClickHouse | Kernel 4 (substring match) + benches |
117+
| W5 | ClickHouse | Kernel 5 (hash batching) + benches |
118+
| W6 | Tantivy | Fork setup + ndarray dep wiring + paths 1+5 (bitpack + gather) |
119+
| W7 | Tantivy | Path 2 (range bucketing) + benches |
120+
| W8 | Tantivy | Path 3 (BM25 scoring) + benches |
121+
| W9 | Tantivy | Path 4 (skip list intersection) + benches |
122+
| W10 | ClickHouse + Tantivy | C ABI parity tests (same kernel from both sides returns same result) |
123+
| W11 | Both | Combined benchmark harness (docker-compose: ClickHouse + Quickwit + workload) |
124+
| W12 | Both | Report generator: stock-vs-ndarray-simd latency tables in markdown + plots |
125+
126+
Coordinator: integration testing, cherry-pick, docker-compose orchestration,
127+
final REPORT.md.
128+
129+
## Benchmarks (deliverable)
130+
131+
Run BOTH stock and ndarray::simd-injected versions against the SAME workloads:
132+
133+
ClickHouse workload:
134+
- TPC-H scale factor 10 (~10 GB)
135+
- Q1, Q3, Q6, Q14 (these stress the kernels we replaced)
136+
- Report: p50/p95/p99 query latency, CPU instructions retired (`perf stat`),
137+
IPC, cache miss rate
138+
139+
Tantivy/Quickwit workload:
140+
- StackOverflow dataset (~10M docs, full body)
141+
- 1000 queries from a realistic mix: term, phrase, range, aggregation
142+
- Report: p50/p95/p99 query latency, indexing throughput, cold-cache vs
143+
warm-cache latency
144+
145+
Output: `./benchmarks/REPORT.md` with side-by-side tables.
146+
147+
## Acceptance criteria
148+
149+
Per kernel:
150+
1. Correctness: parity with stock (bit-exact for integer, ULP-bounded for
151+
float)
152+
2. Performance: within 5% of stock OR faster. If slower by >5%, document
153+
why (function-call overhead per call, larger batch needed, missing AVX-512
154+
primitive, etc.)
155+
3. Test coverage: existing test suites pass unchanged.
156+
157+
Per system:
158+
- ClickHouse: full `ctest` passes with `ENABLE_RUST=1`
159+
- Tantivy: `cargo test --all-features` passes
160+
- Benchmarks reproduce in a clean docker-compose stand-up
161+
162+
## Anti-goals
163+
164+
- Do NOT optimize ndarray::simd to win these specific benchmarks. The point
165+
is to measure what ndarray::simd is TODAY against the gold standard, not
166+
to fake a win.
167+
- Do NOT introduce nightly-only code paths (Rust 1.94 stable; portable_simd
168+
is gated, intrinsics via `core::arch::*` only).
169+
- Do NOT upstream patches this weekend. The deliverable is the validated fork
170+
+ benchmark report, NOT merged PRs. Upstream contribution is a separate
171+
follow-on after the numbers are clean.
172+
- Do NOT touch HHTL substrate (PR-X4, PR-X9, etc.). This is independent
173+
validation of ndarray::simd, not HHTL development.
174+
- Do NOT add new SIMD primitives to ndarray::simd to plug gaps. If a kernel
175+
needs a primitive that doesn't exist, document the gap and skip that
176+
kernel — the gap becomes a follow-on ndarray::simd PR with W1a consumer
177+
contract.
178+
179+
## Time budget
180+
181+
| Hour 0-4 | Build setups (W1, W6) + worker bootstrap |
182+
| Hour 4-16 | Per-kernel implementation (W2-W5, W7-W9) in parallel |
183+
| Hour 16-24 | C ABI parity testing (W10) + first benchmark pass (W11) |
184+
| Hour 24-36 | Tune the worst regressions; iterate kernel-by-kernel |
185+
| Hour 36-44 | Final benchmark pass + report generation (W12) |
186+
| Hour 44-48 | REPORT.md write-up + identified upstream-PR opportunities + handoff |
187+
188+
If a kernel doesn't reach parity in its allotted window, document the gap
189+
(missing ndarray primitive, FFI overhead too high, layout mismatch) and
190+
move on. Honest negatives are also data.
191+
192+
## Strategic outcomes (what the report unlocks)
193+
194+
1. **Validation**: ndarray::simd benchmarked against ClickHouse C++ SIMD
195+
(decades of hand-tuning) and Tantivy's bitpacked codecs (Lucene-class
196+
FTS). This is the bar.
197+
198+
2. **Upstream PR pipeline**: each kernel that hits parity-or-better becomes
199+
a candidate upstream contribution. ClickHouse `rust/` workspace is the
200+
natural channel; Tantivy crate ecosystem the other. Earns ecosystem
201+
credibility independent of AdaWorldAPI.
202+
203+
3. **Migration pressure relief**: if the Bardioc stack itself gets faster
204+
via ndarray::simd injection, the cutover urgency decreases. That's
205+
actually GOOD — it lets HHTL ship on its own merits rather than under
206+
"we have to migrate, Bardioc is too slow" pressure. Honest migration
207+
conversation.
208+
209+
4. **Dependency Trojan horse**: when HHTL is ready, ClickHouse and Tantivy
210+
already depend on ndarray::simd. The migration is "completing a
211+
dependency you've already accepted" rather than "abandoning everything".
212+
Softer organizational change.
213+
214+
5. **Cross-team signal**: this weekend ships a benchmark report that any
215+
ClickHouse / Tantivy team can read and respond to. Opens conversations
216+
that pure cognitive-stack work doesn't.
217+
218+
Begin. Report progress every 4 hours with a status table per worker (kernel
219+
done / in-progress / blocked + correctness pass-fail + perf delta vs stock).
220+
```
221+
222+
---
223+
224+
## Notes for using this prompt
225+
226+
- Drop into a fresh Claude Code session on a build machine with Rust 1.94 +
227+
Clang ≥ 17 + GCC ≥ 13 + CMake ≥ 3.27 + Docker.
228+
- ClickHouse build is heavy: ~40 GB disk, ~30 min full build. Plan accordingly.
229+
- Tantivy build is light: ~5 min.
230+
- Quickwit is the operational shell for Tantivy benchmarks — easier than
231+
running raw Tantivy bench harness.
232+
- The 12-worker pattern matches the master-consolidation protocol. brainstorm
233+
(Opus) for kernel design, scaffolding (Sonnet) for FFI wrappers, review
234+
(Opus) for correctness gates.
235+
- If you only have 24 hours (half-flex), cut Tantivy entirely and focus on
236+
ClickHouse kernels 1+2+4 (sum, avg, substring) — these are the most
237+
ClickHouse-celebrated SIMD paths and the most impressive parity comparison.
238+
- The REPORT.md should be writable as a blog post — that's the strategic
239+
amplification angle that turns a benchmark exercise into ecosystem signal.
240+
241+
## Follow-on opportunities (NOT this weekend)
242+
243+
- **Upstream PR cadence**: 1 ClickHouse PR per 2 weeks for each parity-or-better
244+
kernel. Tantivy PRs faster (no JVM in the build pipeline).
245+
- **ScyllaDB Rust driver SIMD**: hash-function family swap. Similar shape.
246+
- **cudf / Polars** (Rust DataFrame): Polars already uses ndarray-style
247+
vectorization; check if ndarray::simd primitives can replace its
248+
hand-rolled ones.
249+
- **Apache Arrow Rust**: arrow-rs has SIMD for filter/take/aggregate;
250+
ndarray::simd could plug in there too.
251+
- **DataFusion** (Rust SQL engine): similar to Arrow path; the SIMD layer
252+
is generic enough to swap.
253+
254+
The pattern generalizes: any Rust-or-Rust-FFI-able data system with hot
255+
SIMD paths is a candidate. ndarray::simd as the canonical Rust SIMD
256+
substrate is a multi-year strategic position; this weekend is the proof
257+
of concept.

0 commit comments

Comments
 (0)