Skip to content

Commit f0c8b51

Browse files
committed
FSST LIKE: follow-ups doc with subagent-ready briefs
Three self-contained task briefs for the remaining DFA prefilter items — Shift-Or for short needles, Fat Teddy for multi-pattern OR, and an engine planner / cost-model that replaces the hardcoded routing cascade. Each section is sized to be pasted into a subagent prompt: required context, files to touch, exit criteria, validation gates, known pitfalls. Includes a shared "Required context" block covering the FSST DFA architecture so each task brief stays focused on its own scope. Recommended order when running sequentially: 1. Shift-Or — extends FoldedContains for needles ≤ 8 bytes. 2. Planner — refactors scan_to_bitbuf routing; needs Shift-Or added first so the matrix is non-trivial. 3. Fat Teddy — multi-pattern OR; benefits from the planner. Signed-off-by: Claude <noreply@anthropic.com>
1 parent 24f94ed commit f0c8b51

1 file changed

Lines changed: 229 additions & 0 deletions

File tree

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# FSST LIKE follow-ups
2+
3+
Self-contained briefs for the three remaining DFA prefilter items. Each
4+
section is sized to be handed to a subagent as a complete prompt:
5+
context, constraints, files-to-touch, exit criteria, validation. The
6+
shared "Required context" block below applies to all three tasks.
7+
8+
---
9+
10+
## Required context (read before any task)
11+
12+
Vortex's FSST LIKE pushdown lives in `encodings/fsst/src/dfa/`:
13+
14+
| File | Role |
15+
|---|---|
16+
| `dfa/mod.rs` | `FsstMatcher` enum + `LikeKind::parse`. KMP byte-table and symbol-transition builders. Wildcard/CI helpers. |
17+
| `dfa/folded_contains.rs` | `FoldedContainsDfa` (`%needle%`, needle ≤127 bytes). Owns `scan_to_bitbuf` routing: escape-only memmem → dense short-circuit → Teddy + fused SSA → 1-byte → row_loop. |
18+
| `dfa/flat_contains.rs` | `FlatContainsDfa` (`%needle%`, 128–254 bytes). Same routing minus Teddy. |
19+
| `dfa/prefix.rs` | `FlatPrefixDfa` (`prefix%`). Single-direction DFA, fail-state on mismatch. |
20+
| `dfa/suffix.rs` | `SuffixMatcher` (`%suffix`). Backward scan. |
21+
| `dfa/multi_contains.rs` | `MultiContainsDfa` (`%seg1%seg2%`). Chained KMP. |
22+
| `dfa/anchor_scan.rs` | Streaming Teddy-2 / Teddy-3 + bucket collection. AVX2 / AVX-512 / NEON / scalar variants. SSA fusion is inline in each. |
23+
| `dfa/skip.rs` | Per-string skip strategies (memchr1/2/3, bitmap). |
24+
| `compute/like.rs` | `LikeKernel` entry — parses options, builds `FsstMatcher`, dispatches to `scan_to_bitbuf`. |
25+
| `benches/fsst_like.rs` | Divan benches. `fsst_contains` is parametric on a mined-needle corpus; `fsst_contains_{htt,ear,https}_*` cover SSA-density regimes; `fsst_not_contains_*` covers NOT LIKE. |
26+
27+
The DFA construction pipeline is:
28+
29+
1. **Parse** the pattern into a `LikeKind` (Prefix/Suffix/Contains/MultiContains).
30+
2. **Build the byte table** (`kmp_byte_transitions` for contains/multi, `build_prefix_byte_table` for prefix, `build_suffix_byte_table` for suffix). Wildcards (`b'_'`) fill the row with the advancing state; case-insensitive matching sets both letter cases via `set_advance`.
31+
3. **Build symbol transitions** (`build_symbol_transitions`) — for each `(state, FSST symbol code)`, simulate feeding the symbol's expansion bytes through the byte table.
32+
4. **Fuse** symbol transitions + the ESCAPE row into the final `(state × 256) → state` table.
33+
5. **Bucket-extract** progressing-c1 codes and (c2,c3) advancement sets for Teddy.
34+
6. **Capture SSA** (`single_step_accept_codes`) — codes whose state-0 transition lands directly at `accept`. These are passed through to the streaming Teddy as `ssa_codes`; Teddy AVX2/AVX-512/NEON fuse them via an extra per-block PSHUFB.
35+
36+
`scan_to_bitbuf` routes between paths in this order:
37+
38+
```
39+
escape-only memmem [when no symbol contains any needle byte and pattern is wildcard-free + case-sensitive]
40+
ssa_saturated → 1-byte [SSA-density estimated > ~32k candidates total]
41+
triple Teddy + fused SSA + pair fallback
42+
pair Teddy + fused SSA
43+
escape_pair (1 bucket, c1=ESCAPE, ≤3 c2's, no SSA)
44+
1-byte progressing bitset
45+
row_loop scan_to_bitbuf_with
46+
```
47+
48+
Negation (`NOT LIKE`): all streaming paths handle the `negated: bool` correctly by initializing the bitbuf inversely and unsetting on match.
49+
50+
Test conventions: `rstest` cases preferred; tests live in `dfa/tests.rs`; `assert_arrays_eq!` for array comparisons; helpers `sym(bytes)` and `escaped(bytes)` build symbols and all-escape-encoded code streams.
51+
52+
Bench conventions: `cargo bench -p vortex-fsst --bench fsst_like --features _test-harness`. Compare against the bench snapshots in `git log` for recent commits.
53+
54+
Validation gates for any change:
55+
- `cargo test -p vortex-fsst --lib` (must keep passing all existing tests)
56+
- `cargo +nightly fmt --all` (clean)
57+
- `cargo clippy -p vortex-fsst --all-targets --all-features` (no new lints in changed files; preexisting lints on `mod.rs:498` / `anchor_scan.rs:3100+` / `dfa_compressed/` are out of scope)
58+
- `./scripts/public-api.sh` if any public API moves
59+
- `cargo bench -p vortex-fsst --bench fsst_like --features _test-harness` showing no regression on the existing parametric set
60+
61+
---
62+
63+
## Task A — Shift-Or / Bitap for short needles (≤8 bytes)
64+
65+
**Goal.** Add a `ShiftOr` matcher variant in the FSST DFA stack, used when `Contains` patterns have needles of length ≤ 8 bytes AND no SSA codes are present in the trained dict for the needle. Bit-parallel single-`u64`-state matcher; one ALU op per code byte instead of one table lookup + the Teddy verifier dispatch.
66+
67+
**Why.** For very short needles, the Teddy + verify dispatch is the bottleneck — there are many candidates per byte but the actual DFA work is trivial. Shift-Or replaces both the prefilter and the verifier with a single `state = (state << 1) | B[byte]` per byte, accept on a fixed high bit.
68+
69+
**Where it fits in the routing ladder.** Before fused-Teddy, after escape-only memmem and dense short-circuit. So in `folded_contains.rs::scan_to_bitbuf`, the order becomes:
70+
```
71+
escape-only memmem → dense short-circuit → shift-or → triple Teddy → pair Teddy → 1-byte → row_loop
72+
```
73+
74+
**Algorithm sketch.**
75+
- Needle of length L (1 ≤ L ≤ 8): build a 256-entry `B: [u64; 256]` table.
76+
- For each pattern position `i` in 0..L, for each byte value `b` that "matches" `needle[i]` (literal, or any byte if wildcard, or case-folded byte when `ci`): set `B[b]` bit `i` to `0` (rest are `1`).
77+
- State starts at `!0u64`. For each input byte `b`: `state = (state << 1) | B[b]`. Match when `state & (1 << (L - 1)) == 0`.
78+
- For FSST: build B over the COMPRESSED stream by composing through the symbol table. Each FSST symbol code `c` (length up to 8) maps to a transition function that updates the state by feeding all of the symbol's bytes through `B`. Precompute `B_sym[c]: fn(u64) -> u64`. Or, equivalently, store `B_sym[c]` as `(shift: u32, or_mask: u64)` — the shift is `8 * symbol_length`, the or_mask comes from `OR` of shifted `B[symbol_byte_i]`.
79+
- ESCAPE_CODE handling: ESCAPE consumes the next byte as a literal; encode that as a two-step state transition. Either build a dedicated ESCAPE table or fall back to the standard byte path for escape pairs.
80+
81+
**Files to touch.**
82+
- New file `dfa/shift_or.rs` containing `ShiftOrDfa` with `new(symbols, lengths, needle, ci)` and `matches(codes) -> bool`. Inner-loop must be `#[inline(always)]`.
83+
- `dfa/mod.rs`: add a `ShiftOr(ShiftOrDfa)` variant to `MatcherInner`. Add a build-time gate in `FsstMatcher::try_new_with`: prefer `ShiftOr` over `FoldedContains` when needle ≤ 8 bytes AND `!case_insensitive_with_non_ascii`.
84+
- `dfa/mod.rs::scan_to_bitbuf` dispatches new variant via `scan_to_bitbuf_with`.
85+
86+
**Constraints.**
87+
- Wildcard support: `_` at position `i` makes `B[b]` bit `i` zero for ALL `b`. Same as ANY-byte match.
88+
- Case-insensitivity: when `ci`, also clear `B[b]` for the case-flipped letter byte.
89+
- The needle-length cap is 8 because we need `L - 1 ≤ 63` for the high-bit accept; we also want `state >> (L * 8)` to be defined for the symbol composition.
90+
- Don't enable when SSA codes exist — Shift-Or matches deterministic byte sequences; SSA introduces multi-byte symbol shortcuts that would over-approximate without a separate verifier.
91+
- The FSST symbol composition is the subtle part — write it carefully and add a property test that compares against the FoldedContains result on the same needle+data.
92+
93+
**Exit criteria.**
94+
- New `dfa/shift_or.rs` (~200 lines including doc-comment + tests).
95+
- `ShiftOrDfa` selected for ≤ 8-byte needles in `FsstMatcher::try_new_with` (unless SSA is present, falling through to FoldedContains).
96+
- At least 5 tests: 1-byte / 2-byte / 8-byte needles; wildcard; case-insensitive; with FSST symbols; vs FoldedContains on random data (property test).
97+
- All 169+ existing tests pass.
98+
- New bench `fsst_contains_short_<dataset>` (≤4-byte needles like `%abc%`, `%xy%`) added to `benches/fsst_like.rs`. ShiftOr should be ≥1.5× faster than FoldedContains on selective short needles.
99+
100+
**Risk / known pitfalls.**
101+
- The FSST symbol composition is easy to get wrong on multi-byte symbols. Cross-check against the existing FoldedContains for the same needle.
102+
- Beware that the state bit-shift direction is convention-dependent — I described "right-to-accept" where state bit `i` means "matched i+1 bytes ending here". Be consistent.
103+
- Don't ship if Shift-Or regresses any existing fsst_contains parametric bench.
104+
105+
---
106+
107+
## Task B — Fat Teddy / multi-pattern OR
108+
109+
**Goal.** Support `LIKE x OR LIKE y OR …` (and analogous `IN (x, y, …)` LIKE-of-string patterns) with a single Hyperscan-style Fat Teddy pass: 16 buckets, 4-byte fingerprint per bucket, AVX2 or AVX-512. Today every LIKE in the OR runs as a separate scan + boolean OR of results; Fat Teddy collapses them into one streaming pass.
110+
111+
**Why.** ClickBench Q23 (`UserAgent LIKE … OR LIKE … OR LIKE …`) is the canonical workload. Real-world `LIKE IN (…)` lists are typically 3–50 needles. A naive N-pass implementation is `` slower than necessary; Fat Teddy can reach ~1.5× the single-pattern cost.
112+
113+
**Where it fits.** This is a NEW entry point in the FSST `compute/like.rs` / `kernel.rs` layer (one that takes `&[&[u8]]` patterns and returns one BoolArray per pattern OR a single `OR`-merged result). The single-pattern code path stays unchanged.
114+
115+
**Algorithm sketch.** (Reference: Hyperscan's `fdr/fdr_loadval.h` + `teddy_avx2.h`. There is a Rust port in `aho_corasick::packed::teddy`*do not copy-paste*, but study the structure.)
116+
- Build 16 buckets across all needles. Pack needles into buckets greedily by first-byte (and second-byte for collision avoidance).
117+
- For each bucket, build a 4-byte fingerprint table: 16-entry nibble lookup × 4 successive byte positions (c1 through c4). The fingerprint match is bit-AND of the 4 PSHUFB results.
118+
- Per 32-byte block (AVX2): 4 loads (`v1` at `i`, `v2` at `i+1`, `v3` at `i+2`, `v4` at `i+3`). 8 PSHUFB lookups. 3 ANDs. movemask → 32-bit candidate mask. For each set bit, identify the bucket (the bits in the mask are per-bucket; bit `b` in lane `j` set ⇒ candidate for bucket `b` at offset `j`). Run the per-bucket verifier (DFA matches) for that needle.
119+
- Buckets that share more than one needle pack via the cross-bucket scheme (also from Hyperscan).
120+
- ESCAPE / wildcard / FSST-symbol semantics: each per-bucket verifier is the existing single-pattern DFA, so the symbol semantics are already correct.
121+
122+
**Files to touch.**
123+
- New file `dfa/fat_teddy.rs` containing the bucket-packing algorithm, the AVX2 + AVX-512 + scalar passes, and bucket-identification helpers.
124+
- `dfa/mod.rs`: add a `MultiNeedleMatcher` (separate from `FsstMatcher`) with `try_new_multi(symbols, lengths, &[&[u8]])` returning per-pattern `FoldedContainsDfa`s + a packed `FatTeddyIndex`.
125+
- New entry point in `compute/like.rs` or a separate `compute/like_multi.rs` for the multi-pattern case. Or, simpler: add `fn like_or(...) -> Option<ArrayRef>` to `LikeKernel`-style trait. (Confer with the LikeKernel API; if there's no existing precedent for batched OR, defer the API design question and write a free function for now.)
126+
127+
**Constraints.**
128+
- All needles must be `Contains` shape (`%x%`). Mixing prefix/suffix needles in a single Fat Teddy pass is out of scope.
129+
- Wildcard and case-insensitive flags must be consistent across all needles in the batch.
130+
- Number of needles per Fat Teddy pass capped at 16 (1 per bucket). For more, chunk into groups of 16 and OR-merge across passes.
131+
- Fall back to N-pass when Fat Teddy doesn't apply (mixed shapes, > 16 of certain dense buckets, etc.).
132+
133+
**Exit criteria.**
134+
- New `dfa/fat_teddy.rs` (~600–1000 lines including SIMD variants).
135+
- A new test set covering: 3, 8, 16-needle OR; ensure result equals OR-of-N-single-pattern results on the same data.
136+
- New bench `fsst_contains_or_<n>_<dataset>` with `n ∈ {3, 8, 16}`. Fat Teddy should be ≥1.5× faster than N single-pattern passes for n ≥ 4.
137+
- All existing 169+ tests pass.
138+
139+
**Risk / known pitfalls.**
140+
- The bucket-packing greedy can collide pathologically — write the property test against random needle sets to expose this.
141+
- ESCAPE_CODE c1 collides across patterns; needs cross-bucket handling per Hyperscan FDR.
142+
- Don't ship if the single-pattern parametric benches regress.
143+
144+
---
145+
146+
## Task C — Engine planner / cost-model routing
147+
148+
**Goal.** Replace the hardcoded routing cascade in `FoldedContainsDfa::scan_to_bitbuf` (and the parallel `FlatContainsDfa` / `MultiContainsDfa` cascades) with a small cost-model planner that picks the best scan path before execution.
149+
150+
**Why.** The cascade is hardcoded — every new variant (Shift-Or, Fat Teddy, dense short-circuit) is a manual `if` branch. The SSA-density gate I shipped is a one-off hack of this kind. A planner cleanly extends to N variants and is the right place to wire in column statistics (min/max, bloom filters, histogram) when those land.
151+
152+
**Inputs available at scan time** (free):
153+
- `n` (row count), `all_bytes.len()` (total compressed bytes).
154+
- `needle.len()`, `accept_state`, `|progressing_codes|`, `|SSA_codes|`, `|buckets|`, `|triple_buckets|`.
155+
- Whether `escape_only_pattern` is feasible.
156+
- Architecture features (AVX2/AVX-512/NEON detected).
157+
158+
**Inputs available with sampling** (~µs):
159+
- SSA density (already computed by `ssa_saturated`).
160+
- Estimated candidate density per path.
161+
162+
**Decision table to encode.**
163+
164+
```
165+
1. escape_only_pattern is Some? → escape_only memmem
166+
2. SSA codes present AND density > THRESHOLD? → 1-byte progressing bitset
167+
3. needle.len() ≤ 8 AND no SSA? → ShiftOr (after Task A)
168+
4. all_bytes.len() < 4 KiB? → row_loop
169+
5. triple buckets exist? → triple Teddy + SSA fusion + pair fallback
170+
6. pair buckets exist (no SSA, ≤3 c2)? → escape_pair (specialized)
171+
7. pair buckets exist? → pair Teddy + SSA fusion
172+
8. progressing codes exist? → 1-byte progressing bitset
173+
9. fallback → row_loop
174+
```
175+
176+
**Architecture.** Introduce a `ScanPlanner` struct that owns the inputs and exposes `plan(&self) -> ScanPlan`. `ScanPlan` is an enum of `EscapeOnly | OneByteSaturated | ShiftOr | TripleTeddy | PairTeddy | EscapePair | OneByteBitset | RowLoop`. `scan_to_bitbuf` dispatches on the plan instead of branching inline.
177+
178+
**Cost model** (start simple — calibrated constants, not a learned model):
179+
180+
```rust
181+
fn cost(plan: ScanPlan, ctx: &ScanContext) -> u64 {
182+
// Setup + per-byte scan + per-candidate verify, in approximate ns.
183+
match plan {
184+
EscapeOnly => SETUP_MEMMEM + ctx.all_bytes / MEMMEM_THROUGHPUT_BYTES_PER_NS,
185+
ShiftOr => SETUP_SHIFTOR + ctx.all_bytes * SHIFTOR_NS_PER_BYTE,
186+
TripleTeddy => SETUP_TEDDY + ctx.all_bytes * TEDDY_NS_PER_BYTE + ctx.estimated_candidates * VERIFY_NS,
187+
...
188+
}
189+
}
190+
```
191+
192+
Constants come from the existing benches — calibrate per architecture (a tiny build-time decision: AVX-512 vs AVX2 vs NEON vs scalar).
193+
194+
**Files to touch.**
195+
- New `dfa/planner.rs` with `ScanPlanner`, `ScanContext`, `ScanPlan`, `cost`, `plan`.
196+
- `dfa/folded_contains.rs::scan_to_bitbuf`: extract each path into a method (`run_escape_only`, `run_shiftor`, `run_triple_teddy`, …), then have the top-level dispatch be `match self.planner().plan() { ... }`. Same refactor (smaller) for `flat_contains.rs` and `multi_contains.rs`.
197+
- Tracing: under `VORTEX_FSST_PLAN_TRACE`, print the inputs + chosen plan + estimated cost. Required to validate the planner picks the right thing on each bench.
198+
199+
**Constraints.**
200+
- The planner must not perform any allocation in its decision (only constant-time arithmetic + the existing SSA-density sample).
201+
- The planner must produce the same plan that the existing cascade produces today on the bench corpus (lock this in with a per-bench `assert_eq!(plan, …)` test) — i.e. the planner refactor must not regress any bench, then improvements come from cleaner additions.
202+
- Architecture detection happens once at construction time, not per-call.
203+
204+
**Exit criteria.**
205+
- New `dfa/planner.rs` (~300 lines).
206+
- `scan_to_bitbuf` in folded / flat / multi all dispatch via the planner.
207+
- New tests `test_planner_picks_*` covering each decision-table row.
208+
- All 169+ existing tests pass, all existing benches at parity ± noise (1.05× threshold).
209+
- A regression-trace test that confirms the planner picks the same path as the legacy cascade for every existing fsst_like bench.
210+
- Doc comment on `ScanPlanner` explaining the cost-model rationale and the calibration source (which bench produced which constant).
211+
212+
**Risk / known pitfalls.**
213+
- The per-bench parity assertion is the load-bearing test. If a single planner decision diverges from the legacy cascade for any bench, you've either improved or regressed something — figure out which before declaring done.
214+
- Don't try to add column-statistics integration in this task. Just the routing refactor.
215+
216+
---
217+
218+
## Running these as subagents
219+
220+
Each task is sized for a single subagent worktree run. The expected protocol:
221+
222+
1. Launch with `subagent_type: "general-purpose"` and `isolation: "worktree"`.
223+
2. Pass the **Required context** block + the chosen task section verbatim as the prompt.
224+
3. Add at the end: *"Implement the task, run `cargo test -p vortex-fsst --lib` plus the validation gates listed, then commit on the worktree branch and print the diff stat + branch name."*
225+
4. After the subagent returns, review the diff, run benches locally on the parent worktree, integrate.
226+
227+
Tasks are independent and can run in parallel — none of them touch overlapping files in a way that would conflict (Shift-Or adds a new file + a `MatcherInner` variant; Fat Teddy adds new files + a separate matcher type; the planner refactors `scan_to_bitbuf` only).
228+
229+
Recommended order if running sequentially: **Task A (Shift-Or) → Task C (planner) → Task B (Fat Teddy)**. The planner becomes much more valuable after Shift-Or is in (it adds a routing decision); Fat Teddy is the largest and benefits from a stable planner.

0 commit comments

Comments
 (0)