feat: Wave 1 — Lexical search with BM25F scoring and n-gram index#120
feat: Wave 1 — Lexical search with BM25F scoring and n-gram index#120leesharon wants to merge 40 commits into
Conversation
Introduce the rskim-search crate as the search and indexing foundation for the 3-layer code search system. All types and traits are I/O-free and follow the rskim-core patterns exactly. - crates/rskim-search: new crate with Cargo.toml, lib.rs, types.rs, traits.rs - types.rs: FileId, SearchField, MatchSpan, LineRange, TemporalFlags, SearchQuery (builder pattern), SearchResult, IndexStats, FileTable (I/O-free normalize), SearchError (thiserror), Result<T> alias + 10 unit tests covering all types - traits.rs: SearchLayer, LayerBuilder, FieldClassifier trait definitions with full doc comments - CLI: add `skim search` stub subcommand with full clap arg definitions (--ast, --limit, --blast-radius, --hot, --cold, --risky, --build, etc.) - Workspace: add rskim-search member, wire into rskim dependency - main.rs: register --ast and --limit as value-consuming flags - completions.rs: register search::command() for accurate tab completions - cli_search.rs: 4 integration tests (help, -h, stub with query, stub empty)
serde_json was declared as a dependency but never imported or used anywhere in the crate source. Removes unnecessary build-time overhead.
- Add #![allow(clippy::unwrap_used, clippy::expect_used, clippy::panic)] to error_conversion.rs and serialization_roundtrip.rs so crate-level deny lints do not break cargo clippy --tests - Replace 3.14 score literal with 2.75 to avoid clippy::approx_constant
Add 87 new tests across 6 test files (5 new + 1 extended): - error_conversion.rs: 89 lines (error type conversions) - file_table_edge_cases.rs: 180 lines (new) (FileTable corner cases) - serialization_roundtrip.rs: 273 lines (serde round-trip integrity) - trait_object_safety.rs: 68 lines (new) (trait object handling) - types_edge_cases.rs: 270 lines (new) (SearchQuery/types validation) - cli_search.rs: extended with exit code, help text validation Implementation changes supporting tests: - FileTable custom Serialize/Deserialize (emit paths vec only) - SearchQuery #[must_use] attribute - serde_json added to rskim-search dev-dependencies - Cargo.lock updated with rskim-search dependency in rskim crate All 1700+ workspace tests passing. Co-Authored-By: Claude <noreply@anthropic.com>
…t comments, doc drift - search.rs: replace manual print_help() body with command().print_help() so help text auto-tracks clap definition without drift (DRY violation fixed) - rskim/Cargo.toml: add INTENTIONAL comment on rskim-search dep to document it is scaffolding wired up in Wave 1, not dead weight - rskim-search/Cargo.toml: add readme/keywords/categories to match rskim-core metadata pattern; add explanatory comments to [lints.clippy] section - main.rs: update is_flag_with_value() docstring to explicitly cover both Args struct flags and subcommand-only flags (--ast, --limit, etc.) Co-Authored-By: Claude <noreply@anthropic.com>
…asts, fast-path Five issues from batch-1 audit of types.rs: - Deserialization now runs normalize() on each path before building the ids map, so a table serialized with ./src/main.rs round-trips idempotently (re-registering the same unnormalized path after deser returns the existing FileId rather than creating a duplicate). - Replace silent `as usize` / `as u64` casts in lookup() with usize::try_from(...).ok()? so truncation becomes a None return rather than wrong-file access. Add compile_error! to reject 32-bit targets where usize < u64 would make the casts lossy. - Swap clone order in register(): ids takes the clone, paths takes the move — avoids one unnecessary PathBuf allocation per new registration. - Fast-path in normalize(): skip Vec<Component> allocation entirely when the path contains no CurDir or ParentDir components (the common case). - Add two serialization_roundtrip tests covering the deser-normalize invariant: unnormalized-paths-idempotent-after-roundtrip and serialized-form-uses-normalized-paths. Co-Authored-By: Claude <noreply@anthropic.com>
…build/update stub coverage - file_table_edge_cases: add test_normalize_absolute_over_root documenting that /a/../../b produces /../b (I/O-free normalizer cannot clamp at root) - cli_search: add regression tests for --build, --rebuild, --update flags confirming each routes through the not-yet-implemented stub path Co-Authored-By: Claude <noreply@anthropic.com>
types.rs was 577 lines, exceeding the 400-line project convention.
Split into four focused modules under types/:
- file_table.rs — FileId, FileTable, custom serde impls, normalize(),
64-bit compile_error gate
- query.rs — SearchQuery (#[must_use]), TemporalFlags, SearchField
- result.rs — SearchResult, MatchSpan, LineRange, IndexStats
- error.rs — SearchError, Result alias
mod.rs barrel re-exports all types so lib.rs pub use is unchanged.
Inline unit tests moved into their respective sub-modules.
All 91 tests pass; clippy --workspace -D warnings clean.
Co-Authored-By: Claude <noreply@anthropic.com>
command().print_help() renders the command name from clap's
Command::new("search"), which omits the parent binary. Override with
.name("skim search") so help output shows "Usage: skim search".
Simplifier reverted to manual println pattern; restoring clap delegation since eliminating help text duplication was the explicit review finding (flagged by 4 reviewers). Uses let _ = to handle io::Result.
…xHashMap - Add register_within(path, root) that rejects paths escaping project root (leading .. or absolute after normalization) - Add MAX_FILE_TABLE_ENTRIES (10M) size limit on FileTable deserialization to prevent OOM from malicious/corrupted index files - Switch FileTable from HashMap to FxHashMap (rustc-hash) for faster path lookups — zero-cost change, already a transitive dependency - 7 new tests covering all three fixes
…m-search from rskim - Move rustc-hash = "1.1" from rskim-search inline declaration into [workspace.dependencies] so all crates follow the same pattern. - Remove the rskim-search dependency from the rskim CLI crate; no code in rskim imports it (search.rs is a pure stub). The dep will be re-added when Wave 1 wires up real search logic. Fixes batch-2 issues: cargo:rustc_hash_workspace, cargo:unused_rskim_search_dep Co-Authored-By: Claude <noreply@anthropic.com>
…ble normalization - register_within now verifies containment by joining root+path, re-normalizing, and checking starts_with(normalized_root); previously _root was unused so any relative path passed regardless of root - Extract register_normalized(PathBuf) private method so both register() and register_within() normalize exactly once instead of twice - Deserialization now returns Err on duplicate normalized paths (e.g., ["./src/main.rs","src/main.rs"]) rather than silently overwriting the ids map and breaking the paths.len() == ids.len() invariant - Add test_file_table_deserialization_rejects_at_limit_plus_one covering the actual rejection branch of the 10M-entry size limit (was previously untested) - Add test_file_table_deserialization_rejects_duplicate_normalized_paths - Add register_within tests: single-parent escape, subdirectory escape, root-itself acceptance; document that relative paths without ".." do resolve inside root when joined Co-Authored-By: Claude <noreply@anthropic.com>
…cture - Add SearchIndex super-trait (extends SearchLayer with file_table, stats, mtimes) - Change LayerBuilder::build() return type to Box<dyn SearchIndex> - Add CorruptedIndex error variant to SearchError - Add SearchField::as_u8/from_u8/all() for on-disk serialization - Define shared lexical types: Ngram, PostingEntry, IndexHeader, Bm25Params, IndexMetadata - Create lexical/ module (ngram, index_format, scoring, builder, query stubs) - Create fields/ module (tree_sitter_fields, serde_fields, markdown_fields stubs) - Add memmap2 workspace dependency, serde_json to rskim-search deps - Add search test fixtures (TypeScript, Rust, JSON, YAML, Markdown, Python) - Update trait_object_safety tests for SearchIndex
Implement `extract_ngrams` (index-time) and `extract_query_ngrams` (query-time) in `crates/rskim-search/src/lexical/ngram.rs`. - `extract_ngrams`: sliding 2-byte window over whitespace-delimited tokens; first and last bigram of each token receive BORDER_WEIGHT (2.0); interior bigrams receive INTERIOR_WEIGHT (1.0); repeated bigrams accumulate weights via FxHashMap. - `extract_query_ngrams`: greedy covering-set over raw trimmed query bytes; selects bigrams maximising uncovered position coverage; capped at 32 bigrams; all weights uniform 1.0. Edge cases handled: empty input, single-byte input, all-whitespace, Unicode multi-byte chars, CJK 3/4-byte sequences, very long text, repeated chars. Also suppress two pre-existing clippy warnings in scaffolding code (`type_complexity` in tree_sitter_fields.rs, `manual_contains` in markdown_fields.rs) so `cargo clippy -p rskim-search -- -D warnings` passes cleanly. Co-Authored-By: Claude <noreply@anthropic.com>
Implement `extract_ngrams` (index-time) and `extract_query_ngrams` (query-time) in `crates/rskim-search/src/lexical/ngram.rs`. - `extract_ngrams`: sliding 2-byte window over whitespace-delimited tokens; first and last bigram of each token receive BORDER_WEIGHT (2.0); interior bigrams receive INTERIOR_WEIGHT (1.0); repeated bigrams accumulate weights via FxHashMap; O(n) allocation. - `extract_query_ngrams`: greedy covering-set over raw trimmed query bytes; selects bigrams maximising uncovered position coverage; capped at 32 bigrams; all weights uniform 1.0. Edge cases handled: empty input, single-byte input, all-whitespace, Unicode multi-byte chars, CJK 3/4-byte sequences, very long text, repeated chars. No unwrap/expect/panic in library code. Also suppress two pre-existing clippy warnings in scaffolding code (`type_complexity` in tree_sitter_fields.rs, `manual_contains` in markdown_fields.rs) to satisfy `cargo clippy -p rskim-search -- -D warnings`. Co-Authored-By: Claude <noreply@anthropic.com>
- scoring.rs: BM25F formula with IDF + per-field boosted TF, length normalization, guards for zero avg_doc_len and zero df - tree_sitter_fields.rs: data-driven FxHashMap classifier for 13 tree-sitter languages; identifier classification only in declaration contexts to avoid indexing every variable reference - serde_fields.rs: JSON (serde_json-based), YAML/TOML (line-scan, no extra deps), Markdown (line-scan, code fences, links, headings) - 3 integration test files: bm25f_scoring (11 tests), field_classification (13 tests), serde_field_classification (27 tests) — all green - clippy -D warnings: clean
…10) - LexicalLayerBuilder: accumulates postings from multiple files via add_file(), dispatching tree-sitter languages through recursive AST walk and serde-based languages through classify_serde_fields(); always includes whole-file fallback posting under FunctionBody to guarantee searchable content for every file - Skips files >5MB or with avg_line_length >500 (minified detection) - Writes lexical.skidx, lexical.skpost, metadata.json atomically on build() - LexicalSearchLayer: minimal struct satisfying SearchIndex/SearchLayer; search() is a Phase-3 stub; open() deserializes metadata.json and mmaps index - 10 integration tests cover empty index, per-language paths, skip heuristics, disk layout, reopen, and metadata roundtrip; all 265 rskim-search tests pass Co-Authored-By: Claude <noreply@anthropic.com>
) Replace Phase 2 stub with full query pipeline in LexicalSearchLayer::search(): - Extract query n-grams via extract_query_ngrams (covering-set strategy) - Lookup posting lists from main index (binary search) + delta (linear scan) - Filter tombstoned doc_ids from main-index results - Accumulate per-document, per-field term frequencies via FxHashMap - Score each document with Bm25Scorer::score_term using per-field boosts - Sort descending by score, apply offset + limit, return Vec<(FileId, f32)> Added index_dir: PathBuf field to LexicalSearchLayer so tombstone and delta files can be located at query time without passing extra parameters. Added 13 integration tests in tests/lexical_query.rs covering: empty/blank queries, fixture-based ranking (UserService, AuthHandler, database_url, replicas), score ordering, limit/offset, positive scores, type-definition boost, and cross-open reproducibility. Co-Authored-By: Claude <noreply@anthropic.com>
…16) Replace the search subcommand stub with a full BM25F lexical search implementation. Walks repo files respecting .gitignore, builds a per-repo index keyed by repo-root hash under SKIM_CACHE_DIR, and auto-builds on first query. Supports --build, --rebuild, --stats, --clear-cache, --limit, --json, and snippet display in text output.
Remove dead code (merge_field_tfs, unix_now, PostingEntry size-check hack), eliminate unnecessary String allocation in YAML classifier, deduplicate cache-dir resolution in search.rs, tighten classifier_cache lookup in builder.rs, and replace language-behavior tests with behavior-asserting ones.
Remove unused test helpers (make_header, make_posting) from index_format.rs unit test module. These were duplicates of the helpers in the integration test file and generated dead_code warnings.
…ests Add 44 tests across three new files that exercise the full Wave 1 lexical search pipeline from real fixture files through to CLI output: - wave1_acceptance.rs (27 tests): end-to-end build→query→rank pipeline; covers all 6 language fixtures, field-boost ordering, pagination, persistence, reopen, unicode, duplicate idempotency, and corruption detection via the SearchIndex trait (not internal modules). - wave1_robustness.rs (10 tests): adversarial builder and query inputs — very long queries, CJK text, special chars, null bytes, empty/whitespace content, concurrent Arc<LexicalSearchLayer> reads, index size bounds, and a 100-query performance check (< 500ms total). - cli_search_adversarial.rs (7 tests): binary-level adversarial coverage — 5000-char queries, special-char queries, corrupt-index error with helpful --rebuild hint (correctly locates hash-keyed cache subdirectory), stats JSON validity, build-then-query, --limit 0, and help flag completeness. Co-Authored-By: Claude <noreply@anthropic.com>
- Split index_format.rs into module with dedicated concerns: delta, entry, reader, writer, tombstones - Restructured search command into: index (setup), output (formatting), mod (coordination) - Reorganized field extraction with dedicated tree_sitter_tables mapping - Updated tests to match refactored module structure (field_classification, index_format, lexical_query, serialization_roundtrip, types_edge_cases, wave1_acceptance, wave1_robustness) - Removed consolidated bm25f_scoring.rs (integrated into query module) - Unified test fixtures for better reusability Improves codebase clarity and maintainability without behavioral changes. Co-Authored-By: Claude <noreply@anthropic.com>
| let count = entry.posting_length as usize; | ||
| let file_count = self.header.file_count; | ||
|
|
||
| let mut result = Vec::with_capacity(count); |
There was a problem hiding this comment.
🔴 BLOCKING: Resource Exhaustion via Uncapped Vec::with_capacity
The read_postings method calls Vec::with_capacity(count) where count comes from untrusted on-disk data. A crafted index file with posting_length = u32::MAX would attempt to allocate ~51GB, causing OOM or panic.
Fix: Cap the capacity hint before allocation:
let max_possible = self.post_mmap.len().saturating_sub(start) / POSTING_ENTRY_SIZE;
let cap = count.min(max_possible);
let mut result = Vec::with_capacity(cap);This is a straightforward bounds check (3 lines) that prevents DoS via crafted index files.
| let entry = PostingEntry { | ||
| doc_id, | ||
| field_id: field.as_u8(), | ||
| position: node.start_byte() as u32, |
There was a problem hiding this comment.
🔴 BLOCKING: Unchecked Truncation of node.start_byte() to u32
node.start_byte() as u32 silently truncates on files near 5MB (the file size limit), potentially corrupting snippet positions. While unlikely given the 5MB cap, this violates Rust safety discipline used elsewhere (see line 179 which uses try_from).
Fix: Use try_from matching the existing pattern:
position: u32::try_from(node.start_byte()).unwrap_or(u32::MAX),The same issue exists at line 146 (range.start as u32). Both should be fixed together for consistency.
|
|
||
| let path_str = rel_path | ||
| .map(|p| p.display().to_string()) | ||
| .unwrap_or_else(|| "<unknown>".to_string()); |
There was a problem hiding this comment.
🔴 BLOCKING: CLI Depends on Concrete Type, Not Trait
output.rs imports and depends on LexicalSearchLayer directly (lines 8, 16, 79, 97, 107), violating Dependency Inversion Principle. This makes it impossible to add Wave 2 temporal or AST layers without rewriting all output functions.
Fix: Change function signatures to accept &dyn SearchIndex:
pub(super) fn print_text_results(
index: &dyn SearchIndex, // Not &LexicalSearchLayer
results: &[(FileId, f32)],
query_text: &str,
repo_root: &Path,
) -> anyhow::Result<()> {
// Use index.file_table() instead of layer.file_table()
}All callsites in search/mod.rs:135 already have LexicalSearchLayer which implements SearchIndex, so this is backward-compatible. The trait infrastructure is ready — just decouple the output layer.
| #[error("Serialization error: {0}")] | ||
| SerializationError(String), | ||
|
|
||
| /// The index file is corrupted or has an incompatible format version. |
There was a problem hiding this comment.
🔴 BLOCKING: Inconsistent Error Variant Naming
SearchError has both IndexError(String) (line 14) and CorruptedIndex { path, reason } (lines 29-36). Both describe index-related failures but with inconsistent shapes:
- Builder (line 676):
IndexError("file_id exceeds u32::MAX") - Reader (line 92):
CorruptedIndex { path, reason }
This creates confusion: which variant should be used? The structured variant with path context is more debuggable.
Fix: Consolidate into a single variant. Either:
- Rename
IndexError→IndexBuildErrorto clarify it's build-time only - Migrate all
IndexErrorcall sites toCorruptedIndexwith structured fields
The structured version is better for error handling and debugging.
Code Review SummaryReview Status: ⛔️ CHANGES_REQUESTED (4 blocking findings) High-Confidence Blocking Issues (≥80% confidence)Created inline comments for these critical findings:
Lower-Confidence Suggestions (60-79%)These are important but less urgent than the 4 blocking issues above: Security (80% confidence, medium priority)
Performance (90% confidence, HIGH priority)
Rust (82-90% confidence)
Complexity (92% confidence, MEDIUM priority)
Consistency (85-92% confidence)
Test Coverage (85-90% confidence)
DeduplicationMultiple reviewers flagged identical issues — consolidated into single inline comments:
Summary by Category
RecommendationDo not merge until the 4 inline-comment blocking issues are resolved. These are straightforward fixes (3-10 lines each) that prevent resource exhaustion, position corruption, and Wave 2 extensibility issues. The lower-confidence findings are important for code quality but can be tracked as tech debt if the blocking issues are fixed. See inline comments for specific fixes. |
Two denial-of-service vectors closed in IndexReader: 1. Vec::with_capacity OOM (issue #1): posting_length from an untrusted .skidx entry could be u32::MAX, triggering a ~51 GiB allocation before the per-entry bounds check fires. Cap count against the maximum entries that could physically exist in the mapped .skpost file. 2. Unbounded .skpost mmap (issue #3): only a divisibility check existed. Add MAX_SKPOST_BYTES = 4_000_000_000 (4 GiB) check in IndexReader::open, covering ~357 M PostingEntry records — well beyond any real repository. Co-Authored-By: Claude <noreply@anthropic.com>
Fields `reader`, `scorer`, and `metadata` were `pub(crate)` but are only accessed within query.rs via `self`. Restrict visibility to private to enforce encapsulation and prevent external coupling to internals. No accessor methods needed — all field access is self-contained. Co-Authored-By: Claude <noreply@anthropic.com>
Change resolve_result, print_text_results, and print_json_results to accept &dyn SearchIndex instead of &LexicalSearchLayer. The output module only needs file_table() and stats() which are on the SearchIndex trait — tying it to the concrete type was an unnecessary dependency on an internal implementation detail. show_stats still imports LexicalSearchLayer to construct it via ::open, which is the correct construction site. Co-Authored-By: Claude <noreply@anthropic.com>
Resolves builder.rs exceeding 400 lines (426 → 354) by extracting the tree-sitter AST walker into a dedicated lexical/walker.rs module. - Add WalkContext struct to encapsulate mutable walk state (source, classifier, doc_id, postings, doc_len), reducing walk_and_classify_inner from 7 params to 3 (ctx, node, depth) - walk_and_classify is pub(crate) in the new module, matching previous visibility - MAX_AST_DEPTH constant moved to walker.rs with its unit test - Fix as-u32 truncation in walker: use u32::try_from().unwrap_or(u32::MAX) for node.start_byte() and ngrams.len() casts (issues #3, #5) - Register walker module as pub(crate) in lexical/mod.rs Co-Authored-By: Claude <noreply@anthropic.com>
serde_fields.rs exceeded the 400-line project guideline at 477 lines, with 170 lines being inline unit tests. Extracted all #[cfg(test)] tests to tests/serde_field_classification.rs and renamed them with a low_level_ prefix to avoid collisions with existing integration tests. Source file is now 302 lines. All 43 tests pass. Co-Authored-By: Claude <noreply@anthropic.com>
DeltaReader::open() had no upper bound on the file it would mmap, unlike tombstones (40 MB) and metadata.json (100 MB) which both reject oversized files with CorruptedIndex. A crafted or runaway delta file could cause unbounded memory mapping. Add MAX_DELTA_BYTES = 100_000_000 (5 M records at 20 bytes each) and return SearchError::CorruptedIndex if the file exceeds it, matching the existing tombstone/metadata pattern. Co-Authored-By: Claude <noreply@anthropic.com>
- Fix unchecked as-u32 truncations in index_serde: range.start and ngrams.len() now use u32::try_from().unwrap_or(u32::MAX) matching the pattern used throughout the codebase - Fix same truncation in the whole-file fallback path for fallback_ngrams.len() - Reserve postings HashMap capacity on first file using content.len()/4 as a conservative ngram-count estimate, reducing rehash overhead for large repositories Co-Authored-By: Claude <noreply@anthropic.com>
… annotations All 9 bare #[must_use] annotations in crates/rskim-search/src now carry reason strings that explain what goes wrong when the return value is discarded, making the lint warnings actionable rather than generic. Co-Authored-By: Claude <noreply@anthropic.com>
DeltaReader::scan() performed a full linear mmap scan per query n-gram, making each search O(n*q) where n is delta record count and q is query n-gram count (up to 32). For a delta with 100k records this meant 3.2 M unnecessary comparisons per query. Build a FxHashMap<u64, Vec<usize>> (hash → byte offsets) in a single pass during open(). scan() now does a hash lookup plus one PostingEntry decode per match: O(1) per n-gram regardless of delta size. Public API (scan signature, open return type) is unchanged so no callers are affected. Co-Authored-By: Claude <noreply@anthropic.com>
…from CorruptedIndex Both IndexError and CorruptedIndex described index failures, creating ambiguity at call sites. IndexBuildError now clearly signals a failure during index construction (e.g. value exceeds u32::MAX), while CorruptedIndex signals a runtime integrity violation when reading an existing index. The #[error] display string is unchanged so existing error message assertions still pass. Updated all 5 sites: error.rs enum, builder.rs, writer.rs, and two test files. Co-Authored-By: Claude <noreply@anthropic.com>
…d delta+tombstone pipeline
- error_conversion.rs: 3 tests for CorruptedIndex Display format (path+reason
interpolation, template prefix) and source() chain (returns None — no #[source])
- index_format.rs: 3 tests for IndexReader::validate() — valid index passes,
empty index passes, and crafted out-of-bounds posting range is detected
- wave1_acceptance.rs: 3 integration tests that exercise delta/tombstone
filtering end-to-end through the query engine:
tombstoned_file_is_excluded_from_search_results
delta_entries_appear_in_search_results
delta_bypasses_tombstone_for_new_entries (verifies delta is not tombstoned)
Co-Authored-By: Claude <noreply@anthropic.com>
Two-pass approach: Pass 1 computes byte offsets with checked arithmetic (no I/O, no large allocation). Pass 2 streams PostingEntry bytes directly to BufWriter<File>. Eliminates the intermediate Vec<u8> that held the entire postings file in memory.
Add lookup_into/read_postings_into on IndexReader and scan_into on DeltaReader — caller-owned buffer variants that clear-and-fill instead of allocating per call. Original lookup/scan delegate to these for backward compatibility. In the query loop, postings_buf and ngram_docs are now declared once before the loop and reused across iterations, eliminating per-ngram Vec and HashMap allocations.
…tep comments - delta.rs: use mmap directly instead of `let data = &mmap[..]` alias - reader.rs: add #[must_use] to lookup_into (matches lookup) - query.rs: align step numbering with module-level pipeline doc
…unit tests - output::show_stats now accepts &dyn SearchIndex (DI) instead of opening the index internally; caller in mod.rs handles open+error - builder.rs: use or_insert_with return value directly instead of separate get() call for classifier cache - Remove minified_threshold_calculation and max_ast_depth_is_reasonable unit tests — constants-only assertions already covered by integration tests (minified_file_is_skipped, index_builder)
Summary
Full lexical search layer for skim — BM25F scoring with n-gram indexing over AST-classified code fields.
.skidx+.skpost) with memory-mapped reader, delta segments, and tombstone trackingskim searchwith auto-build,--rebuild,--stats, snippet display, JSON outputArchitecture
Key metrics
Test plan
cargo test -p rskim-search— 248 tests pass (unit + integration)cargo test -p rskim— 1001 tests pass (1 flaky perf test excluded — process spawn overhead, unrelated)cargo clippy -p rskim-search -p rskim -- -D warnings— clean