feat: Wave 1 — Lexical search with BM25F scoring and n-gram index by leesharon · Pull Request #120 · dean0x/skim

leesharon · 2026-04-05T16:20:31Z

Summary

Full lexical search layer for skim — BM25F scoring with n-gram indexing over AST-classified code fields.

N-gram extraction with border weighting for better boundary-aware matching
BM25F scoring with field-weighted classifiers (type definitions, function signatures, symbol names, etc.)
Index builder with tree-sitter AST field classification + serde-based classification for JSON/YAML/TOML/Markdown
Binary index format (.skidx + .skpost) with memory-mapped reader, delta segments, and tombstone tracking
CLI integration — skim search with auto-build, --rebuild, --stats, snippet display, JSON output
Performance — O(1) delta scans via FxHashMap, two-pass streaming writer (no in-memory buffer), reusable query buffers
Hardened — size guards on all mmap'd files, checked arithmetic, capacity caps against crafted indexes

Architecture

rskim-search/src/lexical/
├── ngram.rs          — N-gram extraction + border weighting
├── scoring.rs        — BM25F scorer with field boosts
├── builder.rs        — Index builder (AST + serde field classification)
├── walker.rs         — AST cursor walker for field classification
├── query.rs          — SearchLayer impl — query engine
└── index_format/
    ├── writer.rs     — Two-pass atomic write (.skidx + .skpost)
    ├── reader.rs     — Mmap'd binary search reader
    ├── delta.rs      — Append-only delta segment
    ├── tombstones.rs — Deleted doc_id tracking
    └── entry.rs      — Binary IndexEntry format

Key metrics

55 files changed, ~10,580 lines added
319 tests (unit + integration + acceptance + robustness + adversarial)
0 tech debt items remaining (PF-001 through PF-009 all resolved)
All clippy warnings clean

Test plan

Introduce the rskim-search crate as the search and indexing foundation for the 3-layer code search system. All types and traits are I/O-free and follow the rskim-core patterns exactly. - crates/rskim-search: new crate with Cargo.toml, lib.rs, types.rs, traits.rs - types.rs: FileId, SearchField, MatchSpan, LineRange, TemporalFlags, SearchQuery (builder pattern), SearchResult, IndexStats, FileTable (I/O-free normalize), SearchError (thiserror), Result<T> alias + 10 unit tests covering all types - traits.rs: SearchLayer, LayerBuilder, FieldClassifier trait definitions with full doc comments - CLI: add `skim search` stub subcommand with full clap arg definitions (--ast, --limit, --blast-radius, --hot, --cold, --risky, --build, etc.) - Workspace: add rskim-search member, wire into rskim dependency - main.rs: register --ast and --limit as value-consuming flags - completions.rs: register search::command() for accurate tab completions - cli_search.rs: 4 integration tests (help, -h, stub with query, stub empty)

serde_json was declared as a dependency but never imported or used anywhere in the crate source. Removes unnecessary build-time overhead.

- Add #![allow(clippy::unwrap_used, clippy::expect_used, clippy::panic)] to error_conversion.rs and serialization_roundtrip.rs so crate-level deny lints do not break cargo clippy --tests - Replace 3.14 score literal with 2.75 to avoid clippy::approx_constant

Add 87 new tests across 6 test files (5 new + 1 extended): - error_conversion.rs: 89 lines (error type conversions) - file_table_edge_cases.rs: 180 lines (new) (FileTable corner cases) - serialization_roundtrip.rs: 273 lines (serde round-trip integrity) - trait_object_safety.rs: 68 lines (new) (trait object handling) - types_edge_cases.rs: 270 lines (new) (SearchQuery/types validation) - cli_search.rs: extended with exit code, help text validation Implementation changes supporting tests: - FileTable custom Serialize/Deserialize (emit paths vec only) - SearchQuery #[must_use] attribute - serde_json added to rskim-search dev-dependencies - Cargo.lock updated with rskim-search dependency in rskim crate All 1700+ workspace tests passing. Co-Authored-By: Claude <noreply@anthropic.com>

…t comments, doc drift - search.rs: replace manual print_help() body with command().print_help() so help text auto-tracks clap definition without drift (DRY violation fixed) - rskim/Cargo.toml: add INTENTIONAL comment on rskim-search dep to document it is scaffolding wired up in Wave 1, not dead weight - rskim-search/Cargo.toml: add readme/keywords/categories to match rskim-core metadata pattern; add explanatory comments to [lints.clippy] section - main.rs: update is_flag_with_value() docstring to explicitly cover both Args struct flags and subcommand-only flags (--ast, --limit, etc.) Co-Authored-By: Claude <noreply@anthropic.com>

…asts, fast-path Five issues from batch-1 audit of types.rs: - Deserialization now runs normalize() on each path before building the ids map, so a table serialized with ./src/main.rs round-trips idempotently (re-registering the same unnormalized path after deser returns the existing FileId rather than creating a duplicate). - Replace silent `as usize` / `as u64` casts in lookup() with usize::try_from(...).ok()? so truncation becomes a None return rather than wrong-file access. Add compile_error! to reject 32-bit targets where usize < u64 would make the casts lossy. - Swap clone order in register(): ids takes the clone, paths takes the move — avoids one unnecessary PathBuf allocation per new registration. - Fast-path in normalize(): skip Vec<Component> allocation entirely when the path contains no CurDir or ParentDir components (the common case). - Add two serialization_roundtrip tests covering the deser-normalize invariant: unnormalized-paths-idempotent-after-roundtrip and serialized-form-uses-normalized-paths. Co-Authored-By: Claude <noreply@anthropic.com>

…build/update stub coverage - file_table_edge_cases: add test_normalize_absolute_over_root documenting that /a/../../b produces /../b (I/O-free normalizer cannot clamp at root) - cli_search: add regression tests for --build, --rebuild, --update flags confirming each routes through the not-yet-implemented stub path Co-Authored-By: Claude <noreply@anthropic.com>

types.rs was 577 lines, exceeding the 400-line project convention. Split into four focused modules under types/: - file_table.rs — FileId, FileTable, custom serde impls, normalize(), 64-bit compile_error gate - query.rs — SearchQuery (#[must_use]), TemporalFlags, SearchField - result.rs — SearchResult, MatchSpan, LineRange, IndexStats - error.rs — SearchError, Result alias mod.rs barrel re-exports all types so lib.rs pub use is unchanged. Inline unit tests moved into their respective sub-modules. All 91 tests pass; clippy --workspace -D warnings clean. Co-Authored-By: Claude <noreply@anthropic.com>

command().print_help() renders the command name from clap's Command::new("search"), which omits the parent binary. Override with .name("skim search") so help output shows "Usage: skim search".

Simplifier reverted to manual println pattern; restoring clap delegation since eliminating help text duplication was the explicit review finding (flagged by 4 reviewers). Uses let _ = to handle io::Result.

…xHashMap - Add register_within(path, root) that rejects paths escaping project root (leading .. or absolute after normalization) - Add MAX_FILE_TABLE_ENTRIES (10M) size limit on FileTable deserialization to prevent OOM from malicious/corrupted index files - Switch FileTable from HashMap to FxHashMap (rustc-hash) for faster path lookups — zero-cost change, already a transitive dependency - 7 new tests covering all three fixes

…m-search from rskim - Move rustc-hash = "1.1" from rskim-search inline declaration into [workspace.dependencies] so all crates follow the same pattern. - Remove the rskim-search dependency from the rskim CLI crate; no code in rskim imports it (search.rs is a pure stub). The dep will be re-added when Wave 1 wires up real search logic. Fixes batch-2 issues: cargo:rustc_hash_workspace, cargo:unused_rskim_search_dep Co-Authored-By: Claude <noreply@anthropic.com>

…ble normalization - register_within now verifies containment by joining root+path, re-normalizing, and checking starts_with(normalized_root); previously _root was unused so any relative path passed regardless of root - Extract register_normalized(PathBuf) private method so both register() and register_within() normalize exactly once instead of twice - Deserialization now returns Err on duplicate normalized paths (e.g., ["./src/main.rs","src/main.rs"]) rather than silently overwriting the ids map and breaking the paths.len() == ids.len() invariant - Add test_file_table_deserialization_rejects_at_limit_plus_one covering the actual rejection branch of the 10M-entry size limit (was previously untested) - Add test_file_table_deserialization_rejects_duplicate_normalized_paths - Add register_within tests: single-parent escape, subdirectory escape, root-itself acceptance; document that relative paths without ".." do resolve inside root when joined Co-Authored-By: Claude <noreply@anthropic.com>

…cture - Add SearchIndex super-trait (extends SearchLayer with file_table, stats, mtimes) - Change LayerBuilder::build() return type to Box<dyn SearchIndex> - Add CorruptedIndex error variant to SearchError - Add SearchField::as_u8/from_u8/all() for on-disk serialization - Define shared lexical types: Ngram, PostingEntry, IndexHeader, Bm25Params, IndexMetadata - Create lexical/ module (ngram, index_format, scoring, builder, query stubs) - Create fields/ module (tree_sitter_fields, serde_fields, markdown_fields stubs) - Add memmap2 workspace dependency, serde_json to rskim-search deps - Add search test fixtures (TypeScript, Rust, JSON, YAML, Markdown, Python) - Update trait_object_safety tests for SearchIndex

Implement `extract_ngrams` (index-time) and `extract_query_ngrams` (query-time) in `crates/rskim-search/src/lexical/ngram.rs`. - `extract_ngrams`: sliding 2-byte window over whitespace-delimited tokens; first and last bigram of each token receive BORDER_WEIGHT (2.0); interior bigrams receive INTERIOR_WEIGHT (1.0); repeated bigrams accumulate weights via FxHashMap. - `extract_query_ngrams`: greedy covering-set over raw trimmed query bytes; selects bigrams maximising uncovered position coverage; capped at 32 bigrams; all weights uniform 1.0. Edge cases handled: empty input, single-byte input, all-whitespace, Unicode multi-byte chars, CJK 3/4-byte sequences, very long text, repeated chars. Also suppress two pre-existing clippy warnings in scaffolding code (`type_complexity` in tree_sitter_fields.rs, `manual_contains` in markdown_fields.rs) so `cargo clippy -p rskim-search -- -D warnings` passes cleanly. Co-Authored-By: Claude <noreply@anthropic.com>

Implement `extract_ngrams` (index-time) and `extract_query_ngrams` (query-time) in `crates/rskim-search/src/lexical/ngram.rs`. - `extract_ngrams`: sliding 2-byte window over whitespace-delimited tokens; first and last bigram of each token receive BORDER_WEIGHT (2.0); interior bigrams receive INTERIOR_WEIGHT (1.0); repeated bigrams accumulate weights via FxHashMap; O(n) allocation. - `extract_query_ngrams`: greedy covering-set over raw trimmed query bytes; selects bigrams maximising uncovered position coverage; capped at 32 bigrams; all weights uniform 1.0. Edge cases handled: empty input, single-byte input, all-whitespace, Unicode multi-byte chars, CJK 3/4-byte sequences, very long text, repeated chars. No unwrap/expect/panic in library code. Also suppress two pre-existing clippy warnings in scaffolding code (`type_complexity` in tree_sitter_fields.rs, `manual_contains` in markdown_fields.rs) to satisfy `cargo clippy -p rskim-search -- -D warnings`. Co-Authored-By: Claude <noreply@anthropic.com>

- scoring.rs: BM25F formula with IDF + per-field boosted TF, length normalization, guards for zero avg_doc_len and zero df - tree_sitter_fields.rs: data-driven FxHashMap classifier for 13 tree-sitter languages; identifier classification only in declaration contexts to avoid indexing every variable reference - serde_fields.rs: JSON (serde_json-based), YAML/TOML (line-scan, no extra deps), Markdown (line-scan, code fences, links, headings) - 3 integration test files: bm25f_scoring (11 tests), field_classification (13 tests), serde_field_classification (27 tests) — all green - clippy -D warnings: clean

…10) - LexicalLayerBuilder: accumulates postings from multiple files via add_file(), dispatching tree-sitter languages through recursive AST walk and serde-based languages through classify_serde_fields(); always includes whole-file fallback posting under FunctionBody to guarantee searchable content for every file - Skips files >5MB or with avg_line_length >500 (minified detection) - Writes lexical.skidx, lexical.skpost, metadata.json atomically on build() - LexicalSearchLayer: minimal struct satisfying SearchIndex/SearchLayer; search() is a Phase-3 stub; open() deserializes metadata.json and mmaps index - 10 integration tests cover empty index, per-language paths, skip heuristics, disk layout, reopen, and metadata roundtrip; all 265 rskim-search tests pass Co-Authored-By: Claude <noreply@anthropic.com>

) Replace Phase 2 stub with full query pipeline in LexicalSearchLayer::search(): - Extract query n-grams via extract_query_ngrams (covering-set strategy) - Lookup posting lists from main index (binary search) + delta (linear scan) - Filter tombstoned doc_ids from main-index results - Accumulate per-document, per-field term frequencies via FxHashMap - Score each document with Bm25Scorer::score_term using per-field boosts - Sort descending by score, apply offset + limit, return Vec<(FileId, f32)> Added index_dir: PathBuf field to LexicalSearchLayer so tombstone and delta files can be located at query time without passing extra parameters. Added 13 integration tests in tests/lexical_query.rs covering: empty/blank queries, fixture-based ranking (UserService, AuthHandler, database_url, replicas), score ordering, limit/offset, positive scores, type-definition boost, and cross-open reproducibility. Co-Authored-By: Claude <noreply@anthropic.com>

…16) Replace the search subcommand stub with a full BM25F lexical search implementation. Walks repo files respecting .gitignore, builds a per-repo index keyed by repo-root hash under SKIM_CACHE_DIR, and auto-builds on first query. Supports --build, --rebuild, --stats, --clear-cache, --limit, --json, and snippet display in text output.

Remove dead code (merge_field_tfs, unix_now, PostingEntry size-check hack), eliminate unnecessary String allocation in YAML classifier, deduplicate cache-dir resolution in search.rs, tighten classifier_cache lookup in builder.rs, and replace language-behavior tests with behavior-asserting ones.

Remove unused test helpers (make_header, make_posting) from index_format.rs unit test module. These were duplicates of the helpers in the integration test file and generated dead_code warnings.

…ests Add 44 tests across three new files that exercise the full Wave 1 lexical search pipeline from real fixture files through to CLI output: - wave1_acceptance.rs (27 tests): end-to-end build→query→rank pipeline; covers all 6 language fixtures, field-boost ordering, pagination, persistence, reopen, unicode, duplicate idempotency, and corruption detection via the SearchIndex trait (not internal modules). - wave1_robustness.rs (10 tests): adversarial builder and query inputs — very long queries, CJK text, special chars, null bytes, empty/whitespace content, concurrent Arc<LexicalSearchLayer> reads, index size bounds, and a 100-query performance check (< 500ms total). - cli_search_adversarial.rs (7 tests): binary-level adversarial coverage — 5000-char queries, special-char queries, corrupt-index error with helpful --rebuild hint (correctly locates hash-keyed cache subdirectory), stats JSON validity, build-then-query, --limit 0, and help flag completeness. Co-Authored-By: Claude <noreply@anthropic.com>

- Split index_format.rs into module with dedicated concerns: delta, entry, reader, writer, tombstones - Restructured search command into: index (setup), output (formatting), mod (coordination) - Reorganized field extraction with dedicated tree_sitter_tables mapping - Updated tests to match refactored module structure (field_classification, index_format, lexical_query, serialization_roundtrip, types_edge_cases, wave1_acceptance, wave1_robustness) - Removed consolidated bm25f_scoring.rs (integrated into query module) - Unified test fixtures for better reusability Improves codebase clarity and maintainability without behavioral changes. Co-Authored-By: Claude <noreply@anthropic.com>

leesharon · 2026-04-05T18:38:53Z

+        let count = entry.posting_length as usize;
+        let file_count = self.header.file_count;
+
+        let mut result = Vec::with_capacity(count);


🔴 BLOCKING: Resource Exhaustion via Uncapped Vec::with_capacity

The read_postings method calls Vec::with_capacity(count) where count comes from untrusted on-disk data. A crafted index file with posting_length = u32::MAX would attempt to allocate ~51GB, causing OOM or panic.

Fix: Cap the capacity hint before allocation:

let max_possible = self.post_mmap.len().saturating_sub(start) / POSTING_ENTRY_SIZE; let cap = count.min(max_possible); let mut result = Vec::with_capacity(cap);

This is a straightforward bounds check (3 lines) that prevents DoS via crafted index files.

leesharon · 2026-04-05T18:39:01Z

+                let entry = PostingEntry {
+                    doc_id,
+                    field_id: field.as_u8(),
+                    position: node.start_byte() as u32,


🔴 BLOCKING: Unchecked Truncation of node.start_byte() to u32

node.start_byte() as u32 silently truncates on files near 5MB (the file size limit), potentially corrupting snippet positions. While unlikely given the 5MB cap, this violates Rust safety discipline used elsewhere (see line 179 which uses try_from).

Fix: Use try_from matching the existing pattern:

position: u32::try_from(node.start_byte()).unwrap_or(u32::MAX),

The same issue exists at line 146 (range.start as u32). Both should be fixed together for consistency.

leesharon · 2026-04-05T18:39:11Z

+
+    let path_str = rel_path
+        .map(|p| p.display().to_string())
+        .unwrap_or_else(|| "<unknown>".to_string());


🔴 BLOCKING: CLI Depends on Concrete Type, Not Trait

output.rs imports and depends on LexicalSearchLayer directly (lines 8, 16, 79, 97, 107), violating Dependency Inversion Principle. This makes it impossible to add Wave 2 temporal or AST layers without rewriting all output functions.

Fix: Change function signatures to accept &dyn SearchIndex:

pub(super) fn print_text_results( index: &dyn SearchIndex, // Not &LexicalSearchLayer results: &[(FileId, f32)], query_text: &str, repo_root: &Path, ) -> anyhow::Result<()> { // Use index.file_table() instead of layer.file_table() }

All callsites in search/mod.rs:135 already have LexicalSearchLayer which implements SearchIndex, so this is backward-compatible. The trait infrastructure is ready — just decouple the output layer.

leesharon · 2026-04-05T18:39:22Z

+    #[error("Serialization error: {0}")]
+    SerializationError(String),
+
+    /// The index file is corrupted or has an incompatible format version.


🔴 BLOCKING: Inconsistent Error Variant Naming

SearchError has both IndexError(String) (line 14) and CorruptedIndex { path, reason } (lines 29-36). Both describe index-related failures but with inconsistent shapes:

Builder (line 676): IndexError("file_id exceeds u32::MAX")

Reader (line 92): CorruptedIndex { path, reason }

This creates confusion: which variant should be used? The structured variant with path context is more debuggable.

Fix: Consolidate into a single variant. Either:

Rename IndexError → IndexBuildError to clarify it's build-time only

Migrate all IndexError call sites to CorruptedIndex with structured fields

The structured version is better for error handling and debugging.

leesharon · 2026-04-05T18:39:57Z

Code Review Summary

Review Status: ⛔️ CHANGES_REQUESTED (4 blocking findings)

High-Confidence Blocking Issues (≥80% confidence)

Created inline comments for these critical findings:

Security: Uncapped Vec::with_capacity in read_postings - Resource exhaustion via crafted index (90% confidence) — reader.rs:248
Rust: Unchecked as u32 truncation of node.start_byte() - Position corruption on large files (90% confidence) — builder.rs:335
Architecture: CLI depends on concrete LexicalSearchLayer, not trait - Breaks Wave 2 extensibility (90% confidence) — output.rs:79
Consistency: IndexError vs CorruptedIndex overlap - Confusing error handling (85% confidence) — error.rs:29

Lower-Confidence Suggestions (60-79%)

These are important but less urgent than the 4 blocking issues above:

Security (80% confidence, medium priority)

Missing size limit on delta file (200MB cap recommended) — delta.rs:100
TOCTOU race in metadata size guard (read-first pattern) — query.rs:70
Missing size limit on .skpost file (4GB cap recommended) — reader.rs:87
--limit flag accepts zero and usize::MAX — Clamp to 1-10,000 — search/mod.rs:38

Performance (90% confidence, HIGH priority)

Delta linear scan O(n) per query — Use binary search or in-memory hash index — delta.rs:111
Vec allocation per posting lookup (32 allocations per query) — Return iterator or reuse buffer — reader.rs:242
FxHashMap grows unbounded in builder — Pre-allocate with estimate — builder.rs:44
Redundant n-gram extraction for nested nodes — Track classified byte ranges — builder.rs:282

Rust (82-90% confidence)

Unchecked as u32 of range.start — Use try_from — builder.rs:146
Unchecked as u32 of ngrams.len() — Use try_from — builder.rs:151, 219, 340
IndexStats::clone() in accessor — Return reference instead — query.rs:216
Missing #[must_use] on Result-returning functions — Add to all public fallible functions

Complexity (92% confidence, MEDIUM priority)

serde_fields.rs exceeds 400 lines (477) — Extract tests to existing integration file
builder.rs exceeds 400 lines (426) — Extract walker to lexical/walker.rs
walk_and_classify_inner has 7 parameters — Encapsulate mutable state in struct
Moderate cyclomatic complexity in search/mod.rs:run() — Extract unimplemented flag checks

Consistency (85-92% confidence)

Inconsistent #[must_use] annotations — Standardize on reason strings
Inconsistent test naming (test_ prefix vs descriptive) — Pick one convention
Inconsistent section headers (// Unit Tests vs // Unit tests) — Standardize on capital T

Test Coverage (85-90% confidence)

No test for CorruptedIndex error variant
No test for IndexReader::validate() method
No test for MAX_METADATA_BYTES guard
No test for unimplemented CLI flags
Scoring edge cases (doc_count == 0, df > doc_count) not tested
Tombstone + search integration not end-to-end tested
Delta reader + search integration not end-to-end tested

Deduplication

Multiple reviewers flagged identical issues — consolidated into single inline comments:

as u32 truncation (security + rust reviewers)
IndexError vs CorruptedIndex (consistency + rust reviewers)
LexicalSearchLayer field visibility (architecture + rust reviewers)

Summary by Category

Finding	Blocking	Should-Fix	Lower-Confidence
Security	2	0	4
Architecture	1	2	1
Performance	2	2	0
Rust	1	4	4
Complexity	3	1	0
Consistency	2	2	3
Tests	3	2	0
Regression	0	0	1
Total	14	13	13

Recommendation

Do not merge until the 4 inline-comment blocking issues are resolved. These are straightforward fixes (3-10 lines each) that prevent resource exhaustion, position corruption, and Wave 2 extensibility issues. The lower-confidence findings are important for code quality but can be tracked as tech debt if the blocking issues are fixed.

See inline comments for specific fixes.

Two denial-of-service vectors closed in IndexReader: 1. Vec::with_capacity OOM (issue #1): posting_length from an untrusted .skidx entry could be u32::MAX, triggering a ~51 GiB allocation before the per-entry bounds check fires. Cap count against the maximum entries that could physically exist in the mapped .skpost file. 2. Unbounded .skpost mmap (issue #3): only a divisibility check existed. Add MAX_SKPOST_BYTES = 4_000_000_000 (4 GiB) check in IndexReader::open, covering ~357 M PostingEntry records — well beyond any real repository. Co-Authored-By: Claude <noreply@anthropic.com>

Fields `reader`, `scorer`, and `metadata` were `pub(crate)` but are only accessed within query.rs via `self`. Restrict visibility to private to enforce encapsulation and prevent external coupling to internals. No accessor methods needed — all field access is self-contained. Co-Authored-By: Claude <noreply@anthropic.com>

Change resolve_result, print_text_results, and print_json_results to accept &dyn SearchIndex instead of &LexicalSearchLayer. The output module only needs file_table() and stats() which are on the SearchIndex trait — tying it to the concrete type was an unnecessary dependency on an internal implementation detail. show_stats still imports LexicalSearchLayer to construct it via ::open, which is the correct construction site. Co-Authored-By: Claude <noreply@anthropic.com>

Resolves builder.rs exceeding 400 lines (426 → 354) by extracting the tree-sitter AST walker into a dedicated lexical/walker.rs module. - Add WalkContext struct to encapsulate mutable walk state (source, classifier, doc_id, postings, doc_len), reducing walk_and_classify_inner from 7 params to 3 (ctx, node, depth) - walk_and_classify is pub(crate) in the new module, matching previous visibility - MAX_AST_DEPTH constant moved to walker.rs with its unit test - Fix as-u32 truncation in walker: use u32::try_from().unwrap_or(u32::MAX) for node.start_byte() and ngrams.len() casts (issues #3, #5) - Register walker module as pub(crate) in lexical/mod.rs Co-Authored-By: Claude <noreply@anthropic.com>

serde_fields.rs exceeded the 400-line project guideline at 477 lines, with 170 lines being inline unit tests. Extracted all #[cfg(test)] tests to tests/serde_field_classification.rs and renamed them with a low_level_ prefix to avoid collisions with existing integration tests. Source file is now 302 lines. All 43 tests pass. Co-Authored-By: Claude <noreply@anthropic.com>

DeltaReader::open() had no upper bound on the file it would mmap, unlike tombstones (40 MB) and metadata.json (100 MB) which both reject oversized files with CorruptedIndex. A crafted or runaway delta file could cause unbounded memory mapping. Add MAX_DELTA_BYTES = 100_000_000 (5 M records at 20 bytes each) and return SearchError::CorruptedIndex if the file exceeds it, matching the existing tombstone/metadata pattern. Co-Authored-By: Claude <noreply@anthropic.com>

- Fix unchecked as-u32 truncations in index_serde: range.start and ngrams.len() now use u32::try_from().unwrap_or(u32::MAX) matching the pattern used throughout the codebase - Fix same truncation in the whole-file fallback path for fallback_ngrams.len() - Reserve postings HashMap capacity on first file using content.len()/4 as a conservative ngram-count estimate, reducing rehash overhead for large repositories Co-Authored-By: Claude <noreply@anthropic.com>

… annotations All 9 bare #[must_use] annotations in crates/rskim-search/src now carry reason strings that explain what goes wrong when the return value is discarded, making the lint warnings actionable rather than generic. Co-Authored-By: Claude <noreply@anthropic.com>

DeltaReader::scan() performed a full linear mmap scan per query n-gram, making each search O(n*q) where n is delta record count and q is query n-gram count (up to 32). For a delta with 100k records this meant 3.2 M unnecessary comparisons per query. Build a FxHashMap<u64, Vec<usize>> (hash → byte offsets) in a single pass during open(). scan() now does a hash lookup plus one PostingEntry decode per match: O(1) per n-gram regardless of delta size. Public API (scan signature, open return type) is unchanged so no callers are affected. Co-Authored-By: Claude <noreply@anthropic.com>

…from CorruptedIndex Both IndexError and CorruptedIndex described index failures, creating ambiguity at call sites. IndexBuildError now clearly signals a failure during index construction (e.g. value exceeds u32::MAX), while CorruptedIndex signals a runtime integrity violation when reading an existing index. The #[error] display string is unchanged so existing error message assertions still pass. Updated all 5 sites: error.rs enum, builder.rs, writer.rs, and two test files. Co-Authored-By: Claude <noreply@anthropic.com>

…d delta+tombstone pipeline - error_conversion.rs: 3 tests for CorruptedIndex Display format (path+reason interpolation, template prefix) and source() chain (returns None — no #[source]) - index_format.rs: 3 tests for IndexReader::validate() — valid index passes, empty index passes, and crafted out-of-bounds posting range is detected - wave1_acceptance.rs: 3 integration tests that exercise delta/tombstone filtering end-to-end through the query engine: tombstoned_file_is_excluded_from_search_results delta_entries_appear_in_search_results delta_bypasses_tombstone_for_new_entries (verifies delta is not tombstoned) Co-Authored-By: Claude <noreply@anthropic.com>

Two-pass approach: Pass 1 computes byte offsets with checked arithmetic (no I/O, no large allocation). Pass 2 streams PostingEntry bytes directly to BufWriter<File>. Eliminates the intermediate Vec<u8> that held the entire postings file in memory.

Add lookup_into/read_postings_into on IndexReader and scan_into on DeltaReader — caller-owned buffer variants that clear-and-fill instead of allocating per call. Original lookup/scan delegate to these for backward compatibility. In the query loop, postings_buf and ngram_docs are now declared once before the loop and reused across iterations, eliminating per-ngram Vec and HashMap allocations.

…tep comments - delta.rs: use mmap directly instead of `let data = &mmap[..]` alias - reader.rs: add #[must_use] to lookup_into (matches lookup) - query.rs: align step numbering with module-level pipeline doc

…unit tests - output::show_stats now accepts &dyn SearchIndex (DI) instead of opening the index internally; caller in mod.rs handles open+error - builder.rs: use or_insert_with return value directly instead of separate get() call for classifier cache - Remove minified_threshold_calculation and max_ast_depth_is_reasonable unit tests — constants-only assertions already covered by integration tests (minified_file_is_skipped, index_builder)

leesharon and others added 25 commits April 4, 2026 15:37

fix: remove unused serde_json dependency from rskim-search

15427b9

serde_json was declared as a dependency but never imported or used anywhere in the crate source. Removes unnecessary build-time overhead.

fix: set clap display name to 'skim search' for correct help output

92dc7d0

command().print_help() renders the command name from clap's Command::new("search"), which omits the parent binary. Override with .name("skim search") so help output shows "Usage: skim search".

fix: restore clap-delegated print_help (DRY over pattern conformity)

b8b334e

Simplifier reverted to manual println pattern; restoring clap delegation since eliminating help text duplication was the explicit review finding (flagged by 4 reviewers). Uses let _ = to handle io::Result.

chore: bump rskim-search to 2.1.0 to align with upstream release

c929a30

fix: address scrutinizer findings in Wave 1

ea1f29f

Remove unused test helpers (make_header, make_posting) from index_format.rs unit test module. These were duplicates of the helpers in the integration test file and generated dead_code warnings.

leesharon requested a review from dean0x as a code owner April 5, 2026 16:20

leesharon commented Apr 5, 2026

View reviewed changes

leesharon and others added 15 commits April 6, 2026 09:27

style(search): remove redundant slice aliases, add #[must_use], fix s…

276afae

…tep comments - delta.rs: use mmap directly instead of `let data = &mmap[..]` alias - reader.rs: add #[must_use] to lookup_into (matches lookup) - query.rs: align step numbering with module-level pipeline doc

dean0x closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Wave 1 — Lexical search with BM25F scoring and n-gram index#120

feat: Wave 1 — Lexical search with BM25F scoring and n-gram index#120
leesharon wants to merge 40 commits into
dean0x:mainfrom
leesharon:wave-1

leesharon commented Apr 5, 2026 •

edited

Loading

Uh oh!

leesharon Apr 5, 2026

Uh oh!

leesharon Apr 5, 2026

Uh oh!

leesharon Apr 5, 2026

Uh oh!

leesharon Apr 5, 2026

Uh oh!

leesharon commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leesharon commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Key metrics

Test plan

Uh oh!

leesharon Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

leesharon Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

leesharon Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

leesharon Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

leesharon commented Apr 5, 2026

Code Review Summary

High-Confidence Blocking Issues (≥80% confidence)

Lower-Confidence Suggestions (60-79%)

Security (80% confidence, medium priority)

Performance (90% confidence, HIGH priority)

Rust (82-90% confidence)

Complexity (92% confidence, MEDIUM priority)

Consistency (85-92% confidence)

Test Coverage (85-90% confidence)

Deduplication

Summary by Category

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leesharon commented Apr 5, 2026 •

edited

Loading