Skip to content

feat: Wave 1 — Lexical search with BM25F scoring and n-gram index#120

Closed
leesharon wants to merge 40 commits into
dean0x:mainfrom
leesharon:wave-1
Closed

feat: Wave 1 — Lexical search with BM25F scoring and n-gram index#120
leesharon wants to merge 40 commits into
dean0x:mainfrom
leesharon:wave-1

Conversation

@leesharon
Copy link
Copy Markdown

@leesharon leesharon commented Apr 5, 2026

Summary

Full lexical search layer for skim — BM25F scoring with n-gram indexing over AST-classified code fields.

  • N-gram extraction with border weighting for better boundary-aware matching
  • BM25F scoring with field-weighted classifiers (type definitions, function signatures, symbol names, etc.)
  • Index builder with tree-sitter AST field classification + serde-based classification for JSON/YAML/TOML/Markdown
  • Binary index format (.skidx + .skpost) with memory-mapped reader, delta segments, and tombstone tracking
  • CLI integrationskim search with auto-build, --rebuild, --stats, snippet display, JSON output
  • Performance — O(1) delta scans via FxHashMap, two-pass streaming writer (no in-memory buffer), reusable query buffers
  • Hardened — size guards on all mmap'd files, checked arithmetic, capacity caps against crafted indexes

Architecture

rskim-search/src/lexical/
├── ngram.rs          — N-gram extraction + border weighting
├── scoring.rs        — BM25F scorer with field boosts
├── builder.rs        — Index builder (AST + serde field classification)
├── walker.rs         — AST cursor walker for field classification
├── query.rs          — SearchLayer impl — query engine
└── index_format/
    ├── writer.rs     — Two-pass atomic write (.skidx + .skpost)
    ├── reader.rs     — Mmap'd binary search reader
    ├── delta.rs      — Append-only delta segment
    ├── tombstones.rs — Deleted doc_id tracking
    └── entry.rs      — Binary IndexEntry format

Key metrics

  • 55 files changed, ~10,580 lines added
  • 319 tests (unit + integration + acceptance + robustness + adversarial)
  • 0 tech debt items remaining (PF-001 through PF-009 all resolved)
  • All clippy warnings clean

Test plan

  • cargo test -p rskim-search — 248 tests pass (unit + integration)
  • cargo test -p rskim — 1001 tests pass (1 flaky perf test excluded — process spawn overhead, unrelated)
  • cargo clippy -p rskim-search -p rskim -- -D warnings — clean
  • Index write/read roundtrip tests (empty, single, multiple, large posting lists)
  • Delta writer/reader roundtrip + scan filtering
  • Tombstone persistence and binary search
  • Full search pipeline via SearchLayer::search (BM25F scoring, delta merge, tombstone exclusion)
  • Concurrent search safety (Arc + thread::scope)
  • Corrupted index detection (bad magic, wrong version, truncated files, out-of-bounds postings)
  • Adversarial inputs (null bytes, very long queries, unicode, whitespace-only)

leesharon and others added 25 commits April 4, 2026 15:37
Introduce the rskim-search crate as the search and indexing foundation
for the 3-layer code search system. All types and traits are I/O-free
and follow the rskim-core patterns exactly.

- crates/rskim-search: new crate with Cargo.toml, lib.rs, types.rs, traits.rs
- types.rs: FileId, SearchField, MatchSpan, LineRange, TemporalFlags,
  SearchQuery (builder pattern), SearchResult, IndexStats, FileTable
  (I/O-free normalize), SearchError (thiserror), Result<T> alias
  + 10 unit tests covering all types
- traits.rs: SearchLayer, LayerBuilder, FieldClassifier trait definitions
  with full doc comments
- CLI: add `skim search` stub subcommand with full clap arg definitions
  (--ast, --limit, --blast-radius, --hot, --cold, --risky, --build, etc.)
- Workspace: add rskim-search member, wire into rskim dependency
- main.rs: register --ast and --limit as value-consuming flags
- completions.rs: register search::command() for accurate tab completions
- cli_search.rs: 4 integration tests (help, -h, stub with query, stub empty)
serde_json was declared as a dependency but never imported or used
anywhere in the crate source. Removes unnecessary build-time overhead.
- Add #![allow(clippy::unwrap_used, clippy::expect_used, clippy::panic)]
  to error_conversion.rs and serialization_roundtrip.rs so crate-level
  deny lints do not break cargo clippy --tests
- Replace 3.14 score literal with 2.75 to avoid clippy::approx_constant
Add 87 new tests across 6 test files (5 new + 1 extended):
- error_conversion.rs: 89 lines (error type conversions)
- file_table_edge_cases.rs: 180 lines (new) (FileTable corner cases)
- serialization_roundtrip.rs: 273 lines (serde round-trip integrity)
- trait_object_safety.rs: 68 lines (new) (trait object handling)
- types_edge_cases.rs: 270 lines (new) (SearchQuery/types validation)
- cli_search.rs: extended with exit code, help text validation

Implementation changes supporting tests:
- FileTable custom Serialize/Deserialize (emit paths vec only)
- SearchQuery #[must_use] attribute
- serde_json added to rskim-search dev-dependencies
- Cargo.lock updated with rskim-search dependency in rskim crate

All 1700+ workspace tests passing.

Co-Authored-By: Claude <noreply@anthropic.com>
…t comments, doc drift

- search.rs: replace manual print_help() body with command().print_help()
  so help text auto-tracks clap definition without drift (DRY violation fixed)
- rskim/Cargo.toml: add INTENTIONAL comment on rskim-search dep to document
  it is scaffolding wired up in Wave 1, not dead weight
- rskim-search/Cargo.toml: add readme/keywords/categories to match rskim-core
  metadata pattern; add explanatory comments to [lints.clippy] section
- main.rs: update is_flag_with_value() docstring to explicitly cover both
  Args struct flags and subcommand-only flags (--ast, --limit, etc.)

Co-Authored-By: Claude <noreply@anthropic.com>
…asts, fast-path

Five issues from batch-1 audit of types.rs:

- Deserialization now runs normalize() on each path before building the ids
  map, so a table serialized with ./src/main.rs round-trips idempotently
  (re-registering the same unnormalized path after deser returns the existing
  FileId rather than creating a duplicate).

- Replace silent `as usize` / `as u64` casts in lookup() with
  usize::try_from(...).ok()? so truncation becomes a None return rather than
  wrong-file access. Add compile_error! to reject 32-bit targets where usize
  < u64 would make the casts lossy.

- Swap clone order in register(): ids takes the clone, paths takes the move —
  avoids one unnecessary PathBuf allocation per new registration.

- Fast-path in normalize(): skip Vec<Component> allocation entirely when the
  path contains no CurDir or ParentDir components (the common case).

- Add two serialization_roundtrip tests covering the deser-normalize invariant:
  unnormalized-paths-idempotent-after-roundtrip and
  serialized-form-uses-normalized-paths.

Co-Authored-By: Claude <noreply@anthropic.com>
…build/update stub coverage

- file_table_edge_cases: add test_normalize_absolute_over_root documenting
  that /a/../../b produces /../b (I/O-free normalizer cannot clamp at root)
- cli_search: add regression tests for --build, --rebuild, --update flags
  confirming each routes through the not-yet-implemented stub path

Co-Authored-By: Claude <noreply@anthropic.com>
types.rs was 577 lines, exceeding the 400-line project convention.
Split into four focused modules under types/:

- file_table.rs  — FileId, FileTable, custom serde impls, normalize(),
                   64-bit compile_error gate
- query.rs       — SearchQuery (#[must_use]), TemporalFlags, SearchField
- result.rs      — SearchResult, MatchSpan, LineRange, IndexStats
- error.rs       — SearchError, Result alias

mod.rs barrel re-exports all types so lib.rs pub use is unchanged.
Inline unit tests moved into their respective sub-modules.

All 91 tests pass; clippy --workspace -D warnings clean.

Co-Authored-By: Claude <noreply@anthropic.com>
command().print_help() renders the command name from clap's
Command::new("search"), which omits the parent binary. Override with
.name("skim search") so help output shows "Usage: skim search".
Simplifier reverted to manual println pattern; restoring clap delegation
since eliminating help text duplication was the explicit review finding
(flagged by 4 reviewers). Uses let _ = to handle io::Result.
…xHashMap

- Add register_within(path, root) that rejects paths escaping project
  root (leading .. or absolute after normalization)
- Add MAX_FILE_TABLE_ENTRIES (10M) size limit on FileTable deserialization
  to prevent OOM from malicious/corrupted index files
- Switch FileTable from HashMap to FxHashMap (rustc-hash) for faster
  path lookups — zero-cost change, already a transitive dependency
- 7 new tests covering all three fixes
…m-search from rskim

- Move rustc-hash = "1.1" from rskim-search inline declaration into
  [workspace.dependencies] so all crates follow the same pattern.
- Remove the rskim-search dependency from the rskim CLI crate; no code
  in rskim imports it (search.rs is a pure stub). The dep will be
  re-added when Wave 1 wires up real search logic.

Fixes batch-2 issues: cargo:rustc_hash_workspace, cargo:unused_rskim_search_dep

Co-Authored-By: Claude <noreply@anthropic.com>
…ble normalization

- register_within now verifies containment by joining root+path, re-normalizing,
  and checking starts_with(normalized_root); previously _root was unused so any
  relative path passed regardless of root
- Extract register_normalized(PathBuf) private method so both register() and
  register_within() normalize exactly once instead of twice
- Deserialization now returns Err on duplicate normalized paths (e.g.,
  ["./src/main.rs","src/main.rs"]) rather than silently overwriting the ids map
  and breaking the paths.len() == ids.len() invariant
- Add test_file_table_deserialization_rejects_at_limit_plus_one covering the
  actual rejection branch of the 10M-entry size limit (was previously untested)
- Add test_file_table_deserialization_rejects_duplicate_normalized_paths
- Add register_within tests: single-parent escape, subdirectory escape,
  root-itself acceptance; document that relative paths without ".." do resolve
  inside root when joined

Co-Authored-By: Claude <noreply@anthropic.com>
…cture

- Add SearchIndex super-trait (extends SearchLayer with file_table, stats, mtimes)
- Change LayerBuilder::build() return type to Box<dyn SearchIndex>
- Add CorruptedIndex error variant to SearchError
- Add SearchField::as_u8/from_u8/all() for on-disk serialization
- Define shared lexical types: Ngram, PostingEntry, IndexHeader, Bm25Params, IndexMetadata
- Create lexical/ module (ngram, index_format, scoring, builder, query stubs)
- Create fields/ module (tree_sitter_fields, serde_fields, markdown_fields stubs)
- Add memmap2 workspace dependency, serde_json to rskim-search deps
- Add search test fixtures (TypeScript, Rust, JSON, YAML, Markdown, Python)
- Update trait_object_safety tests for SearchIndex
Implement `extract_ngrams` (index-time) and `extract_query_ngrams`
(query-time) in `crates/rskim-search/src/lexical/ngram.rs`.

- `extract_ngrams`: sliding 2-byte window over whitespace-delimited
  tokens; first and last bigram of each token receive BORDER_WEIGHT
  (2.0); interior bigrams receive INTERIOR_WEIGHT (1.0); repeated
  bigrams accumulate weights via FxHashMap.

- `extract_query_ngrams`: greedy covering-set over raw trimmed query
  bytes; selects bigrams maximising uncovered position coverage;
  capped at 32 bigrams; all weights uniform 1.0.

Edge cases handled: empty input, single-byte input, all-whitespace,
Unicode multi-byte chars, CJK 3/4-byte sequences, very long text,
repeated chars.

Also suppress two pre-existing clippy warnings in scaffolding code
(`type_complexity` in tree_sitter_fields.rs, `manual_contains` in
markdown_fields.rs) so `cargo clippy -p rskim-search -- -D warnings`
passes cleanly.

Co-Authored-By: Claude <noreply@anthropic.com>
Implement `extract_ngrams` (index-time) and `extract_query_ngrams`
(query-time) in `crates/rskim-search/src/lexical/ngram.rs`.

- `extract_ngrams`: sliding 2-byte window over whitespace-delimited
  tokens; first and last bigram of each token receive BORDER_WEIGHT
  (2.0); interior bigrams receive INTERIOR_WEIGHT (1.0); repeated
  bigrams accumulate weights via FxHashMap; O(n) allocation.

- `extract_query_ngrams`: greedy covering-set over raw trimmed query
  bytes; selects bigrams maximising uncovered position coverage;
  capped at 32 bigrams; all weights uniform 1.0.

Edge cases handled: empty input, single-byte input, all-whitespace,
Unicode multi-byte chars, CJK 3/4-byte sequences, very long text,
repeated chars. No unwrap/expect/panic in library code.

Also suppress two pre-existing clippy warnings in scaffolding code
(`type_complexity` in tree_sitter_fields.rs, `manual_contains` in
markdown_fields.rs) to satisfy `cargo clippy -p rskim-search -- -D warnings`.

Co-Authored-By: Claude <noreply@anthropic.com>
- scoring.rs: BM25F formula with IDF + per-field boosted TF, length
  normalization, guards for zero avg_doc_len and zero df
- tree_sitter_fields.rs: data-driven FxHashMap classifier for 13
  tree-sitter languages; identifier classification only in declaration
  contexts to avoid indexing every variable reference
- serde_fields.rs: JSON (serde_json-based), YAML/TOML (line-scan, no
  extra deps), Markdown (line-scan, code fences, links, headings)
- 3 integration test files: bm25f_scoring (11 tests), field_classification
  (13 tests), serde_field_classification (27 tests) — all green
- clippy -D warnings: clean
…10)

- LexicalLayerBuilder: accumulates postings from multiple files via add_file(),
  dispatching tree-sitter languages through recursive AST walk and serde-based
  languages through classify_serde_fields(); always includes whole-file fallback
  posting under FunctionBody to guarantee searchable content for every file
- Skips files >5MB or with avg_line_length >500 (minified detection)
- Writes lexical.skidx, lexical.skpost, metadata.json atomically on build()
- LexicalSearchLayer: minimal struct satisfying SearchIndex/SearchLayer;
  search() is a Phase-3 stub; open() deserializes metadata.json and mmaps index
- 10 integration tests cover empty index, per-language paths, skip heuristics,
  disk layout, reopen, and metadata roundtrip; all 265 rskim-search tests pass

Co-Authored-By: Claude <noreply@anthropic.com>
)

Replace Phase 2 stub with full query pipeline in LexicalSearchLayer::search():
- Extract query n-grams via extract_query_ngrams (covering-set strategy)
- Lookup posting lists from main index (binary search) + delta (linear scan)
- Filter tombstoned doc_ids from main-index results
- Accumulate per-document, per-field term frequencies via FxHashMap
- Score each document with Bm25Scorer::score_term using per-field boosts
- Sort descending by score, apply offset + limit, return Vec<(FileId, f32)>

Added index_dir: PathBuf field to LexicalSearchLayer so tombstone and delta
files can be located at query time without passing extra parameters.

Added 13 integration tests in tests/lexical_query.rs covering: empty/blank
queries, fixture-based ranking (UserService, AuthHandler, database_url,
replicas), score ordering, limit/offset, positive scores, type-definition
boost, and cross-open reproducibility.

Co-Authored-By: Claude <noreply@anthropic.com>
…16)

Replace the search subcommand stub with a full BM25F lexical search
implementation. Walks repo files respecting .gitignore, builds a
per-repo index keyed by repo-root hash under SKIM_CACHE_DIR, and
auto-builds on first query. Supports --build, --rebuild, --stats,
--clear-cache, --limit, --json, and snippet display in text output.
Remove dead code (merge_field_tfs, unix_now, PostingEntry size-check hack),
eliminate unnecessary String allocation in YAML classifier, deduplicate
cache-dir resolution in search.rs, tighten classifier_cache lookup in
builder.rs, and replace language-behavior tests with behavior-asserting ones.
Remove unused test helpers (make_header, make_posting) from
index_format.rs unit test module. These were duplicates of the
helpers in the integration test file and generated dead_code warnings.
…ests

Add 44 tests across three new files that exercise the full Wave 1 lexical
search pipeline from real fixture files through to CLI output:

- wave1_acceptance.rs (27 tests): end-to-end build→query→rank pipeline;
  covers all 6 language fixtures, field-boost ordering, pagination,
  persistence, reopen, unicode, duplicate idempotency, and corruption
  detection via the SearchIndex trait (not internal modules).

- wave1_robustness.rs (10 tests): adversarial builder and query inputs —
  very long queries, CJK text, special chars, null bytes, empty/whitespace
  content, concurrent Arc<LexicalSearchLayer> reads, index size bounds,
  and a 100-query performance check (< 500ms total).

- cli_search_adversarial.rs (7 tests): binary-level adversarial coverage —
  5000-char queries, special-char queries, corrupt-index error with helpful
  --rebuild hint (correctly locates hash-keyed cache subdirectory), stats
  JSON validity, build-then-query, --limit 0, and help flag completeness.

Co-Authored-By: Claude <noreply@anthropic.com>
- Split index_format.rs into module with dedicated concerns: delta, entry, reader, writer, tombstones
- Restructured search command into: index (setup), output (formatting), mod (coordination)
- Reorganized field extraction with dedicated tree_sitter_tables mapping
- Updated tests to match refactored module structure (field_classification, index_format, lexical_query, serialization_roundtrip, types_edge_cases, wave1_acceptance, wave1_robustness)
- Removed consolidated bm25f_scoring.rs (integrated into query module)
- Unified test fixtures for better reusability

Improves codebase clarity and maintainability without behavioral changes.

Co-Authored-By: Claude <noreply@anthropic.com>
@leesharon leesharon requested a review from dean0x as a code owner April 5, 2026 16:20
let count = entry.posting_length as usize;
let file_count = self.header.file_count;

let mut result = Vec::with_capacity(count);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: Resource Exhaustion via Uncapped Vec::with_capacity

The read_postings method calls Vec::with_capacity(count) where count comes from untrusted on-disk data. A crafted index file with posting_length = u32::MAX would attempt to allocate ~51GB, causing OOM or panic.

Fix: Cap the capacity hint before allocation:

let max_possible = self.post_mmap.len().saturating_sub(start) / POSTING_ENTRY_SIZE;
let cap = count.min(max_possible);
let mut result = Vec::with_capacity(cap);

This is a straightforward bounds check (3 lines) that prevents DoS via crafted index files.

let entry = PostingEntry {
doc_id,
field_id: field.as_u8(),
position: node.start_byte() as u32,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: Unchecked Truncation of node.start_byte() to u32

node.start_byte() as u32 silently truncates on files near 5MB (the file size limit), potentially corrupting snippet positions. While unlikely given the 5MB cap, this violates Rust safety discipline used elsewhere (see line 179 which uses try_from).

Fix: Use try_from matching the existing pattern:

position: u32::try_from(node.start_byte()).unwrap_or(u32::MAX),

The same issue exists at line 146 (range.start as u32). Both should be fixed together for consistency.


let path_str = rel_path
.map(|p| p.display().to_string())
.unwrap_or_else(|| "<unknown>".to_string());
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: CLI Depends on Concrete Type, Not Trait

output.rs imports and depends on LexicalSearchLayer directly (lines 8, 16, 79, 97, 107), violating Dependency Inversion Principle. This makes it impossible to add Wave 2 temporal or AST layers without rewriting all output functions.

Fix: Change function signatures to accept &dyn SearchIndex:

pub(super) fn print_text_results(
    index: &dyn SearchIndex,  // Not &LexicalSearchLayer
    results: &[(FileId, f32)],
    query_text: &str,
    repo_root: &Path,
) -> anyhow::Result<()> {
    // Use index.file_table() instead of layer.file_table()
}

All callsites in search/mod.rs:135 already have LexicalSearchLayer which implements SearchIndex, so this is backward-compatible. The trait infrastructure is ready — just decouple the output layer.

#[error("Serialization error: {0}")]
SerializationError(String),

/// The index file is corrupted or has an incompatible format version.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: Inconsistent Error Variant Naming

SearchError has both IndexError(String) (line 14) and CorruptedIndex { path, reason } (lines 29-36). Both describe index-related failures but with inconsistent shapes:

  • Builder (line 676): IndexError("file_id exceeds u32::MAX")
  • Reader (line 92): CorruptedIndex { path, reason }

This creates confusion: which variant should be used? The structured variant with path context is more debuggable.

Fix: Consolidate into a single variant. Either:

  1. Rename IndexErrorIndexBuildError to clarify it's build-time only
  2. Migrate all IndexError call sites to CorruptedIndex with structured fields

The structured version is better for error handling and debugging.

@leesharon
Copy link
Copy Markdown
Author

Code Review Summary

Review Status: ⛔️ CHANGES_REQUESTED (4 blocking findings)

High-Confidence Blocking Issues (≥80% confidence)

Created inline comments for these critical findings:

  1. Security: Uncapped Vec::with_capacity in read_postings - Resource exhaustion via crafted index (90% confidence) — reader.rs:248
  2. Rust: Unchecked as u32 truncation of node.start_byte() - Position corruption on large files (90% confidence) — builder.rs:335
  3. Architecture: CLI depends on concrete LexicalSearchLayer, not trait - Breaks Wave 2 extensibility (90% confidence) — output.rs:79
  4. Consistency: IndexError vs CorruptedIndex overlap - Confusing error handling (85% confidence) — error.rs:29

Lower-Confidence Suggestions (60-79%)

These are important but less urgent than the 4 blocking issues above:

Security (80% confidence, medium priority)

  • Missing size limit on delta file (200MB cap recommended) — delta.rs:100
  • TOCTOU race in metadata size guard (read-first pattern) — query.rs:70
  • Missing size limit on .skpost file (4GB cap recommended) — reader.rs:87
  • --limit flag accepts zero and usize::MAX — Clamp to 1-10,000 — search/mod.rs:38

Performance (90% confidence, HIGH priority)

  • Delta linear scan O(n) per query — Use binary search or in-memory hash index — delta.rs:111
  • Vec allocation per posting lookup (32 allocations per query) — Return iterator or reuse buffer — reader.rs:242
  • FxHashMap grows unbounded in builder — Pre-allocate with estimate — builder.rs:44
  • Redundant n-gram extraction for nested nodes — Track classified byte ranges — builder.rs:282

Rust (82-90% confidence)

  • Unchecked as u32 of range.start — Use try_from — builder.rs:146
  • Unchecked as u32 of ngrams.len() — Use try_from — builder.rs:151, 219, 340
  • IndexStats::clone() in accessor — Return reference instead — query.rs:216
  • Missing #[must_use] on Result-returning functions — Add to all public fallible functions

Complexity (92% confidence, MEDIUM priority)

  • serde_fields.rs exceeds 400 lines (477) — Extract tests to existing integration file
  • builder.rs exceeds 400 lines (426) — Extract walker to lexical/walker.rs
  • walk_and_classify_inner has 7 parameters — Encapsulate mutable state in struct
  • Moderate cyclomatic complexity in search/mod.rs:run() — Extract unimplemented flag checks

Consistency (85-92% confidence)

  • Inconsistent #[must_use] annotations — Standardize on reason strings
  • Inconsistent test naming (test_ prefix vs descriptive) — Pick one convention
  • Inconsistent section headers (// Unit Tests vs // Unit tests) — Standardize on capital T

Test Coverage (85-90% confidence)

  • No test for CorruptedIndex error variant
  • No test for IndexReader::validate() method
  • No test for MAX_METADATA_BYTES guard
  • No test for unimplemented CLI flags
  • Scoring edge cases (doc_count == 0, df > doc_count) not tested
  • Tombstone + search integration not end-to-end tested
  • Delta reader + search integration not end-to-end tested

Deduplication

Multiple reviewers flagged identical issues — consolidated into single inline comments:

  • as u32 truncation (security + rust reviewers)
  • IndexError vs CorruptedIndex (consistency + rust reviewers)
  • LexicalSearchLayer field visibility (architecture + rust reviewers)

Summary by Category

Finding Blocking Should-Fix Lower-Confidence
Security 2 0 4
Architecture 1 2 1
Performance 2 2 0
Rust 1 4 4
Complexity 3 1 0
Consistency 2 2 3
Tests 3 2 0
Regression 0 0 1
Total 14 13 13

Recommendation

Do not merge until the 4 inline-comment blocking issues are resolved. These are straightforward fixes (3-10 lines each) that prevent resource exhaustion, position corruption, and Wave 2 extensibility issues. The lower-confidence findings are important for code quality but can be tracked as tech debt if the blocking issues are fixed.

See inline comments for specific fixes.

leesharon and others added 15 commits April 6, 2026 09:27
Two denial-of-service vectors closed in IndexReader:

1. Vec::with_capacity OOM (issue #1): posting_length from an untrusted
   .skidx entry could be u32::MAX, triggering a ~51 GiB allocation before
   the per-entry bounds check fires. Cap count against the maximum entries
   that could physically exist in the mapped .skpost file.

2. Unbounded .skpost mmap (issue #3): only a divisibility check existed.
   Add MAX_SKPOST_BYTES = 4_000_000_000 (4 GiB) check in IndexReader::open,
   covering ~357 M PostingEntry records — well beyond any real repository.

Co-Authored-By: Claude <noreply@anthropic.com>
Fields `reader`, `scorer`, and `metadata` were `pub(crate)` but are
only accessed within query.rs via `self`. Restrict visibility to private
to enforce encapsulation and prevent external coupling to internals.

No accessor methods needed — all field access is self-contained.

Co-Authored-By: Claude <noreply@anthropic.com>
Change resolve_result, print_text_results, and print_json_results to
accept &dyn SearchIndex instead of &LexicalSearchLayer. The output
module only needs file_table() and stats() which are on the SearchIndex
trait — tying it to the concrete type was an unnecessary dependency on
an internal implementation detail.

show_stats still imports LexicalSearchLayer to construct it via ::open,
which is the correct construction site.

Co-Authored-By: Claude <noreply@anthropic.com>
Resolves builder.rs exceeding 400 lines (426 → 354) by extracting the
tree-sitter AST walker into a dedicated lexical/walker.rs module.

- Add WalkContext struct to encapsulate mutable walk state (source,
  classifier, doc_id, postings, doc_len), reducing walk_and_classify_inner
  from 7 params to 3 (ctx, node, depth)
- walk_and_classify is pub(crate) in the new module, matching previous visibility
- MAX_AST_DEPTH constant moved to walker.rs with its unit test
- Fix as-u32 truncation in walker: use u32::try_from().unwrap_or(u32::MAX)
  for node.start_byte() and ngrams.len() casts (issues #3, #5)
- Register walker module as pub(crate) in lexical/mod.rs

Co-Authored-By: Claude <noreply@anthropic.com>
serde_fields.rs exceeded the 400-line project guideline at 477 lines,
with 170 lines being inline unit tests. Extracted all #[cfg(test)] tests
to tests/serde_field_classification.rs and renamed them with a low_level_
prefix to avoid collisions with existing integration tests.

Source file is now 302 lines. All 43 tests pass.

Co-Authored-By: Claude <noreply@anthropic.com>
DeltaReader::open() had no upper bound on the file it would mmap,
unlike tombstones (40 MB) and metadata.json (100 MB) which both
reject oversized files with CorruptedIndex. A crafted or runaway
delta file could cause unbounded memory mapping.

Add MAX_DELTA_BYTES = 100_000_000 (5 M records at 20 bytes each)
and return SearchError::CorruptedIndex if the file exceeds it,
matching the existing tombstone/metadata pattern.

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix unchecked as-u32 truncations in index_serde: range.start and
  ngrams.len() now use u32::try_from().unwrap_or(u32::MAX) matching
  the pattern used throughout the codebase
- Fix same truncation in the whole-file fallback path for fallback_ngrams.len()
- Reserve postings HashMap capacity on first file using content.len()/4
  as a conservative ngram-count estimate, reducing rehash overhead for
  large repositories

Co-Authored-By: Claude <noreply@anthropic.com>
… annotations

All 9 bare #[must_use] annotations in crates/rskim-search/src now carry
reason strings that explain what goes wrong when the return value is
discarded, making the lint warnings actionable rather than generic.

Co-Authored-By: Claude <noreply@anthropic.com>
DeltaReader::scan() performed a full linear mmap scan per query
n-gram, making each search O(n*q) where n is delta record count
and q is query n-gram count (up to 32). For a delta with 100k
records this meant 3.2 M unnecessary comparisons per query.

Build a FxHashMap<u64, Vec<usize>> (hash → byte offsets) in a
single pass during open(). scan() now does a hash lookup plus
one PostingEntry decode per match: O(1) per n-gram regardless of
delta size. Public API (scan signature, open return type) is
unchanged so no callers are affected.

Co-Authored-By: Claude <noreply@anthropic.com>
…from CorruptedIndex

Both IndexError and CorruptedIndex described index failures, creating
ambiguity at call sites. IndexBuildError now clearly signals a failure
during index construction (e.g. value exceeds u32::MAX), while
CorruptedIndex signals a runtime integrity violation when reading an
existing index. The #[error] display string is unchanged so existing
error message assertions still pass.

Updated all 5 sites: error.rs enum, builder.rs, writer.rs, and two test files.

Co-Authored-By: Claude <noreply@anthropic.com>
…d delta+tombstone pipeline

- error_conversion.rs: 3 tests for CorruptedIndex Display format (path+reason
  interpolation, template prefix) and source() chain (returns None — no #[source])
- index_format.rs: 3 tests for IndexReader::validate() — valid index passes,
  empty index passes, and crafted out-of-bounds posting range is detected
- wave1_acceptance.rs: 3 integration tests that exercise delta/tombstone
  filtering end-to-end through the query engine:
    tombstoned_file_is_excluded_from_search_results
    delta_entries_appear_in_search_results
    delta_bypasses_tombstone_for_new_entries (verifies delta is not tombstoned)

Co-Authored-By: Claude <noreply@anthropic.com>
Two-pass approach: Pass 1 computes byte offsets with checked arithmetic
(no I/O, no large allocation). Pass 2 streams PostingEntry bytes directly
to BufWriter<File>. Eliminates the intermediate Vec<u8> that held the
entire postings file in memory.
Add lookup_into/read_postings_into on IndexReader and scan_into on
DeltaReader — caller-owned buffer variants that clear-and-fill instead
of allocating per call. Original lookup/scan delegate to these for
backward compatibility.

In the query loop, postings_buf and ngram_docs are now declared once
before the loop and reused across iterations, eliminating per-ngram
Vec and HashMap allocations.
…tep comments

- delta.rs: use mmap directly instead of `let data = &mmap[..]` alias
- reader.rs: add #[must_use] to lookup_into (matches lookup)
- query.rs: align step numbering with module-level pipeline doc
…unit tests

- output::show_stats now accepts &dyn SearchIndex (DI) instead of
  opening the index internally; caller in mod.rs handles open+error
- builder.rs: use or_insert_with return value directly instead of
  separate get() call for classifier cache
- Remove minified_threshold_calculation and max_ast_depth_is_reasonable
  unit tests — constants-only assertions already covered by integration
  tests (minified_file_is_skipped, index_builder)
@dean0x dean0x closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants