Skip to content

search() hangs on large repos — walkdir doesn't respect .gitignore #2

@btakita

Description

@btakita

Summary

engine.search() blocks indefinitely on repositories with many files (~200K+) because the corpus file walker uses walkdir::WalkDir instead of the ignore crate's WalkBuilder, which means .gitignore rules are not respected during file discovery.

Environment

  • sift: current main branch (git dep)
  • OS: Linux (Arch)
  • Rust: stable

Reproduction

use sift::{SearchInput, SearchOptions, Sift};

fn main() {
    // Point at any large repo with submodules/node_modules
    // e.g., a monorepo with ~200K files (excluding .git/)
    let engine = Sift::builder().build();
    let options = SearchOptions::default()
        .with_limit(10)
        .with_strategy("lexical".to_string());
    let input = SearchInput::new("/path/to/large/repo", "main").with_options(options);
    
    // This blocks indefinitely
    let response = engine.search(input);
}

Test repo: a workspace with git submodules containing node_modules/, build outputs, etc. Total file count: ~1.87M files (excluding .git/), ~200K excluding node_modules/ and target/.

Observed: search() never returns (tested up to 40 seconds, then timed out via external wrapper).

Expected: search completes in seconds by respecting .gitignore to skip node_modules/, build artifacts, etc.

Root cause

In src/search/corpus.rs:283, collect_file_paths uses walkdir::WalkDir:

fn collect_file_paths(root: &Path, ignore: Option<&Ignore>) -> Vec<PathBuf> {
    // ...
    for entry in WalkDir::new(root).sort_by_file_name().into_iter().flatten() {
        // ...
    }
}

This walks every file in the directory tree. The Ignore struct (from src/config.rs) only checks against .siftignore files and two hardcoded exclusions (target/**, .git/**). It does not read or respect .gitignore files.

The ignore crate (already a dependency) provides WalkBuilder which natively respects .gitignore, global gitignore, and .ignore files. Switching WalkDir::new(root) to WalkBuilder::new(root) would fix this without changing the API surface.

Workaround

We wrapped engine.search() in a thread with mpsc::recv_timeout (30-second default) to prevent our CLI from hanging:

fn run_search_with_timeout(engine: Sift, input: SearchInput, timeout_secs: u64) -> Result<SearchResponse> {
    if timeout_secs == 0 { return engine.search(input); }
    let (tx, rx) = std::sync::mpsc::channel();
    std::thread::spawn(move || { let _ = tx.send(engine.search(input)); });
    match rx.recv_timeout(Duration::from_secs(timeout_secs)) {
        Ok(result) => result,
        Err(_) => bail!("search timed out after {}s", timeout_secs),
    }
}

Suggested fix

Replace WalkDir with the ignore crate's WalkBuilder in collect_file_paths. This would automatically respect .gitignore at every level of the tree, dramatically reducing the file set for typical repositories. The Ignore struct could then layer .siftignore on top of the git-native ignore rules.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions