| tags |
|
||||
|---|---|---|---|---|---|
| aliases |
|
A benchmark suite for measuring CodeRAG search quality against a grep baseline. Uses curated query datasets with expected results and computes standard information retrieval metrics.
Package: @code-rag/benchmarks (private)
Dependencies: @code-rag/core
Dev dependencies: tsx, vitest
The benchmark package answers the question: How much better is CodeRAG's hybrid semantic search compared to simple text grep?
It loads a dataset of curated queries with known relevant files, runs each query through different search runners, computes IR metrics, and generates a comparison report.
All metrics are implemented as pure functions in metrics.ts:
| Metric | Function | Description |
|---|---|---|
| Precision@K | precisionAtK(retrieved, relevant, k) |
Fraction of top-K results that are relevant |
| Recall@K | recallAtK(retrieved, relevant, k) |
Fraction of relevant items found in top-K |
| MRR | meanReciprocalRank(retrieved, relevant) |
1 / rank of the first relevant result |
| nDCG@K | ndcg(retrieved, relevant, k) |
Normalized Discounted Cumulative Gain (binary relevance) |
Metrics are computed at K=5 and K=10 for precision and recall.
Note: > Precision@K uses the standard IR definition where the denominator is always K, penalizing runners that return fewer than K results.
Benchmark datasets are JSON files following this schema:
interface BenchmarkDataset {
name: string; // e.g., "coderag-self-benchmark"
description: string;
targetRepo: string; // Path to the repo under test
queries: BenchmarkQuery[];
}
interface BenchmarkQuery {
id: string; // Unique query identifier
query: string; // Natural language query text
difficulty: 'easy' | 'medium' | 'hard';
category: 'function_lookup' | 'concept_search' | 'cross_file' | 'error_investigation';
expectedChunks: ExpectedChunk[];
tags: string[];
}
interface ExpectedChunk {
filePath: string; // e.g., "packages/core/src/embedding/hybrid-search.ts"
chunkType: string; // e.g., "function", "class", "method"
name: string; // e.g., "HybridSearch"
relevance: 'primary' | 'secondary';
}Datasets are stored in the datasets/ directory (e.g., datasets/coderag-queries.json).
Uses HybridSearch from @code-rag/core to run queries:
- Accepts a pre-configured
HybridSearchinstance - Runs the query with configurable
topK - Deduplicates file paths from results while preserving rank order
- Returns file paths, full
SearchResult[], and duration in milliseconds
Baseline comparison using system grep:
- Executes
grep -rn --include=*.ts --include=*.js <query> <rootDir> - Parses output to count matches per file
- Ranks files by occurrence count (descending)
- Handles
grepexit code 1 (no matches) gracefully - 10 MB output buffer limit
The runBenchmark() function orchestrates a complete benchmark run:
const report = await runBenchmark(
'datasets/coderag-queries.json', // dataset path
'coderag', // runner name
runnerFn, // (query: string) => { filePaths, durationMs }
);Steps:
- Load the dataset from JSON
- For each query, extract relevant file paths from expected chunks
- Run the query through the runner function
- Compute per-query metrics (P@5, P@10, R@5, R@10, MRR, nDCG@10)
- Compute aggregate metrics by averaging across all queries
- Return a
BenchmarkReportwith all results
The perf/ directory provides utilities for measuring runtime and resource usage:
| Function | Description |
|---|---|
measureTime(fn) |
Wraps an async function, returns { result, durationMs } |
measureMemory() |
Captures current heap and RSS usage in megabytes |
computePercentiles(values, [50, 95, 99]) |
Computes percentile values from an array of measurements |
# Run with default dataset
pnpm --filter @code-rag/benchmarks benchmark
# Run with a custom dataset
node --import tsx packages/benchmarks/src/run-benchmark.ts path/to/dataset.jsonThe CLI entry point (run-benchmark.ts) runs the grep baseline, generates a markdown comparison table, and writes a JSON report to results/benchmark-report.json.
The generateMarkdownReport() function produces a comparison table:
# Benchmark Results
Date: 2026-02-23
## Aggregate Metrics
| Runner | P@5 | P@10 | R@5 | R@10 | MRR | nDCG@10 | Queries |
|--------|-----|------|-----|------|-----|---------|---------|
| grep | 0.2000 | 0.1500 | 0.3000 | 0.4500 | 0.5000 | 0.4200 | 20 |
| coderag | 0.6000 | 0.5000 | 0.7500 | 0.9000 | 0.8500 | 0.7800 | 20 |- Hybrid Search -- the search algorithm being benchmarked
- Core -- the core library providing search functionality