test: auto-generated embedding strategy benchmark#76
Conversation
For every symbol in the graph, generates a natural language query from splitIdentifier (e.g. buildGraph → "build graph") and checks if search finds that symbol in the top N. Tests all 286 unique symbols with zero human bias. Results (minilm, 286 symbols): structured: Hit@1=75.5%, Hit@3=95.8%, 2 misses source: Hit@1=66.8%, Hit@3=88.1%, 11 misses
|
Claude finished @carlos-alm's task —— View job PR Review: Auto-Generated Embedding Strategy BenchmarkTodo List:
Root Cause AnalysisProblem Being Solved: The previous benchmark used 14 hand-picked queries that introduced human bias into performance evaluation. With only 14 queries, results could be skewed by the specific selection of test cases, making it difficult to assess real-world performance across the entire codebase. Why This Approach: Auto-generating queries from every symbol name in the graph (286 symbols) using Risk Assessment: ✅ Low risk - this is a pure test improvement with no production code changes. Critical ConcernsNone identified. This is a test-only change that improves benchmarking methodology. Phase 2: Configuration Change Detection ✅
Phase 3: Test Integrity ✅
Phase 4: Code Quality✅ Excellent Improvements
✅ Implementation Quality
✅ Output and UX
Minor Observations
✅ Results ValidationThe reported benchmark results show meaningful improvements:
These results demonstrate the structured strategy's superiority with statistical significance due to the large sample size. Final Recommendation
Exceptional aspects:
|
Greptile SummaryThis PR replaces the hand-picked 14-query benchmark with an auto-generated approach that tests all 286 unique symbols in the codebase. It generates queries by splitting symbol names (e.g. Key changes:
The Confidence Score: 5/5
Important Files Changed
Last reviewed commit: dd71a64 |
| function splitIdentifier(name) { | ||
| return name | ||
| .replace(/([a-z])([A-Z])/g, '$1 $2') | ||
| .replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2') | ||
| .replace(/[_-]+/g, ' ') | ||
| .trim() | ||
| .toLowerCase(); | ||
| } |
There was a problem hiding this comment.
splitIdentifier in src/embedder.js:10-16 doesn't call .toLowerCase() but this copy does. Mismatched logic could cause benchmark to test different queries than what the actual embedding strategies use.
| function splitIdentifier(name) { | |
| return name | |
| .replace(/([a-z])([A-Z])/g, '$1 $2') | |
| .replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2') | |
| .replace(/[_-]+/g, ' ') | |
| .trim() | |
| .toLowerCase(); | |
| } | |
| function splitIdentifier(name) { | |
| return name | |
| .replace(/([a-z])([A-Z])/g, '$1 $2') | |
| .replace(/([A-Z]+)([A-Z][a-z])/g, '$1 $2') | |
| .replace(/[_-]+/g, ' ') | |
| .trim(); | |
| } |
- Rename generated/BENCHMARKS.md → generated/BUILD-BENCHMARKS.md - Add scripts/embedding-benchmark.js: CI runner that tests all models against auto-generated queries from every symbol in the graph - Add scripts/update-embedding-report.js: generates historical EMBEDDING-BENCHMARKS.md with per-model Hit@1/3/5 trends - Update benchmark.yml with separate build + embedding jobs - Embedding job caches HF models, skips jina-code without HF_TOKEN - Add HF_TOKEN secret support for gated model access - Remove tests/search/embedding-benchmark.js (superseded by scripts/)
|
Claude finished @carlos-alm's task —— View job Todo List:
Root Cause AnalysisProblem Being Solved: The previous benchmark used 14 hand-picked queries that introduced human bias into performance evaluation. With only 14 queries, results could be skewed by the specific selection of test cases, making it difficult to assess real-world performance across the entire codebase. Why This Approach: Auto-generating queries from every symbol name in the graph (286 symbols) using Risk Assessment: Critical ConcernsBLOCKING:
Why This Matters: Case sensitivity affects embedding similarity scores. The benchmark is measuring performance against different queries than users actually run, making the results meaningless for validating real-world performance. Phase 2: Configuration Change Detection ✅No problematic configuration changes:
Phase 3: Test Integrity ✅Test methodology strengthened:
Phase 4: Code Quality✅ Good Improvements
|
Impact: 1 functions changed, 2 affected
|
Claude finished @carlos-alm's task —— View job Claude finished reviewing the PR PR Review: Auto-Generated Embedding Strategy BenchmarkTodo List:
Root Cause AnalysisProblem Being Solved: The previous benchmark used 14 hand-picked queries that introduced human bias into performance evaluation. With only 14 queries, results could be skewed by the specific selection of test cases, making it difficult to assess real-world performance across the entire codebase. Why This Approach: Auto-generating queries from every symbol name in the graph (286 symbols) using Risk Assessment: ✅ Low risk - this is a pure test improvement with no production code changes. Critical ConcernsRESOLVED ✅ The critical
Phase 2: Configuration Change Detection ✅No problematic configuration changes:
Phase 3: Test Integrity ✅Test methodology significantly strengthened:
Phase 4: Code Quality✅ Excellent Improvements
✅ Implementation Quality
✅ CI/CD Enhancements
✅ Results ValidationThe benchmark methodology now provides statistically significant results with 286 data points vs the previous 14, enabling confident assessment of embedding strategy performance. Final Recommendation
Outstanding aspects:
|
|
Claude finished @carlos-alm's task —— View job Claude finished reviewing the PR PR Review: Auto-Generated Embedding Strategy BenchmarkTodo List:
Root Cause AnalysisProblem Being Solved: The previous benchmark used 14 hand-picked queries that introduced human bias into performance evaluation. With only 14 queries, results could be skewed by the specific selection of test cases, making it difficult to assess real-world performance across the entire codebase. Why This Approach: Auto-generating queries from every symbol name in the graph (286 symbols) using Risk Assessment: ✅ Low risk - this is a pure test improvement with no production code changes. Critical ConcernsRESOLVED ✅ The critical
Phase 2: Configuration Change Detection ✅No problematic configuration changes:
Phase 3: Test Integrity ✅Test methodology significantly strengthened:
Phase 4: Code Quality✅ Excellent Improvements
✅ Implementation Quality
✅ CI/CD Enhancements
✅ Results ValidationThe benchmark methodology now provides statistically significant results with 286 data points vs the previous 14, enabling confident assessment of embedding strategy performance. Final Recommendation
Outstanding aspects:
|
|
Claude finished @carlos-alm's task —— View job Todo List:
Root Cause AnalysisProblem Being Solved: The previous benchmark used 14 hand-picked queries that introduced human bias into performance evaluation. With only 14 queries, results could be skewed by the specific selection of test cases, making it difficult to assess real-world performance across the entire codebase. Why This Approach: Auto-generating queries from every symbol name in the graph (286 symbols) using Risk Assessment: ✅ Low risk - this is a pure test improvement with no production code changes. Critical ConcernsRESOLVED ✅ The critical
Phase 2: Configuration Change Detection ✅No problematic configuration changes:
Phase 3: Test Integrity ✅Test methodology significantly strengthened:
Phase 4: Code Quality✅ Excellent Improvements
✅ Implementation Quality
✅ CI/CD Enhancements
✅ Results ValidationThe benchmark methodology now provides statistically significant results with 286 data points vs the previous 14, enabling confident assessment of embedding strategy performance. Final Recommendation
Outstanding aspects:
|

Summary
splitIdentifier(e.g.buildGraph→"build graph"), removing all human biassplitIdentifierinto both strategies, adoptsminilmas default model)Benchmark results (minilm, 286 symbols)
Test plan
.test.in filename)