Skip to content

Commit 8207787

Browse files
committed
feat: v1.6.0 search quality improvements (#26)
* feat: evaluation framework and frozen fixture set * feat: v1.6.0 search quality - RRF fusion, intent classification, reranker - Reciprocal Rank Fusion (k=60, rank-based scoring) - Intent-aware query classification (5 types) - Hard test-file filtering (0% spec contamination) - Import-graph proximity reranking - File-level deduplication - Chunk optimization (100→50 lines, 0 overlap) - Embedding upgrade (@huggingface/transformers v3) Note: Re-indexing recommended for best results due to chunking changes. Existing indices remain readable — search still works without re-indexing.
1 parent 04a395f commit 8207787

20 files changed

+2244
-585
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,4 @@ nul
1515
*~
1616
.claude
1717
.codebase-intelligence.json
18+
.cursor/

AGENTS.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,62 @@ These are non-negotiable. Every PR, feature, and design decision must respect th
1212
- **No overclaiming in public docs**: README and CHANGELOG must be evidence-backed. Don't claim capabilities that aren't shipped and tested.
1313
- **internal-docs is private**: Never commit `internal-docs/` pointer changes unless explicitly intended. The submodule is always dirty locally; ignore it.
1414

15+
## Evaluation Integrity (NON-NEGOTIABLE)
16+
17+
These rules prevent metric gaming, overfitting, and false quality claims. Violation of these rules means the feature CANNOT ship.
18+
19+
### Rule 1: Eval Sets are Frozen Before Implementation
20+
21+
- **Define test queries and expected results BEFORE writing any code**
22+
- Commit the eval fixture (e.g., `tests/fixtures/eval-queries.json`) BEFORE starting implementation
23+
- **NEVER adjust expected results to match system output** - If the system returns different results, that's a failure, not a fixture bug
24+
- Exception: If the original expected result was factually wrong (file doesn't exist, query is ambiguous), document the correction with justification
25+
26+
### Rule 2: Eval Sets Must Be General
27+
28+
- **Minimum 20 queries** across diverse patterns (exact names, conceptual, multi-concept, edge cases)
29+
- Test on **multiple codebases** (minimum 2: one you control, one public/real-world)
30+
- Include queries that are HARD and likely to fail - don't cherry-pick easy wins
31+
- Eval set must represent real user queries, not synthetic examples designed to pass
32+
33+
### Rule 3: Public Eval Methodology
34+
35+
- Full eval harness code must be in `tests/` (public repository)
36+
- Eval fixtures must be public (or provide reproducible public examples)
37+
- Document how to run eval: `npm run eval -- /path/to/codebase`
38+
- Results must be reproducible by external users
39+
40+
### Rule 4: No Score Manipulation
41+
42+
- **NEVER add heuristics specifically to game eval metrics** (e.g., "if query contains X, boost Y")
43+
- **NEVER adjust scoring to break ties just to improve top-1 accuracy**
44+
- If you add ranking heuristics, they must be general-purpose and justified by search theory, not by "it makes test #7 pass"
45+
- Document all ranking heuristics with research citations or principled justification
46+
47+
### Rule 5: Report Honestly
48+
49+
- Report **both improvements AND failures** (e.g., "9/20 pass, 11/20 fail")
50+
- If top-3 recall is 80% but top-1 is 45%, say so - don't hide behind a single cherry-picked metric
51+
- Acknowledge when improvements are **workarounds** (filtering, heuristics) vs **fundamental** (better embeddings, ML models)
52+
- Include failure analysis in CHANGELOG: "Known limitations: struggles with multi-concept queries"
53+
54+
### Rule 6: Cross-Check with Real Usage
55+
56+
- Before claiming "X% improvement", test on a real codebase you didn't develop against
57+
- Ask: "Would this improvement generalize to a Python codebase? A Go codebase?"
58+
- If the improvement is framework-specific (e.g., Angular-only), say so explicitly
59+
60+
### Violation Response
61+
62+
If any agent violates these rules:
63+
1. **STOP immediately** - do not proceed with the release
64+
2. **Revert** any fixture adjustments made to game metrics
65+
3. **Re-run eval** with frozen fixtures
66+
4. **Document the violation** in internal-docs for learning
67+
5. **Delay the release** until honest metrics are available
68+
69+
These rules exist because **trustworthiness is more valuable than a good-looking number**.
70+
1571
## Codebase Context
1672

1773
**At start of each task:** Call `get_memory` to load team conventions.

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,37 @@
11
# Changelog
22

3+
## [1.6.0](https://github.com/PatrickSys/codebase-context/compare/v1.5.1...v1.6.0) (2026-02-10)
4+
5+
### Added
6+
7+
- **Search Quality Improvements** — Weighted hybrid search with intent-aware classification
8+
- Intent-aware query classification (EXACT_NAME, CONCEPTUAL, FLOW, CONFIG, WIRING)
9+
- Reciprocal Rank Fusion (RRF, k=60) for robust rank-based score combination
10+
- Hard test-file filtering (eliminates spec contamination in non-test queries)
11+
- Import-graph proximity reranking (structural centrality boosting)
12+
- File-level deduplication (one best chunk per file)
13+
- **Evaluation Harness** — Frozen fixture set with reproducible methodology
14+
- **Embedding Upgrade** — Granite model support (47M params, 8192 context)
15+
- **Chunk Optimization** — 100→50 lines, overlap 10→0, merge small chunks
16+
17+
### Changed
18+
19+
- **Dependencies**: `@xenova/transformers` v2 → `@huggingface/transformers` v3
20+
- **Indexing**: Tighter chunks (50 lines) with zero overlap
21+
- **Search**: RRF fusion immune to score distribution differences
22+
23+
### Fixed
24+
25+
- Intent-blind search (conceptual queries now classified and routed correctly)
26+
- Spec file contamination (test files hard-filtered from non-test query results)
27+
- Embedding truncation (granite's 8192 context eliminates previous 512 token limit)
28+
29+
### Note
30+
31+
**Re-indexing recommended** for best results due to chunking changes.
32+
Existing indices remain readable — search still works without re-indexing.
33+
To re-index: `refresh_index(incrementalOnly: false)` or delete `.codebase-context/` folder.
34+
335
## [1.5.1](https://github.com/PatrickSys/codebase-context/compare/v1.5.0...v1.5.1) (2026-02-08)
436

537

package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,10 +94,10 @@
9494
"type-check": "tsc --noEmit"
9595
},
9696
"dependencies": {
97+
"@huggingface/transformers": "^3.8.1",
9798
"@lancedb/lancedb": "^0.4.0",
9899
"@modelcontextprotocol/sdk": "^1.25.2",
99100
"@typescript-eslint/typescript-estree": "^7.0.0",
100-
"@xenova/transformers": "^2.17.0",
101101
"fuse.js": "^7.0.0",
102102
"glob": "^10.3.10",
103103
"hono": "4.11.7",
@@ -125,6 +125,7 @@
125125
"pnpm": {
126126
"onlyBuiltDependencies": [
127127
"esbuild",
128+
"onnxruntime-node",
128129
"protobufjs",
129130
"sharp"
130131
]

0 commit comments

Comments
 (0)