Skip to content

Commit 3c1c53b

Browse files
PatrickSysclaude
andauthored
feat(indexing): OpenAI embeddings + broader language coverage (#57)
* feat(indexing): OpenAI embeddings + broader language coverage - Index meta stores embedding provider/model; search uses stored embedding config\n- Expand default indexing include globs to 30+ languages + config formats\n- Add LanceDB dimension mismatch guard for incremental updates\n- Curate Kotlin Tree-sitter grammar + fixture coverage\n- npm packaging: ship only docs/cli.md + docs/capabilities.md; exclude local drafts\n- Docs: clarify reindex vs refresh_index; document watcher auto-refresh * refactor(embeddings): centralize EMBEDDING_PROVIDER parsing * fix(embeddings): model-aware OpenAI dimensions + safe default model - text-embedding-3-large returns 3072 dims, not 1536; use a getter on OpenAIEmbeddingProvider so dimensions resolve after modelName is set - getConfiguredDimensions checks model name for 'large' before returning the OpenAI dimension value - mergeConfig now defaults to text-embedding-3-small when EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL is unset, avoiding a 400 from the OpenAI API caused by sending 'Xenova/bge-small-en-v1.5' - add text-embedding-3-large test case; fix stale test description Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: update memory.json with auto-extracted commit history and refactor indexer and embeddings code for improved readability - Added multiple entries to memory.json for conventions and architectural decisions based on auto-extracted commit history. - Refactored indexer.ts to improve formatting and readability in exclusion patterns and conditional checks. - Enhanced embeddings/index.ts and types.ts for better code clarity and structure in provider name parsing. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent bcac3fa commit 3c1c53b

25 files changed

+575
-71
lines changed

.codebase-context/memory.json

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,5 +129,122 @@
129129
"memory": "Never commit .planning/** or use gsd-tools commit; always use plain git commits with explicit messages",
130130
"reason": "We accidentally committed ignored .planning files and created pushed placeholder commits (e.g., --help). This is explicitly disallowed in this repo.",
131131
"date": "2026-02-20T19:08:22.195Z"
132+
},
133+
{
134+
"id": "d4d3b072ea53",
135+
"type": "gotcha",
136+
"category": "conventions",
137+
"memory": "fix(watcher-tests): await ready + harden Windows cleanup (#55)",
138+
"reason": "Auto-extracted from git commit history",
139+
"date": "2026-03-01T15:52:27.000Z",
140+
"source": "git"
141+
},
142+
{
143+
"id": "8821f0a1affe",
144+
"type": "gotcha",
145+
"category": "conventions",
146+
"memory": "fix(watcher): allow debounce 0 and harden test",
147+
"reason": "Auto-extracted from git commit history",
148+
"date": "2026-02-28T17:16:59.000Z",
149+
"source": "git"
150+
},
151+
{
152+
"id": "c06d7e79009f",
153+
"type": "gotcha",
154+
"category": "conventions",
155+
"memory": "fix(watcher): queue refresh during indexing",
156+
"reason": "Auto-extracted from git commit history",
157+
"date": "2026-02-28T17:12:35.000Z",
158+
"source": "git"
159+
},
160+
{
161+
"id": "73638343a916",
162+
"type": "gotcha",
163+
"category": "conventions",
164+
"memory": "fix(refs): prevent out-of-root file reads from index",
165+
"reason": "Auto-extracted from git commit history",
166+
"date": "2026-02-28T15:58:03.000Z",
167+
"source": "git"
168+
},
169+
{
170+
"id": "8a0f5410d2e2",
171+
"type": "decision",
172+
"category": "architecture",
173+
"memory": "refactor: eliminate all any types and consolidate type definitions (#46)",
174+
"reason": "Auto-extracted from git commit history",
175+
"date": "2026-02-22T19:45:41.000Z",
176+
"source": "git"
177+
},
178+
{
179+
"id": "6a5bf4f56124",
180+
"type": "gotcha",
181+
"category": "conventions",
182+
"memory": "fix: close v1.8 post-merge integration gaps (#44)",
183+
"reason": "Auto-extracted from git commit history",
184+
"date": "2026-02-22T17:58:51.000Z",
185+
"source": "git"
186+
},
187+
{
188+
"id": "8e014f2b09cd",
189+
"type": "decision",
190+
"category": "architecture",
191+
"memory": "refactor: clean up formatting and improve readability in multiple files",
192+
"reason": "Auto-extracted from git commit history",
193+
"date": "2026-02-21T12:50:44.000Z",
194+
"source": "git"
195+
},
196+
{
197+
"id": "3125c037fc40",
198+
"type": "decision",
199+
"category": "architecture",
200+
"memory": "refactor: extract 11 MCP tool handlers into src/tools/ (#37)",
201+
"reason": "Auto-extracted from git commit history",
202+
"date": "2026-02-20T22:21:55.000Z",
203+
"source": "git"
204+
},
205+
{
206+
"id": "6ae00519485a",
207+
"type": "gotcha",
208+
"category": "conventions",
209+
"memory": "fix(03-02): add regression guardrails for extraction and large-file safety",
210+
"reason": "Auto-extracted from git commit history",
211+
"date": "2026-02-20T18:35:47.000Z",
212+
"source": "git"
213+
},
214+
{
215+
"id": "0080c6e64d64",
216+
"type": "gotcha",
217+
"category": "conventions",
218+
"memory": "fix(03-02): harden tree-sitter extraction against byte-offset and parser failures",
219+
"reason": "Auto-extracted from git commit history",
220+
"date": "2026-02-20T18:33:19.000Z",
221+
"source": "git"
222+
},
223+
{
224+
"id": "92493e34e3e1",
225+
"type": "gotcha",
226+
"category": "conventions",
227+
"memory": "fix(02-tree-sitter-02): prevent symbol-aware chunk merging",
228+
"reason": "Auto-extracted from git commit history",
229+
"date": "2026-02-20T14:41:29.000Z",
230+
"source": "git"
231+
},
232+
{
233+
"id": "32c95757f1b3",
234+
"type": "gotcha",
235+
"category": "conventions",
236+
"memory": "fix(02-01): fall back when tree-sitter parse has errors",
237+
"reason": "Auto-extracted from git commit history",
238+
"date": "2026-02-20T14:38:35.000Z",
239+
"source": "git"
240+
},
241+
{
242+
"id": "a597568f48c2",
243+
"type": "gotcha",
244+
"category": "conventions",
245+
"memory": "fix: guard null chunk.content crash + docs rewrite for v1.6.1",
246+
"reason": "Auto-extracted from git commit history",
247+
"date": "2026-02-15T13:04:10.000Z",
248+
"source": "git"
132249
}
133250
]

.npmignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
docs/TODO.md
2+
docs/visuals.md

CHANGELOG.md

Lines changed: 36 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -2,43 +2,41 @@
22

33
## [1.7.0](https://github.com/PatrickSys/codebase-context/compare/v1.6.1...v1.7.0) (2026-02-21)
44

5-
65
### Features
76

8-
* **02-03:** implement keyword-index symbol reference lookup ([ccfc564](https://github.com/PatrickSys/codebase-context/commit/ccfc5649a3f4e321bbd3770e5945f83213e103a6))
9-
* **02-03:** register get_symbol_references MCP tool ([6f6bc3a](https://github.com/PatrickSys/codebase-context/commit/6f6bc3ae3bfa9af13c404028c1307d774b69291c))
10-
* **03-01:** add frozen controlled eval fixture and local codebase ([46736ed](https://github.com/PatrickSys/codebase-context/commit/46736ed4c4681767164682a774e1ddf08ee81768))
11-
* **03-03:** add multi-codebase eval runner command ([b065042](https://github.com/PatrickSys/codebase-context/commit/b065042f9a689d82485532872009af571d22db44))
12-
* **03-03:** centralize eval harness scoring logic ([5c5319b](https://github.com/PatrickSys/codebase-context/commit/5c5319b4a3c9caf30f7b31de3ee210bc153ee58c))
13-
* **04-01:** add curated grammar manifest, sync script, and publish inclusion ([908f39a](https://github.com/PatrickSys/codebase-context/commit/908f39a2c82a9630150262299ec8ae1f25c269ab))
14-
* **04-01:** update tree-sitter loader to resolve packaged grammars and fail closed ([458520f](https://github.com/PatrickSys/codebase-context/commit/458520ff3d24bd9ff6399b6bedfe1b6776fc6579))
15-
* **04-02:** add manifest-driven grammar CI test with fail-closed fallback ([2559405](https://github.com/PatrickSys/codebase-context/commit/2559405007e17bad6fffcf6ea61b97475f0da1e6))
16-
* **05-01:** create AST-aligned chunking engine with symbol tree builder ([f865abc](https://github.com/PatrickSys/codebase-context/commit/f865abc0a3877441b492695c02ddca12fe9b36c6))
17-
* **05-01:** wire AST-aligned chunker into GenericAnalyzer with 21 unit tests ([68a2d6d](https://github.com/PatrickSys/codebase-context/commit/68a2d6da844a9ffdb6104670c565f338487d2199))
18-
* **05-02:** add scope-aware prefix generation to AST chunks ([3dbd43e](https://github.com/PatrickSys/codebase-context/commit/3dbd43eec1d6cdf63ec4d5094c870bf2ee6b164d))
19-
* **06-01:** add index format metadata and headers ([a216c6d](https://github.com/PatrickSys/codebase-context/commit/a216c6dd2c7614b705525bc30ba8fddf918c7cf3))
20-
* **06-01:** gate index consumers on IndexMeta validation ([6a52c0d](https://github.com/PatrickSys/codebase-context/commit/6a52c0d33d408a7463e036eac8a650c461c86a43))
21-
* **06-02:** implement staging directory build and atomic swap for full rebuild ([d719801](https://github.com/PatrickSys/codebase-context/commit/d71980128795bdf8e7c7ab16beb350729a85e306))
22-
* **AST indexing:** Implement relationship index ([#38](https://github.com/PatrickSys/codebase-context/issues/38)) ([5b05092](https://github.com/PatrickSys/codebase-context/commit/5b05092b4d5a4a08b117fdc06a3292afdcc8764e))
23-
* expose all 10 MCP tools via CLI + document them ([#42](https://github.com/PatrickSys/codebase-context/issues/42)) ([7581fba](https://github.com/PatrickSys/codebase-context/commit/7581fbac5b4fd5bc52abc56d946bf55962870566))
24-
* references confidence, remove get_component_usage, ranked search hints ([#39](https://github.com/PatrickSys/codebase-context/issues/39)) ([33616aa](https://github.com/PatrickSys/codebase-context/commit/33616aa48b165d5cfd95c44bc416cb74c4fd5cbf))
25-
* rework decision-card to make it based on AST parsing ([#41](https://github.com/PatrickSys/codebase-context/issues/41)) ([ac4389d](https://github.com/PatrickSys/codebase-context/commit/ac4389d6cc55b7f8efc310a6e020bcd184a70adc))
26-
* symbol ranking, smart snippets, and edit decision card ([#40](https://github.com/PatrickSys/codebase-context/issues/40)) ([03964b3](https://github.com/PatrickSys/codebase-context/commit/03964b3f40cc0fa0caf9768747a39fb559daaa8e))
27-
* use tree-sitter symbols in generic analyzer ([b470709](https://github.com/PatrickSys/codebase-context/commit/b470709aa77f02325ed5a4e2b0710017020565da))
28-
7+
- **02-03:** implement keyword-index symbol reference lookup ([ccfc564](https://github.com/PatrickSys/codebase-context/commit/ccfc5649a3f4e321bbd3770e5945f83213e103a6))
8+
- **02-03:** register get_symbol_references MCP tool ([6f6bc3a](https://github.com/PatrickSys/codebase-context/commit/6f6bc3ae3bfa9af13c404028c1307d774b69291c))
9+
- **03-01:** add frozen controlled eval fixture and local codebase ([46736ed](https://github.com/PatrickSys/codebase-context/commit/46736ed4c4681767164682a774e1ddf08ee81768))
10+
- **03-03:** add multi-codebase eval runner command ([b065042](https://github.com/PatrickSys/codebase-context/commit/b065042f9a689d82485532872009af571d22db44))
11+
- **03-03:** centralize eval harness scoring logic ([5c5319b](https://github.com/PatrickSys/codebase-context/commit/5c5319b4a3c9caf30f7b31de3ee210bc153ee58c))
12+
- **04-01:** add curated grammar manifest, sync script, and publish inclusion ([908f39a](https://github.com/PatrickSys/codebase-context/commit/908f39a2c82a9630150262299ec8ae1f25c269ab))
13+
- **04-01:** update tree-sitter loader to resolve packaged grammars and fail closed ([458520f](https://github.com/PatrickSys/codebase-context/commit/458520ff3d24bd9ff6399b6bedfe1b6776fc6579))
14+
- **04-02:** add manifest-driven grammar CI test with fail-closed fallback ([2559405](https://github.com/PatrickSys/codebase-context/commit/2559405007e17bad6fffcf6ea61b97475f0da1e6))
15+
- **05-01:** create AST-aligned chunking engine with symbol tree builder ([f865abc](https://github.com/PatrickSys/codebase-context/commit/f865abc0a3877441b492695c02ddca12fe9b36c6))
16+
- **05-01:** wire AST-aligned chunker into GenericAnalyzer with 21 unit tests ([68a2d6d](https://github.com/PatrickSys/codebase-context/commit/68a2d6da844a9ffdb6104670c565f338487d2199))
17+
- **05-02:** add scope-aware prefix generation to AST chunks ([3dbd43e](https://github.com/PatrickSys/codebase-context/commit/3dbd43eec1d6cdf63ec4d5094c870bf2ee6b164d))
18+
- **06-01:** add index format metadata and headers ([a216c6d](https://github.com/PatrickSys/codebase-context/commit/a216c6dd2c7614b705525bc30ba8fddf918c7cf3))
19+
- **06-01:** gate index consumers on IndexMeta validation ([6a52c0d](https://github.com/PatrickSys/codebase-context/commit/6a52c0d33d408a7463e036eac8a650c461c86a43))
20+
- **06-02:** implement staging directory build and atomic swap for full rebuild ([d719801](https://github.com/PatrickSys/codebase-context/commit/d71980128795bdf8e7c7ab16beb350729a85e306))
21+
- **AST indexing:** Implement relationship index ([#38](https://github.com/PatrickSys/codebase-context/issues/38)) ([5b05092](https://github.com/PatrickSys/codebase-context/commit/5b05092b4d5a4a08b117fdc06a3292afdcc8764e))
22+
- expose all 10 MCP tools via CLI + document them ([#42](https://github.com/PatrickSys/codebase-context/issues/42)) ([7581fba](https://github.com/PatrickSys/codebase-context/commit/7581fbac5b4fd5bc52abc56d946bf55962870566))
23+
- references confidence, remove get_component_usage, ranked search hints ([#39](https://github.com/PatrickSys/codebase-context/issues/39)) ([33616aa](https://github.com/PatrickSys/codebase-context/commit/33616aa48b165d5cfd95c44bc416cb74c4fd5cbf))
24+
- rework decision-card to make it based on AST parsing ([#41](https://github.com/PatrickSys/codebase-context/issues/41)) ([ac4389d](https://github.com/PatrickSys/codebase-context/commit/ac4389d6cc55b7f8efc310a6e020bcd184a70adc))
25+
- symbol ranking, smart snippets, and edit decision card ([#40](https://github.com/PatrickSys/codebase-context/issues/40)) ([03964b3](https://github.com/PatrickSys/codebase-context/commit/03964b3f40cc0fa0caf9768747a39fb559daaa8e))
26+
- use tree-sitter symbols in generic analyzer ([b470709](https://github.com/PatrickSys/codebase-context/commit/b470709aa77f02325ed5a4e2b0710017020565da))
2927

3028
### Bug Fixes
3129

32-
* **02-01:** fall back when tree-sitter parse has errors ([8a7cd92](https://github.com/PatrickSys/codebase-context/commit/8a7cd92cab25b045b5108b1cba04773f644eab10))
33-
* **02-tree-sitter-02:** prevent symbol-aware chunk merging ([fd02625](https://github.com/PatrickSys/codebase-context/commit/fd0262516e262eff0c17646eaca021d6288c6647))
34-
* **03-02:** add regression guardrails for extraction and large-file safety ([a1c71de](https://github.com/PatrickSys/codebase-context/commit/a1c71de070b434f326dc80e627964c1540eea93f))
35-
* **03-02:** harden tree-sitter extraction against byte-offset and parser failures ([375a48f](https://github.com/PatrickSys/codebase-context/commit/375a48f231c85d72157aa74ea964db27bf9a983e))
30+
- **02-01:** fall back when tree-sitter parse has errors ([8a7cd92](https://github.com/PatrickSys/codebase-context/commit/8a7cd92cab25b045b5108b1cba04773f644eab10))
31+
- **02-tree-sitter-02:** prevent symbol-aware chunk merging ([fd02625](https://github.com/PatrickSys/codebase-context/commit/fd0262516e262eff0c17646eaca021d6288c6647))
32+
- **03-02:** add regression guardrails for extraction and large-file safety ([a1c71de](https://github.com/PatrickSys/codebase-context/commit/a1c71de070b434f326dc80e627964c1540eea93f))
33+
- **03-02:** harden tree-sitter extraction against byte-offset and parser failures ([375a48f](https://github.com/PatrickSys/codebase-context/commit/375a48f231c85d72157aa74ea964db27bf9a983e))
3634

3735
## [Unreleased]
3836

3937
### Added
4038

41-
- **Definition-first ranking**: Exact-name searches now show the file that *defines* a symbol before files that use it. For example, searching `parseConfig` shows the function definition first, then callers.
39+
- **Definition-first ranking**: Exact-name searches now show the file that _defines_ a symbol before files that use it. For example, searching `parseConfig` shows the function definition first, then callers.
4240

4341
### Refactored
4442

@@ -63,16 +61,26 @@
6361
- Shared eval scoring/reporting module (`src/eval/*`) used by both the CLI runner and the test suite.
6462
- Second frozen eval fixture plus an in-repo controlled TypeScript codebase for fully-offline eval runs.
6563
- Regression tests covering Tree-sitter Unicode slicing, parser cleanup/reset behavior, and large/generated file skipping.
64+
- **Tree-sitter symbol references** (PR #49): identifier scan excludes comment/string nodes; `confidence: "syntactic"` returned; `usageCount` reflects real AST occurrences, not regex matches.
65+
- **Import edge details** (PR #50): `importDetails` per edge (line number + imported symbols) persisted in `relationships.json`. Backward-compatible with existing `imports` field.
66+
- **2-hop transitive impact** (PR #50): `search --intent edit` impact now shows direct importers (hop 1) and their importers (hop 2), each labeled with distance. Capped at 20.
67+
- **Chokidar file watcher** (PR #52): index auto-refreshes in MCP server mode on file save (2 s debounce). No manual `reindex` needed during active editing sessions.
68+
- **CLI human formatters** (PR #48): all 9 commands now render as structured human-readable output. `--json` flag on every command for agent/pipe consumption.
69+
- **`status` + `reindex` formatters** (PR #56): status box with index health, progress, and last-built time. ASCII fallback via `CODEBASE_CONTEXT_ASCII=1`.
70+
- **`docs/cli.md` gallery** (PR #56): command reference with output previews for all 9 CLI commands.
6671

6772
### Changed
6873

6974
- **Preflight response shape**: Renamed `reason` to `nextAction` for clarity. Removed internal fields (`evidenceLock`, `riskLevel`, `confidence`) so the output is stable and doesn't change shape unexpectedly.
70-
75+
7176
### Fixed
7277

7378
- Null-pointer crash in GenericAnalyzer when chunk content is undefined.
7479
- Tree-sitter symbol extraction now treats node offsets as UTF-8 byte ranges and evicts cached parsers on failures/timeouts.
7580
- **Post-merge integration gaps** (v1.8 audit): Removed orphaned `get_component_usage` source file, deleted phantom allowlist entry, removed dead guidance strings referencing the deleted tool. Added fallback decision card when `intelligence.json` is absent during edit-intent searches, now returns `ready: false` with actionable guidance instead of silently skipping.
81+
- Watcher initialization race: `onReady` hook ensures tests wait for chokidar readiness before asserting (PR #55).
82+
- Windows temp dir cleanup hardened with retry/backoff to fix `ENOTEMPTY`/`EPERM` test flakes (PR #55).
83+
- `--json` output now always pure JSON on stdout; status lines go to stderr (PR #48).
7684

7785
## [1.6.2] - 2026-02-17
7886

0 commit comments

Comments
 (0)