Skip to content

Commit 983e6b4

Browse files
committed
phase4.5: implement AD-1/AD-2 hardening
1 parent 62a7f99 commit 983e6b4

19 files changed

Lines changed: 693 additions & 210 deletions

KnowCode.md

Lines changed: 107 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -816,6 +816,84 @@ You've essentially defined a **code intelligence system**, not a chatbot with em
816816

817817
---
818818

819+
## **Known Architectural Debt & Target State**
820+
821+
This section documents known architectural issues identified during review and the target state for each. Items are prioritised by impact.
822+
823+
### **AD-1: Monolithic Dependency Footprint** *(Priority: Critical)*
824+
825+
**Current state:** `pyproject.toml` requires FastAPI, FAISS, OpenAI, Gemini, numpy, uvicorn, and watchdog for *every* install — even users who only need `knowcode analyze` + `knowcode query`.
826+
827+
**Impact:** Slow installs, platform-specific failures (FAISS wheels, numpy ABI), increased vulnerability surface, and import-time latency for CLI-only users.
828+
829+
**Target state:** Core install (`pip install knowcode`) includes only: `click`, `networkx`, `pyyaml`, `pathspec`, `tree-sitter`, `tree-sitter-languages`, `GitPython`, `tiktoken`. Heavy dependencies move behind extras:
830+
831+
| Extra | Dependencies | Unlocks |
832+
|-------|-------------|---------|
833+
| `knowcode[server]` | `fastapi`, `uvicorn` | `knowcode server` |
834+
| `knowcode[search]` | `faiss-cpu`, `numpy` | `knowcode index`, `knowcode semantic-search` |
835+
| `knowcode[llm]` | `openai`, `google-genai`, `google-api-core` | `knowcode ask` |
836+
| `knowcode[watch]` | `watchdog` | `knowcode server --watch` |
837+
| `knowcode[all]` | All of the above | Batteries-included (preserves backward compatibility) |
838+
839+
Commands invoked without the required extra should fail fast with: *"Install knowcode[server] to use `knowcode server`"*.
840+
841+
### **AD-2: Hidden Side Effects in Query Paths** *(Priority: Critical)*
842+
843+
**Current state:** `KnowCodeService.retrieve_context_for_query()` auto-triggers `analyze()` and `_build_index()` if artifacts are missing. A read operation silently performs expensive writes.
844+
845+
**Impact:** Unpredictable latency in API/MCP server calls; surprises in CI/CD pipelines; makes the system non-deterministic from the caller's perspective.
846+
847+
**Target state:** Query methods fail fast with actionable errors when prerequisites are missing (e.g., *"Knowledge store not found. Run `knowcode analyze <dir>` first."*). Opt-in helpers `ensure_store()` and `ensure_index()` are available for callers who want the auto-build behavior.
848+
849+
### **AD-3: No Schema Versioning on Persisted Artifacts** *(Priority: High)*
850+
851+
**Current state:** The JSON knowledge store and FAISS index contain no `schema_version` field. Data model changes silently corrupt existing stores.
852+
853+
**Impact:** No safe migration path; users must manually delete and rebuild after upgrades.
854+
855+
**Target state:** Top-level `schema_version` field in both the knowledge store JSON and the index metadata. A minimal migration shim validates version on load and either migrates or emits a clear error.
856+
857+
### **AD-4: Metadata Type Restriction** *(Priority: High)*
858+
859+
**Current state:** `Entity.metadata`, `Relationship.metadata`, and `CodeChunk.metadata` are typed as `dict[str, str]`, forcing stringification of booleans, integers, and lists.
860+
861+
**Target state:** Change to `dict[str, Any]` across all data models. Serialization/deserialization handles mixed types natively.
862+
863+
### **AD-5: Configuration Error Handling** *(Priority: Medium)*
864+
865+
**Current state:** `AppConfig._load_from_yaml()` catches all exceptions, prints to stdout, and silently falls back to defaults. No schema validation on YAML keys.
866+
867+
**Target state:** Use `logging.warning()` instead of `print()`. In server/MCP contexts, raise on invalid configuration. Validate known config keys and warn on unrecognised ones.
868+
869+
### **AD-6: Service Layer Cohesion** *(Priority: Medium)*
870+
871+
**Current state:** `KnowCodeService` handles orchestration, caching, persistence, query classification, retrieval strategy selection, index validation, and auto-building — too many reasons to change.
872+
873+
**Target state:** Extract retrieval orchestration into a dedicated `RetrievalOrchestrator` class. `KnowCodeService` delegates to specialised components. Define `Protocol` interfaces for `EmbeddingProvider`, `VectorStore`, and `KnowledgeStoreProtocol` to decouple layers.
874+
875+
### **AD-7: Brittle Entity Identity** *(Priority: Medium)*
876+
877+
**Current state:** Entity IDs use `file_path::qualified_name`. File renames or moves break identity, poisoning temporal history and cached indexes.
878+
879+
**Target state:** Retain `file_path::qualified_name` as the primary ID but add a `content_hash` (SHA-256 of canonical source snippet) to entity metadata for rename-resilient correlation.
880+
881+
### **AD-8: Scalability Ceiling** *(Priority: Low — future concern)*
882+
883+
**Current state:** NetworkX in-memory graph + full JSON serialization. Adequate for small/medium repos but will hit memory and load-time walls on large monorepos (>100k entities).
884+
885+
**Target state:** Evaluate SQLite-backed storage for entities/edges/chunks with FTS, enabling incremental loads and partial queries. This is a Phase 6 concern.
886+
887+
### **AD-9: `[HARDENED]` Tag Clarity** *(Priority: Low)*
888+
889+
**Current state:** Layer descriptions throughout this document include `[HARDENED]` items that represent aspirational capabilities, not shipped features. This can mislead readers about the system's current state.
890+
891+
**Target state:** All `[HARDENED]` items are clearly labelled as *"ASPIRATIONAL — not yet implemented"* where they first appear (Section 1 preamble), and individual items are not removed — they remain as the north-star design.
892+
893+
---
894+
895+
> **Note on `[HARDENED]` tags:** Throughout the layer descriptions above, items marked `[HARDENED]` represent the *target design* for a production-grade system. They are **not yet implemented** in the current codebase. See the roadmap below for the phased plan to address them.
896+
819897
## **Implementation Status & Roadmap**
820898

821899
### **Phase 1: Foundation (COMPLETED)**
@@ -840,31 +918,41 @@ You've essentially defined a **code intelligence system**, not a chatbot with em
840918
13. **[x] Markdown Export (MVP)**: CLI `export` produces an index-style Markdown doc (see `docs_test/index.md`).
841919
14. **[ ] Multi-Level Doc Synthesis (Layer 7)**: Architecture/module/function narratives, change summaries, and freshness tracking.
842920

843-
### **Phase 5: Deep Analysis (NEXT)**
844-
15. **[ ] Static Behavioral Analysis (Layer 4)**: Data flow, state transitions, side-effect classification.
845-
16. **[ ] Intent Extraction (Layer 6)**: ADR/PR/commit intent linking beyond commit metadata.
846-
17. **[ ] Confidence Scoring (Layer 3)**: Weighted edges/entities by evidence source.
921+
### **Phase 4.5: Architectural Hardening (NEXT)** *(addresses AD-1 through AD-7)*
922+
15. **[x] Dependency Modularisation (AD-1)**: Move heavy dependencies behind optional extras (`server`, `search`, `llm`, `watch`, `all`). Core install stays lightweight.
923+
16. **[x] Side-Effect-Free Query Paths (AD-2)**: Remove auto-analyze/index from `retrieve_context_for_query()`. Fail fast with actionable errors. Add explicit `ensure_store()` / `ensure_index()` helpers.
924+
17. **[ ] Schema Versioning (AD-3)**: Add `schema_version` to knowledge store JSON and index metadata. Write migration shim for version validation on load.
925+
18. **[ ] Data Model Fixes (AD-4)**: Change `metadata: dict[str, str]` to `dict[str, Any]` across `Entity`, `Relationship`, and `CodeChunk`.
926+
19. **[ ] Configuration Hardening (AD-5)**: Replace `print()` with `logging`; raise on invalid config in server contexts; validate YAML schema.
927+
20. **[ ] Service Layer Decomposition (AD-6)**: Extract `RetrievalOrchestrator` from `KnowCodeService`. Define `Protocol` interfaces for `EmbeddingProvider`, `VectorStore`, `KnowledgeStoreProtocol`.
928+
21. **[ ] Entity Identity Resilience (AD-7)**: Add `content_hash` to entity metadata for rename-resilient correlation.
929+
22. **[ ] Layer Contract Tests**: Parser → `ParseResult` contract tests; store save/load roundtrip with schema version; retrieval golden-query tests; CLI smoke tests (Click runner); API endpoint contract tests (conditional on `server` extra).
930+
931+
### **Phase 5: Deep Analysis**
932+
23. **[ ] Static Behavioral Analysis (Layer 4)**: Data flow, state transitions, side-effect classification.
933+
24. **[ ] Intent Extraction (Layer 6)**: ADR/PR/commit intent linking beyond commit metadata.
934+
25. **[ ] Confidence Scoring (Layer 3)**: Weighted edges/entities by evidence source.
847935

848936
### **Phase 6: Enterprise (FUTURE)**
849-
18. **[ ] Security & RBAC**: Permissioned access and audit trails.
850-
19. **[ ] Scalability**: Large monorepo support and distributed processing.
851-
20. **[ ] Team Sharing**: Remote knowledge store sync and collaboration.
937+
26. **[ ] Security & RBAC**: Permissioned access and audit trails.
938+
27. **[ ] Scalability (AD-8)**: SQLite-backed storage for large monorepos; incremental graph loading; sharded indexes.
939+
28. **[ ] Team Sharing**: Remote knowledge store sync and collaboration.
852940

853941
### **Phase 7: Agentic Capabilities (COMPLETED v2.2)**
854-
21. **[x] Agent Architecture**: `Agent` class with configuration-driven model selection.
855-
22. **[x] Multi-Provider Support**: Google Gemini and OpenRouter/OpenAI integration.
856-
23. **[x] Rate Limiting**: Persistent RPM/RPD tracking and enforcement.
857-
24. **[x] Query Classification**: 6 task types (explain, debug, extend, review, locate, general).
858-
25. **[x] Smart Answer**: Local-first answering with configurable sufficiency threshold.
859-
26. **[x] VoyageAI Reranking**: Cross-encoder reranking with signal-based fallback.
942+
29. **[x] Agent Architecture**: `Agent` class with configuration-driven model selection.
943+
30. **[x] Multi-Provider Support**: Google Gemini and OpenRouter/OpenAI integration.
944+
31. **[x] Rate Limiting**: Persistent RPM/RPD tracking and enforcement.
945+
32. **[x] Query Classification**: 6 task types (explain, debug, extend, review, locate, general).
946+
33. **[x] Smart Answer**: Local-first answering with configurable sufficiency threshold.
947+
34. **[x] VoyageAI Reranking**: Cross-encoder reranking with signal-based fallback.
860948

861949
### **Phase 8: IDE Integration (COMPLETED v2.2)**
862-
27. **[x] MCP Server (Layer 10b)**: Tool exposure via STDIO for IDE agents.
863-
28. **[x] Core 4 Tools**: `search_codebase`, `get_entity_context`, `trace_calls`, `retrieve_context_for_query`.
864-
29. **[x] Sufficiency Scoring**: Context confidence metrics for local-first answering.
865-
30. **[x] Task-Specific Templates**: Debug/extend/review/explain/locate prioritization.
866-
31. **[x] Multi-hop Queries**: `trace_calls(depth=N)` and `get_impact()` analysis.
867-
32. **[x] Structured Responses**: JSON with `task_type` and `sufficiency_score`.
950+
35. **[x] MCP Server (Layer 10b)**: Tool exposure via STDIO for IDE agents.
951+
36. **[x] Core 4 Tools**: `search_codebase`, `get_entity_context`, `trace_calls`, `retrieve_context_for_query`.
952+
37. **[x] Sufficiency Scoring**: Context confidence metrics for local-first answering.
953+
38. **[x] Task-Specific Templates**: Debug/extend/review/explain/locate prioritization.
954+
39. **[x] Multi-hop Queries**: `trace_calls(depth=N)` and `get_impact()` analysis.
955+
40. **[x] Structured Responses**: JSON with `task_type` and `sufficiency_score`.
868956

869957
### **Supporting Tooling & QA (COMPLETED)**
870958
- **[x] Tests**: Unit/integration/e2e coverage for parsing, indexing, retrieval, API, CLI, storage, and analysis.

README.md

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,29 @@ KnowCode analyzes your codebase and builds a semantic graph of entities (functio
1818
uv venv
1919
source .venv/bin/activate # On Windows: .venv\Scripts\activate
2020

21-
# Install KnowCode (with dev dependencies)
22-
uv sync --dev
21+
# Install KnowCode for development (batteries included)
22+
uv sync --dev --extra all --extra mcp --extra voyageai
2323

2424
# Set API keys (only needed for the features you use; see aimodels.yaml)
2525
export VOYAGE_API_KEY_1="..." # embeddings + reranking (semantic search)
2626
export OPENAI_API_KEY="..." # embeddings (alternative to VoyageAI)
2727
export GOOGLE_API_KEY_1="..." # LLM (Gemini) for `knowcode ask`
2828
```
2929

30+
### Optional Dependency Extras
31+
32+
KnowCode now ships with a lightweight core install plus feature extras:
33+
34+
- `knowcode[server]``knowcode server`
35+
- `knowcode[search]``knowcode index`, `knowcode semantic-search`
36+
- `knowcode[llm]``knowcode ask`
37+
- `knowcode[watch]``knowcode server --watch`
38+
- `knowcode[all]` → union of `server`, `search`, `llm`, `watch`
39+
- `knowcode[mcp]` and `knowcode[voyageai]` remain available as before
40+
41+
Commands fail fast with actionable hints, e.g.:
42+
`Install knowcode[server] to use 'knowcode server'.`
43+
3044
## Quick Start
3145

3246
```bash
@@ -184,6 +198,11 @@ knowcode history "KnowledgeStore"
184198
### `ask`
185199
Ask questions about the codebase using an LLM agent. Requires an API key for at least one configured model in `aimodels.yaml`.
186200

201+
Prerequisites:
202+
- Knowledge store exists (`knowcode analyze <dir>`)
203+
- Semantic index exists (`knowcode index <dir>`)
204+
- LLM dependencies installed (`knowcode[llm]`)
205+
187206
```bash
188207
knowcode ask <question> [--config <path>]
189208
```
@@ -214,6 +233,9 @@ Start an MCP (Model Context Protocol) server for IDE agent integration.
214233
knowcode mcp-server [--store <path>] [--config <path>]
215234
```
216235

236+
Prerequisite: knowledge store must already exist (`knowcode analyze <dir>`).
237+
MCP read tools are deterministic and do not auto-run analysis.
238+
217239
**Tools Exposed:**
218240
- `search_codebase` - Search for code entities by name
219241
- `get_entity_context` - Get detailed context for an entity
@@ -389,8 +411,9 @@ ruff format src/
389411

390412
## Roadmap
391413

392-
See [KnowCode.md](KnowCode.md) for the full vision. The MVP focuses on:
414+
See [KnowCode.md](KnowCode.md) for the full vision and detailed architectural debt register.
393415

416+
**MVP (completed):**
394417
- ✅ Single monorepo support
395418
- ✅ Python, Markdown, YAML parsing
396419
- ✅ Snapshot-only analysis (no temporal tracking)
@@ -410,8 +433,19 @@ See [KnowCode.md](KnowCode.md) for the full vision. The MVP focuses on:
410433
- MCP server for IDE integration
411434
- VoyageAI cross-encoder reranking
412435

436+
**Next: v2.3 — Architectural Hardening:**
437+
- Modularise dependencies into optional extras (core install stays lightweight)
438+
- Remove hidden side effects from query paths (fail fast, not auto-build)
439+
- Add schema versioning to knowledge store and index artifacts
440+
- Fix `metadata` type restriction (`dict[str, str]``dict[str, Any]`)
441+
- Harden configuration loading (logging, validation, strict server mode)
442+
- Decompose `KnowCodeService` and introduce `Protocol` interfaces
443+
- Add layer contract tests (parser, store roundtrip, retrieval golden queries)
444+
413445
**Future releases:**
414-
- v3.0: Team sharing & Enterprise features (RBAC, SSO, etc.)
446+
- v2.4: Multi-level documentation synthesis
447+
- v3.0: Deep analysis (data flow, intent extraction, confidence scoring)
448+
- v4.0: Enterprise features (RBAC, scalability, team sharing)
415449

416450
## License
417451

docs/evolution.md

Lines changed: 27 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -100,28 +100,37 @@ flowchart TB
100100
13. **[x] Markdown Export (MVP)**: CLI `export` produces an index-style Markdown doc.
101101
14. **[ ] Multi-Level Doc Synthesis (Layer 7)**: Architecture/module/function narratives, change summaries, and freshness tracking.
102102

103+
### **Phase 4.5: Architectural Hardening (PARTIAL)**
104+
15. **[x] Dependency Modularisation (AD-1)**: Optional extras (`server`, `search`, `llm`, `watch`, `all`) with lightweight core install.
105+
16. **[x] Side-Effect-Free Query Paths (AD-2)**: Retrieval and MCP read tools fail fast on missing prerequisites; no auto analyze/index side effects.
106+
17. **[ ] Schema Versioning (AD-3)**: Persisted artifact schema versioning + migration shim.
107+
18. **[ ] Data Model Fixes (AD-4)**: Metadata fields move from `dict[str, str]` to `dict[str, Any]`.
108+
19. **[ ] Configuration Hardening (AD-5)**: Logging-based config warnings + strict server validation.
109+
20. **[ ] Service Layer Decomposition (AD-6)**: Retrieval orchestrator + protocol interfaces.
110+
21. **[ ] Entity Identity Resilience (AD-7)**: Add `content_hash` for rename-resilient correlation.
111+
103112
### **Phase 5: Deep Analysis (NEXT)**
104-
15. **[ ] Static Behavioral Analysis (Layer 4)**: Data flow, state transitions, side-effect classification.
105-
16. **[ ] Intent Extraction (Layer 6)**: ADR/PR/commit intent linking beyond commit metadata.
106-
17. **[ ] Confidence Scoring (Layer 3)**: Weighted edges/entities by evidence source.
113+
22. **[ ] Static Behavioral Analysis (Layer 4)**: Data flow, state transitions, side-effect classification.
114+
23. **[ ] Intent Extraction (Layer 6)**: ADR/PR/commit intent linking beyond commit metadata.
115+
24. **[ ] Confidence Scoring (Layer 3)**: Weighted edges/entities by evidence source.
107116

108117
### **Phase 6: Enterprise (FUTURE)**
109-
18. **[ ] Security & RBAC**: Permissioned access and audit trails.
110-
19. **[ ] Scalability**: Large monorepo support and distributed processing.
111-
20. **[ ] Team Sharing**: Remote knowledge store sync and collaboration.
118+
25. **[ ] Security & RBAC**: Permissioned access and audit trails.
119+
26. **[ ] Scalability**: Large monorepo support and distributed processing.
120+
27. **[ ] Team Sharing**: Remote knowledge store sync and collaboration.
112121

113122
### **Phase 7: Agentic Capabilities (COMPLETED v2.2)**
114-
21. **[x] Agent Architecture**: `Agent` class with configuration-driven model selection.
115-
22. **[x] Multi-Provider Support**: Google Gemini and OpenRouter/OpenAI integration.
116-
23. **[x] Rate Limiting**: Persistent RPM/RPD tracking and enforcement.
117-
24. **[x] Query Classification**: 6 task types (explain, debug, extend, review, locate, general).
118-
25. **[x] Smart Answer**: Local-first answering with configurable sufficiency threshold.
119-
26. **[x] VoyageAI Reranking**: Cross-encoder reranking with signal-based fallback.
123+
28. **[x] Agent Architecture**: `Agent` class with configuration-driven model selection.
124+
29. **[x] Multi-Provider Support**: Google Gemini and OpenRouter/OpenAI integration.
125+
30. **[x] Rate Limiting**: Persistent RPM/RPD tracking and enforcement.
126+
31. **[x] Query Classification**: 6 task types (explain, debug, extend, review, locate, general).
127+
32. **[x] Smart Answer**: Local-first answering with configurable sufficiency threshold.
128+
33. **[x] VoyageAI Reranking**: Cross-encoder reranking with signal-based fallback.
120129

121130
### **Phase 8: IDE Integration (COMPLETED v2.2)**
122-
27. **[x] MCP Server (Layer 10b)**: Tool exposure via STDIO for IDE agents.
123-
28. **[x] Core Tools**: `search_codebase`, `get_entity_context`, `trace_calls`.
124-
29. **[x] Sufficiency Scoring**: Context confidence metrics for local-first answering.
125-
30. **[x] Task-Specific Templates**: Debug/extend/review/explain/locate prioritization.
126-
31. **[x] Multi-hop Queries**: `trace_calls(depth=N)` and `get_impact()` analysis.
127-
32. **[x] Structured Responses**: JSON with `task_type` and `sufficiency_score`.
131+
34. **[x] MCP Server (Layer 10b)**: Tool exposure via STDIO for IDE agents.
132+
35. **[x] Core Tools**: `search_codebase`, `get_entity_context`, `trace_calls`.
133+
36. **[x] Sufficiency Scoring**: Context confidence metrics for local-first answering.
134+
37. **[x] Task-Specific Templates**: Debug/extend/review/explain/locate prioritization.
135+
38. **[x] Multi-hop Queries**: `trace_calls(depth=N)` and `get_impact()` analysis.
136+
39. **[x] Structured Responses**: JSON with `task_type` and `sufficiency_score`.

pyproject.toml

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,35 @@ dependencies = [
1313
"tree-sitter-languages>=1.10.0",
1414
"GitPython>=3.1.0",
1515
"tiktoken>=0.7.0",
16+
]
17+
18+
[project.optional-dependencies]
19+
server = [
1620
"fastapi>=0.100.0",
1721
"uvicorn>=0.22.0",
18-
"openai>=1.0.0",
22+
"slowapi>=0.1.9",
23+
]
24+
search = [
1925
"faiss-cpu>=1.7.0",
2026
"numpy>=1.24.0",
21-
"watchdog>=3.0.0",
27+
]
28+
llm = [
29+
"openai>=1.0.0",
2230
"google-genai>=0.3.0",
2331
"google-api-core>=2.29.0",
32+
]
33+
watch = ["watchdog>=3.0.0"]
34+
all = [
35+
"fastapi>=0.100.0",
36+
"uvicorn>=0.22.0",
2437
"slowapi>=0.1.9",
38+
"faiss-cpu>=1.7.0",
39+
"numpy>=1.24.0",
40+
"openai>=1.0.0",
41+
"google-genai>=0.3.0",
42+
"google-api-core>=2.29.0",
43+
"watchdog>=3.0.0",
2544
]
26-
27-
[project.optional-dependencies]
2845
mcp = ["mcp>=1.0.0"]
2946
voyageai = ["voyageai>=0.2.0"]
3047

0 commit comments

Comments
 (0)