Skip to content

Commit 6682838

Browse files
tbitcsoz-agent
andcommitted
docs: Phase 0 baseline audit + AGENTS.md AG2 realignment
- docs/baseline-audit.md: architecture map, test inventory (208 pass/18 fail), untested modules, known breakpoints, ranked gap summary - AGENTS.md: AG2 four-layer architecture, agent roles (Planner/Builder/Verifier), Ollama policy, updated file registry, 12 project rules Co-Authored-By: Oz <oz-agent@warp.dev>
1 parent 367c56c commit 6682838

2 files changed

Lines changed: 251 additions & 16 deletions

File tree

AGENTS.md

Lines changed: 60 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,39 @@
22

33
## Identity
44
- **Project**: specsmith
5-
- **Type**: CLI tool (Python) + AEE library — Spec Section 17.3
6-
- **Spec version**: 0.3.0
5+
- **Type**: CLI tool (Python) + AEE library + AG2 agent shell — Spec Section 17.3
6+
- **Spec version**: 0.3.10
77
- **Language**: Python 3.10+
88
- **Platforms**: Windows, Linux, macOS
9+
- **Agent layer**: AG2 (`ag2[ollama]`) over Ollama local models
910

1011
## Purpose
1112
Applied Epistemic Engineering toolkit for AI-assisted development. Treats belief systems
1213
like code: codable, testable, deployable. Co-installs the `epistemic` standalone library.
13-
Includes an AEE-integrated agentic client (`specsmith run`) supporting Claude, GPT, Gemini,
14-
and local Ollama models.
14+
Includes an AEE-integrated agentic client (`specsmith run`) and an AG2-based agent shell
15+
(`specsmith agent run`) supporting Planner/Builder/Verifier agents over local Ollama models.
16+
17+
## Agent Architecture (AG2 Realignment — April 2026)
18+
19+
The system has four layers:
20+
21+
1. **Product Surface** — specsmith CLI, VS Code plugin, PySide6 GUI, existing REPL
22+
2. **Agent Layer (AG2)** — Planner/Builder/Verifier agents in `src/specsmith/agents/`
23+
3. **Model Runtime (Ollama)** — local inference via `OllamaProvider`, structured outputs, tool calling
24+
4. **Verification Layer** — pytest, VS Code extension tests, traces, golden outputs
25+
26+
Do not collapse these layers into one blob. Each has clear boundaries.
27+
28+
### Agent Roles
29+
- **Planner**: understands tasks, generates execution plans, outputs breakdown + acceptance criteria
30+
- **Builder**: makes code/doc changes, wires features, patches defects
31+
- **Verifier**: runs tests, validates behavior, accepts or rejects changes
32+
33+
### Ollama Policy
34+
- Ollama is the default local model backend
35+
- Use structured outputs and tool calling whenever possible
36+
- Abstract model selection behind config — never hardcode one model
37+
- Primary orchestration model + optional lighter utility model
1538

1639
## Quick Commands
1740
- `pip install -e ".[dev]"` — dev install
@@ -47,27 +70,29 @@ and local Ollama models.
4770
- `src/specsmith/` — specsmith CLI package
4871
- `src/epistemic/` — standalone AEE library (canonical location)
4972
- `src/specsmith/epistemic/` — compatibility shim (re-exports from epistemic)
50-
- `src/specsmith/agent/` — agentic client (providers, tools, runner, hooks, skills)
73+
- `src/specsmith/agent/`existing agentic client (providers, tools, runner, hooks, skills)
5174
- `src/specsmith/agent/profiles/` — built-in agent profiles (planner, verifier, epistemic-auditor)
75+
- `src/specsmith/agents/`**AG2 agent shell** (new — Planner, Builder, Verifier)
76+
- `src/specsmith/agents/tools/` — AG2 tool surface (filesystem, shell, tests, git, docs, vscode)
77+
- `src/specsmith/agents/workflows/` — AG2 workflows (analyze_edit_test, bugfix, improve_specsmith)
78+
- `src/specsmith/agents/runtime/` — Ollama bridge for AG2
79+
- `src/specsmith/agents/config.py` — AG2 agent config from scaffold.yml
80+
- `src/specsmith/agents/cli.py``specsmith agent run|plan|status|verify` commands
5281
- `src/specsmith/templates/` — Jinja2 scaffold templates (incl. 4 new epistemic templates)
5382
- `src/specsmith/integrations/` — agent platform adapters
54-
- `src/specsmith/commands/` — harness slash command implementations _(planned)_
55-
- `src/specsmith/operations.py`typed ProjectOperations (file/git/search) _(planned)_
56-
- `src/specsmith/instinct.py`instinct persistence and continuous learning _(planned)_
57-
- `src/specsmith/memory.py`cross-session agent memory _(planned)_
58-
- `src/specsmith/eval/`EDD eval harness (Task/Trial/Grader/pass@k) _(planned)_
83+
- `src/specsmith/commands/` — harness slash command implementations _(stub — not yet wired)_
84+
- `src/specsmith/agents/instinct.py`instinct persistence and continuous learning _(planned)_
85+
- `src/specsmith/agents/memory.py`cross-session agent memory _(planned)_
86+
- `src/specsmith/agents/eval/`EDD eval harness (Task/Trial/Grader/pass@k) _(planned)_
87+
- `src/specsmith/agents/flags.py`feature flag system for tool schema gating _(planned)_
5988
- `src/specsmith/server/` — specsmith serve daemon (REST + WebSocket) _(planned)_
60-
- `src/specsmith/agent/spawner.py` — AgentTool subagent spawning _(planned)_
61-
- `src/specsmith/agent/orchestrator.py` — orchestrator meta-agent on Ollama _(planned)_
62-
- `src/specsmith/agent/teams.py` — agent team coordination via filesystem mailbox _(planned)_
63-
- `src/specsmith/agent/flags.py` — feature flag system for tool schema gating _(planned)_
64-
- `tests/` — test suite
89+
- `tests/` — test suite (208 pass, 18 sandbox failures as of 2026-04-20)
90+
- `docs/baseline-audit.md` — Phase 0 architecture audit
6591
- `docs/REQUIREMENTS.md` — formal requirements (extended April 2026)
6692
- `docs/TEST_SPEC.md` — test specifications
6793
- `docs/ARCHITECTURE.md` — architecture reference (extended April 2026)
6894
- `docs/governance/` — modular governance docs
6995
- `docs/AGENT-WORKFLOW-SPEC.md` — the specification itself
70-
- `C:\Users\trist\Development\BitConcepts\everything-claude-code` — ECC reference (local clone)
7196

7297
## Governance
7398
This project follows its own specification. See:
@@ -82,6 +107,8 @@ AGENTS.md is < 200 lines. Run `specsmith upgrade --full` to generate them if nee
82107
- Templates: jinja2
83108
- Config: pydantic + pyyaml
84109
- Output: rich
110+
- Agent shell: AG2 (`ag2[ollama]`)
111+
- Local LLM: Ollama v0.3+ (stdlib urllib, /api/chat)
85112
- Lint: ruff
86113
- Types: mypy (strict)
87114
- Tests: pytest + pytest-cov
@@ -92,6 +119,23 @@ AGENTS.md is < 200 lines. Run `specsmith upgrade --full` to generate them if nee
92119
- Service: FastAPI or aiohttp + websockets _(planned — specsmith serve)_
93120
- IDE: Eclipse Theia + @theia/ai-core _(planned — specsmith-ide repo)_
94121

122+
## Project Rules (AG2 Realignment)
123+
124+
These rules apply to all agents working on this codebase:
125+
126+
1. **Evidence over claims** — do not say "works" unless it is demonstrated with test output
127+
2. **Small safe steps** — prefer small validated improvements over large speculative rewrites
128+
3. **Preserve the existing product** — wrap and improve the current system before replacing major pieces
129+
4. **Tooling first** — a good tool loop beats a clever prompt
130+
5. **Tests are product** — if the system cannot prove itself, it is not ready to improve itself
131+
6. **Inspect before editing** — always read the relevant files before proposing changes
132+
7. **Preserve architectural boundaries** — do not collapse the four layers
133+
8. **Run the narrowest relevant tests** — after every change, run only the tests that cover it
134+
9. **Update docs alongside code** — undocumented features are governance violations (H14)
135+
10. **Use Ollama as default** — local model provider; cloud providers are opt-in
136+
11. **Keep AG2 shell modular** — tools, agents, workflows, and runtime are separate packages
137+
12. **Leave clear follow-up tasks** — when work is partial, document what remains
138+
95139
## Documentation Rule (H14 — Hard Rule)
96140

97141
The Read the Docs site (`docs/site/`) is the authoritative user manual.

docs/baseline-audit.md

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
# Baseline Audit — specsmith
2+
3+
> Generated: 2026-04-20 (Phase 0 — AG2 Realignment)
4+
5+
## 1. Architecture Map
6+
7+
### Entrypoints
8+
9+
| Entrypoint | Module | Description |
10+
|---|---|---|
11+
| `specsmith` CLI | `cli.py` → Click `_AutoUpdateGroup` | 50+ commands. Auto-checks spec_version and PyPI updates on invocation. |
12+
| `specsmith run` | `agent/runner.py``AgentRunner` REPL | Agent loop: system prompt → provider → tool dispatch → hooks. Supports `/help`, `/tools`, `/model`, `/status`, `/save`, `/clear`. |
13+
| `specsmith gui` | `gui/app.py``launch()` | PySide6 (Qt6) desktop app. `GUIAgentRunner(AgentRunner)` overrides print/provider/tool methods to emit Qt signals. `AgentWorker(QThread)` runs off UI thread. |
14+
| VS Code extension | `extension.ts``activate()` | Activation event: `onStartupFinished`. 14 TypeScript source files. 30+ contributed commands. |
15+
16+
### Service Boundaries
17+
18+
```
19+
CLI Layer (cli.py)
20+
├── scaffolder.py — Jinja2 template render → project files
21+
├── auditor.py — health checks (file existence, REQ↔TEST, ledger)
22+
├── exporter.py — compliance reports, REQ coverage matrix
23+
├── importer.py — detect language/build/test → generate overlay
24+
├── config.py — Pydantic model for scaffold.yml (33 project types)
25+
├── differ.py — governance file drift detection
26+
├── doctor.py — environment diagnostic
27+
├── phase.py — project lifecycle phase management
28+
├── compressor.py — LEDGER.md archival
29+
├── ledger.py — CryptoAuditChain (SHA-256 append-only)
30+
├── retrieval.py — keyword scoring index (term-frequency, not BM25)
31+
├── profiles.py — execution profiles
32+
├── credit_analyzer.py — LLM credit spend analysis
33+
└── credits.py — rate limit profiles
34+
35+
Agent Layer (agent/)
36+
├── runner.py — REPL loop, tool execution, streaming, session state
37+
├── core.py — Message, Tool, CompletionResponse, ModelTier, BaseProvider
38+
├── tools.py — 20 tool handlers (all use _run_specsmith → subprocess)
39+
├── hooks.py — HookRegistry: Pre/PostTool, SessionStart, SessionEnd, H13
40+
├── skills.py — SKILL.md loader with domain priority
41+
├── optimizer.py — TokenEstimator, ResponseCache, ContextManager, ModelRouter, ToolFilter
42+
└── providers/
43+
├── anthropic.py — Claude (SDK: anthropic>=0.56)
44+
├── openai.py — GPT (SDK: openai>=1.0, also used for Mistral via base_url)
45+
├── gemini.py — Gemini (SDK: google-genai>=1.0, fallback google-generativeai)
46+
├── ollama.py — Ollama v0.3+ (stdlib urllib, /api/chat, tool calling, streaming)
47+
└── mistral.py — Mistral via openai SDK pointed at api.mistral.ai
48+
49+
Epistemic Layer (epistemic/ + specsmith/epistemic/)
50+
├── belief.py — BeliefArtifact dataclass
51+
├── stress_tester.py — 8 adversarial challenges, Logic Knot detection
52+
├── failure_graph.py — FailureModeGraph, equilibrium_check, Mermaid render
53+
├── recovery.py — RecoveryOperator, bounded proposals
54+
├── certainty.py — CertaintyEngine, weakest-link propagation
55+
├── session.py — AEESession facade
56+
└── trace.py — TraceVault SHA-256 append-only chain
57+
58+
GUI Layer (gui/)
59+
├── app.py — QApplication bootstrap, dark AEE theme
60+
├── main_window.py — QTabWidget, status bar, menu bar
61+
├── session_tab.py — per-tab: chat + input + meter + tool panel + provider bar
62+
├── worker.py — GUIAgentRunner + AgentWorker(QThread)
63+
└── widgets/ — chat_view, input_bar, provider_bar, token_meter, tool_panel, update_checker
64+
```
65+
66+
### VS Code Plugin Structure
67+
68+
```
69+
specsmith-vscode/src/
70+
├── extension.ts — activate(): tree views, commands, startup checks
71+
├── bridge.ts — SpecsmithBridge: child process (specsmith run --json-events), JSONL protocol
72+
├── SessionPanel.ts — webview: agent chat, auto-approve, model/provider switching
73+
├── GovernancePanel.ts — webview: 6-tab settings (General, Models, Execution, Tools, Agents, Help)
74+
├── SettingsPanel.ts — webview: global extension settings
75+
├── HelpPanel.ts — webview: help/docs
76+
├── OllamaManager.ts — Ollama model management (list, pull, delete, GPU detection)
77+
├── ModelRegistry.ts — fetch available models per provider
78+
├── ApiKeyManager.ts — secret storage for LLM API keys
79+
├── VenvManager.ts — Python venv detection/management
80+
├── ProjectTree.ts — sidebar tree: project folders + file operations
81+
├── EpistemicBar.ts — status bar: epistemic health indicator
82+
├── BugReporter.ts — interactive bug report filing
83+
└── types.ts — SpecsmithEvent, SessionConfig, SessionStatus types
84+
```
85+
86+
**Bridge protocol:** `SpecsmithBridge` spawns `specsmith run --json-events` as a child process. Communication is stdin (user messages, one per line) / stdout (JSONL events: `ready`, `llm_chunk`, `tool_started`, `tool_finished`, `tokens`, `turn_done`, `error`, `system`). Turn timeout: 5 minutes.
87+
88+
**Activation:** `onStartupFinished`. On activate: apply venv path, create tree views, register 30+ commands, startup checks (privacy notice, fetch models, update check, venv check, auto-open governance panel).
89+
90+
**No integration tests exist** for the VS Code extension.
91+
92+
### Model/Backend Assumptions per Provider
93+
94+
- **Anthropic:** SDK `anthropic>=0.56`. Streaming via SDK. Tool calling native.
95+
- **OpenAI:** SDK `openai>=1.0`. Also serves Mistral (base_url override). Tool calling native.
96+
- **Gemini:** SDK `google-genai>=1.0` (preferred) or `google-generativeai` (fallback). Auto-detects.
97+
- **Ollama:** Stdlib only (`urllib.request`). `/api/chat` for all completions. Tool calling v0.3+. `num_ctx` via `SPECSMITH_OLLAMA_NUM_CTX` (default 4096). `keep_alive=-1` to prevent model unload. Think parameter for reasoning models.
98+
- **Mistral:** Uses OpenAI SDK pointed at `api.mistral.ai`.
99+
100+
All providers are optional extras — specsmith core has zero LLM SDK dependencies.
101+
102+
## 2. Verification Results (2026-04-20)
103+
104+
### pytest (226 collected)
105+
106+
- **Passed:** 208
107+
- **Failed:** 18
108+
- **Skipped:** 0
109+
110+
**Failing tests (all sandbox/lifecycle + 1 scaffolder):**
111+
112+
| Test | Category |
113+
|---|---|
114+
| `test_sandbox_import::test_full_import_workflow` | sandbox import |
115+
| `test_sandbox_import::test_import_force_overwrites` | sandbox import |
116+
| `test_sandbox_import::test_import_idempotent_restart` | sandbox import |
117+
| `test_sandbox_import::test_import_preserves_existing_project_docs` | sandbox import |
118+
| `test_sandbox_import::test_import_force_overwrites_existing_docs` | sandbox import |
119+
| `test_sandbox_lifecycle_import::test_import_sets_inception_phase` | lifecycle import |
120+
| `test_sandbox_lifecycle_import::test_import_creates_governance_files` | lifecycle import |
121+
| `test_sandbox_lifecycle_import::test_import_then_phase_operations` | lifecycle import |
122+
| `test_sandbox_lifecycle_import::test_import_audit_includes_phase_readiness` | lifecycle import |
123+
| `test_sandbox_lifecycle_new::test_full_lifecycle_phases` | lifecycle new |
124+
| `test_sandbox_lifecycle_new::test_phase_gating_without_force` | lifecycle new |
125+
| `test_sandbox_lifecycle_new::test_governance_files_present` | lifecycle new |
126+
| `test_sandbox_lifecycle_upgrade::test_upgrade_migrates_workflow_to_session_protocol` | lifecycle upgrade |
127+
| `test_sandbox_lifecycle_upgrade::test_upgrade_preserves_workflow_content` | lifecycle upgrade |
128+
| `test_sandbox_lifecycle_upgrade::test_upgrade_then_audit_runs` | lifecycle upgrade |
129+
| `test_sandbox_lifecycle_upgrade::test_upgrade_idempotent` | lifecycle upgrade |
130+
| `test_sandbox_new::test_full_scaffold_workflow` | sandbox new |
131+
| `test_scaffolder::test_creates_expected_files` | scaffolder |
132+
133+
**Root cause:** Likely governance template drift — scaffolder output changed but sandbox test expectations weren't updated.
134+
135+
**Platform issue:** pytest cleanup crashes with `WinError 448` (untrusted mount point in temp dir). Does not affect test results.
136+
137+
### ruff (lint)
138+
139+
All checks passed. Zero issues.
140+
141+
### mypy (typecheck)
142+
143+
Success: 0 errors across 72 source files. One note: unused `keyring.*` override in pyproject.toml.
144+
145+
## 3. Untested Modules
146+
147+
**Critical (agent layer — zero test coverage):**
148+
- `agent/runner.py` — REPL loop, tool execution, streaming, session state, meta-commands
149+
- `agent/tools.py` — 20 tool handlers (all route through `_run_specsmith` subprocess wrapper)
150+
- `agent/hooks.py` — HookRegistry, trigger dispatch, H13 check
151+
- `agent/skills.py` — SKILL.md loading, domain priority
152+
- `agent/providers/anthropic.py` — Claude provider
153+
- `agent/providers/openai.py` — GPT/Mistral provider
154+
- `agent/providers/gemini.py` — Gemini provider
155+
- `agent/providers/ollama.py` — Ollama provider (tool calling, streaming, think parameter)
156+
- `commands/__init__.py` — empty stub, no slash commands implemented
157+
158+
**Secondary (supporting modules):**
159+
- `architect.py`, `auth.py`, `credit_analyzer.py`, `credits.py`, `doctor.py`
160+
- `ledger.py`, `ollama_cmds.py`, `patent.py`, `phase.py`, `plugins.py`
161+
- `profiles.py`, `releaser.py`, `retrieval.py`, `session.py`
162+
163+
**Excluded from mypy strict:**
164+
- `gui/` (requires PySide6)
165+
- `ollama_cmds`, `languages`, `phase`, `cli`, `importer`, `agent.providers.gemini`, `agent.runner`, `profiles`, `toolrules`, `tool_installer`
166+
167+
**VS Code plugin:** Zero integration tests. No test runner configured.
168+
169+
## 4. Known Breakpoints
170+
171+
1. **18 sandbox/lifecycle test failures** — governance template expectations are stale. Severity: medium (blocks CI green).
172+
2. **Tool handlers use raw subprocess**`_run_specsmith()` in `tools.py` shells out to `python -m specsmith <args>`. No structured error handling, no cross-platform abstraction, no typed results.
173+
3. **`commands/__init__.py` is empty** — slash commands documented in AGENTS.md and ARCHITECTURE.md are not implemented.
174+
4. **No agent/runner tests** — the entire REPL loop, tool dispatch, streaming, and session state management is untested.
175+
5. **No provider tests** — all 5 LLM providers have zero unit tests.
176+
6. **No VS Code extension tests** — plugin activation, bridge protocol, panel rendering are all untested.
177+
7. **Retrieval uses term-frequency** — not BM25 as documented in requirements.
178+
8. **pytest WinError 448** — temp directory cleanup fails on Windows. Cosmetic but noisy.
179+
180+
## 5. Gap Summary (ranked by severity)
181+
182+
1. **No agent layer tests** — runner, tools, hooks, skills, providers all untested → high risk for AG2 integration
183+
2. **18 failing sandbox tests** — CI is red → blocks safe development
184+
3. **Empty commands/** — REPL meta-commands not wired → blocks slash command surface
185+
4. **Tool handlers = raw subprocess** — no typed operations → AG2 tools must replace this
186+
5. **No VS Code extension tests** — plugin correctness is assumed, not proven
187+
6. **No AG2 integration** — the entire agent orchestration layer is missing
188+
7. **No eval harness** — cannot measure agent quality
189+
8. **No instinct/memory** — no cross-session learning
190+
9. **No feature flags** — no way to gate capabilities
191+
10. **No server daemon** — no WebSocket path for IDE integration

0 commit comments

Comments
 (0)