Skip to content

Commit b3ede36

Browse files
committed
feat: portfolio polish — CI, BaseAgent refactor, results in README
- Add .github/workflows/ci.yml running ruff + mypy strict + pytest with coverage (no API key needed — all tests use mock data; dummy ANTHROPIC_API_KEY in env) - Extract agents/base.py with BaseAgent and TesterBaseAgent; refactor orchestrator/planner/coder/reviewer/tester to subclass it via lazy singleton while preserving the public *_node signatures used by graph/workflow.py - Commit experiments/micro_run/ (2 generations, overall score 0.506 → 0.921) and embed the evolution chart plus a gen 0 → gen 1 metrics table and prompt-diff summary in README's new Results section - Add pytest-cov to dev deps; add CI status badge to README - Make repo pass mypy strict on 30 source files: dict[str, Any] everywhere, -> None on all test functions, targeted type: ignore at third-party boundaries (ChatAnthropic call-args, langgraph StateGraph generics), sandbox/__init__.py to fix duplicate-module discovery, typed rate_limited_call - Add module docstrings to evolution/analyzer.py and evolution/evolver.py
1 parent 7d03403 commit b3ede36

33 files changed

Lines changed: 588 additions & 176 deletions

.github/workflows/ci.yml

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
lint-and-type:
11+
name: Lint & Type
12+
runs-on: ubuntu-latest
13+
env:
14+
ANTHROPIC_API_KEY: sk-ant-dummy
15+
steps:
16+
- uses: actions/checkout@v4
17+
18+
- name: Set up Python
19+
uses: actions/setup-python@v5
20+
with:
21+
python-version: "3.10"
22+
cache: pip
23+
24+
- name: Install dev dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install -e ".[dev]"
28+
29+
- name: Ruff check
30+
run: ruff check .
31+
32+
- name: Ruff format check
33+
run: ruff format --check .
34+
35+
- name: Mypy (strict)
36+
run: mypy .
37+
38+
test:
39+
name: Tests & Coverage
40+
runs-on: ubuntu-latest
41+
env:
42+
ANTHROPIC_API_KEY: sk-ant-dummy
43+
steps:
44+
- uses: actions/checkout@v4
45+
46+
- name: Set up Python
47+
uses: actions/setup-python@v5
48+
with:
49+
python-version: "3.10"
50+
cache: pip
51+
52+
- name: Install dev dependencies
53+
run: |
54+
python -m pip install --upgrade pip
55+
pip install -e ".[dev]"
56+
57+
- name: Run pytest with coverage
58+
run: |
59+
pytest tests/ -v \
60+
--cov=agents \
61+
--cov=evolution \
62+
--cov=models \
63+
--cov-report=term \
64+
--cov-report=xml
65+
66+
- name: Upload coverage artifact
67+
uses: actions/upload-artifact@v4
68+
with:
69+
name: coverage-xml
70+
path: coverage.xml
71+
if-no-files-found: warn

README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Self-Evolving Code Generator
22

3+
[![CI](https://github.com/tathadn/self-evolving-codegen/actions/workflows/ci.yml/badge.svg)](https://github.com/tathadn/self-evolving-codegen/actions/workflows/ci.yml)
4+
35
> V2 of [multi-agent-codegen](https://github.com/tathadn/multi-agent-codegen) — the same pipeline, now with a self-evolving tester.
46
57
A multi-agent AI code generation pipeline (LangGraph + Claude) where the **Tester agent autonomously improves its own test generation strategy** over successive generations through self-evaluation, failure analysis, and prompt evolution.
@@ -81,6 +83,46 @@ If a newly evolved prompt causes the overall score to drop by more than 15% rela
8183

8284
---
8385

86+
## Results
87+
88+
A committed run of the evolution loop lives in [`experiments/micro_run/`](experiments/micro_run/). The chart below is produced by `plot_evolution()` in [`evolution/visualize.py`](evolution/visualize.py) directly from the run's metrics history.
89+
90+
![Evolution performance chart — micro_run](experiments/micro_run/evolution_chart.png)
91+
92+
### Metrics — Generation 0 → Generation 1
93+
94+
Source: [`metrics_gen_0.json`](experiments/micro_run/metrics_gen_0.json), [`metrics_gen_1.json`](experiments/micro_run/metrics_gen_1.json).
95+
96+
| Metric | Gen 0 | Gen 1 | Δ |
97+
|---|---:|---:|---:|
98+
| **Overall score** | 0.506 | **0.921** | **+0.415** |
99+
| Bug detection rate | 0.00 | 1.00 | +1.00 |
100+
| False failure rate | 0.00 | 0.00 | 0.00 |
101+
| Redundancy rate | 0.000 | 0.042 | +0.042 |
102+
| Coverage quality (/10) | 5.0 | 8.5 | +3.5 |
103+
| Edge case coverage (/10) | 5.0 | 7.5 | +2.5 |
104+
105+
**What the evolver fixed.** Gen 0's effective score was pinned at 0.506 because the LLM-as-Judge could not parse the tester's output at all — `weaknesses` in `metrics_gen_0.json` show two `Unterminated string` JSON decode errors. The analyzer surfaced this as a prompt-format failure; the evolver responded by writing generation 1's prompt with an explicit `test_main.py` output contract, JSON-serialisation escape rules, and dependency pins. With the format-parse failure gone, the underlying test generation was already strong: 50/50 tests passing, full bug detection, zero false failures, and coverage jumping from 5.0 to 8.5.
106+
107+
<details>
108+
<summary><strong>Diff — <code>tester_gen_0.txt</code> → <code>tester_gen_1.txt</code> (click to expand)</strong></summary>
109+
110+
The evolver preserved the original responsibilities block and added three new constraint sections. The most load-bearing additions:
111+
112+
- **Mandatory output artifact** — "You MUST always produce at least one artifact with filename `test_main.py`". Forces a stable filename the sandbox runner can locate.
113+
- **JSON serialisation rules** — explicit escaping for backslashes, quotes, and newlines inside test source strings. Directly fixes the `Unterminated string` parse errors seen in gen 0.
114+
- **Dependency constraints** — pinned allowlist (FastAPI, requests, pytest, `pytest-asyncio`, etc.) matching the sandbox's preinstalled packages.
115+
116+
See [`prompts/tester_gen_0.txt`](prompts/tester_gen_0.txt) and [`prompts/tester_gen_1.txt`](prompts/tester_gen_1.txt) for the full text.
117+
118+
</details>
119+
120+
### Test suite
121+
122+
All tests are pure-Python and use [`evolution/mock_data.py`](evolution/mock_data.py)**zero API cost**, safe to run in CI without an Anthropic key. Coverage is reported in the [`Tests & Coverage`](.github/workflows/ci.yml) job of the CI workflow.
123+
124+
---
125+
84126
## Quick Start
85127

86128
```bash

agents/base.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
"""Shared base class for pipeline agents.
2+
3+
Every agent in the code-generation pipeline constructs a ``ChatAnthropic`` LLM
4+
and loads a system prompt from ``prompts/``. ``BaseAgent`` centralises that
5+
plumbing so concrete agents only declare model, token budget, and prompt name.
6+
``TesterBaseAgent`` extends it with the generation-aware fallback chain used
7+
by the self-evolving tester.
8+
"""
9+
10+
from __future__ import annotations
11+
12+
from pathlib import Path
13+
from typing import ClassVar
14+
15+
from langchain_anthropic import ChatAnthropic
16+
17+
from config import MAX_TOKENS
18+
19+
_PROMPTS_DIR = Path(__file__).parent.parent / "prompts"
20+
21+
22+
class BaseAgent:
23+
"""Common LLM + system-prompt wiring for pipeline agents.
24+
25+
Subclasses declare three class-level attributes and inherit the rest:
26+
27+
- ``model_name``: Anthropic model id (usually sourced from ``config.py``).
28+
- ``max_tokens_key``: key into ``config.MAX_TOKENS`` for this agent's cap.
29+
- ``prompt_name``: basename (without ``.md``) of the file in ``prompts/``.
30+
"""
31+
32+
model_name: ClassVar[str]
33+
max_tokens_key: ClassVar[str]
34+
prompt_name: ClassVar[str]
35+
36+
def __init__(self) -> None:
37+
self.llm: ChatAnthropic = ChatAnthropic( # type: ignore[call-arg]
38+
model=self.model_name,
39+
max_tokens=MAX_TOKENS[self.max_tokens_key],
40+
)
41+
self.system_prompt: str = self._load_prompt()
42+
43+
def _load_prompt(self) -> str:
44+
"""Load ``prompts/{prompt_name}.md`` as the system prompt."""
45+
return (_PROMPTS_DIR / f"{self.prompt_name}.md").read_text()
46+
47+
48+
class TesterBaseAgent(BaseAgent):
49+
"""Tester agent with generation-aware prompt resolution.
50+
51+
Generation 0 uses the original ``prompts/tester.md``. Generation N > 0
52+
uses ``prompts/tester_gen_{N}.txt``, falling back to the nearest earlier
53+
generation that exists, then to the base prompt.
54+
"""
55+
56+
prompt_name: ClassVar[str] = "tester"
57+
generation: int
58+
59+
def __init__(self, generation: int = 0) -> None:
60+
self.generation = generation
61+
super().__init__()
62+
63+
def _load_prompt(self) -> str:
64+
if self.generation == 0:
65+
return (_PROMPTS_DIR / "tester.md").read_text()
66+
for gen in range(self.generation, 0, -1):
67+
path = _PROMPTS_DIR / f"tester_gen_{gen}.txt"
68+
if path.exists():
69+
return path.read_text()
70+
return (_PROMPTS_DIR / "tester.md").read_text()

agents/coder.py

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,40 @@
11
from __future__ import annotations
22

3-
from pathlib import Path
3+
import json
4+
from typing import Any, ClassVar
45

5-
from langchain_anthropic import ChatAnthropic
66
from langchain_core.messages import HumanMessage, SystemMessage
77
from pydantic import BaseModel
88

9-
from config import CODER_MODEL, MAX_TOKENS
9+
from agents.base import BaseAgent
10+
from config import CODER_MODEL
1011
from models.schemas import AgentState, CodeArtifact
1112

12-
_PROMPT = (Path(__file__).parent.parent / "prompts" / "coder.md").read_text()
13-
1413

1514
class ArtifactList(BaseModel):
1615
artifacts: list[CodeArtifact]
1716

1817

19-
def get_llm() -> ChatAnthropic:
20-
return ChatAnthropic(
21-
model=CODER_MODEL,
22-
max_tokens=MAX_TOKENS["coder"],
23-
)
18+
class CoderAgent(BaseAgent):
19+
model_name: ClassVar[str] = CODER_MODEL
20+
max_tokens_key: ClassVar[str] = "coder"
21+
prompt_name: ClassVar[str] = "coder"
22+
23+
24+
_agent: CoderAgent | None = None
25+
26+
27+
def _get_agent() -> CoderAgent:
28+
global _agent
29+
if _agent is None:
30+
_agent = CoderAgent()
31+
return _agent
2432

2533

2634
def _build_prompt(state: AgentState) -> str:
2735
parts = [f"User request: {state.user_request}"]
2836

2937
if state.plan:
30-
import json
31-
3238
parts.append(f"\nImplementation plan:\n{json.dumps(state.plan.model_dump(), indent=2)}")
3339

3440
if state.review and not state.review.approved:
@@ -53,12 +59,13 @@ def _build_prompt(state: AgentState) -> str:
5359
return "\n".join(parts)
5460

5561

56-
def coder_node(state: AgentState) -> dict:
62+
def coder_node(state: AgentState) -> dict[str, Any]:
5763
"""Generates or revises code artifacts based on the plan and feedback."""
58-
llm = get_llm().with_structured_output(ArtifactList)
64+
agent = _get_agent()
65+
llm = agent.llm.with_structured_output(ArtifactList)
5966

6067
messages = [
61-
SystemMessage(content=_PROMPT),
68+
SystemMessage(content=agent.system_prompt),
6269
HumanMessage(content=_build_prompt(state)),
6370
]
6471

agents/orchestrator.py

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,36 @@
11
from __future__ import annotations
22

3-
from pathlib import Path
3+
from typing import Any, ClassVar
44

5-
from langchain_anthropic import ChatAnthropic
65
from langchain_core.messages import HumanMessage, SystemMessage
76

8-
from config import MAX_TOKENS, ORCHESTRATOR_MODEL
7+
from agents.base import BaseAgent
8+
from config import ORCHESTRATOR_MODEL
99
from models.schemas import AgentState, TaskStatus
1010

11-
_PROMPT = (Path(__file__).parent.parent / "prompts" / "orchestrator.md").read_text()
1211

12+
class OrchestratorAgent(BaseAgent):
13+
model_name: ClassVar[str] = ORCHESTRATOR_MODEL
14+
max_tokens_key: ClassVar[str] = "orchestrator"
15+
prompt_name: ClassVar[str] = "orchestrator"
1316

14-
def get_llm() -> ChatAnthropic:
15-
return ChatAnthropic(
16-
model=ORCHESTRATOR_MODEL,
17-
max_tokens=MAX_TOKENS["orchestrator"],
18-
)
1917

18+
_agent: OrchestratorAgent | None = None
2019

21-
def orchestrator_node(state: AgentState) -> dict:
20+
21+
def _get_agent() -> OrchestratorAgent:
22+
global _agent
23+
if _agent is None:
24+
_agent = OrchestratorAgent()
25+
return _agent
26+
27+
28+
def orchestrator_node(state: AgentState) -> dict[str, Any]:
2229
"""Entry point: interprets the user request and sets the initial status."""
23-
llm = get_llm()
30+
agent = _get_agent()
2431

2532
messages = [
26-
SystemMessage(content=_PROMPT),
33+
SystemMessage(content=agent.system_prompt),
2734
HumanMessage(
2835
content=(
2936
f"User request: {state.user_request}\n\n"
@@ -33,7 +40,7 @@ def orchestrator_node(state: AgentState) -> dict:
3340
),
3441
]
3542

36-
response = llm.invoke(messages)
43+
response = agent.llm.invoke(messages)
3744

3845
return {
3946
"messages": [response],

agents/planner.py

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,38 @@
11
from __future__ import annotations
22

33
import json
4-
from pathlib import Path
4+
from typing import Any, ClassVar
55

6-
from langchain_anthropic import ChatAnthropic
76
from langchain_core.messages import HumanMessage, SystemMessage
87

9-
from config import MAX_TOKENS, PLANNER_MODEL
8+
from agents.base import BaseAgent
9+
from config import PLANNER_MODEL
1010
from models.schemas import AgentState, Plan
1111

12-
_PROMPT = (Path(__file__).parent.parent / "prompts" / "planner.md").read_text()
1312

13+
class PlannerAgent(BaseAgent):
14+
model_name: ClassVar[str] = PLANNER_MODEL
15+
max_tokens_key: ClassVar[str] = "planner"
16+
prompt_name: ClassVar[str] = "planner"
1417

15-
def get_llm() -> ChatAnthropic:
16-
return ChatAnthropic(
17-
model=PLANNER_MODEL,
18-
max_tokens=MAX_TOKENS["planner"],
19-
)
2018

19+
_agent: PlannerAgent | None = None
2120

22-
def planner_node(state: AgentState) -> dict:
21+
22+
def _get_agent() -> PlannerAgent:
23+
global _agent
24+
if _agent is None:
25+
_agent = PlannerAgent()
26+
return _agent
27+
28+
29+
def planner_node(state: AgentState) -> dict[str, Any]:
2330
"""Produces a structured implementation plan for the user's request."""
24-
llm = get_llm().with_structured_output(Plan)
31+
agent = _get_agent()
32+
llm = agent.llm.with_structured_output(Plan)
2533

2634
messages = [
27-
SystemMessage(content=_PROMPT),
35+
SystemMessage(content=agent.system_prompt),
2836
HumanMessage(content=f"Create an implementation plan for: {state.user_request}"),
2937
]
3038

0 commit comments

Comments
 (0)