tathadn
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 71 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 42 additions & 0 deletions b/‎README.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎agents/base.py‎
Lines changed: 70 additions & 0 deletions b/‎agents/base.py‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎agents/coder.py‎
Lines changed: 22 additions & 15 deletions b/‎agents/coder.py‎
Lines changed: 22 additions & 15 deletions
diff --git a/‎agents/orchestrator.py‎
Lines changed: 20 additions & 13 deletions b/‎agents/orchestrator.py‎
Lines changed: 20 additions & 13 deletions
diff --git a/‎agents/planner.py‎
Lines changed: 20 additions & 12 deletions b/‎agents/planner.py‎
Lines changed: 20 additions & 12 deletions
@@ -0,0 +1,71 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  lint-and-type:
+    name: Lint & Type
+    runs-on: ubuntu-latest
+    env:
+      ANTHROPIC_API_KEY: sk-ant-dummy
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+          cache: pip
+
+      - name: Install dev dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Ruff check
+        run: ruff check .
+
+      - name: Ruff format check
+        run: ruff format --check .
+
+      - name: Mypy (strict)
+        run: mypy .
+
+  test:
+    name: Tests & Coverage
+    runs-on: ubuntu-latest
+    env:
+      ANTHROPIC_API_KEY: sk-ant-dummy
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+          cache: pip
+
+      - name: Install dev dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Run pytest with coverage
+        run: |
+          pytest tests/ -v \
+            --cov=agents \
+            --cov=evolution \
+            --cov=models \
+            --cov-report=term \
+            --cov-report=xml
+
+      - name: Upload coverage artifact
+        uses: actions/upload-artifact@v4
+        with:
+          name: coverage-xml
+          path: coverage.xml
+          if-no-files-found: warn
@@ -1,5 +1,7 @@
 # Self-Evolving Code Generator
 
+[![CI](https://github.com/tathadn/self-evolving-codegen/actions/workflows/ci.yml/badge.svg)](https://github.com/tathadn/self-evolving-codegen/actions/workflows/ci.yml)
+
 > V2 of [multi-agent-codegen](https://github.com/tathadn/multi-agent-codegen) — the same pipeline, now with a self-evolving tester.
 
 A multi-agent AI code generation pipeline (LangGraph + Claude) where the **Tester agent autonomously improves its own test generation strategy** over successive generations through self-evaluation, failure analysis, and prompt evolution.
@@ -81,6 +83,46 @@ If a newly evolved prompt causes the overall score to drop by more than 15% rela
 
 ---
 
+## Results
+
+A committed run of the evolution loop lives in [`experiments/micro_run/`](experiments/micro_run/). The chart below is produced by `plot_evolution()` in [`evolution/visualize.py`](evolution/visualize.py) directly from the run's metrics history.
+
+![Evolution performance chart — micro_run](experiments/micro_run/evolution_chart.png)
+
+### Metrics — Generation 0 → Generation 1
+
+Source: [`metrics_gen_0.json`](experiments/micro_run/metrics_gen_0.json), [`metrics_gen_1.json`](experiments/micro_run/metrics_gen_1.json).
+
+| Metric | Gen 0 | Gen 1 | Δ |
+|---|---:|---:|---:|
+| **Overall score** | 0.506 | **0.921** | **+0.415** |
+| Bug detection rate | 0.00 | 1.00 | +1.00 |
+| False failure rate | 0.00 | 0.00 | 0.00 |
+| Redundancy rate | 0.000 | 0.042 | +0.042 |
+| Coverage quality (/10) | 5.0 | 8.5 | +3.5 |
+| Edge case coverage (/10) | 5.0 | 7.5 | +2.5 |
+
+**What the evolver fixed.** Gen 0's effective score was pinned at 0.506 because the LLM-as-Judge could not parse the tester's output at all — `weaknesses` in `metrics_gen_0.json` show two `Unterminated string` JSON decode errors. The analyzer surfaced this as a prompt-format failure; the evolver responded by writing generation 1's prompt with an explicit `test_main.py` output contract, JSON-serialisation escape rules, and dependency pins. With the format-parse failure gone, the underlying test generation was already strong: 50/50 tests passing, full bug detection, zero false failures, and coverage jumping from 5.0 to 8.5.
+
+<details>
+<summary><strong>Diff — <code>tester_gen_0.txt</code> → <code>tester_gen_1.txt</code> (click to expand)</strong></summary>
+
+The evolver preserved the original responsibilities block and added three new constraint sections. The most load-bearing additions:
+
+- **Mandatory output artifact** — "You MUST always produce at least one artifact with filename `test_main.py`". Forces a stable filename the sandbox runner can locate.
+- **JSON serialisation rules** — explicit escaping for backslashes, quotes, and newlines inside test source strings. Directly fixes the `Unterminated string` parse errors seen in gen 0.
+- **Dependency constraints** — pinned allowlist (FastAPI, requests, pytest, `pytest-asyncio`, etc.) matching the sandbox's preinstalled packages.
+
+See [`prompts/tester_gen_0.txt`](prompts/tester_gen_0.txt) and [`prompts/tester_gen_1.txt`](prompts/tester_gen_1.txt) for the full text.
+
+</details>
+
+### Test suite
+
+All tests are pure-Python and use [`evolution/mock_data.py`](evolution/mock_data.py) — **zero API cost**, safe to run in CI without an Anthropic key. Coverage is reported in the [`Tests & Coverage`](.github/workflows/ci.yml) job of the CI workflow.
+
+---
+
 ## Quick Start
 
 ```bash
 
@@ -0,0 +1,70 @@
+"""Shared base class for pipeline agents.
+
+Every agent in the code-generation pipeline constructs a ``ChatAnthropic`` LLM
+and loads a system prompt from ``prompts/``. ``BaseAgent`` centralises that
+plumbing so concrete agents only declare model, token budget, and prompt name.
+``TesterBaseAgent`` extends it with the generation-aware fallback chain used
+by the self-evolving tester.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import ClassVar
+
+from langchain_anthropic import ChatAnthropic
+
+from config import MAX_TOKENS
+
+_PROMPTS_DIR = Path(__file__).parent.parent / "prompts"
+
+
+class BaseAgent:
+    """Common LLM + system-prompt wiring for pipeline agents.
+
+    Subclasses declare three class-level attributes and inherit the rest:
+
+    - ``model_name``: Anthropic model id (usually sourced from ``config.py``).
+    - ``max_tokens_key``: key into ``config.MAX_TOKENS`` for this agent's cap.
+    - ``prompt_name``: basename (without ``.md``) of the file in ``prompts/``.
+    """
+
+    model_name: ClassVar[str]
+    max_tokens_key: ClassVar[str]
+    prompt_name: ClassVar[str]
+
+    def __init__(self) -> None:
+        self.llm: ChatAnthropic = ChatAnthropic(  # type: ignore[call-arg]
+            model=self.model_name,
+            max_tokens=MAX_TOKENS[self.max_tokens_key],
+        )
+        self.system_prompt: str = self._load_prompt()
+
+    def _load_prompt(self) -> str:
+        """Load ``prompts/{prompt_name}.md`` as the system prompt."""
+        return (_PROMPTS_DIR / f"{self.prompt_name}.md").read_text()
+
+
+class TesterBaseAgent(BaseAgent):
+    """Tester agent with generation-aware prompt resolution.
+
+    Generation 0 uses the original ``prompts/tester.md``. Generation N > 0
+    uses ``prompts/tester_gen_{N}.txt``, falling back to the nearest earlier
+    generation that exists, then to the base prompt.
+    """
+
+    prompt_name: ClassVar[str] = "tester"
+    generation: int
+
+    def __init__(self, generation: int = 0) -> None:
+        self.generation = generation
+        super().__init__()
+
+    def _load_prompt(self) -> str:
+        if self.generation == 0:
+            return (_PROMPTS_DIR / "tester.md").read_text()
+        for gen in range(self.generation, 0, -1):
+            path = _PROMPTS_DIR / f"tester_gen_{gen}.txt"
+            if path.exists():
+                return path.read_text()
+        return (_PROMPTS_DIR / "tester.md").read_text()
@@ -1,34 +1,40 @@
 from __future__ import annotations
 
-from pathlib import Path
+import json
+from typing import Any, ClassVar
 
-from langchain_anthropic import ChatAnthropic
 from langchain_core.messages import HumanMessage, SystemMessage
 from pydantic import BaseModel
 
-from config import CODER_MODEL, MAX_TOKENS
+from agents.base import BaseAgent
+from config import CODER_MODEL
 from models.schemas import AgentState, CodeArtifact
 
-_PROMPT = (Path(__file__).parent.parent / "prompts" / "coder.md").read_text()
-
 
 class ArtifactList(BaseModel):
     artifacts: list[CodeArtifact]
 
 
-def get_llm() -> ChatAnthropic:
-    return ChatAnthropic(
-        model=CODER_MODEL,
-        max_tokens=MAX_TOKENS["coder"],
-    )
+class CoderAgent(BaseAgent):
+    model_name: ClassVar[str] = CODER_MODEL
+    max_tokens_key: ClassVar[str] = "coder"
+    prompt_name: ClassVar[str] = "coder"
+
+
+_agent: CoderAgent | None = None
+
+
+def _get_agent() -> CoderAgent:
+    global _agent
+    if _agent is None:
+        _agent = CoderAgent()
+    return _agent
 
 
 def _build_prompt(state: AgentState) -> str:
     parts = [f"User request: {state.user_request}"]
 
     if state.plan:
-        import json
-
         parts.append(f"\nImplementation plan:\n{json.dumps(state.plan.model_dump(), indent=2)}")
 
     if state.review and not state.review.approved:
@@ -53,12 +59,13 @@ def _build_prompt(state: AgentState) -> str:
     return "\n".join(parts)
 
 
-def coder_node(state: AgentState) -> dict:
+def coder_node(state: AgentState) -> dict[str, Any]:
     """Generates or revises code artifacts based on the plan and feedback."""
-    llm = get_llm().with_structured_output(ArtifactList)
+    agent = _get_agent()
+    llm = agent.llm.with_structured_output(ArtifactList)
 
     messages = [
-        SystemMessage(content=_PROMPT),
+        SystemMessage(content=agent.system_prompt),
         HumanMessage(content=_build_prompt(state)),
     ]
 
 
@@ -1,29 +1,36 @@
 from __future__ import annotations
 
-from pathlib import Path
+from typing import Any, ClassVar
 
-from langchain_anthropic import ChatAnthropic
 from langchain_core.messages import HumanMessage, SystemMessage
 
-from config import MAX_TOKENS, ORCHESTRATOR_MODEL
+from agents.base import BaseAgent
+from config import ORCHESTRATOR_MODEL
 from models.schemas import AgentState, TaskStatus
 
-_PROMPT = (Path(__file__).parent.parent / "prompts" / "orchestrator.md").read_text()
 
+class OrchestratorAgent(BaseAgent):
+    model_name: ClassVar[str] = ORCHESTRATOR_MODEL
+    max_tokens_key: ClassVar[str] = "orchestrator"
+    prompt_name: ClassVar[str] = "orchestrator"
 
-def get_llm() -> ChatAnthropic:
-    return ChatAnthropic(
-        model=ORCHESTRATOR_MODEL,
-        max_tokens=MAX_TOKENS["orchestrator"],
-    )
 
+_agent: OrchestratorAgent | None = None
 
-def orchestrator_node(state: AgentState) -> dict:
+
+def _get_agent() -> OrchestratorAgent:
+    global _agent
+    if _agent is None:
+        _agent = OrchestratorAgent()
+    return _agent
+
+
+def orchestrator_node(state: AgentState) -> dict[str, Any]:
     """Entry point: interprets the user request and sets the initial status."""
-    llm = get_llm()
+    agent = _get_agent()
 
     messages = [
-        SystemMessage(content=_PROMPT),
+        SystemMessage(content=agent.system_prompt),
         HumanMessage(
             content=(
                 f"User request: {state.user_request}\n\n"
@@ -33,7 +40,7 @@ def orchestrator_node(state: AgentState) -> dict:
         ),
     ]
 
-    response = llm.invoke(messages)
+    response = agent.llm.invoke(messages)
 
     return {
         "messages": [response],
 
@@ -1,30 +1,38 @@
 from __future__ import annotations
 
 import json
-from pathlib import Path
+from typing import Any, ClassVar
 
-from langchain_anthropic import ChatAnthropic
 from langchain_core.messages import HumanMessage, SystemMessage
 
-from config import MAX_TOKENS, PLANNER_MODEL
+from agents.base import BaseAgent
+from config import PLANNER_MODEL
 from models.schemas import AgentState, Plan
 
-_PROMPT = (Path(__file__).parent.parent / "prompts" / "planner.md").read_text()
 
+class PlannerAgent(BaseAgent):
+    model_name: ClassVar[str] = PLANNER_MODEL
+    max_tokens_key: ClassVar[str] = "planner"
+    prompt_name: ClassVar[str] = "planner"
 
-def get_llm() -> ChatAnthropic:
-    return ChatAnthropic(
-        model=PLANNER_MODEL,
-        max_tokens=MAX_TOKENS["planner"],
-    )
 
+_agent: PlannerAgent | None = None
 
-def planner_node(state: AgentState) -> dict:
+
+def _get_agent() -> PlannerAgent:
+    global _agent
+    if _agent is None:
+        _agent = PlannerAgent()
+    return _agent
+
+
+def planner_node(state: AgentState) -> dict[str, Any]:
     """Produces a structured implementation plan for the user's request."""
-    llm = get_llm().with_structured_output(Plan)
+    agent = _get_agent()
+    llm = agent.llm.with_structured_output(Plan)
 
     messages = [
-        SystemMessage(content=_PROMPT),
+        SystemMessage(content=agent.system_prompt),
         HumanMessage(content=f"Create an implementation plan for: {state.user_request}"),
     ]