AI-native-Systems-Research
diff --git a/‎.github/workflows/tests.yml‎
Lines changed: 59 additions & 0 deletions b/‎.github/workflows/tests.yml‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 82 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 2 deletions b/‎README.md‎
Lines changed: 10 additions & 2 deletions
diff --git a/‎bin/nous-execute-stop‎
Lines changed: 85 additions & 0 deletions b/‎bin/nous-execute-stop‎
Lines changed: 85 additions & 0 deletions
@@ -0,0 +1,59 @@
+name: Tests
+
+on:
+  push:
+    branches: [main, reflective]
+  # Trigger on every PR regardless of base branch so contributors get
+  # CI feedback on long-running integration branches (e.g. tracking-N)
+  # in addition to PRs targeting main/reflective.
+  pull_request:
+
+# Cancel in-flight runs on the same PR/branch when a new push lands.
+# Only safe context expressions used here: github.workflow, github.ref,
+# github.event_name. None come from user-controlled input.
+concurrency:
+  group: tests-${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+
+jobs:
+  pytest:
+    name: pytest (Python ${{ matrix.python-version }})
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.11", "3.12"]
+
+    # No LLM API keys in the env. The no-live-LLM project principle
+    # (CLAUDE.md, tests/CLAUDE.md, tests/conftest.py autouse guard) says
+    # tests must mock LLMs, never call them. This is the outer line of
+    # defence; the conftest guard is the inner.
+    env:
+      OPENAI_API_KEY: ""
+      OPENAI_BASE_URL: ""
+      ANTHROPIC_API_KEY: ""
+
+    steps:
+      - uses: actions/checkout@v5
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: pip
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Run pytest
+        run: pytest -ra --strict-markers
+
+      - name: Upload pytest cache on failure
+        if: failure()
+        uses: actions/upload-artifact@v4
+        with:
+          name: pytest-cache-py${{ matrix.python-version }}
+          path: .pytest_cache/
+          if-no-files-found: ignore
@@ -0,0 +1,82 @@
+# Nous — project conventions
+
+This file is auto-loaded by Claude Code on every session in this repo. The
+rules below are non-negotiable; when they conflict with general AI/coding
+defaults, **the rules here win**.
+
+## 🚫 Tests must NEVER make live LLM calls
+
+**No unit, integration, or end-to-end test in this repo may make a real
+API call to Anthropic, OpenAI, or any other LLM provider. Period.**
+
+Why this is a hard rule:
+- Tests run on every CI build, every contributor's laptop, and every PR
+  rebase. Live LLM calls would burn tokens for no signal — the test
+  result depends on what the model said today, not on the code under test.
+- Token budget for `nous` is mission-critical. We refuse to spend it on
+  CI churn.
+- Live calls are non-deterministic. A flaky test from a model rephrasing
+  itself is worse than no test.
+
+**How to test correctly:**
+
+| Code under test | How to mock |
+|---|---|
+| `LLMDispatcher` | Pass `completion_fn=` in the constructor — a callable that returns canned `chat.completions`-shaped objects. See `tests/test_llm_dispatch.py`'s `_make_fake_completion` for the pattern. |
+| `CLIDispatcher` (claude -p subprocess) | Patch `orchestrator.cli_dispatch.subprocess.run` — return a `subprocess.CompletedProcess` with the JSON the test wants. See `tests/test_cli_dispatch.py`. |
+| `SDKDispatcher` (Claude Agent SDK) | Pass `sdk_runner=` in the constructor — a callable returning `SDKResult`. See `tests/test_sdk_dispatch.py`'s `_ScriptedRunner`. |
+| `InlineDispatcher` | Set up the `.nous_response_*` signal file in tmp_path before calling dispatch. |
+| Stub-driven flows | Use `StubDispatcher` from `orchestrator.dispatch` — it produces valid schema-conformant artifacts with no LLM at all. |
+
+**Active enforcement:** `tests/conftest.py` installs an autouse fixture
+(`block_live_llm_calls`) that:
+1. Strips `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` from the env so any
+   accidental real-client construction fails loudly instead of silently
+   billing.
+2. Patches `urllib.request.urlopen` to refuse `api.anthropic.com`,
+   `api.openai.com`, and `api.litellm.ai` hosts.
+3. Patches `claude_agent_sdk.query` (when installed) to a hard-fail.
+
+If a test triggers any of these guards, the fix is to inject a fake at
+the dispatcher's seam — never to disable the guard. The guards are the
+backstop; the seams are the contract.
+
+## Behavioral testing only
+
+When the test mock is in place, write **behavioral** tests:
+- ✓ Assert what's on disk after `dispatcher.dispatch(...)`.
+- ✓ Assert metrics rows in `llm_metrics.jsonl`.
+- ✓ Assert artifacts match a JSON Schema.
+- ✗ Don't assert which method was called on the mock.
+- ✗ Don't assert argv shape, internal helper invocation, or attribute access.
+
+The seam is the contract; the implementation is free to evolve.
+
+## Token-budget discipline (production code)
+
+Beyond tests, Nous itself must be frugal with tokens:
+- **Methodology stays in `CLAUDE.md`** (auto-loaded by Claude Code), not
+  in per-call prompts. The thin templates in `prompts/methodology/*_thin.md`
+  carry only per-iteration context.
+- **System blocks are cached** (`cache_control: ephemeral`). Any code
+  that constructs an SDK call with a static system_prompt should rely
+  on this, and any change that breaks within-iteration cache locality
+  must be measured (`nous cost --cache-stats`) and justified.
+- **Read-only mapping uses Explore subagents**, not Opus. See
+  `orchestrator/explore_design.py`.
+
+## PR workflow (project owner: @sriumcp)
+
+1. Branch off `upstream/reflective` (NOT `main`).
+2. Push to `origin` (the fork at `sriumcp/agentic-strategy-evolution`).
+3. Open PR with base `upstream/reflective`, head `sriumcp:<branch>`.
+4. PR body links the issue with `Closes #N` (or `Refs #N` for partials).
+5. Stack PRs when one logical change builds on another rather than waiting
+   for merge — see `docs/plans/CHECKPOINT.md` for the pattern.
+
+## See also
+
+- `docs/contributing/workflow.md` — full workflow doc.
+- `docs/security.md` — permission policy (#135).
+- `docs/architecture.md` — internals.
+- `docs/plans/CHECKPOINT.md` — current state of the #120 epic.
@@ -80,17 +80,25 @@ If you're using Anthropic directly via a LiteLLM proxy, point both vars at the p
 ### 1. Install Nous
 
 ```bash
-pip install "git+https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git"
+pip install "git+https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git@reflective"
 ```
 
+`reflective` is the active integration branch — that's where new work lands first. `main` lags slightly behind. To pin to a release, replace `@reflective` with a tag (`@v0.2.0`).
+
 For development (editable install with test dependencies):
 
 ```bash
-git clone https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git
+git clone -b reflective https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git
 cd agentic-strategy-evolution
 pip install -e ".[dev]"
 ```
 
+For the SDK-based dispatcher (`--agent sdk`, see `docs/architecture.md`), also install the optional `[sdk]` extra:
+
+```bash
+pip install -e ".[dev,sdk]"
+```
+
 ### 2. Configure models
 
 Two LLM calls per iteration, both via `claude -p`:
 
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+"""Stop hook for the Nous executor session (issue #129).
+
+Runs after every Claude Code agent turn. Returns:
+    exit 0 → allow the agent to stop (its work is done).
+    exit 2 → block stopping; the structured reason on stderr is fed back
+             into the agent's conversation so it can react.
+
+A "stop is allowed" decision needs two pieces of evidence on disk:
+    1. ``$NOUS_ITER_DIR/principle_updates.json`` exists.
+    2. ``nous validate execution --dir $NOUS_ITER_DIR`` returns ``status: pass``.
+
+Both are deterministic — no LLM judgment, no agent self-assessment. The
+hook pairs with the ``/goal``-driven loop (#124) but is preferred wherever
+the success criterion is a schema check, because it's cheaper and more
+reliable than a Haiku evaluator.
+
+Configured per-campaign in ``.claude/settings.json`` (see #135). The
+orchestrator sets ``NOUS_ITER_DIR`` before launching the executor session.
+"""
+from __future__ import annotations
+
+import os
+import sys
+from pathlib import Path
+
+# When invoked as a Claude Code hook, the script's directory may not be
+# on PYTHONPATH. Add the repo root so `orchestrator.validate` imports.
+_HERE = Path(__file__).resolve().parent
+_REPO_ROOT = _HERE.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+
+from orchestrator.validate import validate_execution  # noqa: E402
+
+
+_OK = 0
+_BLOCK = 2
+
+
+def main() -> int:
+    iter_dir_str = os.environ.get("NOUS_ITER_DIR")
+    if not iter_dir_str:
+        print(
+            "NOUS_ITER_DIR is not set. The orchestrator should export this "
+            "variable before launching the executor session.",
+            file=sys.stderr,
+        )
+        return _BLOCK
+
+    iter_dir = Path(iter_dir_str)
+    if not iter_dir.is_dir():
+        print(
+            f"iter_dir does not exist: {iter_dir}. NOUS_ITER_DIR is "
+            f"misconfigured or the executor was launched before init.",
+            file=sys.stderr,
+        )
+        return _BLOCK
+
+    principles = iter_dir / "principle_updates.json"
+    if not principles.exists():
+        print(
+            f"principle_updates.json is missing from {iter_dir}. "
+            f"Write the file (a JSON list, possibly empty: []) before stopping.",
+            file=sys.stderr,
+        )
+        return _BLOCK
+
+    result = validate_execution(iter_dir)
+    if result.get("status") != "pass":
+        errors = result.get("errors", [])
+        print(
+            f"validation failed for {iter_dir} ({len(errors)} error(s)). "
+            f"Fix these before stopping:",
+            file=sys.stderr,
+        )
+        for err in errors:
+            print(f"  - {err}", file=sys.stderr)
+        return _BLOCK
+
+    return _OK
+
+
+if __name__ == "__main__":
+    sys.exit(main())