Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
33c07a1
feat: add SDKDispatcher and --agent sdk flag (#121)
sriumcp May 24, 2026
bd330d7
feat: add deterministic Stop hook for executor completion (#129)
sriumcp May 24, 2026
b745558
security: per-campaign permission policy via .claude/settings.json (#…
sriumcp May 24, 2026
d61b9dc
feat: PreToolUse plan-enforcer hook (#128)
sriumcp May 24, 2026
ea8d02d
refactor: per-campaign CLAUDE.md generated at init + regenerated each…
sriumcp May 24, 2026
18613af
feat: channel notification at human gates (#130, Phase A)
sriumcp May 24, 2026
0861823
feat: campaign-index pure functions, foundation for nous-mcp (#126 Ph…
sriumcp May 24, 2026
d80230c
feat: orphan-worktree GC at run start (#133, Phase A)
sriumcp May 24, 2026
42e3557
perf: cache hit-rate stats + nous cost --cache-stats (#122)
sriumcp May 24, 2026
5c6215c
feat: nous status --watch / --line + snapshot reader (#127, Phase A)
sriumcp May 24, 2026
74b7eb0
feat: Routines payload builder for scheduled campaigns (#134, Phase A)
sriumcp May 24, 2026
b993203
feat: package nous as a Claude Code plugin (#125)
sriumcp May 24, 2026
473970b
feat: /goal-driven prompt builders for goal-bounded campaign mode (#1…
sriumcp May 24, 2026
bcc82a7
feat: explore-then-synthesize DESIGN orchestration helpers (#132, Pha…
sriumcp May 24, 2026
25ce1e8
perf: load methodology preamble as cached system_prompt (#122 Phase B)
sriumcp May 24, 2026
d6039e9
feat: tee SDK events to executor_log.jsonl (#127 Phase B)
sriumcp May 24, 2026
33b5811
refactor: thin prompt templates when CLAUDE.md is in scope (#131 Phas…
sriumcp May 24, 2026
3ca5070
chore: codify no-live-LLM-in-tests as a hard project principle
sriumcp May 24, 2026
d6d69cc
feat: run_goal_driven_iteration runner (#124 Phase B)
sriumcp May 24, 2026
4a45e13
feat: submit_routine HTTP POST with poster injection (#134 Phase B)
sriumcp May 24, 2026
9522fb2
feat: nous-mcp stdio server (#126 Phase B)
sriumcp May 24, 2026
5d8aa7a
feat: parse_reply + wait_for_reply for channel gate decisions (#130 P…
sriumcp May 24, 2026
32250bb
feat: make_isolated_arm_runner factory for harness-managed worktrees …
sriumcp May 24, 2026
f7a01f3
feat: parallel-arm orchestration helpers (#123, Phase A)
sriumcp May 24, 2026
9cb7fc4
feat: end-to-end isolated-runner tests for parallel arms (#123 Phase B)
sriumcp May 24, 2026
a186f2a
feat: make_sdk_explore_runner factory for Stage A (#132 Phase B)
sriumcp May 24, 2026
33952e6
docs: retro for the #120 Claude-Code-native uplift initiative
sriumcp May 24, 2026
13674db
Merge #136 (#121 SDK port + Phase B)
sriumcp May 24, 2026
bde2e61
Merge #137 (#129 Stop hook)
sriumcp May 24, 2026
add8204
Merge #138 (#135 Permission policy)
sriumcp May 24, 2026
b9e6f78
Merge #139 (#128 PreToolUse enforcer)
sriumcp May 24, 2026
723645f
Merge #140 (#131 CLAUDE.md + Phase B thin templates)
sriumcp May 24, 2026
1859788
Merge #141 (#130 Channels + Phase B reply parsing)
sriumcp May 24, 2026
2f39883
Merge #142 (#126 MCP server + Phase B stdio transport)
sriumcp May 24, 2026
8e851ed
Merge #143 (#133 Worktree GC + Phase B harness runner)
sriumcp May 24, 2026
315b1ce
Merge #144 (#122 Prompt caching + Phase B preamble loader)
sriumcp May 24, 2026
3857687
Merge #145 (#127 Status watch + Phase B SDK event tee)
sriumcp May 24, 2026
da0938f
Merge #146 (#134 Routines + Phase B submit)
sriumcp May 24, 2026
7872a78
Merge #147 (#125 Plugin)
sriumcp May 24, 2026
ef9eab5
Merge #148 (#124 /goal-driven + Phase B runner)
sriumcp May 24, 2026
ee71b8c
Merge #149 (#132 Explore design + Phase B SDK runner)
sriumcp May 24, 2026
e2c209a
Merge #150 (#123 Parallel arms + Phase B isolated runner integration)
sriumcp May 24, 2026
4cec326
Merge #151 (test policy: no live LLM calls)
sriumcp May 24, 2026
304ccef
Merge #152 (retro for #120)
sriumcp May 24, 2026
eac8c2a
ci: add pytest workflow for push and pull_request
sriumcp May 24, 2026
d68ad85
Merge ci/pytest: add pytest workflow for tracking-120
sriumcp May 24, 2026
322f851
ci: drop pull_request base-branch filter so any PR runs CI
sriumcp May 24, 2026
24f8a76
docs: pip install + git clone use the reflective branch (#120)
sriumcp May 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Tests

on:
push:
branches: [main, reflective]
# Trigger on every PR regardless of base branch so contributors get
# CI feedback on long-running integration branches (e.g. tracking-N)
# in addition to PRs targeting main/reflective.
pull_request:

# Cancel in-flight runs on the same PR/branch when a new push lands.
# Only safe context expressions used here: github.workflow, github.ref,
# github.event_name. None come from user-controlled input.
concurrency:
group: tests-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}

jobs:
pytest:
name: pytest (Python ${{ matrix.python-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.11", "3.12"]

# No LLM API keys in the env. The no-live-LLM project principle
# (CLAUDE.md, tests/CLAUDE.md, tests/conftest.py autouse guard) says
# tests must mock LLMs, never call them. This is the outer line of
# defence; the conftest guard is the inner.
env:
OPENAI_API_KEY: ""
OPENAI_BASE_URL: ""
ANTHROPIC_API_KEY: ""

steps:
- uses: actions/checkout@v5

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"

- name: Run pytest
run: pytest -ra --strict-markers

- name: Upload pytest cache on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: pytest-cache-py${{ matrix.python-version }}
path: .pytest_cache/
if-no-files-found: ignore
82 changes: 82 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Nous — project conventions

This file is auto-loaded by Claude Code on every session in this repo. The
rules below are non-negotiable; when they conflict with general AI/coding
defaults, **the rules here win**.

## 🚫 Tests must NEVER make live LLM calls

**No unit, integration, or end-to-end test in this repo may make a real
API call to Anthropic, OpenAI, or any other LLM provider. Period.**

Why this is a hard rule:
- Tests run on every CI build, every contributor's laptop, and every PR
rebase. Live LLM calls would burn tokens for no signal — the test
result depends on what the model said today, not on the code under test.
- Token budget for `nous` is mission-critical. We refuse to spend it on
CI churn.
- Live calls are non-deterministic. A flaky test from a model rephrasing
itself is worse than no test.

**How to test correctly:**

| Code under test | How to mock |
|---|---|
| `LLMDispatcher` | Pass `completion_fn=` in the constructor — a callable that returns canned `chat.completions`-shaped objects. See `tests/test_llm_dispatch.py`'s `_make_fake_completion` for the pattern. |
| `CLIDispatcher` (claude -p subprocess) | Patch `orchestrator.cli_dispatch.subprocess.run` — return a `subprocess.CompletedProcess` with the JSON the test wants. See `tests/test_cli_dispatch.py`. |
| `SDKDispatcher` (Claude Agent SDK) | Pass `sdk_runner=` in the constructor — a callable returning `SDKResult`. See `tests/test_sdk_dispatch.py`'s `_ScriptedRunner`. |
| `InlineDispatcher` | Set up the `.nous_response_*` signal file in tmp_path before calling dispatch. |
| Stub-driven flows | Use `StubDispatcher` from `orchestrator.dispatch` — it produces valid schema-conformant artifacts with no LLM at all. |

**Active enforcement:** `tests/conftest.py` installs an autouse fixture
(`block_live_llm_calls`) that:
1. Strips `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` from the env so any
accidental real-client construction fails loudly instead of silently
billing.
2. Patches `urllib.request.urlopen` to refuse `api.anthropic.com`,
`api.openai.com`, and `api.litellm.ai` hosts.
3. Patches `claude_agent_sdk.query` (when installed) to a hard-fail.

If a test triggers any of these guards, the fix is to inject a fake at
the dispatcher's seam — never to disable the guard. The guards are the
backstop; the seams are the contract.

## Behavioral testing only

When the test mock is in place, write **behavioral** tests:
- ✓ Assert what's on disk after `dispatcher.dispatch(...)`.
- ✓ Assert metrics rows in `llm_metrics.jsonl`.
- ✓ Assert artifacts match a JSON Schema.
- ✗ Don't assert which method was called on the mock.
- ✗ Don't assert argv shape, internal helper invocation, or attribute access.

The seam is the contract; the implementation is free to evolve.

## Token-budget discipline (production code)

Beyond tests, Nous itself must be frugal with tokens:
- **Methodology stays in `CLAUDE.md`** (auto-loaded by Claude Code), not
in per-call prompts. The thin templates in `prompts/methodology/*_thin.md`
carry only per-iteration context.
- **System blocks are cached** (`cache_control: ephemeral`). Any code
that constructs an SDK call with a static system_prompt should rely
on this, and any change that breaks within-iteration cache locality
must be measured (`nous cost --cache-stats`) and justified.
- **Read-only mapping uses Explore subagents**, not Opus. See
`orchestrator/explore_design.py`.

## PR workflow (project owner: @sriumcp)

1. Branch off `upstream/reflective` (NOT `main`).
2. Push to `origin` (the fork at `sriumcp/agentic-strategy-evolution`).
3. Open PR with base `upstream/reflective`, head `sriumcp:<branch>`.
4. PR body links the issue with `Closes #N` (or `Refs #N` for partials).
5. Stack PRs when one logical change builds on another rather than waiting
for merge — see `docs/plans/CHECKPOINT.md` for the pattern.

## See also

- `docs/contributing/workflow.md` — full workflow doc.
- `docs/security.md` — permission policy (#135).
- `docs/architecture.md` — internals.
- `docs/plans/CHECKPOINT.md` — current state of the #120 epic.
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,17 +80,25 @@ If you're using Anthropic directly via a LiteLLM proxy, point both vars at the p
### 1. Install Nous

```bash
pip install "git+https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git"
pip install "git+https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git@reflective"
```

`reflective` is the active integration branch — that's where new work lands first. `main` lags slightly behind. To pin to a release, replace `@reflective` with a tag (`@v0.2.0`).

For development (editable install with test dependencies):

```bash
git clone https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git
git clone -b reflective https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git
cd agentic-strategy-evolution
pip install -e ".[dev]"
```

For the SDK-based dispatcher (`--agent sdk`, see `docs/architecture.md`), also install the optional `[sdk]` extra:

```bash
pip install -e ".[dev,sdk]"
```

### 2. Configure models

Two LLM calls per iteration, both via `claude -p`:
Expand Down
85 changes: 85 additions & 0 deletions bin/nous-execute-stop
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/env python3
"""Stop hook for the Nous executor session (issue #129).

Runs after every Claude Code agent turn. Returns:
exit 0 → allow the agent to stop (its work is done).
exit 2 → block stopping; the structured reason on stderr is fed back
into the agent's conversation so it can react.

A "stop is allowed" decision needs two pieces of evidence on disk:
1. ``$NOUS_ITER_DIR/principle_updates.json`` exists.
2. ``nous validate execution --dir $NOUS_ITER_DIR`` returns ``status: pass``.

Both are deterministic — no LLM judgment, no agent self-assessment. The
hook pairs with the ``/goal``-driven loop (#124) but is preferred wherever
the success criterion is a schema check, because it's cheaper and more
reliable than a Haiku evaluator.

Configured per-campaign in ``.claude/settings.json`` (see #135). The
orchestrator sets ``NOUS_ITER_DIR`` before launching the executor session.
"""
from __future__ import annotations

import os
import sys
from pathlib import Path

# When invoked as a Claude Code hook, the script's directory may not be
# on PYTHONPATH. Add the repo root so `orchestrator.validate` imports.
_HERE = Path(__file__).resolve().parent
_REPO_ROOT = _HERE.parent
if str(_REPO_ROOT) not in sys.path:
sys.path.insert(0, str(_REPO_ROOT))

from orchestrator.validate import validate_execution # noqa: E402


_OK = 0
_BLOCK = 2


def main() -> int:
iter_dir_str = os.environ.get("NOUS_ITER_DIR")
if not iter_dir_str:
print(
"NOUS_ITER_DIR is not set. The orchestrator should export this "
"variable before launching the executor session.",
file=sys.stderr,
)
return _BLOCK

iter_dir = Path(iter_dir_str)
if not iter_dir.is_dir():
print(
f"iter_dir does not exist: {iter_dir}. NOUS_ITER_DIR is "
f"misconfigured or the executor was launched before init.",
file=sys.stderr,
)
return _BLOCK

principles = iter_dir / "principle_updates.json"
if not principles.exists():
print(
f"principle_updates.json is missing from {iter_dir}. "
f"Write the file (a JSON list, possibly empty: []) before stopping.",
file=sys.stderr,
)
return _BLOCK

result = validate_execution(iter_dir)
if result.get("status") != "pass":
errors = result.get("errors", [])
print(
f"validation failed for {iter_dir} ({len(errors)} error(s)). "
f"Fix these before stopping:",
file=sys.stderr,
)
for err in errors:
print(f" - {err}", file=sys.stderr)
return _BLOCK

return _OK


if __name__ == "__main__":
sys.exit(main())
Loading
Loading