feat: observational mode for live-system debugging campaigns (#220)

namasl · claude · web-flow · commit 3499dbe02956 · 2026-06-02T13:51:34.000-04:00
* feat: observational mode for live-system debugging campaigns Add target_system.observational flag so campaigns whose target is a live system (cluster, service, dataset) can use repo_path purely to grant the agent shell access — without per-iteration git worktree isolation. When observational=true: - run_iteration skips create_experiment_worktree and runs the executor directly in repo_path. Prevents the FileNotFoundError "Not a git repository" failure mode and avoids polluting a non-code target with per-iteration orphan branches and .nous-experiments/ subdirs. - The design and execute_analyze prompts swap their worktree paragraphs for observational equivalents via {{execution_environment}} and {{worktree_constraint}} placeholders, so the agent is told it is probing a live target rather than mutating an isolated worktree. Default behavior is unchanged — the flag is opt-in and the worktree path remains the default for code-evolution campaigns. Tested: 10 new tests + 337 existing tests pass. * fix: allow target_system.observational in campaign schema The observational flag was wired into validation, prompts, and the iteration loop but the JSON schema still rejected it as an unknown property, so campaigns failed at load time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * review: address PR #220 review feedback - Fix prompt body / lead-paragraph contradiction in execute_analyze.md. The lead said "no per-iteration git isolation" in observational mode, but Phase 2 still hardcoded `git checkout -- .` between conditions (which would fail with no .git) and framed result-path warnings as "the worktree is temporary." Replace the reset step with a new {{condition_reset}} placeholder and rephrase the persistence note to be accurate in both modes. - Fix validation bypass: extract _validate_campaign to a module-level validate_campaign() and call it at the top of run_iteration. The staticmethod was only invoked from LLMDispatcher.__init__, so inline- agent mode (which never builds an LLMDispatcher) silently coerced non-bool observational values via bool() further down. - Add regression test that create_experiment_worktree IS called when observational=False (existing tests would all pass if the gate were inverted). - Loosen brittle prompt-text assertions: import the fragment constants and assert constant identity / containment instead of substrings, so copy-edits to the prompt text don't churn six tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * rename: observational → live_target per reviewer feedback Reviewer flagged that "observational" collides with the existing observe-mode in execute_analyze.md, which means "the bundle has no code_changes arms" — a bundle-level property, not the infra-level concern of whether to skip worktree creation. The new flag controls executor environment (live system vs. isolated worktree), so `live_target` is a more accurate name. Mechanical rename across iteration.py, llm_dispatch.py, campaign.schema.yaml, and the test module. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: document live_target campaigns in README and quickstart Reviewer asked for user-facing docs on when and how to use live_target: true so the feature is discoverable without reading the PR description or schema. Adds a quickstart section with an example campaign and contrasts live_target (campaign-level, no worktree, all arms must be probes) with observe-mode arms (bundle-level, worktree still created). README points to the new section. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -123,6 +123,8 @@ When `repo_path` is set, the campaign directory is created inside the target rep
 
 The planner explores the codebase to discover metrics, knobs, and execution methods. You can optionally provide `observable_metrics` and `controllable_knobs` as hints — see [examples/campaign.yaml](examples/campaign.yaml) for all options.
 
+If your target is a *running* system rather than a codebase (a cluster, a deployed service, a scratch directory that isn't a git repo), set `target_system.live_target: true`. The executor then runs directly in `repo_path` with no per-iteration `git worktree`, and the planner is told up front that arms must be probes — see [docs/quickstart.md#live-target-campaigns-live_target-true](docs/quickstart.md#live-target-campaigns-live_target-true) for details.
+
 ### 5. Run a campaign
 
 ```bash
diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -125,6 +125,43 @@ After a campaign, your working directory contains:
 - **`runs/iter-N/inputs/`** — Agent-created input files (configs, workloads)
 - **`runs/iter-N/results/`** — Experiment output files
 
+## Live-target campaigns (`live_target: true`)
+
+By default Nous treats `repo_path` as a git repo and creates a fresh `git worktree` per iteration so that any source-code patches are isolated. For some campaigns there is no codebase to evolve — the thing you want to study is a *running* system: a Kubernetes cluster, a deployed service, a dataset on disk, a non-git scratch directory. Setting `live_target: true` tells Nous to skip worktree creation and run the executor directly inside `repo_path`.
+
+Use it when:
+
+- The target is a live system you are probing, not a codebase you are mutating (e.g. a GPU cluster, a production-like service, a workload generator).
+- `repo_path` points at a directory that is not a git repo, or is a git repo whose working tree must not be branched.
+- The bundle should only contain probe-style arms (config tweaks, command-line invocations, observation runs) — never `code_changes`.
+
+Example:
+
+```yaml
+research_question: >
+  Why does p99 latency spike when the cluster autoscaler kicks in?
+
+target_system:
+  name: "Staging GPU cluster"
+  description: >
+    Live Kubernetes cluster running our inference workload.
+    The agent probes the cluster via kubectl and Prometheus; it does
+    not modify source code.
+  repo_path: /scratch/cluster-probe   # any working directory; need not be a git repo
+  live_target: true
+
+prompts:
+  methodology_layer: "prompts/methodology"
+  domain_adapter_layer: null
+```
+
+How `live_target` differs from regular observe-mode arms:
+
+- **Observe mode** is a *bundle-level* property — an individual arm has no `code_changes`, so the executor skips patching and just runs commands. The campaign can still mix observe arms and evolve arms in the same bundle, and a worktree is still created.
+- **`live_target: true`** is a *campaign-level* property — it controls the *executor environment* (no worktree, run in `repo_path` directly) and tells the planner up front that the target is a shared running system, so every arm must be a probe. Bundles with `code_changes` arms are incoherent in this mode.
+
+Pick `live_target: true` when there is nothing meaningful to branch from; pick observe-mode arms when you have a real codebase but a particular iteration only needs to measure, not patch.
+
 ## Choosing a model
 
 Defaults (from `defaults.yaml`):
diff --git a/orchestrator/iteration.py b/orchestrator/iteration.py
@@ -339,6 +339,13 @@ def run_iteration(
     Returns:
         An IterationOutcome value: COMPLETED, CONTINUE, ABORTED, or REDESIGN.
     """
+    # Validate the campaign once, up front. The staticmethod on LLMDispatcher
+    # is also called from its constructor, but inline-agent mode never builds
+    # an LLMDispatcher — without this call, a non-bool `live_target` value
+    # would slip past validation and silently coerce via bool() below.
+    from orchestrator.llm_dispatch import validate_campaign
+    validate_campaign(campaign)
+
     engine = Engine(work_dir)
     repo_path = campaign.get("target_system", {}).get("repo_path")
 
@@ -454,7 +461,10 @@ def _max_turns_for(phase_key: str) -> int:
             cli_dispatcher.model = _model_for("execute_analyze")
             cli_dispatcher.max_turns = _max_turns_for("execute_analyze")
         exec_dispatcher = cli_dispatcher or llm_dispatcher
-        if repo_path:
+        live_target = bool(
+            campaign.get("target_system", {}).get("live_target", False)
+        )
+        if repo_path and not live_target:
             from orchestrator.worktree import (
                 create_experiment_worktree,
                 remove_experiment_worktree,
@@ -464,6 +474,12 @@ def _max_turns_for(phase_key: str) -> int:
             )
             (iter_dir / ".experiment_id").write_text(experiment_id)
             print(f"  Experiment worktree: {experiment_dir}")
+        elif repo_path:
+            # Live-target mode: executor runs directly in repo_path. The
+            # target system is running (cluster, service, dataset) and there
+            # is nothing to isolate — bundles must contain no code_changes arms.
+            experiment_dir = Path(repo_path)
+            print(f"  Live target: executor runs in {experiment_dir}")
         if cli_dispatcher:
             import contextlib
             ctx = cli_dispatcher.override_cwd(experiment_dir) if experiment_dir else contextlib.nullcontext()
diff --git a/orchestrator/llm_dispatch.py b/orchestrator/llm_dispatch.py
@@ -35,6 +35,85 @@
 # Schema cache: schema_name -> parsed schema dict
 _schema_cache: dict[str, dict] = {}
 
+# Prompt fragments that swap based on target_system.live_target. Worktree
+# mode is the default — code-evolution campaigns get an isolated git worktree
+# per iteration. Live-target mode is for running systems (clusters, services,
+# datasets) that the executor probes without per-iteration code mutation.
+# (The flag is `live_target` rather than `observational` to avoid colliding
+# with the existing "observe mode" in execute_analyze.md, which means
+# "the bundle has no code_changes arms.")
+_WORKTREE_EXECUTION_ENV = (
+    "You are running inside an isolated git worktree of the target system. "
+    "You own this worktree — reset it yourself with `git checkout -- .` "
+    "between conditions."
+)
+_LIVE_TARGET_EXECUTION_ENV = (
+    "You are running directly against a live target system, in its working "
+    "directory. There is no per-iteration git isolation, and your bundle "
+    "must contain no `code_changes` arms. Do not mutate the target system's "
+    "persistent state — your job is to probe, measure, and report. Treat "
+    "any files you create as scratch artifacts that belong under "
+    "`{{iter_dir}}/inputs/` or `{{iter_dir}}/results/`, not in the target "
+    "directory."
+)
+_WORKTREE_DESIGN_CONSTRAINT = (
+    "**Worktree isolation assumed.** The executor runs in a clean git "
+    "worktree. Each condition starts from clean state (`git checkout -- .` "
+    "runs between conditions). Design your experimental conditions assuming "
+    "this — don't include manual cleanup steps."
+)
+_LIVE_TARGET_DESIGN_CONSTRAINT = (
+    "**Live target system.** The executor runs directly against a running "
+    "system — no git worktree, no code-change arms. All arms must be pure "
+    "observations of system state (probes, metrics, log scrapes). Do not "
+    "include `code_changes` in any arm; do not assume mutation is possible "
+    "without explicit consent gates."
+)
+
+# Per-condition reset step in execute_analyze.md Phase 2. Worktree mode resets
+# tracked files between conditions; live-target mode has no checkout to
+# revert and instead reminds the agent not to mutate the live target.
+_WORKTREE_CONDITION_RESET = "Reset worktree: `git checkout -- .`"
+_LIVE_TARGET_CONDITION_RESET = (
+    "Do not mutate the target system between conditions. Any files you "
+    "wrote to the target directory during the previous condition must be "
+    "removed before the next one runs (this is your responsibility — "
+    "there is no automatic checkout)."
+)
+
+
+def validate_campaign(campaign: dict) -> None:
+    """Validate campaign config. Module-level so it can be called before any
+    dispatcher is constructed (e.g., from `run_iteration` in inline-agent mode,
+    where no LLMDispatcher is built and the staticmethod path is never taken).
+    """
+    ts = campaign.get("target_system")
+    if not isinstance(ts, dict):
+        raise ValueError(
+            "Campaign config missing 'target_system' section. "
+            "See examples/campaign.yaml for the expected format."
+        )
+    required = ["name", "description"]
+    missing = [k for k in required if k not in ts]
+    if missing:
+        raise ValueError(
+            f"Campaign 'target_system' missing required keys: {missing}. "
+            f"See examples/campaign.yaml for the expected format."
+        )
+    for field in ("observable_metrics", "controllable_knobs"):
+        val = ts.get(field)
+        if val is not None:
+            if not isinstance(val, list) or not all(isinstance(x, str) for x in val):
+                raise ValueError(
+                    f"Campaign 'target_system.{field}' must be a list of strings. "
+                    f"Got: {val!r}"
+                )
+    if "live_target" in ts and not isinstance(ts["live_target"], bool):
+        raise ValueError(
+            f"Campaign 'target_system.live_target' must be a bool. "
+            f"Got: {ts['live_target']!r}"
+        )
+
 
 class LLMDispatcher:
     """Dispatch agent roles to an LLM and produce schema-conformant artifacts."""
@@ -50,7 +129,7 @@ def __init__(
         completion_fn: Callable | None = None,
     ) -> None:
         self.work_dir = Path(work_dir)
-        self._validate_campaign(campaign)
+        validate_campaign(campaign)
         self.campaign = campaign
         self.model = model
         self.loader = PromptLoader(
@@ -84,29 +163,7 @@ def __init__(
                 dal,
             )
 
-    @staticmethod
-    def _validate_campaign(campaign: dict) -> None:
-        ts = campaign.get("target_system")
-        if not isinstance(ts, dict):
-            raise ValueError(
-                "Campaign config missing 'target_system' section. "
-                "See examples/campaign.yaml for the expected format."
-            )
-        required = ["name", "description"]
-        missing = [k for k in required if k not in ts]
-        if missing:
-            raise ValueError(
-                f"Campaign 'target_system' missing required keys: {missing}. "
-                f"See examples/campaign.yaml for the expected format."
-            )
-        for field in ("observable_metrics", "controllable_knobs"):
-            val = ts.get(field)
-            if val is not None:
-                if not isinstance(val, list) or not all(isinstance(x, str) for x in val):
-                    raise ValueError(
-                        f"Campaign 'target_system.{field}' must be a list of strings. "
-                        f"Got: {val!r}"
-                    )
+    _validate_campaign = staticmethod(validate_campaign)
 
     # ------------------------------------------------------------------
     # Public interface (satisfies Dispatcher protocol)
@@ -212,13 +269,17 @@ def _build_context(
         perspective: str | None,
     ) -> dict[str, str]:
         ts = self.campaign["target_system"]
+        live_target = bool(ts.get("live_target", False))
         ctx: dict[str, str] = {
             "target_system": ts["name"],
             "system_description": ts["description"],
             "observable_metrics": ", ".join(ts["observable_metrics"]) if ts.get("observable_metrics") else "Not specified — planner should discover from code",
             "controllable_knobs": ", ".join(ts["controllable_knobs"]) if ts.get("controllable_knobs") else "Not specified — planner should discover from code",
             "active_principles": self._format_principles(),
             "iteration": str(iteration),
+            "execution_environment": _LIVE_TARGET_EXECUTION_ENV if live_target else _WORKTREE_EXECUTION_ENV,
+            "worktree_constraint": _LIVE_TARGET_DESIGN_CONSTRAINT if live_target else _WORKTREE_DESIGN_CONSTRAINT,
+            "condition_reset": _LIVE_TARGET_CONDITION_RESET if live_target else _WORKTREE_CONDITION_RESET,
         }
 
         if phase == "design":
diff --git a/orchestrator/schemas/campaign.schema.yaml b/orchestrator/schemas/campaign.schema.yaml
@@ -53,6 +53,9 @@ properties:
         type: ["string", "null"]
         minLength: 1
         description: "Path to target system git repo. Used by CLIDispatcher for code-access agents. If set, experiments run in isolated worktrees."
+      live_target:
+        type: boolean
+        description: "If true, the executor runs directly in repo_path with no per-iteration git worktree. Use for campaigns that probe a running system (cluster, service, dataset) where there is no code to evolve. Bundles must contain no code_changes arms."
 
   metadata:
     type: object
diff --git a/prompts/methodology/design.md b/prompts/methodology/design.md
@@ -158,7 +158,7 @@ Now design a hypothesis bundle based on what you actually observed and verified:
 - Predictions must be directional, falsifiable, and reference specific observable metrics. Do not invent arbitrary numeric thresholds unless campaign.yaml specifies them.
 - Base all experiment parameters on verified system behavior — if you didn't probe it, don't assume it.
 - **No `sed`/`awk` for code changes.** When describing code modifications in problem framing or bundle arms, describe the *intent* (what to change and why). The executor agent will implement changes properly via file edits, verify they compile, and create reusable `git diff` patches. Never suggest inline shell regex as an implementation strategy.
-- **Worktree isolation assumed.** The executor runs in a clean git worktree. Each condition starts from clean state (`git checkout -- .` runs between conditions). Design your experimental conditions assuming this — don't include manual cleanup steps.
+- {{worktree_constraint}}
 
 ## Output — Write Files Directly
 
diff --git a/prompts/methodology/execute_analyze.md b/prompts/methodology/execute_analyze.md
@@ -1,6 +1,6 @@
 You are a scientific executor for the Nous hypothesis-driven experimentation framework.
 
-You have **shell access**. You are running inside an isolated git worktree of the target system. You own this worktree — reset it yourself with `git checkout -- .` between conditions.
+You have **shell access**. {{execution_environment}}
 
 Your job has FIVE phases — all in one session with full context:
 1. **Prepare** — build, create patches, validate ALL commands
@@ -105,7 +105,7 @@ arms:
 ```
 
 **Important:**
-- All output paths MUST use absolute paths under `{{iter_dir}}/results/`. Do NOT use relative paths — the experiment runs in a worktree that gets cleaned up.
+- All output paths MUST use absolute paths under `{{iter_dir}}/results/`. Do NOT use relative paths — only files under `{{iter_dir}}/` are guaranteed to persist past this session.
 - Create per-arm result subdirectories before writing output: `mkdir -p {{iter_dir}}/results/<arm_id>` (the top-level `results/` already exists, but per-arm subdirectories like `results/h-main/` do not).
 - If you create ANY input files for the experiment (config files, workload specs, policy definitions, parameter files), write them to `{{iter_dir}}/inputs/` and list them in the condition's `inputs` array. Do NOT write input files to `/tmp/` or other temporary locations — they will be lost and the experiment will not be reproducible.
 
@@ -114,13 +114,13 @@ arms:
 Run the experiment plan you wrote in Step 4 — execute every command exactly as written. The plan is the source of truth.
 
 For each condition:
-1. Reset worktree: `git checkout -- .`
+1. {{condition_reset}}
 2. Run the `cmd` from the plan
 3. Verify the `output` file was created at the expected path
 
 After each baseline+treatment pair with the same seed, compare key metrics. If they are byte-identical, STOP and investigate — the patch may not be affecting the code path.
 
-**All results must land in `{{iter_dir}}/results/`.** The worktree is temporary — anything written there will be lost.
+**All results must land in `{{iter_dir}}/results/`.** Only files under `{{iter_dir}}/` are guaranteed to persist — anything written elsewhere may be lost.
 
 ## Phase 3: Analyze and Write Findings
 
diff --git a/tests/test_live_target.py b/tests/test_live_target.py