Skip to content

feat: observational mode for live-system debugging campaigns#220

Merged
mtoslalibu merged 5 commits into
AI-native-Systems-Research:mainfrom
namasl:feat/observational-mode
Jun 2, 2026
Merged

feat: observational mode for live-system debugging campaigns#220
mtoslalibu merged 5 commits into
AI-native-Systems-Research:mainfrom
namasl:feat/observational-mode

Conversation

@namasl
Copy link
Copy Markdown
Contributor

@namasl namasl commented May 27, 2026

Closes #219.

Summary

Adds an opt-in target_system.observational: bool flag that lets a campaign target a live system (running cluster, deployed service, dataset on disk) rather than a git-tracked codebase that mutates across iterations. In observational mode, the executor runs directly in repo_path and no per-iteration git worktree is created.

This decouples two concerns that the current code conflates: give the agent CLI access in a directory vs. create a per-iteration git worktree to isolate code mutations. Today setting repo_path triggers both; for live-system debugging the second is incoherent (the cluster isn't a thing you can git worktree add), and create_experiment_worktree raises FileNotFoundError: Not a git repository: <repo_path> every iteration.

Default is false — existing campaigns unaffected.

What changed

  • orchestrator/schemas/campaign.schema.yaml — accept observational: boolean under target_system.
  • orchestrator/llm_dispatch.py — validate the flag is a bool; expose execution_environment and worktree_constraint context keys with worktree vs. observational variants.
  • orchestrator/iteration.py — gate create_experiment_worktree behind if repo_path and not observational; in observational mode, point experiment_dir at repo_path directly (no .experiment_id written, no .nous-experiments/ created).
  • prompts/methodology/design.md, prompts/methodology/execute_analyze.md — replace the hardcoded worktree paragraphs with the two new placeholders, so the agent gets clear instructions matching the actual environment (no git checkout -- . in observational mode; treat target as live; bundles must contain no code_changes arms).
  • tests/test_observational.py — 10 new tests: schema validation, prompt fragment selection, end-to-end template rendering with observational substitutions, and an iteration-flow test that asserts create_experiment_worktree is never called and no .experiment_id / .nous-experiments/ artifacts are produced.

Test plan

  • pytest tests/test_observational.py — 10 new tests pass.
  • pytest — full suite passes (348 tests, including test_integration_real_execution.py).
  • Manual: a campaign with repo_path: /scratch/dir (no .git) and observational: true validates against the schema and runs past the previously-failing worktree gate.

Out of scope / follow-ups

  • nous replay (orchestrator/cli.py:269) still calls create_experiment_worktree unconditionally. Replay support for observational campaigns can land in a follow-up — it isn't a typical use case for live-system runs.
  • Bundle-level enforcement that observational campaigns reject code_changes arms in _validate_bundle (currently only the prompt discourages them).
  • repo_path reads as "git repo" but in observational mode it's a working directory. A future rename to working_dir (with repo_path kept as a deprecated alias) would clarify intent.

🤖 Generated with Claude Code

namasl and others added 3 commits May 27, 2026 12:22
Add target_system.observational flag so campaigns whose target is a live
system (cluster, service, dataset) can use repo_path purely to grant the
agent shell access — without per-iteration git worktree isolation.

When observational=true:
- run_iteration skips create_experiment_worktree and runs the executor
  directly in repo_path. Prevents the FileNotFoundError "Not a git
  repository" failure mode and avoids polluting a non-code target with
  per-iteration orphan branches and .nous-experiments/ subdirs.
- The design and execute_analyze prompts swap their worktree paragraphs
  for observational equivalents via {{execution_environment}} and
  {{worktree_constraint}} placeholders, so the agent is told it is
  probing a live target rather than mutating an isolated worktree.

Default behavior is unchanged — the flag is opt-in and the worktree
path remains the default for code-evolution campaigns.

Tested: 10 new tests + 337 existing tests pass.
The observational flag was wired into validation, prompts, and the
iteration loop but the JSON schema still rejected it as an unknown
property, so campaigns failed at load time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix prompt body / lead-paragraph contradiction in execute_analyze.md.
  The lead said "no per-iteration git isolation" in observational mode,
  but Phase 2 still hardcoded `git checkout -- .` between conditions
  (which would fail with no .git) and framed result-path warnings as
  "the worktree is temporary." Replace the reset step with a new
  {{condition_reset}} placeholder and rephrase the persistence note
  to be accurate in both modes.

- Fix validation bypass: extract _validate_campaign to a module-level
  validate_campaign() and call it at the top of run_iteration. The
  staticmethod was only invoked from LLMDispatcher.__init__, so inline-
  agent mode (which never builds an LLMDispatcher) silently coerced
  non-bool observational values via bool() further down.

- Add regression test that create_experiment_worktree IS called when
  observational=False (existing tests would all pass if the gate were
  inverted).

- Loosen brittle prompt-text assertions: import the fragment constants
  and assert constant identity / containment instead of substrings,
  so copy-edits to the prompt text don't churn six tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mtoslalibu
Copy link
Copy Markdown
Collaborator

Hey Nick, nice work catching this — the crash when repo_path points at a non-git directory is a real problem and your GPU cluster use case makes sense.

One thing to flag though: Nous already has an "observe vs evolve" distinction built in. It's implicit and lives at the bundle level. When the planner designs arms without code_changes, the executor already knows to skip patching and just run CLI commands to measure things. The execute_analyze prompt even says: "If the bundle has NO code_changes (observe mode), skip this step entirely." We've run observe-mode campaigns before — repo_path points at a git repo, a worktree gets created, but the agent just runs commands without touching code. Works fine today.

So the real issue isn't that Nous needs a new "observational mode" — it's that Nous currently requires repo_path to be a git repo even when no code isolation is needed.

I'd suggest simplifying this to something like no_git: true (or even just gracefully skipping worktree creation when .git doesn't exist). The new prompt fragments ({{execution_environment}}, {{worktree_constraint}}, etc.) and behavioral instructions aren't really needed — the "don't mutate, only probe" behavior already happens naturally when the planner omits code_changes from the bundle. Adding a separate "observational" concept alongside the existing observe/evolve distinction would be confusing.

TL;DR: the use case is valid, but the fix should be ~20 lines of infrastructure gating (skip worktree when no .git), not a new behavioral mode that duplicates what observe mode already does. What do you think?

Reviewer flagged that "observational" collides with the existing
observe-mode in execute_analyze.md, which means "the bundle has no
code_changes arms" — a bundle-level property, not the infra-level
concern of whether to skip worktree creation.

The new flag controls executor environment (live system vs. isolated
worktree), so `live_target` is a more accurate name. Mechanical rename
across iteration.py, llm_dispatch.py, campaign.schema.yaml, and the
test module.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@namasl
Copy link
Copy Markdown
Contributor Author

namasl commented May 27, 2026

Thanks for the close read — you're right that there's a naming collision, and I've pushed c421047 renaming observationallive_target everywhere (flag, constants, tests). The new name reflects what the flag actually controls: where the executor runs (a live system vs. an isolated worktree), not what the bundle contains (arms with or without code_changes). Those are orthogonal axes — a code-evolution campaign can have observe arms, and a live-target campaign could in principle have arms that mutate config — so they shouldn't share a name.

On dropping the prompt fragments and relying on bundle-level observe mode for the safety behavior: I don't think that holds up, for two reasons.

1. The signal arrives too late. The execute_analyze "skip if no code_changes" check fires during execution, after the planner has already designed the arms. But the planner needs to know at design time that there's no codebase to mutate, otherwise it'll happily propose code_changes arms against a directory that doesn't have a meaningful working copy. With live_target=true, the design prompt tells the planner up front: this is a running system, the arms must be probes, not patches. Without it, you'd get bundles that observe-mode then has to reject — wasted iteration, and a confusing failure mode for the user.

2. "No code_changes" is not the same as "don't mutate the target." Observe mode means "don't patch source files." It says nothing about whether the agent can kubectl apply, restart pods, or change cluster config — all things that do mutate the target system, just not via code. For a campaign against someone's production cluster (my actual use case here), the relevant guardrail is "probe, don't poke," and that has to be stated explicitly in the execution prompt. The existing observe-mode text doesn't cover it because it was written for an evolve-vs-observe codebase distinction, not a live-vs-isolated infra distinction.

So I'd push back on collapsing this into pure infra gating. The 20-line worktree skip is necessary but not sufficient — the prompts encode the behavioral contract that "this target is shared, real, and not yours to break," which is information the framework genuinely doesn't have today.

Happy to keep iterating on the wording of the fragments themselves if any of them read as redundant once you're looking at them next to the observe-mode text. And if you'd prefer I split the rename and the prompt argument into separate commits before merge, say the word.

@mtoslalibu
Copy link
Copy Markdown
Collaborator

Thanks for the rename and the thoughtful response — the live_target naming is much clearer, and your two points about why the prompt fragments are needed (planner needs to know at design time, and "no code_changes" ≠ "don't mutate the target") make sense. I'm on board with the approach.

Did a closer review of the final diff. A few things worth addressing:


1. Missing directory existence check in live-target branch (the main one)

In iteration.py, the new elif repo_path: branch does:

experiment_dir = Path(repo_path)
print(f"  Live target: executor runs in {experiment_dir}")

No check that the directory actually exists. If someone has a typo in repo_path, the campaign will run through DESIGN and the human gate successfully, then fail with a confusing subprocess error deep in EXECUTE_ANALYZE. The worktree path catches this immediately via create_experiment_worktree's guard — the live-target path should fail fast too. A simple if not experiment_dir.exists(): raise RuntimeError(...) would do it.


2. Two unconditional prompt changes affect all campaigns

These lines in execute_analyze.md were reworded for ALL campaigns, not just live_target ones:

  • Old: "the experiment runs in a worktree that gets cleaned up"

  • New: "only files under {{iter_dir}}/ are guaranteed to persist past this session"

  • Old: "The worktree is temporary — anything written there will be lost."

  • New: "Only files under {{iter_dir}}/ are guaranteed to persist — anything written elsewhere may be lost."

The new text is arguably better and semantically equivalent, but it's a prompt change that hits existing campaigns. Worth calling out explicitly — or making these placeholder-controlled too.


3. PR description still references observational

The title and body still say observational in many places but the code now uses live_target. Minor but confusing for anyone reading the PR later.


4. Minor test gaps

  • No test with live_target completely absent from the campaign dict (the most common real-world case — field just isn't there). The existing regression test sets live_target=False explicitly, which is slightly different.
  • No test for live_target=True when repo_path points at a real git repo — verifying it still skips the worktree even when .git exists.

None of these are blockers except #1 (the existence check). The rest are small. Nice work overall — the feature is clean and well-tested.

@mtoslalibu
Copy link
Copy Markdown
Collaborator

One more thing — could you add a short section to the README or quickstart showing how and when to use live_target: true? Something simple like:

  • When to use it (targeting a running cluster, service, or non-git directory)
  • An example campaign YAML snippet
  • How it differs from regular observe mode (no code_changes in bundle)

Right now the only documentation for this feature is the PR description and the schema's inline description field — would be nice to have something a user can find without reading git history.

Reviewer asked for user-facing docs on when and how to use
live_target: true so the feature is discoverable without reading the
PR description or schema. Adds a quickstart section with an example
campaign and contrasts live_target (campaign-level, no worktree, all
arms must be probes) with observe-mode arms (bundle-level, worktree
still created). README points to the new section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mtoslalibu mtoslalibu merged commit 3499dbe into AI-native-Systems-Research:main Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: observational mode for live-system debugging campaigns

2 participants