Skip to content

Commit 5064674

Browse files
vabruzzoclaude
andcommitted
Fix bugs, update docs, and clean up test configs
Bug fixes: - Fix fork point worktree ref in resample_session.py (was using "baseline" instead of fork point tag) - Fix replay.py fork reset logic for replay context (always reset for forked sessions) - Fix replay.py to honor --prompt at turn 1 - Fix transcript.py to handle string content in assistant messages - Fix UI resample API to support replicate session keys (e.g. "2_r01") - Fix CLI list to skip internal directories (_worktrees, .shadow_git) - Clean up worktree parent directories after replay/resample - Preserve thinking block signatures instead of stripping them (API rejects tampered signatures) Docs: - Fix resampling.md: thinking blocks listed as editable, now correctly shown as read-only - Fix cli.md: broken anchor link to intervention testing section - Fix output.md: add transcript.jsonl and uuid_map.json to session directory tree - Update example configs to use current model name (claude-sonnet-4-20250514) - Update web-ui.md: clarify API key needed for all resampling, not just edit & resample Cleanup: - Remove unused test configs (edit_test, final_chained, final_isolated_subagent, subagent_3session) - Update test_resample.py to match signature preservation behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 421ab84 commit 5064674

27 files changed

Lines changed: 231 additions & 574 deletions

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Initial release.
2020
- **Subagent capture** — separate ATIF trajectories for each subagent invocation, linked to parent via `SubagentTrajectoryRef`
2121
- **API request capture** — local reverse proxy captures raw request/response bodies, system prompts, tool definitions, token usage, and compaction events
2222
- **Turn-level resampling** — replay a specific API request N times to study response variance (stateless, no tool execution)
23-
- **Intervention testing** — edit captured API requests (assistant text, tool results, system prompt) and resample with modified inputs; available from both CLI (`harness resample-edit`) and web UI
23+
- **Intervention testing** — edit captured API requests (thinking, text, tool results, system prompt) and resample with modified inputs; available from both CLI (`harness resample-edit`) and web UI
2424
- **Session-level resampling** — re-run a forked session N times with full tool execution (`harness resample-session`)
2525
- **Turn-level replay** — branch execution from any API turn with exact-match context, filesystem reset via git worktrees, and full tool execution; replicates run in parallel (`harness replay`)
2626
- **Transcript capture** — Claude Code transcript JSONL copied into session output for replay support

README.md

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# AgentLens
22

3-
Developed at [MATS Exploration Phase](https://www.matsprogram.org/) under [Neel Nanda](https://github.com/neelnanda-io), for a research project with [Greg Kocher](https://github.com/gregkocher).
3+
> **This repository has moved to [dreadnode/agent-lens](https://github.com/dreadnode/agent-lens).** This copy is no longer maintained — please use the new location for the latest code, issues, and contributions.
44
55
A harness for running multi-session agent trajectories using the Claude Agent SDK, capturing them in [ATIF](https://harborframework.com/docs/agents/trajectory-format) (Agent Trajectory Interchange Format), and tracking file state changes across sessions.
66

@@ -13,7 +13,7 @@ The harness takes a YAML config describing a sequence of sessions (prompts to an
1313
- **ATIF trajectories** — standardized JSON capturing every agent step, tool call, observation, and thinking block
1414
- **Shadow git change tracking** — automatic tracking of all file changes via an invisible git repo, with per-step write attribution and full unified diffs
1515
- **Session chaining** — three modes for controlling how sessions relate to each other (isolated, chained, forked)
16-
- **Resampling & replay** — study behavioral variance at multiple levels: stateless API resampling, intervention testing (edit assistant text, tool results, or system prompts and resample), session-level resampling, and turn-level replay with full tool execution from any branch point
16+
- **Resampling & replay** — study behavioral variance at multiple levels: stateless API resampling, intervention testing (edit inputs and resample), session-level resampling, and turn-level replay with full tool execution from any branch point
1717
- **Subagent capture** — separate ATIF trajectories for each subagent invocation, linked to the parent via `SubagentTrajectoryRef`
1818

1919
## Install
@@ -325,7 +325,7 @@ Edit a captured API request and resample with the modified version — the CLI e
325325
# Step 1: Dump the request for editing
326326
harness resample-edit runs/my-run --session 1 --request 5 --dump > edit.json
327327

328-
# Step 2: Edit the JSON (assistant text, tool results, system prompt...)
328+
# Step 2: Edit the JSON (thinking, text, tool results, system prompt...)
329329
# Step 3: Resample with the modified request
330330
harness resample-edit runs/my-run --session 1 --request 5 \
331331
--input edit.json --label "removed hedging" --count 5
@@ -335,13 +335,11 @@ Pipe through `jq` for programmatic edits:
335335

336336
```bash
337337
harness resample-edit runs/my-run --session 1 --request 5 --dump \
338-
| jq '.system = "You are a cautious engineer. Double-check everything."' \
338+
| jq '.messages[-1].content[0].thinking = "Be more direct."' \
339339
| harness resample-edit runs/my-run --session 1 --request 5 \
340-
--input - --label "cautious prompt" --count 10
340+
--input - --label "direct thinking" --count 10
341341
```
342342

343-
> **Note:** Thinking blocks cannot be edited — they carry cryptographic signatures validated by the API. See [Thinking blocks](docs/guide/resampling.md#thinking-blocks-not-editable) for details.
344-
345343
Variants are saved alongside vanilla resamples and appear in the web UI.
346344

347345
### `harness resample-session`
@@ -362,18 +360,13 @@ Replay a session from any API turn with full tool execution. Each replicate runs
362360
# List available turns
363361
harness replay runs/my-run --session 1 --list-turns
364362

365-
# Replay from turn 5, three times (only session 1 runs)
363+
# Replay from turn 5, three times (runs in parallel)
366364
harness replay runs/my-run --session 1 --turn 5 --count 3
367365

368-
# Replay session 1 turn 5, then continue with sessions 2, 3, etc.
369-
harness replay runs/my-run --session 1 --turn 5 --continue-sessions
370-
371366
# Replay with an additional prompt after tool results
372367
harness replay runs/my-run --session 1 --turn 5 --prompt "Try a different approach"
373368
```
374369

375-
By default, replay only runs the targeted session. Use `--continue-sessions` to also run subsequent sessions from the original config.
376-
377370
Replay creates new run directories (e.g. `replay_my-run_s1_t5_r01_<timestamp>/`) with full artifacts. Each includes a `replay_meta.json` with provenance linking back to the source run, session, and turn. The source working directory is never modified.
378371

379372
## Web UI
@@ -395,7 +388,7 @@ Open `http://localhost:5173`. The UI reads from the `runs/` directory and provid
395388
- **API captures** — request/response viewer with token usage, system prompts, tool definitions, compaction events
396389
- **Subagent viewer** — separate trajectory view for each subagent, with task prompt and return value
397390
- **Resamples** — compare N resample outputs for a given API turn
398-
- **Edit & Resample** — interactive message editor for intervention testing: edit assistant text, tool results, or system prompts in the conversation, then resample with the modified input to study how changes affect behavior (thinking blocks are shown read-only — see [why](docs/guide/resampling.md#thinking-blocks-not-editable))
391+
- **Edit & Resample** — interactive message editor for intervention testing: edit thinking, text, tool results, or system prompts in the conversation, then resample with the modified input to study how changes affect behavior
399392
- **Changelog** — per-step file write log across all sessions with expandable diffs
400393
- **Config viewer** — frozen YAML config from the run
401394
- **Analysis** — rendered markdown from `analysis.md`

docs/cli.md

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ Results are saved to `session_NN/resamples/request_NNN/` (and `request_NNN_vNN/`
126126

127127
Edit a captured API request and resample with the modified version.
128128

129-
For intervention strategy and output details, see [Resampling & Replay](guide/resampling.md#intervention-testing).
129+
For intervention strategy and output details, see [Resampling & Replay](guide/resampling.md#intervention-testing-edit-resample).
130130

131131
```bash
132132
harness resample-edit <run_dir> [OPTIONS]
@@ -151,9 +151,7 @@ harness resample-edit <run_dir> [OPTIONS]
151151
harness resample-edit runs/my-run --session 1 --request 5 --dump > edit.json
152152
```
153153

154-
**Step 2** — Edit the JSON file (change assistant text, tool results, system prompt, etc.), then resample.
155-
156-
> **Do not edit thinking blocks.** They carry cryptographic signatures validated by the API — any modification will cause a 400 error. See [Thinking blocks](guide/resampling.md#thinking-blocks-not-editable) for details.
154+
**Step 2** — Edit the JSON file (change thinking, text, tool results, system prompt, etc.), then resample:
157155

158156
```bash
159157
harness resample-edit runs/my-run --session 1 --request 5 \
@@ -164,19 +162,19 @@ harness resample-edit runs/my-run --session 1 --request 5 \
164162

165163
```bash
166164
harness resample-edit runs/my-run --session 1 --request 5 --dump \
167-
| jq '.system = "You are a cautious engineer. Always check for edge cases."' \
165+
| jq '.messages[-1].content[0].thinking = "I should be more direct."' \
168166
| harness resample-edit runs/my-run --session 1 --request 5 \
169-
--input - --label "cautious prompt" --count 10
167+
--input - --label "direct thinking" --count 10
170168
```
171169

172170
### Batch interventions
173171

174172
```bash
175173
for req in 3 5 7 9; do
176174
harness resample-edit runs/my-run --session 1 --request $req --dump \
177-
| jq '(.messages[] | select(.role == "user") | .content[] | select(.type == "tool_result")).content = "Error: file not found"' \
175+
| jq '.messages[-1].content[0].thinking = "Skip exploration, go straight to implementation."' \
178176
| harness resample-edit runs/my-run --session 1 --request $req \
179-
--input - --label "tool-error" --count 5
177+
--input - --label "skip-exploration" --count 5
180178
done
181179
```
182180

@@ -239,20 +237,18 @@ Turns in session 1 (12 total):
239237

240238
### Replaying
241239

242-
By default, only the targeted session is replayed. Use `--continue-sessions` to also run sessions after it.
243-
244240
```bash
245-
# Replay from turn 5, three times (only session 1 runs)
241+
# Replay from turn 5, three times (runs in parallel)
246242
harness replay runs/my-run --session 1 --turn 5 --count 3
247243

248-
# Replay session 1 turn 5, then continue with sessions 2, 3, etc.
249-
harness replay runs/my-run --session 1 --turn 5 --continue-sessions
250-
251244
# Replay with an additional prompt
252245
harness replay runs/my-run --session 1 --turn 5 --prompt "Try a different approach"
253246

254247
# Replay from turn 1 (re-run from scratch)
255248
harness replay runs/my-run --session 1 --turn 1 --count 2
249+
250+
# Replay session 1 turn 5, then continue sessions 2..end
251+
harness replay runs/my-run --session 1 --turn 5 --continue-sessions
256252
```
257253

258254
Each replay creates a new run directory (e.g. `replay_my-run_s1_t5_r01_2026-03-16T00-00-00/`) with full artifacts including `replay_meta.json` for provenance tracking. The source working directory is never modified — each replicate operates in its own git worktree.

docs/glossary.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ A full-fidelity re-execution from a specific turn. Each replicate runs in an iso
7373

7474
### Intervention (variant)
7575

76-
A modified resample — the API request is edited before being sent (e.g. changing assistant text, tool results, or system prompt) to test counterfactuals. Thinking blocks cannot be edited due to cryptographic signature requirements. Variants are saved alongside vanilla resamples with a `_vNN` suffix and include the edited request for reproducibility.
76+
A modified resample — the API request is edited before being sent (e.g. changing a thinking block or system prompt) to test counterfactuals. Variants are saved alongside vanilla resamples with a `_vNN` suffix and include the edited request for reproducibility.
7777

7878
### Shadow git
7979

docs/guide/output.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,6 @@ runs/<run_name>/
1313
1414
├── session_01/
1515
│ ├── trajectory.json # ATIF v1.6 trajectory (parent)
16-
│ ├── transcript.jsonl # Claude Code transcript (for replay)
17-
│ ├── uuid_map.json # turn correlation map (transcript ↔ ATIF ↔ raw dumps)
1816
│ ├── session_diff.patch # unified diff of this session's changes
1917
│ ├── subagent_<name>_<id>.json # subagent ATIF trajectory (if any)
2018
│ ├── api_captures.jsonl # API request/response metadata

docs/guide/resampling.md

Lines changed: 19 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Cheapest / fastest Most thorough
2525
| I want to... | Method | Command |
2626
|--------------|--------|---------|
2727
| Check if the model would say the same thing again | [Turn resample](#turn-level-resampling) | `harness resample` |
28-
| See what happens if the model had seen different text or tool results | [Intervention](#intervention-testing) | `harness resample-edit` |
28+
| See what happens if the model had different thinking | [Intervention](#intervention-testing) | `harness resample-edit` |
2929
| See what happens if a tool returned something different | [Intervention](#intervention-testing) | `harness resample-edit` |
3030
| Compare N complete trajectories for the same task | [Session resample](#session-level-resampling) | `harness resample-session` |
3131
| Branch from a specific point and let the agent continue | [Turn replay](#turn-level-replay) | `harness replay` |
@@ -87,18 +87,17 @@ session_01/resamples/request_005/
8787

8888
## Intervention testing
8989

90-
Edit the conversation inputs — text, tool results, or system prompt — then resample. This lets you test counterfactuals: "What would the model do differently if it had seen X instead of Y?"
90+
Edit the conversation inputs — thinking blocks, text, tool results, or system prompt — then resample. This lets you test counterfactuals: "What would the model do differently if it had seen X instead of Y?"
9191

9292
Like turn-level resampling, this is **stateless** — no tools execute. But the input is modified before sending, so you can study causal effects.
9393

9494
**What you can edit:**
9595

96-
- **Assistant text** — alter what the model said in prior turns (e.g., remove hedging, change a decision)
97-
- **Tool results** — change what a tool returned (e.g., different file contents, simulated errors)
96+
- **Thinking blocks** — change the model's internal reasoning
97+
- **Text responses** — alter what the model said in prior turns
98+
- **Tool results** — change what a tool returned (e.g., different file contents)
9899
- **System prompt** — modify instructions
99100

100-
> **Note:** Thinking blocks are visible in the dump and UI but are **not editable** — the API requires cryptographic signatures on thinking blocks that can't survive modification. They are preserved as-is so the model retains its original reasoning context. See [Thinking blocks](#thinking-blocks) for details.
101-
102101
### From the CLI
103102

104103
Two-step workflow: dump the request, edit it, resample.
@@ -107,7 +106,7 @@ Two-step workflow: dump the request, edit it, resample.
107106
# 1. Dump the request to a file
108107
harness resample-edit runs/my-run --session 1 --request 5 --dump > edit.json
109108

110-
# 2. Edit edit.json (change assistant text, tool results, system prompt...)
109+
# 2. Edit edit.json (change thinking, text, tool results, system prompt...)
111110

112111
# 3. Resample with the modified request
113112
harness resample-edit runs/my-run --session 1 --request 5 \
@@ -117,30 +116,28 @@ harness resample-edit runs/my-run --session 1 --request 5 \
117116
For scriptable interventions, pipe through `jq`:
118117

119118
```bash
120-
# Change the system prompt
121119
harness resample-edit runs/my-run --session 1 --request 5 --dump \
122-
| jq '.system = "You are a cautious engineer. Always check for edge cases."' \
120+
| jq '.messages[-1].content[0].thinking = "Be more direct."' \
123121
| harness resample-edit runs/my-run --session 1 --request 5 \
124-
--input - --label "cautious prompt" --count 10
122+
--input - --label "direct thinking" --count 10
125123
```
126124

127125
Batch across multiple requests:
128126

129127
```bash
130-
# Change a tool result across several turns
131128
for req in 3 5 7 9; do
132129
harness resample-edit runs/my-run --session 1 --request $req --dump \
133-
| jq '(.messages[] | select(.role == "user") | .content[] | select(.type == "tool_result")).content = "Error: file not found"' \
130+
| jq '.messages[-1].content[0].thinking = "Skip exploration."' \
134131
| harness resample-edit runs/my-run --session 1 --request $req \
135-
--input - --label "tool-error" --count 5
132+
--input - --label "skip-exploration" --count 5
136133
done
137134
```
138135

139136
### From the web UI
140137

141138
1. Open a session's API captures
142139
2. Click "Edit & Resample" on any request
143-
3. Modify text, tool results, or system prompts (thinking blocks are shown read-only)
140+
3. Modify thinking blocks, text, tool results, or system prompts
144141
4. Resample with the modified input
145142

146143
### Output
@@ -214,24 +211,22 @@ Bracketed tags (e.g. `[_step_1_3]`) indicate shadow git snapshots — turns wher
214211

215212
### Running
216213

217-
By default, replay **only runs the targeted session** — it branches from the specified turn and lets the agent continue until that session ends. Subsequent sessions from the original config are not run.
218-
219-
To replay the full remaining experiment (the targeted session *and* all sessions after it), use `--continue-sessions`.
220-
221214
```bash
222-
# Replay from turn 5, three times (only session 1 runs)
215+
# Replay from turn 5, three times (runs in parallel)
223216
harness replay runs/my-run --session 1 --turn 5 --count 3
224217

225-
# Replay session 1 turn 5, then continue with sessions 2, 3, etc.
226-
harness replay runs/my-run --session 1 --turn 5 --continue-sessions
227-
228218
# Replay with an additional prompt after tool results
229219
harness replay runs/my-run --session 1 --turn 5 --prompt "Try a different approach"
230220

231221
# Replay from turn 1 (re-run from scratch with same config)
232222
harness replay runs/my-run --session 1 --turn 1 --count 2
223+
224+
# Replay session 1 turn 5, then continue with sessions 2..end
225+
harness replay runs/my-run --session 1 --turn 5 --continue-sessions
233226
```
234227

228+
When `--continue-sessions` is enabled, each replicate runs the replayed session first, then continues with sessions `N+1..end` from the original config.
229+
235230
### Output
236231

237232
Each replay creates a new independent run directory:
@@ -258,28 +253,6 @@ runs/replay_my-run_s1_t5_r01_2026-03-16T00-00-00/
258253

259254
Each session generates a `uuid_map.json` that correlates entries across the three data formats (transcript, ATIF trajectory, raw API dumps). The primary join key is `tool_call_id`. The replay system uses this to find shadow git tags for filesystem reset.
260255

261-
### Thinking blocks (not editable)
262-
263-
> **Warning:** Thinking blocks cannot be edited in interventions. Any attempt to modify thinking content in a dumped request JSON will cause the API to reject the request with a 400 error. The UI editor shows thinking blocks as read-only.
264-
265-
#### Why: cryptographic signatures
266-
267-
When the Anthropic API returns a response with extended thinking enabled, each `thinking` block includes a cryptographic `signature` field. On subsequent requests, the API validates this signature to confirm the thinking content has not been tampered with. This is a server-side integrity check — there is no way to regenerate or forge a valid signature outside of Anthropic's infrastructure.
268-
269-
This means:
270-
- **Unmodified thinking blocks** have valid signatures and are accepted by the API
271-
- **Edited thinking blocks** have invalidated signatures and are rejected (HTTP 400)
272-
- **Stripped signatures** (keeping the text but removing the `signature` field) are also rejected
273-
274-
`redacted_thinking` blocks are similarly protected — they contain opaque encrypted content that cannot be inspected or modified.
275-
276-
#### What this means for interventions
277-
278-
All resampling methods preserve thinking blocks with their original signatures intact, so the model always sees its full original reasoning context. This is faithful — the model receives the same thinking it originally produced.
279-
280-
To test counterfactuals about model behavior, edit the fields that *are* modifiable:
281-
- **Assistant text** — change what the model said (its visible output)
282-
- **Tool results** — change what a tool returned (e.g., different file contents, simulated errors)
283-
- **System prompt** — change the instructions
256+
### Thinking signatures
284257

285-
These fields have no signature requirements and can be freely modified.
258+
When resampling, the harness automatically strips thinking block signatures from the request. Signatures are response-specific and would cause errors if replayed verbatim.

docs/guide/web-ui.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Configure the UI via `ui/.env` or shell environment:
2222
| `ANTHROPIC_API_KEY` || Required for resampling via Anthropic API |
2323
| `ANTHROPIC_BASE_URL` | `https://api.anthropic.com` | Override the API base URL for resampling |
2424

25-
The resampling API keys are needed for any resampling in the UI (both vanilla resamples and "Edit & Resample"). The UI auto-detects whether to use OpenRouter or Anthropic based on the original run's API target.
25+
The resampling API keys are only needed if you use the "Edit & Resample" feature in the UI. The UI auto-detects whether to use OpenRouter or Anthropic based on the original run's API target.
2626

2727
## Features
2828

@@ -62,7 +62,7 @@ Compare N resample outputs for a given API turn side-by-side.
6262
### Edit & Resample
6363
Interactive message editor for intervention testing:
6464

65-
1. Edit assistant text, tool results, or system prompts (thinking blocks are shown read-only)
65+
1. Edit thinking blocks, text, tool results, or system prompts
6666
2. Resample with the modified input
6767
3. Compare original vs. variant responses
6868

0 commit comments

Comments
 (0)