You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug fixes:
- Fix fork point worktree ref in resample_session.py (was using "baseline" instead of fork point tag)
- Fix replay.py fork reset logic for replay context (always reset for forked sessions)
- Fix replay.py to honor --prompt at turn 1
- Fix transcript.py to handle string content in assistant messages
- Fix UI resample API to support replicate session keys (e.g. "2_r01")
- Fix CLI list to skip internal directories (_worktrees, .shadow_git)
- Clean up worktree parent directories after replay/resample
- Preserve thinking block signatures instead of stripping them (API rejects tampered signatures)
Docs:
- Fix resampling.md: thinking blocks listed as editable, now correctly shown as read-only
- Fix cli.md: broken anchor link to intervention testing section
- Fix output.md: add transcript.jsonl and uuid_map.json to session directory tree
- Update example configs to use current model name (claude-sonnet-4-20250514)
- Update web-ui.md: clarify API key needed for all resampling, not just edit & resample
Cleanup:
- Remove unused test configs (edit_test, final_chained, final_isolated_subagent, subagent_3session)
- Update test_resample.py to match signature preservation behavior
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ Initial release.
20
20
-**Subagent capture** — separate ATIF trajectories for each subagent invocation, linked to parent via `SubagentTrajectoryRef`
21
21
-**API request capture** — local reverse proxy captures raw request/response bodies, system prompts, tool definitions, token usage, and compaction events
22
22
-**Turn-level resampling** — replay a specific API request N times to study response variance (stateless, no tool execution)
23
-
-**Intervention testing** — edit captured API requests (assistant text, tool results, system prompt) and resample with modified inputs; available from both CLI (`harness resample-edit`) and web UI
23
+
-**Intervention testing** — edit captured API requests (thinking, text, tool results, system prompt) and resample with modified inputs; available from both CLI (`harness resample-edit`) and web UI
24
24
-**Session-level resampling** — re-run a forked session N times with full tool execution (`harness resample-session`)
25
25
-**Turn-level replay** — branch execution from any API turn with exact-match context, filesystem reset via git worktrees, and full tool execution; replicates run in parallel (`harness replay`)
26
26
-**Transcript capture** — Claude Code transcript JSONL copied into session output for replay support
Copy file name to clipboardExpand all lines: README.md
+7-14Lines changed: 7 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# AgentLens
2
2
3
-
Developed at [MATS Exploration Phase](https://www.matsprogram.org/) under [Neel Nanda](https://github.com/neelnanda-io), for a research project with [Greg Kocher](https://github.com/gregkocher).
3
+
> **This repository has moved to [dreadnode/agent-lens](https://github.com/dreadnode/agent-lens).** This copy is no longer maintained — please use the new location for the latest code, issues, and contributions.
4
4
5
5
A harness for running multi-session agent trajectories using the Claude Agent SDK, capturing them in [ATIF](https://harborframework.com/docs/agents/trajectory-format) (Agent Trajectory Interchange Format), and tracking file state changes across sessions.
6
6
@@ -13,7 +13,7 @@ The harness takes a YAML config describing a sequence of sessions (prompts to an
13
13
-**ATIF trajectories** — standardized JSON capturing every agent step, tool call, observation, and thinking block
14
14
-**Shadow git change tracking** — automatic tracking of all file changes via an invisible git repo, with per-step write attribution and full unified diffs
15
15
-**Session chaining** — three modes for controlling how sessions relate to each other (isolated, chained, forked)
16
-
-**Resampling & replay** — study behavioral variance at multiple levels: stateless API resampling, intervention testing (edit assistant text, tool results, or system prompts and resample), session-level resampling, and turn-level replay with full tool execution from any branch point
16
+
-**Resampling & replay** — study behavioral variance at multiple levels: stateless API resampling, intervention testing (edit inputs and resample), session-level resampling, and turn-level replay with full tool execution from any branch point
17
17
-**Subagent capture** — separate ATIF trajectories for each subagent invocation, linked to the parent via `SubagentTrajectoryRef`
18
18
19
19
## Install
@@ -325,7 +325,7 @@ Edit a captured API request and resample with the modified version — the CLI e
> **Note:** Thinking blocks cannot be edited — they carry cryptographic signatures validated by the API. See [Thinking blocks](docs/guide/resampling.md#thinking-blocks-not-editable) for details.
344
-
345
343
Variants are saved alongside vanilla resamples and appear in the web UI.
346
344
347
345
### `harness resample-session`
@@ -362,18 +360,13 @@ Replay a session from any API turn with full tool execution. Each replicate runs
# Replay with an additional prompt after tool results
372
367
harness replay runs/my-run --session 1 --turn 5 --prompt "Try a different approach"
373
368
```
374
369
375
-
By default, replay only runs the targeted session. Use `--continue-sessions` to also run subsequent sessions from the original config.
376
-
377
370
Replay creates new run directories (e.g. `replay_my-run_s1_t5_r01_<timestamp>/`) with full artifacts. Each includes a `replay_meta.json` with provenance linking back to the source run, session, and turn. The source working directory is never modified.
378
371
379
372
## Web UI
@@ -395,7 +388,7 @@ Open `http://localhost:5173`. The UI reads from the `runs/` directory and provid
395
388
-**API captures** — request/response viewer with token usage, system prompts, tool definitions, compaction events
396
389
-**Subagent viewer** — separate trajectory view for each subagent, with task prompt and return value
397
390
-**Resamples** — compare N resample outputs for a given API turn
398
-
-**Edit & Resample** — interactive message editor for intervention testing: edit assistant text, tool results, or system prompts in the conversation, then resample with the modified input to study how changes affect behavior (thinking blocks are shown read-only — see [why](docs/guide/resampling.md#thinking-blocks-not-editable))
391
+
-**Edit & Resample** — interactive message editor for intervention testing: edit thinking, text, tool results, or system prompts in the conversation, then resample with the modified input to study how changes affect behavior
399
392
-**Changelog** — per-step file write log across all sessions with expandable diffs
400
393
-**Config viewer** — frozen YAML config from the run
401
394
-**Analysis** — rendered markdown from `analysis.md`
**Step 2** — Edit the JSON file (change assistant text, tool results, system prompt, etc.), then resample.
155
-
156
-
> **Do not edit thinking blocks.** They carry cryptographic signatures validated by the API — any modification will cause a 400 error. See [Thinking blocks](guide/resampling.md#thinking-blocks-not-editable) for details.
154
+
**Step 2** — Edit the JSON file (change thinking, text, tool results, system prompt, etc.), then resample:
Each replay creates a new run directory (e.g. `replay_my-run_s1_t5_r01_2026-03-16T00-00-00/`) with full artifacts including `replay_meta.json` for provenance tracking. The source working directory is never modified — each replicate operates in its own git worktree.
Copy file name to clipboardExpand all lines: docs/glossary.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,7 +73,7 @@ A full-fidelity re-execution from a specific turn. Each replicate runs in an iso
73
73
74
74
### Intervention (variant)
75
75
76
-
A modified resample — the API request is edited before being sent (e.g. changing assistant text, tool results, or system prompt) to test counterfactuals. Thinking blocks cannot be edited due to cryptographic signature requirements. Variants are saved alongside vanilla resamples with a `_vNN` suffix and include the edited request for reproducibility.
76
+
A modified resample — the API request is edited before being sent (e.g. changing a thinking block or system prompt) to test counterfactuals. Variants are saved alongside vanilla resamples with a `_vNN` suffix and include the edited request for reproducibility.
Edit the conversation inputs — text, tool results, or system prompt — then resample. This lets you test counterfactuals: "What would the model do differently if it had seen X instead of Y?"
90
+
Edit the conversation inputs — thinking blocks, text, tool results, or system prompt — then resample. This lets you test counterfactuals: "What would the model do differently if it had seen X instead of Y?"
91
91
92
92
Like turn-level resampling, this is **stateless** — no tools execute. But the input is modified before sending, so you can study causal effects.
93
93
94
94
**What you can edit:**
95
95
96
-
-**Assistant text** — alter what the model said in prior turns (e.g., remove hedging, change a decision)
97
-
-**Tool results** — change what a tool returned (e.g., different file contents, simulated errors)
96
+
-**Thinking blocks** — change the model's internal reasoning
97
+
-**Text responses** — alter what the model said in prior turns
98
+
-**Tool results** — change what a tool returned (e.g., different file contents)
98
99
-**System prompt** — modify instructions
99
100
100
-
> **Note:** Thinking blocks are visible in the dump and UI but are **not editable** — the API requires cryptographic signatures on thinking blocks that can't survive modification. They are preserved as-is so the model retains its original reasoning context. See [Thinking blocks](#thinking-blocks) for details.
101
-
102
101
### From the CLI
103
102
104
103
Two-step workflow: dump the request, edit it, resample.
By default, replay **only runs the targeted session** — it branches from the specified turn and lets the agent continue until that session ends. Subsequent sessions from the original config are not run.
218
-
219
-
To replay the full remaining experiment (the targeted session *and* all sessions after it), use `--continue-sessions`.
220
-
221
214
```bash
222
-
# Replay from turn 5, three times (only session 1 runs)
215
+
# Replay from turn 5, three times (runs in parallel)
When `--continue-sessions` is enabled, each replicate runs the replayed session first, then continues with sessions `N+1..end` from the original config.
229
+
235
230
### Output
236
231
237
232
Each replay creates a new independent run directory:
Each session generates a `uuid_map.json` that correlates entries across the three data formats (transcript, ATIF trajectory, raw API dumps). The primary join key is `tool_call_id`. The replay system uses this to find shadow git tags for filesystem reset.
260
255
261
-
### Thinking blocks (not editable)
262
-
263
-
> **Warning:** Thinking blocks cannot be edited in interventions. Any attempt to modify thinking content in a dumped request JSON will cause the API to reject the request with a 400 error. The UI editor shows thinking blocks as read-only.
264
-
265
-
#### Why: cryptographic signatures
266
-
267
-
When the Anthropic API returns a response with extended thinking enabled, each `thinking` block includes a cryptographic `signature` field. On subsequent requests, the API validates this signature to confirm the thinking content has not been tampered with. This is a server-side integrity check — there is no way to regenerate or forge a valid signature outside of Anthropic's infrastructure.
268
-
269
-
This means:
270
-
-**Unmodified thinking blocks** have valid signatures and are accepted by the API
271
-
-**Edited thinking blocks** have invalidated signatures and are rejected (HTTP 400)
272
-
-**Stripped signatures** (keeping the text but removing the `signature` field) are also rejected
273
-
274
-
`redacted_thinking` blocks are similarly protected — they contain opaque encrypted content that cannot be inspected or modified.
275
-
276
-
#### What this means for interventions
277
-
278
-
All resampling methods preserve thinking blocks with their original signatures intact, so the model always sees its full original reasoning context. This is faithful — the model receives the same thinking it originally produced.
279
-
280
-
To test counterfactuals about model behavior, edit the fields that *are* modifiable:
281
-
-**Assistant text** — change what the model said (its visible output)
282
-
-**Tool results** — change what a tool returned (e.g., different file contents, simulated errors)
283
-
-**System prompt** — change the instructions
256
+
### Thinking signatures
284
257
285
-
These fields have no signature requirements and can be freely modified.
258
+
When resampling, the harness automatically strips thinking block signatures from the request. Signatures are response-specific and would cause errors if replayed verbatim.
Copy file name to clipboardExpand all lines: docs/guide/web-ui.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ Configure the UI via `ui/.env` or shell environment:
22
22
|`ANTHROPIC_API_KEY`| — | Required for resampling via Anthropic API |
23
23
|`ANTHROPIC_BASE_URL`|`https://api.anthropic.com`| Override the API base URL for resampling |
24
24
25
-
The resampling API keys are needed for any resampling in the UI (both vanilla resamples and "Edit & Resample"). The UI auto-detects whether to use OpenRouter or Anthropic based on the original run's API target.
25
+
The resampling API keys are only needed if you use the "Edit & Resample" feature in the UI. The UI auto-detects whether to use OpenRouter or Anthropic based on the original run's API target.
26
26
27
27
## Features
28
28
@@ -62,7 +62,7 @@ Compare N resample outputs for a given API turn side-by-side.
62
62
### Edit & Resample
63
63
Interactive message editor for intervention testing:
64
64
65
-
1. Edit assistant text, tool results, or system prompts (thinking blocks are shown read-only)
65
+
1. Edit thinking blocks, text, tool results, or system prompts
0 commit comments