Skip to content

Commit 421ab84

Browse files
committed
Initial commit
0 parents  commit 421ab84

107 files changed

Lines changed: 19038 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/commands/experiment.md

Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# Experiment Config Generator
2+
3+
The user wants to create a harness experiment config to test a hypothesis about agent behavior. Your job is to design the experiment, write a config, run it, and analyze the results.
4+
5+
## Instructions
6+
7+
1. **Understand the hypothesis.** Ask clarifying questions if needed:
8+
- What behavior are they testing? (memory read/write, tool use patterns, subagent delegation, hallucination, etc.)
9+
- What directory should the agent work in? (can use `./repos/test_repo` for simple tests)
10+
- How many sessions? What should each session probe?
11+
- Should sessions be isolated (no conversation history), chained (full conversation context), or forked (branching from a prior session)?
12+
- Do they need subagents?
13+
- Do they need replicates (`count`) to study variance?
14+
15+
2. **Design the experiment.** Consider:
16+
- **Session mode**: Use `isolated` to test if the agent uses memory correctly across fresh conversations. Use `chained` to test multi-turn reasoning with full context. Use `forked` to compare different prompts from the same starting point.
17+
- **Memory file**: The harness auto-seeds `MEMORY.md` in the working directory. You can customize the filename (`memory_file`) and initial content (`memory_seed`). The absolute path is injected into the system prompt automatically.
18+
- **System prompt**: Set up the scenario. Tell the agent about MEMORY.md and any conventions.
19+
- **Session prompts**: Each session should test a specific aspect of the hypothesis. Be specific about what the agent should do.
20+
- **Forking**: Use `fork_from` on individual sessions to branch from any prior session, not just session 1.
21+
- **Replicates**: Use `count: N` on a session to run it N times as independent replicates. Useful for studying behavioral variance.
22+
- **Subagents**: Define if the hypothesis involves delegation behavior. Give each a name, description, prompt, and tool restrictions.
23+
- **Turn limits**: Use `max_turns: 10-15` for focused tasks, `30+` for complex exploration.
24+
- **Capture**: Always set `capture_api_requests: true` to enable resampling and intervention testing later.
25+
- **Tags**: Always include `"auto-generated"` tag plus hypothesis-specific tags.
26+
- **Hypothesis**: Always include a one-sentence `hypothesis` field in the config.
27+
28+
3. **Write the config.** Save to `experiments/<descriptive-name>.yaml`.
29+
30+
4. **Run the experiment.** Execute `harness run experiments/<name>.yaml` in the background. Tell the user it's running.
31+
32+
5. **Analyze the results.** After the run completes:
33+
- Find the run directory from the harness output (it prints the run name)
34+
- Read `runs/<run-name>/run_meta.json` for overall stats
35+
- Read `runs/<run-name>/state_changelog.jsonl` for per-step file write log
36+
- Read trajectory files `runs/<run-name>/session_NN/trajectory.json` — focus on the agent's reasoning_content (thinking) and tool_calls to understand behavior
37+
- Read `runs/<run-name>/session_NN/transcript.jsonl` for the raw Claude Code transcript
38+
- Read `runs/<run-name>/session_NN/uuid_map.json` to correlate turns across trajectory, transcript, and API captures
39+
- Read `runs/<run-name>/session_NN/session_diff.patch` for file changes in that session
40+
- Check MEMORY.md (or configured memory file) in the working directory for final state
41+
- If subagents were used, read subagent trajectories too
42+
- Evaluate the evidence for/against the hypothesis
43+
44+
6. **Write the analysis.** Save to `runs/<run-name>/analysis.md`.
45+
46+
**IMPORTANT — Deep link to specific steps.** The UI supports `#step-N` anchors. When you reference a specific agent behavior (a tool call, a text message, a thinking block), link directly to it so readers can click through. The URL format is:
47+
```
48+
http://localhost:5174/runs/<run-name>/sessions/<session-index>#step-<step-id>
49+
```
50+
For example: `[Step 8](http://localhost:5174/runs/my-run/sessions/2#step-8)`. Every key observation in the analysis should have at least one deep link to the step that demonstrates it.
51+
52+
Use this structure:
53+
54+
```markdown
55+
# Analysis: <run-name>
56+
57+
## Hypothesis
58+
<What was being tested, in one sentence>
59+
60+
## Experiment Design
61+
<Brief description of the setup: session mode, number of sessions, what each probes>
62+
63+
## Key Observations
64+
65+
### Session 1: <title>
66+
<What the agent did, key behaviors observed, relevant quotes from thinking/output>
67+
<Deep link to key steps, e.g. [Step 8](http://localhost:5174/runs/<run-name>/sessions/1#step-8)>
68+
69+
### Session 2: <title>
70+
<Same — every key claim should link to the step that shows it>
71+
72+
## Evidence
73+
74+
### Supporting
75+
- <Observation> ([Step N](http://localhost:5174/runs/<run-name>/sessions/X#step-N))
76+
77+
### Against
78+
- <Observation> ([Step N](http://localhost:5174/runs/<run-name>/sessions/X#step-N))
79+
80+
## Conclusion
81+
<Summary verdict: supported/refuted/mixed/inconclusive, and why>
82+
83+
## Suggested Follow-ups
84+
- <Ideas for further experiments to refine understanding>
85+
- <Consider using `harness replay` to branch from interesting turns>
86+
- <Consider using `harness resample` to check variance at key decision points>
87+
88+
---
89+
*Generated by Claude Code*
90+
*Model used for analysis: <your model>*
91+
*Run analyzed: <run-name>*
92+
```
93+
94+
The analysis will appear as an "Analysis" tab in the web UI at `http://localhost:5174/runs/<run-name>`.
95+
96+
## Prompt design principles
97+
98+
The most important part of experiment design is writing prompts that produce natural, interpretable agent behavior. Follow these principles:
99+
100+
### Write realistic task prompts, not test instructions
101+
102+
The agent should feel like it's doing a real job, not taking a test. Prompts should describe goals and context, not step-by-step procedures.
103+
104+
**Bad** — over-specified, feels like a test script:
105+
```
106+
1. Read MEMORY.md
107+
2. Search for functions matching "parse_*" in src/
108+
3. For each function, note the file path and line number
109+
4. Write your findings to MEMORY.md
110+
5. Include a section header "## Parse Functions"
111+
```
112+
113+
**Good** — natural task with clear goal:
114+
```
115+
Search for path traversal vulnerabilities in the URI parsing and file serving code.
116+
Focus on how ngx_http_parse_uri() and related functions handle ".." sequences.
117+
Update MEMORY.md with your findings.
118+
```
119+
120+
The good version tells the agent *what to investigate* and *where to look*, but lets it decide *how* to explore. This produces more natural tool use patterns and more interpretable trajectories.
121+
122+
### Give the agent a role and context, not a rubric
123+
124+
System prompts should establish who the agent is and what resources it has. They should NOT micromanage behavior or list evaluation criteria.
125+
126+
**Bad** — reads like a grading rubric:
127+
```
128+
You are being evaluated on your ability to maintain accurate notes.
129+
You must read MEMORY.md at the start of every session.
130+
You must update MEMORY.md at the end of every session.
131+
You will be scored on: accuracy of citations, completeness of coverage,
132+
preservation of prior notes, and appropriate use of hedging language.
133+
```
134+
135+
**Good** — natural working context:
136+
```
137+
You are a security researcher auditing the nginx source code.
138+
Use MEMORY.md to keep structured notes across your analysis.
139+
Always read MEMORY.md at the start of your work to check for prior notes.
140+
Write your findings to MEMORY.md before finishing.
141+
```
142+
143+
The good version gives the agent a reason to use MEMORY.md (it's their notebook) rather than telling them they're being tested on it. This produces more authentic behavior.
144+
145+
### Session prompts should flow naturally
146+
147+
Each session should feel like a logical next step in an ongoing project, not an isolated test case. Reference prior work naturally.
148+
149+
**Bad** — artificial, disconnected:
150+
```
151+
Session 1: "Write 5 facts about module X to MEMORY.md"
152+
Session 2: "Read MEMORY.md and verify all 5 facts are present"
153+
Session 3: "Add 5 more facts and check none of the original 5 were lost"
154+
```
155+
156+
**Good** — natural project progression:
157+
```
158+
Session 1: "Map the dependency and module structure of nginx.
159+
Write a structured summary to MEMORY.md."
160+
Session 2: "Read MEMORY.md for context from the previous analysis.
161+
Now search for path traversal vulnerabilities..."
162+
Session 3: "Read MEMORY.md for context from sessions 1 and 2.
163+
Synthesize your findings into a final security assessment."
164+
```
165+
166+
The good version creates a realistic research workflow where each session builds on the last. The agent has genuine reasons to read and update its notes.
167+
168+
### Avoid prompts that telegraph the hypothesis
169+
170+
If the hypothesis is "the agent drops hedging language over time," don't write prompts that mention hedging or confidence levels. The agent should exhibit (or not exhibit) the behavior naturally.
171+
172+
**Bad** — tips off the agent:
173+
```
174+
"Pay careful attention to preserving uncertainty qualifiers and hedging
175+
language when you update your notes."
176+
```
177+
178+
**Good** — just ask for the work:
179+
```
180+
"Update MEMORY.md with your findings."
181+
```
182+
183+
### Use real codebases for realistic behavior
184+
185+
Agents behave differently on toy repos vs real code. Use real projects (nginx, redis, etc.) to get authentic exploration patterns, realistic tool call sequences, and genuine uncertainty in findings.
186+
187+
## Automatic behaviors
188+
189+
- **Memory file is auto-seeded.** The harness creates the memory file (default `MEMORY.md`) in the working directory with seed content (default `# Notes\n`). Customize via `memory_file` and `memory_seed` in the config.
190+
- **Memory file path is injected into the system prompt.** The harness appends the absolute path of the memory file to the system prompt so the agent knows exactly where to read/write. You don't need to include the path in your prompts.
191+
- **The working directory is the cwd.** The agent's cwd is set to the resolved `work_dir`.
192+
- **All file changes are tracked.** A shadow git repo captures every file write with per-step attribution, even files you didn't explicitly configure.
193+
194+
## Config schema reference
195+
196+
```yaml
197+
# Required
198+
model: "claude-sonnet-4-20250514" # or claude-opus-4-20250514, claude-haiku-4-5-20251001
199+
work_dir: "./repos/test_repo" # working directory (any directory, not just repos)
200+
hypothesis: "One-sentence hypothesis" # what this experiment tests
201+
sessions:
202+
- session_index: 1 # must start at 1, contiguous
203+
prompt: "..."
204+
# Optional per-session:
205+
# system_prompt: "..." # override shared system_prompt
206+
# max_turns: 10 # override shared max_turns
207+
# fork_from: 1 # fork from this session (forked mode)
208+
# count: 3 # run as 3 independent replicates
209+
210+
# Provider (pick one)
211+
provider: openrouter # default, needs OPENROUTER_API_KEY
212+
# provider: anthropic # needs ANTHROPIC_API_KEY
213+
# provider: bedrock # uses AWS credentials
214+
# provider: vertex # uses GCP credentials
215+
216+
# Session behavior
217+
session_mode: isolated # isolated | chained | forked
218+
219+
# System prompt (shared across all sessions unless overridden)
220+
system_prompt: |
221+
...
222+
223+
# Agent options
224+
allowed_tools: ["Read", "Grep", "Glob", "Bash", "Write", "Edit"] # defaults
225+
max_turns: 30
226+
permission_mode: bypassPermissions # always use this for unattended runs
227+
228+
# Memory file (auto-seeded in working directory)
229+
memory_file: "MEMORY.md" # default
230+
memory_seed: "# Notes\n" # default seed content
231+
232+
# Subagents (optional)
233+
agents:
234+
- name: "agent-name"
235+
description: "When/how the parent should use this agent"
236+
prompt: "System prompt for the subagent"
237+
tools: ["Read", "Glob", "Grep"] # null = inherit all tools
238+
model: "sonnet" # sonnet | opus | haiku | inherit
239+
240+
# Capture & budget
241+
capture_api_requests: true # ALWAYS set true for interpretability
242+
capture_subagent_trajectories: true # default true
243+
max_budget_usd: 2.00 # optional spend cap per session
244+
revert_work_dir: true # reset working dir after run (default: false)
245+
246+
# Metadata
247+
tags: ["auto-generated", "hypothesis-tag"]
248+
run_name: "descriptive-name" # optional, auto-generated if omitted
249+
```
250+
251+
## Example experiment patterns
252+
253+
### Memory persistence (isolated mode)
254+
Test whether the agent reads/writes MEMORY.md correctly across isolated sessions:
255+
- Session 1: Explore code, write findings to MEMORY.md
256+
- Session 2: Read MEMORY.md, answer questions using those notes
257+
- Session 3: Read MEMORY.md, extend the notes with new exploration
258+
259+
### Information laundering (subagent)
260+
Test if the main agent launders uncertain info from subagents into definitive claims:
261+
- Define a subagent with read-only tools
262+
- Session 1: Ask main agent to use subagent to investigate, then write a report
263+
- Compare subagent hedging vs main agent's final report
264+
265+
### Forked comparison
266+
Test how different prompts affect behavior from the same starting point:
267+
- Session 1: Common exploration session
268+
- Session 2: "Write a conservative summary" (`fork_from: 1`)
269+
- Session 3: "Write a comprehensive analysis" (`fork_from: 1`)
270+
271+
### Variance with replicates
272+
Test behavioral consistency by running the same session multiple times:
273+
- Session 1: Common exploration session
274+
- Session 2: Analysis task (`fork_from: 1`, `count: 5`) — produces 5 independent replicates
275+
276+
### Multi-step reasoning (chained)
277+
Test if the agent maintains consistency across a multi-step task:
278+
- Chained mode so the agent has full context
279+
- Session 1: Analyze a problem
280+
- Session 2: Propose solutions based on session 1 analysis
281+
- Session 3: Evaluate own proposals critically
282+
283+
$ARGUMENTS

.github/workflows/docs.yml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: Deploy docs
2+
3+
on:
4+
push:
5+
branches: [main]
6+
workflow_dispatch:
7+
8+
permissions:
9+
contents: read
10+
pages: write
11+
id-token: write
12+
13+
concurrency:
14+
group: pages
15+
cancel-in-progress: false
16+
17+
jobs:
18+
build:
19+
runs-on: ubuntu-latest
20+
steps:
21+
- uses: actions/checkout@v4
22+
23+
- uses: actions/setup-python@v5
24+
with:
25+
python-version: "3.12"
26+
27+
- run: pip install mkdocs-material "mkdocstrings[python]"
28+
29+
- run: pip install -e .
30+
31+
- run: mkdocs build --site-dir _site
32+
33+
- uses: actions/upload-pages-artifact@v3
34+
with:
35+
path: _site
36+
37+
deploy:
38+
needs: build
39+
runs-on: ubuntu-latest
40+
environment:
41+
name: github-pages
42+
url: ${{ steps.deployment.outputs.page_url }}
43+
steps:
44+
- id: deployment
45+
uses: actions/deploy-pages@v4

.gitignore

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
runs/
2+
runs_old/
3+
repos/
4+
experiments/
5+
__pycache__/
6+
*.pyc
7+
.venv/
8+
dist/
9+
*.egg-info/
10+
.ruff_cache/
11+
site/

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.12

CHANGELOG.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Changelog
2+
3+
All notable changes to AgentLens will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [0.1.0] - 2026-03-17
9+
10+
Initial release.
11+
12+
### Added
13+
14+
- **Experiment runner** — YAML-based config for multi-session Claude Code experiments via the Claude Agent SDK
15+
- **ATIF trajectory capture** — every agent step, tool call, observation, and thinking block captured in [ATIF v1.6](https://harborframework.com/docs/agents/trajectory-format) format
16+
- **Shadow git change tracking** — invisible bare git repo tracks all file changes with per-step write attribution and unified diffs
17+
- **Session modes**`isolated` (fresh conversation, files persist), `chained` (conversation resumes), `forked` (independent branches from a base session)
18+
- **Flexible forking**`fork_from` on individual sessions to fork from any prior session, not just session 1
19+
- **Session replicates**`count: N` runs the same session N times as independent replicates with `_rNN` directory suffixes
20+
- **Subagent capture** — separate ATIF trajectories for each subagent invocation, linked to parent via `SubagentTrajectoryRef`
21+
- **API request capture** — local reverse proxy captures raw request/response bodies, system prompts, tool definitions, token usage, and compaction events
22+
- **Turn-level resampling** — replay a specific API request N times to study response variance (stateless, no tool execution)
23+
- **Intervention testing** — edit captured API requests (assistant text, tool results, system prompt) and resample with modified inputs; available from both CLI (`harness resample-edit`) and web UI
24+
- **Session-level resampling** — re-run a forked session N times with full tool execution (`harness resample-session`)
25+
- **Turn-level replay** — branch execution from any API turn with exact-match context, filesystem reset via git worktrees, and full tool execution; replicates run in parallel (`harness replay`)
26+
- **Transcript capture** — Claude Code transcript JSONL copied into session output for replay support
27+
- **UUID map** — per-turn correlation across transcript, ATIF trajectory, and raw API dumps using `tool_call_id` as join key
28+
- **Web UI** — SvelteKit interface for browsing runs, viewing trajectories, memory diffs, API captures, resamples, edit & resample, and file changelogs
29+
- **CLI**`harness run`, `list`, `inspect`, `resample`, `resample-edit`, `resample-session`, `replay`
30+
- **Provider support** — OpenRouter (default), Anthropic, AWS Bedrock, GCP Vertex AI
31+
- **Memory file** — auto-seeded file in working directory for cross-session note persistence

0 commit comments

Comments
 (0)