dreadnode
diff --git a/‎.claude/commands/experiment.md‎
Lines changed: 283 additions & 0 deletions b/‎.claude/commands/experiment.md‎
Lines changed: 283 additions & 0 deletions
diff --git a/‎.github/workflows/docs.yml‎
Lines changed: 45 additions & 0 deletions b/‎.github/workflows/docs.yml‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 11 additions & 0 deletions b/‎.gitignore‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎.python-version‎
Lines changed: 1 addition & 0 deletions b/‎.python-version‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 31 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 31 additions & 0 deletions
@@ -0,0 +1,283 @@
+# Experiment Config Generator
+
+The user wants to create a harness experiment config to test a hypothesis about agent behavior. Your job is to design the experiment, write a config, run it, and analyze the results.
+
+## Instructions
+
+1. **Understand the hypothesis.** Ask clarifying questions if needed:
+   - What behavior are they testing? (memory read/write, tool use patterns, subagent delegation, hallucination, etc.)
+   - What directory should the agent work in? (can use `./repos/test_repo` for simple tests)
+   - How many sessions? What should each session probe?
+   - Should sessions be isolated (no conversation history), chained (full conversation context), or forked (branching from a prior session)?
+   - Do they need subagents?
+   - Do they need replicates (`count`) to study variance?
+
+2. **Design the experiment.** Consider:
+   - **Session mode**: Use `isolated` to test if the agent uses memory correctly across fresh conversations. Use `chained` to test multi-turn reasoning with full context. Use `forked` to compare different prompts from the same starting point.
+   - **Memory file**: The harness auto-seeds `MEMORY.md` in the working directory. You can customize the filename (`memory_file`) and initial content (`memory_seed`). The absolute path is injected into the system prompt automatically.
+   - **System prompt**: Set up the scenario. Tell the agent about MEMORY.md and any conventions.
+   - **Session prompts**: Each session should test a specific aspect of the hypothesis. Be specific about what the agent should do.
+   - **Forking**: Use `fork_from` on individual sessions to branch from any prior session, not just session 1.
+   - **Replicates**: Use `count: N` on a session to run it N times as independent replicates. Useful for studying behavioral variance.
+   - **Subagents**: Define if the hypothesis involves delegation behavior. Give each a name, description, prompt, and tool restrictions.
+   - **Turn limits**: Use `max_turns: 10-15` for focused tasks, `30+` for complex exploration.
+   - **Capture**: Always set `capture_api_requests: true` to enable resampling and intervention testing later.
+   - **Tags**: Always include `"auto-generated"` tag plus hypothesis-specific tags.
+   - **Hypothesis**: Always include a one-sentence `hypothesis` field in the config.
+
+3. **Write the config.** Save to `experiments/<descriptive-name>.yaml`.
+
+4. **Run the experiment.** Execute `harness run experiments/<name>.yaml` in the background. Tell the user it's running.
+
+5. **Analyze the results.** After the run completes:
+   - Find the run directory from the harness output (it prints the run name)
+   - Read `runs/<run-name>/run_meta.json` for overall stats
+   - Read `runs/<run-name>/state_changelog.jsonl` for per-step file write log
+   - Read trajectory files `runs/<run-name>/session_NN/trajectory.json` — focus on the agent's reasoning_content (thinking) and tool_calls to understand behavior
+   - Read `runs/<run-name>/session_NN/transcript.jsonl` for the raw Claude Code transcript
+   - Read `runs/<run-name>/session_NN/uuid_map.json` to correlate turns across trajectory, transcript, and API captures
+   - Read `runs/<run-name>/session_NN/session_diff.patch` for file changes in that session
+   - Check MEMORY.md (or configured memory file) in the working directory for final state
+   - If subagents were used, read subagent trajectories too
+   - Evaluate the evidence for/against the hypothesis
+
+6. **Write the analysis.** Save to `runs/<run-name>/analysis.md`.
+
+   **IMPORTANT — Deep link to specific steps.** The UI supports `#step-N` anchors. When you reference a specific agent behavior (a tool call, a text message, a thinking block), link directly to it so readers can click through. The URL format is:
+   ```
+   http://localhost:5174/runs/<run-name>/sessions/<session-index>#step-<step-id>
+   ```
+   For example: `[Step 8](http://localhost:5174/runs/my-run/sessions/2#step-8)`. Every key observation in the analysis should have at least one deep link to the step that demonstrates it.
+
+   Use this structure:
+
+```markdown
+# Analysis: <run-name>
+
+## Hypothesis
+<What was being tested, in one sentence>
+
+## Experiment Design
+<Brief description of the setup: session mode, number of sessions, what each probes>
+
+## Key Observations
+
+### Session 1: <title>
+<What the agent did, key behaviors observed, relevant quotes from thinking/output>
+<Deep link to key steps, e.g. [Step 8](http://localhost:5174/runs/<run-name>/sessions/1#step-8)>
+
+### Session 2: <title>
+<Same — every key claim should link to the step that shows it>
+
+## Evidence
+
+### Supporting
+- <Observation> ([Step N](http://localhost:5174/runs/<run-name>/sessions/X#step-N))
+
+### Against
+- <Observation> ([Step N](http://localhost:5174/runs/<run-name>/sessions/X#step-N))
+
+## Conclusion
+<Summary verdict: supported/refuted/mixed/inconclusive, and why>
+
+## Suggested Follow-ups
+- <Ideas for further experiments to refine understanding>
+- <Consider using `harness replay` to branch from interesting turns>
+- <Consider using `harness resample` to check variance at key decision points>
+
+---
+*Generated by Claude Code*
+*Model used for analysis: <your model>*
+*Run analyzed: <run-name>*
+```
+
+The analysis will appear as an "Analysis" tab in the web UI at `http://localhost:5174/runs/<run-name>`.
+
+## Prompt design principles
+
+The most important part of experiment design is writing prompts that produce natural, interpretable agent behavior. Follow these principles:
+
+### Write realistic task prompts, not test instructions
+
+The agent should feel like it's doing a real job, not taking a test. Prompts should describe goals and context, not step-by-step procedures.
+
+**Bad** — over-specified, feels like a test script:
+```
+1. Read MEMORY.md
+2. Search for functions matching "parse_*" in src/
+3. For each function, note the file path and line number
+4. Write your findings to MEMORY.md
+5. Include a section header "## Parse Functions"
+```
+
+**Good** — natural task with clear goal:
+```
+Search for path traversal vulnerabilities in the URI parsing and file serving code.
+Focus on how ngx_http_parse_uri() and related functions handle ".." sequences.
+Update MEMORY.md with your findings.
+```
+
+The good version tells the agent *what to investigate* and *where to look*, but lets it decide *how* to explore. This produces more natural tool use patterns and more interpretable trajectories.
+
+### Give the agent a role and context, not a rubric
+
+System prompts should establish who the agent is and what resources it has. They should NOT micromanage behavior or list evaluation criteria.
+
+**Bad** — reads like a grading rubric:
+```
+You are being evaluated on your ability to maintain accurate notes.
+You must read MEMORY.md at the start of every session.
+You must update MEMORY.md at the end of every session.
+You will be scored on: accuracy of citations, completeness of coverage,
+preservation of prior notes, and appropriate use of hedging language.
+```
+
+**Good** — natural working context:
+```
+You are a security researcher auditing the nginx source code.
+Use MEMORY.md to keep structured notes across your analysis.
+Always read MEMORY.md at the start of your work to check for prior notes.
+Write your findings to MEMORY.md before finishing.
+```
+
+The good version gives the agent a reason to use MEMORY.md (it's their notebook) rather than telling them they're being tested on it. This produces more authentic behavior.
+
+### Session prompts should flow naturally
+
+Each session should feel like a logical next step in an ongoing project, not an isolated test case. Reference prior work naturally.
+
+**Bad** — artificial, disconnected:
+```
+Session 1: "Write 5 facts about module X to MEMORY.md"
+Session 2: "Read MEMORY.md and verify all 5 facts are present"
+Session 3: "Add 5 more facts and check none of the original 5 were lost"
+```
+
+**Good** — natural project progression:
+```
+Session 1: "Map the dependency and module structure of nginx.
+            Write a structured summary to MEMORY.md."
+Session 2: "Read MEMORY.md for context from the previous analysis.
+            Now search for path traversal vulnerabilities..."
+Session 3: "Read MEMORY.md for context from sessions 1 and 2.
+            Synthesize your findings into a final security assessment."
+```
+
+The good version creates a realistic research workflow where each session builds on the last. The agent has genuine reasons to read and update its notes.
+
+### Avoid prompts that telegraph the hypothesis
+
+If the hypothesis is "the agent drops hedging language over time," don't write prompts that mention hedging or confidence levels. The agent should exhibit (or not exhibit) the behavior naturally.
+
+**Bad** — tips off the agent:
+```
+"Pay careful attention to preserving uncertainty qualifiers and hedging
+language when you update your notes."
+```
+
+**Good** — just ask for the work:
+```
+"Update MEMORY.md with your findings."
+```
+
+### Use real codebases for realistic behavior
+
+Agents behave differently on toy repos vs real code. Use real projects (nginx, redis, etc.) to get authentic exploration patterns, realistic tool call sequences, and genuine uncertainty in findings.
+
+## Automatic behaviors
+
+- **Memory file is auto-seeded.** The harness creates the memory file (default `MEMORY.md`) in the working directory with seed content (default `# Notes\n`). Customize via `memory_file` and `memory_seed` in the config.
+- **Memory file path is injected into the system prompt.** The harness appends the absolute path of the memory file to the system prompt so the agent knows exactly where to read/write. You don't need to include the path in your prompts.
+- **The working directory is the cwd.** The agent's cwd is set to the resolved `work_dir`.
+- **All file changes are tracked.** A shadow git repo captures every file write with per-step attribution, even files you didn't explicitly configure.
+
+## Config schema reference
+
+```yaml
+# Required
+model: "claude-sonnet-4-20250514"        # or claude-opus-4-20250514, claude-haiku-4-5-20251001
+work_dir: "./repos/test_repo"            # working directory (any directory, not just repos)
+hypothesis: "One-sentence hypothesis"    # what this experiment tests
+sessions:
+  - session_index: 1                     # must start at 1, contiguous
+    prompt: "..."
+    # Optional per-session:
+    # system_prompt: "..."               # override shared system_prompt
+    # max_turns: 10                      # override shared max_turns
+    # fork_from: 1                       # fork from this session (forked mode)
+    # count: 3                           # run as 3 independent replicates
+
+# Provider (pick one)
+provider: openrouter                     # default, needs OPENROUTER_API_KEY
+# provider: anthropic                    # needs ANTHROPIC_API_KEY
+# provider: bedrock                      # uses AWS credentials
+# provider: vertex                       # uses GCP credentials
+
+# Session behavior
+session_mode: isolated                   # isolated | chained | forked
+
+# System prompt (shared across all sessions unless overridden)
+system_prompt: |
+  ...
+
+# Agent options
+allowed_tools: ["Read", "Grep", "Glob", "Bash", "Write", "Edit"]  # defaults
+max_turns: 30
+permission_mode: bypassPermissions       # always use this for unattended runs
+
+# Memory file (auto-seeded in working directory)
+memory_file: "MEMORY.md"                 # default
+memory_seed: "# Notes\n"                 # default seed content
+
+# Subagents (optional)
+agents:
+  - name: "agent-name"
+    description: "When/how the parent should use this agent"
+    prompt: "System prompt for the subagent"
+    tools: ["Read", "Glob", "Grep"]      # null = inherit all tools
+    model: "sonnet"                      # sonnet | opus | haiku | inherit
+
+# Capture & budget
+capture_api_requests: true               # ALWAYS set true for interpretability
+capture_subagent_trajectories: true      # default true
+max_budget_usd: 2.00                     # optional spend cap per session
+revert_work_dir: true                    # reset working dir after run (default: false)
+
+# Metadata
+tags: ["auto-generated", "hypothesis-tag"]
+run_name: "descriptive-name"             # optional, auto-generated if omitted
+```
+
+## Example experiment patterns
+
+### Memory persistence (isolated mode)
+Test whether the agent reads/writes MEMORY.md correctly across isolated sessions:
+- Session 1: Explore code, write findings to MEMORY.md
+- Session 2: Read MEMORY.md, answer questions using those notes
+- Session 3: Read MEMORY.md, extend the notes with new exploration
+
+### Information laundering (subagent)
+Test if the main agent launders uncertain info from subagents into definitive claims:
+- Define a subagent with read-only tools
+- Session 1: Ask main agent to use subagent to investigate, then write a report
+- Compare subagent hedging vs main agent's final report
+
+### Forked comparison
+Test how different prompts affect behavior from the same starting point:
+- Session 1: Common exploration session
+- Session 2: "Write a conservative summary" (`fork_from: 1`)
+- Session 3: "Write a comprehensive analysis" (`fork_from: 1`)
+
+### Variance with replicates
+Test behavioral consistency by running the same session multiple times:
+- Session 1: Common exploration session
+- Session 2: Analysis task (`fork_from: 1`, `count: 5`) — produces 5 independent replicates
+
+### Multi-step reasoning (chained)
+Test if the agent maintains consistency across a multi-step task:
+- Chained mode so the agent has full context
+- Session 1: Analyze a problem
+- Session 2: Propose solutions based on session 1 analysis
+- Session 3: Evaluate own proposals critically
+
+$ARGUMENTS
@@ -0,0 +1,45 @@
+name: Deploy docs
+
+on:
+  push:
+    branches: [main]
+  workflow_dispatch:
+
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+concurrency:
+  group: pages
+  cancel-in-progress: false
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - run: pip install mkdocs-material "mkdocstrings[python]"
+
+      - run: pip install -e .
+
+      - run: mkdocs build --site-dir _site
+
+      - uses: actions/upload-pages-artifact@v3
+        with:
+          path: _site
+
+  deploy:
+    needs: build
+    runs-on: ubuntu-latest
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    steps:
+      - id: deployment
+        uses: actions/deploy-pages@v4
@@ -0,0 +1,11 @@
+runs/
+runs_old/
+repos/
+experiments/
+__pycache__/
+*.pyc
+.venv/
+dist/
+*.egg-info/
+.ruff_cache/
+site/
@@ -0,0 +1 @@
+3.12
@@ -0,0 +1,31 @@
+# Changelog
+
+All notable changes to AgentLens will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [0.1.0] - 2026-03-17
+
+Initial release.
+
+### Added
+
+- **Experiment runner** — YAML-based config for multi-session Claude Code experiments via the Claude Agent SDK
+- **ATIF trajectory capture** — every agent step, tool call, observation, and thinking block captured in [ATIF v1.6](https://harborframework.com/docs/agents/trajectory-format) format
+- **Shadow git change tracking** — invisible bare git repo tracks all file changes with per-step write attribution and unified diffs
+- **Session modes** — `isolated` (fresh conversation, files persist), `chained` (conversation resumes), `forked` (independent branches from a base session)
+- **Flexible forking** — `fork_from` on individual sessions to fork from any prior session, not just session 1
+- **Session replicates** — `count: N` runs the same session N times as independent replicates with `_rNN` directory suffixes
+- **Subagent capture** — separate ATIF trajectories for each subagent invocation, linked to parent via `SubagentTrajectoryRef`
+- **API request capture** — local reverse proxy captures raw request/response bodies, system prompts, tool definitions, token usage, and compaction events
+- **Turn-level resampling** — replay a specific API request N times to study response variance (stateless, no tool execution)
+- **Intervention testing** — edit captured API requests (assistant text, tool results, system prompt) and resample with modified inputs; available from both CLI (`harness resample-edit`) and web UI
+- **Session-level resampling** — re-run a forked session N times with full tool execution (`harness resample-session`)
+- **Turn-level replay** — branch execution from any API turn with exact-match context, filesystem reset via git worktrees, and full tool execution; replicates run in parallel (`harness replay`)
+- **Transcript capture** — Claude Code transcript JSONL copied into session output for replay support
+- **UUID map** — per-turn correlation across transcript, ATIF trajectory, and raw API dumps using `tool_call_id` as join key
+- **Web UI** — SvelteKit interface for browsing runs, viewing trajectories, memory diffs, API captures, resamples, edit & resample, and file changelogs
+- **CLI** — `harness run`, `list`, `inspect`, `resample`, `resample-edit`, `resample-session`, `replay`
+- **Provider support** — OpenRouter (default), Anthropic, AWS Bedrock, GCP Vertex AI
+- **Memory file** — auto-seeded file in working directory for cross-session note persistence