Skip to content

Commit 3db11ba

Browse files
Add impact analysis benchmark harness
* Add impact analysis benchmark harness * Keep benchmark harness separate from results
1 parent 4ee130f commit 3db11ba

25 files changed

Lines changed: 4463 additions & 0 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
dist/
22
npm/bin/
3+
/target/
34
*.exe
45
/supermodel
56
/cli

benchmark/agent-impact/Dockerfile

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
FROM node:24-bookworm
2+
3+
RUN apt-get update \
4+
&& apt-get install -y --no-install-recommends git ca-certificates python3 make g++ bash jq ripgrep time \
5+
&& rm -rf /var/lib/apt/lists/*
6+
7+
RUN corepack enable \
8+
&& npm install -g @openai/codex@0.128.0
9+
10+
WORKDIR /workspace

benchmark/agent-impact/README.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Agent Impact A/B Benchmark
2+
3+
This benchmark answers a different question than the compiler-only impact benchmark:
4+
5+
> Does impact-analysis context make a top agent repair real breakage with fewer tokens, less wall time, fewer tool calls, or a higher success rate?
6+
7+
The comparison must isolate one variable:
8+
9+
- **control:** agent receives the broken repository and task prompt.
10+
- **impact:** agent receives the same broken repository and task prompt, plus `IMPACT_ANALYSIS.md` and `impact-analysis.json`.
11+
12+
Everything else is held constant: model, container image, repository commit, mutation, verifier, timeout, and prompt wording.
13+
14+
## Protocol
15+
16+
For each case:
17+
18+
1. Clone a pinned public repository commit.
19+
2. Install dependencies inside Docker.
20+
3. Run the verifier once to prove the clean checkout is green.
21+
4. Apply a configured mutation or deletion.
22+
5. Run the verifier again and require that it fails with at least one real source/test/type failure.
23+
6. Run the control agent in a fresh Docker container.
24+
7. Run the impact-context agent in another fresh Docker container.
25+
8. Run the verifier after each agent.
26+
9. Record success, wall time, token usage, tool calls, files changed, final diff, stdout/stderr, and raw agent JSONL.
27+
28+
The agent prompt forbids simply reverting the target mutation. The harness also checks that the configured `mutation.mustContain` text still exists after the agent run.
29+
30+
## Ten-Repository Set
31+
32+
The initial manifest is [agent-impact-repos.json](./agent-impact-repos.json). It uses 10 pinned public repositories:
33+
34+
- `tinylibs/tinyspy`
35+
- `tinylibs/tinybench`
36+
- `sindresorhus/p-queue`
37+
- `sindresorhus/ky`
38+
- `sindresorhus/p-map`
39+
- `sindresorhus/p-retry`
40+
- `sindresorhus/p-timeout`
41+
- `sindresorhus/p-throttle`
42+
- `chalk/chalk`
43+
- `sindresorhus/yoctocolors`
44+
45+
The set intentionally mixes implementation TypeScript, declaration-heavy packages, test/type-test repairs, class constructors, default exports, and shared helpers.
46+
47+
## Build
48+
49+
```bash
50+
docker build -t supermodel-agent-impact:local benchmark/agent-impact
51+
```
52+
53+
## Dry Run
54+
55+
```bash
56+
node benchmark/agent-impact/run-agent-impact-ab.mjs --dry-run
57+
```
58+
59+
This validates the manifest shape and writes the prompts/run plan without invoking the model.
60+
61+
## Implementation Tests
62+
63+
The API ranking implementation is tested in the public API repository. From the paired public API checkout:
64+
65+
```bash
66+
export SUPERMODEL_PUBLIC_API_REPO=/path/to/supermodel-public-api
67+
cd "$SUPERMODEL_PUBLIC_API_REPO/src/data-plane"
68+
npm test -- --runInBand \
69+
impact-validation-ranking-regression.test.js
70+
```
71+
72+
Those tests cover scoped validation ranking behavior. The benchmark harness itself can be checked without invoking a model using the dry-run command above.
73+
74+
## Full Run
75+
76+
```bash
77+
node benchmark/agent-impact/run-agent-impact-ab.mjs \
78+
--image supermodel-agent-impact:local \
79+
--model gpt-5.5 \
80+
--codex-home ~/.codex
81+
```
82+
83+
The runner writes one directory per case and arm under `target/agent-impact/`.
84+
85+
The summary records:
86+
87+
- `agentModel`, expected to be `gpt-5.5` for this run
88+
- `agentRunner`, expected to be `codex-cli 0.128.0`
89+
- per-arm aggregate success, time, tool calls, token usage, and agent file-level F1
90+
91+
Each arm directory contains:
92+
93+
- `prompt.md`
94+
- `agent.jsonl`
95+
- `agent.stdout`
96+
- `agent.stderr`
97+
- `metrics.json`
98+
- `final.diff`
99+
- `verify-before.log`
100+
- `verify-after.log`
101+
- `impact-analysis.json` and `IMPACT_ANALYSIS.md` for the impact arm
102+
103+
## Metrics
104+
105+
Primary:
106+
107+
- success rate
108+
- agent file-level F1
109+
- wall-clock seconds
110+
- input tokens
111+
- output tokens
112+
- total tokens
113+
- tool calls
114+
115+
Secondary:
116+
117+
- verifier failure category
118+
- files changed
119+
- diff line count
120+
- whether changed files overlap predicted impact files
121+
- whether the mutation was illegally reverted
122+
123+
Agent file-level F1 is computed as:
124+
125+
```text
126+
actual files = files implicated by the broken verifier output after mutation
127+
agent files = files changed by the agent after the mutation is committed as the baseline
128+
precision = changed files that were actually implicated / changed files
129+
recall = implicated files changed by the agent / implicated files
130+
F1 = harmonic mean of precision and recall
131+
```
132+
133+
This is not a substitute for verifier success. It measures whether the agent edited the right files, while verifier success measures whether the repair actually worked.
134+
135+
## Impact Context
136+
137+
For production-quality runs, `impact-analysis.json` should come from the Supermodel impact endpoint or from the same local graph implementation used by that endpoint.
138+
139+
The context file must not include compiler ground truth from after the mutation. It can include:
140+
141+
- target file and symbol
142+
- direct callers
143+
- transitive callers
144+
- affected files with confidence tiers
145+
- entry points
146+
- risk score
147+
148+
It must not include:
149+
150+
- actual compiler errors from the mutated repository
151+
- the verifier output
152+
- hand-labeled files that were discovered after running the mutation
153+
154+
## Reading Results
155+
156+
The benchmark supports three conclusions:
157+
158+
- **Positive:** impact arm succeeds more often or uses less time/tokens/tool calls with similar quality.
159+
- **Neutral:** impact context changes little; product should not claim agent-efficiency gains yet.
160+
- **Negative:** impact context increases distraction, false edits, or cost.
161+
162+
The per-case logs matter more than the aggregate. A single false-positive-heavy impact file can make the agent inspect or edit the wrong place; those failures should be visible in `agent.jsonl`, `final.diff`, and verifier logs.
163+
164+
## Result Artifacts
165+
166+
Keep generated result artifacts separate from the harness PR so reviewers can evaluate runner logic and benchmark evidence independently.
167+
168+
For real impact ranking runs, check in result artifacts under `benchmark/agent-impact/results/<run-id>/` on a results branch. A complete result artifact should include:
169+
170+
- reproduction instructions with the exact API/CLI branches and command
171+
- aggregate precision, recall, and F1 tables
172+
- generated `report.md`
173+
- sanitized `summary.json`
174+
- per-case scoped packets when useful for audit
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
const fs = require('node:fs');
2+
3+
function summarizeAgentJsonl(jsonlPath) {
4+
if (!fs.existsSync(jsonlPath)) {
5+
return {events: 0, toolCalls: 0, failedToolCalls: 0, toolCallsByType: {}, usage: {}};
6+
}
7+
8+
const lines = fs.readFileSync(jsonlPath, 'utf8').split('\n').filter(Boolean);
9+
const seenToolItems = new Set();
10+
let failedToolCalls = 0;
11+
const toolCallsByType = {};
12+
const usage = {};
13+
14+
for (const line of lines) {
15+
let event;
16+
try {
17+
event = JSON.parse(line);
18+
} catch {
19+
continue;
20+
}
21+
22+
const item = event.item;
23+
const itemType = item?.type;
24+
if (event.type === 'item.started' && isToolItemType(itemType)) {
25+
seenToolItems.add(item.id);
26+
toolCallsByType[itemType] = (toolCallsByType[itemType] || 0) + 1;
27+
} else if (event.type === 'item.completed' && isToolItemType(itemType) && !seenToolItems.has(item.id)) {
28+
seenToolItems.add(item.id);
29+
toolCallsByType[itemType] = (toolCallsByType[itemType] || 0) + 1;
30+
}
31+
32+
if (event.type === 'item.completed' && isToolItemType(itemType) && item.status === 'failed') {
33+
failedToolCalls += 1;
34+
}
35+
36+
collectUsage(event, usage);
37+
}
38+
39+
return {
40+
events: lines.length,
41+
toolCalls: seenToolItems.size,
42+
failedToolCalls,
43+
toolCallsByType,
44+
usage,
45+
};
46+
}
47+
48+
function isToolItemType(itemType) {
49+
return [
50+
'command_execution',
51+
'file_change',
52+
'mcp_tool_call',
53+
'tool_call',
54+
'function_call',
55+
].includes(itemType);
56+
}
57+
58+
function collectUsage(value, usage) {
59+
if (!value || typeof value !== 'object') return;
60+
for (const [key, nested] of Object.entries(value)) {
61+
if (typeof nested === 'number' && /tokens?|token_count/i.test(key)) {
62+
usage[key] = Math.max(usage[key] || 0, nested);
63+
} else if (nested && typeof nested === 'object') {
64+
collectUsage(nested, usage);
65+
}
66+
}
67+
}
68+
69+
function computeFileF1(predictedFiles, actualFiles) {
70+
const predicted = new Set(predictedFiles);
71+
const actual = new Set(actualFiles);
72+
const truePositiveFiles = predictedFiles.filter(file => actual.has(file));
73+
const falsePositiveFiles = predictedFiles.filter(file => !actual.has(file));
74+
const falseNegativeFiles = actualFiles.filter(file => !predicted.has(file));
75+
const precision = predictedFiles.length === 0 ? (actualFiles.length === 0 ? 1 : 0) : truePositiveFiles.length / predictedFiles.length;
76+
const recall = actualFiles.length === 0 ? 1 : truePositiveFiles.length / actualFiles.length;
77+
const f1 = precision + recall === 0 ? 0 : (2 * precision * recall) / (precision + recall);
78+
return {
79+
precision,
80+
recall,
81+
f1,
82+
truePositiveFiles,
83+
falsePositiveFiles,
84+
falseNegativeFiles,
85+
};
86+
}
87+
88+
function aggregateByArm(cases) {
89+
const byArm = {};
90+
for (const arm of Array.from(new Set(cases.map(item => item.arm)))) {
91+
const armCases = cases.filter(item => item.arm === arm && !item.dryRun);
92+
const totals = armCases.reduce((acc, item) => {
93+
acc.successes += item.success ? 1 : 0;
94+
acc.wallTimeMs += item.wallTimeMs || 0;
95+
acc.toolCalls += item.agent?.toolCalls || 0;
96+
acc.failedToolCalls += item.agent?.failedToolCalls || 0;
97+
acc.tp += item.agentFileF1?.truePositiveFiles?.length || 0;
98+
acc.fp += item.agentFileF1?.falsePositiveFiles?.length || 0;
99+
acc.fn += item.agentFileF1?.falseNegativeFiles?.length || 0;
100+
for (const [key, value] of Object.entries(item.agent?.toolCallsByType || {})) {
101+
acc.toolCallsByType[key] = (acc.toolCallsByType[key] || 0) + value;
102+
}
103+
for (const [key, value] of Object.entries(item.agent?.usage || {})) {
104+
acc.usage[key] = (acc.usage[key] || 0) + value;
105+
}
106+
return acc;
107+
}, {successes: 0, wallTimeMs: 0, toolCalls: 0, failedToolCalls: 0, tp: 0, fp: 0, fn: 0, toolCallsByType: {}, usage: {}});
108+
109+
const precision = totals.tp + totals.fp === 0 ? (totals.fn === 0 ? 1 : 0) : totals.tp / (totals.tp + totals.fp);
110+
const recall = totals.tp + totals.fn === 0 ? 1 : totals.tp / (totals.tp + totals.fn);
111+
byArm[arm] = {
112+
cases: armCases.length,
113+
successRate: armCases.length === 0 ? 0 : totals.successes / armCases.length,
114+
averageWallTimeMs: armCases.length === 0 ? 0 : totals.wallTimeMs / armCases.length,
115+
averageToolCalls: armCases.length === 0 ? 0 : totals.toolCalls / armCases.length,
116+
failedToolCalls: totals.failedToolCalls,
117+
toolCallsByType: totals.toolCallsByType,
118+
fileF1: {
119+
precision,
120+
recall,
121+
f1: precision + recall === 0 ? 0 : (2 * precision * recall) / (precision + recall),
122+
tp: totals.tp,
123+
fp: totals.fp,
124+
fn: totals.fn,
125+
},
126+
usage: totals.usage,
127+
};
128+
}
129+
return byArm;
130+
}
131+
132+
function isBenchmarkRelevantFile(file) {
133+
const normalized = file.replaceAll('\\', '/');
134+
return !normalized.startsWith('node_modules/') &&
135+
!normalized.startsWith('.pnpm-store/') &&
136+
!normalized.startsWith('dist/') &&
137+
!normalized.startsWith('coverage/') &&
138+
normalized !== 'IMPACT_ANALYSIS.md' &&
139+
normalized !== 'impact-analysis.json' &&
140+
/\.(ts|tsx|js|mjs|cjs|d\.ts)$/.test(normalized);
141+
}
142+
143+
module.exports = {
144+
aggregateByArm,
145+
computeFileF1,
146+
isBenchmarkRelevantFile,
147+
summarizeAgentJsonl,
148+
};

0 commit comments

Comments
 (0)