supermodeltools
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎benchmark/agent-impact/Dockerfile‎
Lines changed: 10 additions & 0 deletions b/‎benchmark/agent-impact/Dockerfile‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎benchmark/agent-impact/README.md‎
Lines changed: 174 additions & 0 deletions b/‎benchmark/agent-impact/README.md‎
Lines changed: 174 additions & 0 deletions
diff --git a/‎benchmark/agent-impact/agent-impact-metrics.cjs‎
Lines changed: 148 additions & 0 deletions b/‎benchmark/agent-impact/agent-impact-metrics.cjs‎
Lines changed: 148 additions & 0 deletions
@@ -1,5 +1,6 @@
 dist/
 npm/bin/
+/target/
 *.exe
 /supermodel
 /cli
 
@@ -0,0 +1,10 @@
+FROM node:24-bookworm
+
+RUN apt-get update \
+  && apt-get install -y --no-install-recommends git ca-certificates python3 make g++ bash jq ripgrep time \
+  && rm -rf /var/lib/apt/lists/*
+
+RUN corepack enable \
+  && npm install -g @openai/codex@0.128.0
+
+WORKDIR /workspace
@@ -0,0 +1,174 @@
+# Agent Impact A/B Benchmark
+
+This benchmark answers a different question than the compiler-only impact benchmark:
+
+> Does impact-analysis context make a top agent repair real breakage with fewer tokens, less wall time, fewer tool calls, or a higher success rate?
+
+The comparison must isolate one variable:
+
+- **control:** agent receives the broken repository and task prompt.
+- **impact:** agent receives the same broken repository and task prompt, plus `IMPACT_ANALYSIS.md` and `impact-analysis.json`.
+
+Everything else is held constant: model, container image, repository commit, mutation, verifier, timeout, and prompt wording.
+
+## Protocol
+
+For each case:
+
+1. Clone a pinned public repository commit.
+2. Install dependencies inside Docker.
+3. Run the verifier once to prove the clean checkout is green.
+4. Apply a configured mutation or deletion.
+5. Run the verifier again and require that it fails with at least one real source/test/type failure.
+6. Run the control agent in a fresh Docker container.
+7. Run the impact-context agent in another fresh Docker container.
+8. Run the verifier after each agent.
+9. Record success, wall time, token usage, tool calls, files changed, final diff, stdout/stderr, and raw agent JSONL.
+
+The agent prompt forbids simply reverting the target mutation. The harness also checks that the configured `mutation.mustContain` text still exists after the agent run.
+
+## Ten-Repository Set
+
+The initial manifest is [agent-impact-repos.json](./agent-impact-repos.json). It uses 10 pinned public repositories:
+
+- `tinylibs/tinyspy`
+- `tinylibs/tinybench`
+- `sindresorhus/p-queue`
+- `sindresorhus/ky`
+- `sindresorhus/p-map`
+- `sindresorhus/p-retry`
+- `sindresorhus/p-timeout`
+- `sindresorhus/p-throttle`
+- `chalk/chalk`
+- `sindresorhus/yoctocolors`
+
+The set intentionally mixes implementation TypeScript, declaration-heavy packages, test/type-test repairs, class constructors, default exports, and shared helpers.
+
+## Build
+
+```bash
+docker build -t supermodel-agent-impact:local benchmark/agent-impact
+```
+
+## Dry Run
+
+```bash
+node benchmark/agent-impact/run-agent-impact-ab.mjs --dry-run
+```
+
+This validates the manifest shape and writes the prompts/run plan without invoking the model.
+
+## Implementation Tests
+
+The API ranking implementation is tested in the public API repository. From the paired public API checkout:
+
+```bash
+export SUPERMODEL_PUBLIC_API_REPO=/path/to/supermodel-public-api
+cd "$SUPERMODEL_PUBLIC_API_REPO/src/data-plane"
+npm test -- --runInBand \
+  impact-validation-ranking-regression.test.js
+```
+
+Those tests cover scoped validation ranking behavior. The benchmark harness itself can be checked without invoking a model using the dry-run command above.
+
+## Full Run
+
+```bash
+node benchmark/agent-impact/run-agent-impact-ab.mjs \
+  --image supermodel-agent-impact:local \
+  --model gpt-5.5 \
+  --codex-home ~/.codex
+```
+
+The runner writes one directory per case and arm under `target/agent-impact/`.
+
+The summary records:
+
+- `agentModel`, expected to be `gpt-5.5` for this run
+- `agentRunner`, expected to be `codex-cli 0.128.0`
+- per-arm aggregate success, time, tool calls, token usage, and agent file-level F1
+
+Each arm directory contains:
+
+- `prompt.md`
+- `agent.jsonl`
+- `agent.stdout`
+- `agent.stderr`
+- `metrics.json`
+- `final.diff`
+- `verify-before.log`
+- `verify-after.log`
+- `impact-analysis.json` and `IMPACT_ANALYSIS.md` for the impact arm
+
+## Metrics
+
+Primary:
+
+- success rate
+- agent file-level F1
+- wall-clock seconds
+- input tokens
+- output tokens
+- total tokens
+- tool calls
+
+Secondary:
+
+- verifier failure category
+- files changed
+- diff line count
+- whether changed files overlap predicted impact files
+- whether the mutation was illegally reverted
+
+Agent file-level F1 is computed as:
+
+```text
+actual files = files implicated by the broken verifier output after mutation
+agent files  = files changed by the agent after the mutation is committed as the baseline
+precision    = changed files that were actually implicated / changed files
+recall       = implicated files changed by the agent / implicated files
+F1           = harmonic mean of precision and recall
+```
+
+This is not a substitute for verifier success. It measures whether the agent edited the right files, while verifier success measures whether the repair actually worked.
+
+## Impact Context
+
+For production-quality runs, `impact-analysis.json` should come from the Supermodel impact endpoint or from the same local graph implementation used by that endpoint.
+
+The context file must not include compiler ground truth from after the mutation. It can include:
+
+- target file and symbol
+- direct callers
+- transitive callers
+- affected files with confidence tiers
+- entry points
+- risk score
+
+It must not include:
+
+- actual compiler errors from the mutated repository
+- the verifier output
+- hand-labeled files that were discovered after running the mutation
+
+## Reading Results
+
+The benchmark supports three conclusions:
+
+- **Positive:** impact arm succeeds more often or uses less time/tokens/tool calls with similar quality.
+- **Neutral:** impact context changes little; product should not claim agent-efficiency gains yet.
+- **Negative:** impact context increases distraction, false edits, or cost.
+
+The per-case logs matter more than the aggregate. A single false-positive-heavy impact file can make the agent inspect or edit the wrong place; those failures should be visible in `agent.jsonl`, `final.diff`, and verifier logs.
+
+## Result Artifacts
+
+Keep generated result artifacts separate from the harness PR so reviewers can evaluate runner logic and benchmark evidence independently.
+
+For real impact ranking runs, check in result artifacts under `benchmark/agent-impact/results/<run-id>/` on a results branch. A complete result artifact should include:
+
+- reproduction instructions with the exact API/CLI branches and command
+- aggregate precision, recall, and F1 tables
+- generated `report.md`
+- sanitized `summary.json`
+- per-case scoped packets when useful for audit
@@ -0,0 +1,148 @@
+const fs = require('node:fs');
+
+function summarizeAgentJsonl(jsonlPath) {
+  if (!fs.existsSync(jsonlPath)) {
+    return {events: 0, toolCalls: 0, failedToolCalls: 0, toolCallsByType: {}, usage: {}};
+  }
+
+  const lines = fs.readFileSync(jsonlPath, 'utf8').split('\n').filter(Boolean);
+  const seenToolItems = new Set();
+  let failedToolCalls = 0;
+  const toolCallsByType = {};
+  const usage = {};
+
+  for (const line of lines) {
+    let event;
+    try {
+      event = JSON.parse(line);
+    } catch {
+      continue;
+    }
+
+    const item = event.item;
+    const itemType = item?.type;
+    if (event.type === 'item.started' && isToolItemType(itemType)) {
+      seenToolItems.add(item.id);
+      toolCallsByType[itemType] = (toolCallsByType[itemType] || 0) + 1;
+    } else if (event.type === 'item.completed' && isToolItemType(itemType) && !seenToolItems.has(item.id)) {
+      seenToolItems.add(item.id);
+      toolCallsByType[itemType] = (toolCallsByType[itemType] || 0) + 1;
+    }
+
+    if (event.type === 'item.completed' && isToolItemType(itemType) && item.status === 'failed') {
+      failedToolCalls += 1;
+    }
+
+    collectUsage(event, usage);
+  }
+
+  return {
+    events: lines.length,
+    toolCalls: seenToolItems.size,
+    failedToolCalls,
+    toolCallsByType,
+    usage,
+  };
+}
+
+function isToolItemType(itemType) {
+  return [
+    'command_execution',
+    'file_change',
+    'mcp_tool_call',
+    'tool_call',
+    'function_call',
+  ].includes(itemType);
+}
+
+function collectUsage(value, usage) {
+  if (!value || typeof value !== 'object') return;
+  for (const [key, nested] of Object.entries(value)) {
+    if (typeof nested === 'number' && /tokens?|token_count/i.test(key)) {
+      usage[key] = Math.max(usage[key] || 0, nested);
+    } else if (nested && typeof nested === 'object') {
+      collectUsage(nested, usage);
+    }
+  }
+}
+
+function computeFileF1(predictedFiles, actualFiles) {
+  const predicted = new Set(predictedFiles);
+  const actual = new Set(actualFiles);
+  const truePositiveFiles = predictedFiles.filter(file => actual.has(file));
+  const falsePositiveFiles = predictedFiles.filter(file => !actual.has(file));
+  const falseNegativeFiles = actualFiles.filter(file => !predicted.has(file));
+  const precision = predictedFiles.length === 0 ? (actualFiles.length === 0 ? 1 : 0) : truePositiveFiles.length / predictedFiles.length;
+  const recall = actualFiles.length === 0 ? 1 : truePositiveFiles.length / actualFiles.length;
+  const f1 = precision + recall === 0 ? 0 : (2 * precision * recall) / (precision + recall);
+  return {
+    precision,
+    recall,
+    f1,
+    truePositiveFiles,
+    falsePositiveFiles,
+    falseNegativeFiles,
+  };
+}
+
+function aggregateByArm(cases) {
+  const byArm = {};
+  for (const arm of Array.from(new Set(cases.map(item => item.arm)))) {
+    const armCases = cases.filter(item => item.arm === arm && !item.dryRun);
+    const totals = armCases.reduce((acc, item) => {
+      acc.successes += item.success ? 1 : 0;
+      acc.wallTimeMs += item.wallTimeMs || 0;
+      acc.toolCalls += item.agent?.toolCalls || 0;
+      acc.failedToolCalls += item.agent?.failedToolCalls || 0;
+      acc.tp += item.agentFileF1?.truePositiveFiles?.length || 0;
+      acc.fp += item.agentFileF1?.falsePositiveFiles?.length || 0;
+      acc.fn += item.agentFileF1?.falseNegativeFiles?.length || 0;
+      for (const [key, value] of Object.entries(item.agent?.toolCallsByType || {})) {
+        acc.toolCallsByType[key] = (acc.toolCallsByType[key] || 0) + value;
+      }
+      for (const [key, value] of Object.entries(item.agent?.usage || {})) {
+        acc.usage[key] = (acc.usage[key] || 0) + value;
+      }
+      return acc;
+    }, {successes: 0, wallTimeMs: 0, toolCalls: 0, failedToolCalls: 0, tp: 0, fp: 0, fn: 0, toolCallsByType: {}, usage: {}});
+
+    const precision = totals.tp + totals.fp === 0 ? (totals.fn === 0 ? 1 : 0) : totals.tp / (totals.tp + totals.fp);
+    const recall = totals.tp + totals.fn === 0 ? 1 : totals.tp / (totals.tp + totals.fn);
+    byArm[arm] = {
+      cases: armCases.length,
+      successRate: armCases.length === 0 ? 0 : totals.successes / armCases.length,
+      averageWallTimeMs: armCases.length === 0 ? 0 : totals.wallTimeMs / armCases.length,
+      averageToolCalls: armCases.length === 0 ? 0 : totals.toolCalls / armCases.length,
+      failedToolCalls: totals.failedToolCalls,
+      toolCallsByType: totals.toolCallsByType,
+      fileF1: {
+        precision,
+        recall,
+        f1: precision + recall === 0 ? 0 : (2 * precision * recall) / (precision + recall),
+        tp: totals.tp,
+        fp: totals.fp,
+        fn: totals.fn,
+      },
+      usage: totals.usage,
+    };
+  }
+  return byArm;
+}
+
+function isBenchmarkRelevantFile(file) {
+  const normalized = file.replaceAll('\\', '/');
+  return !normalized.startsWith('node_modules/') &&
+    !normalized.startsWith('.pnpm-store/') &&
+    !normalized.startsWith('dist/') &&
+    !normalized.startsWith('coverage/') &&
+    normalized !== 'IMPACT_ANALYSIS.md' &&
+    normalized !== 'impact-analysis.json' &&
+    /\.(ts|tsx|js|mjs|cjs|d\.ts)$/.test(normalized);
+}
+
+module.exports = {
+  aggregateByArm,
+  computeFileF1,
+  isBenchmarkRelevantFile,
+  summarizeAgentJsonl,
+};