Skip to content

feat: review per-call tool usage in the integration-tests dashboard#2659

Merged
tmeschter merged 4 commits into
microsoft:mainfrom
tmeschter:260615-RecordTools
Jun 22, 2026
Merged

feat: review per-call tool usage in the integration-tests dashboard#2659
tmeschter merged 4 commits into
microsoft:mainfrom
tmeschter:260615-RecordTools

Conversation

@tmeschter

Copy link
Copy Markdown
Member

Summary

Adds an end-to-end pipeline so the integration-tests dashboard can show exactly which tools were called in each agent run, for reviewability of nightly runs (not automated pass/fail comparison).

Delivered in three compartmentalized phases:

Phase 1 — capture (de434420)

  • tests/utils/agent-runner.ts: computeToolUsage records the ordered tool-call sequence per run (incl. the skill pseudo-tool, success joined by toolCallId, plus per-call durationMs and outputBytes), written to a per-run tool-usage-<token>.json blob named 1:1 with the run''s markdown report.

Phase 2a — storage + API (08f682e8)

  • tests/scripts/upload-tool-usage.ts: uploads one Azure Table row per tool call (name, order, success, duration, output size); full arguments stay in the blob and are fetched on demand.
  • dashboard/api getToolUsage.ts: GET /api/tool-usage read endpoint with skill/test/branch/runId/runToken/runDate filters.
  • dashboard/infra: provisions the integrationtoolusage table and wires TOOL_USAGE_TABLE_NAME through the bicep modules.
  • CI: an "Upload tool usage to table" step in the integration and azure-deploy workflows.

Phase 2b — dashboard UI (e6cac93f)

  • A collapsible "Tools" toggle under each test item (passed and failed) in the details panel. On expand it lazy-loads the API filtered by skill + test + selected date, groups calls by runToken (one block per run, ordered by call order), and lists each call as order · ✓/✗/? · tool name · duration · output size. Clicking args fetches that call''s full arguments on demand from the per-run blob.

Testing

  • tests: npm run typecheck, npm run lint, unit tests for capture + uploader transforms all green.
  • dashboard/api: tsc build clean.
  • dashboard: vite build + tsc --noEmit clean for changed files.
  • az bicep build clean (pre-existing tag warnings only).

tmeschter and others added 3 commits June 15, 2026 13:34
Each integration-test agent run now writes a per-run tool-usage-<token>.json
alongside its agent-metadata-<token>.md report (1:1 correlation by filename),
so the ordered list of tools called in a specific run can be reconstructed even
when the same stimulus runs multiple times in one directory.

The capture records each tool call's name, arguments (secret-redacted, full),
toolCallId, success, and order, including the 'skill' pseudo-tool. The dashboard
blob enumerators exclude tool-usage-*.json from API enumeration for now.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2a: add the pipeline that makes each integration-test run's tool
calls queryable from the dashboard.

- tests/scripts/upload-tool-usage.ts: new uploader writing one Azure Table
  row per tool call (name, order, success, duration, output size); full
  arguments stay in the per-run blob and are fetched on demand.
- dashboard/api getToolUsage.ts: GET /api/tool-usage read endpoint with
  skill/test/branch/runId/runToken filters.
- dashboard/infra: provision the integrationtoolusage table and wire the
  TOOL_USAGE_TABLE_NAME app setting through main/storage/function-app bicep.
- CI: add an 'Upload tool usage to table' step to the integration and
  azure-deploy workflows.
- agent-runner.ts: capture per-call wall-clock durationMs and UTF-8
  outputBytes alongside the existing tool sequence.
- tests: unit coverage for the uploader transforms and the new capture fields.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2b: surface each run's tool calls for review in the dashboard.

- integration-tests App.tsx: add a collapsible 'Tools' toggle under every
  test item (passed and failed) in the details panel. On expand it lazy-loads
  GET /api/tool-usage filtered by skill + test + selected date, groups rows by
  runToken (one block per run), and lists each call as order, success
  indicator, tool name, duration, and output size. Clicking 'args' fetches the
  call's full arguments on demand from the per-run tool-usage blob.
- getToolUsage.ts: add a runDate filter and include durationMs/outputBytes in
  the projected rows.
- integration-tests.css: styles for the tools toggle, run blocks, call rows,
  metrics, and the args panel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 16, 2026 20:49
Comment thread tests/scripts/upload-tool-usage.ts Dismissed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end capture, storage, and UI rendering for per-call tool usage in integration-test agent runs so nightly runs can be reviewed at the “which tools ran, in what order, with what outcome/metrics” level.

Changes:

  • Capture ordered tool invocations per agent run (including skill) with success join, duration, and output size into per-run tool-usage-<token>.json.
  • Upload one Azure Table row per tool call and expose a new GET /api/tool-usage endpoint for querying tool-call history.
  • Add a dashboard “Tools” section under each test that lazy-loads tool-call rows and fetches per-call arguments on demand.
Show a summary per file
File Description
tests/utils/agent-runner.ts Captures tool-call sequences + writes per-run tool-usage JSON alongside the markdown report.
tests/utils/tests/tool-usage.test.ts Unit tests for tool-usage capture ordering, joins, metrics, and filename derivation.
tests/scripts/upload-tool-usage.ts Uploads per-run tool calls into Azure Table storage (one row per tool call).
tests/scripts/tests/upload-tool-usage.test.ts Tests deterministic uploader transforms (row keys, token derivation, row expansion).
tests/package.json Adds upload:tool-usage script entry.
dashboard/sync/src/msbenchBlobEnumerator.ts Excludes tool-usage blobs from msbench blob enumeration.
dashboard/src/integration-tests/integration-tests.css Styles for the new per-test “Tools” UI section.
dashboard/src/integration-tests/App.tsx Adds the “Tools” collapsible UI, grouping by runToken and lazy-loading args.
dashboard/infra/modules/storage.bicep Provisions the integrationtoolusage table and outputs its name.
dashboard/infra/modules/function-app.bicep Wires TOOL_USAGE_TABLE_NAME into Function App settings.
dashboard/infra/main.bicep Adds tool-usage table param and passes it through modules.
dashboard/api/src/functions/getToolUsage.ts New anonymous API endpoint to query tool-call rows with optional filters.
dashboard/api/src/functions/getData.ts Updates blob layout documentation to include tool-usage files.
dashboard/api/src/blobEnumerator.ts Updates blob exclusion rules (currently also excludes tool-usage blobs).
.github/workflows/test-azure-deploy.yml Upload step for tool-usage rows after integration tests.
.github/workflows/test-all-integration.yml Upload step for tool-usage rows after integration tests across skills.

Copilot's findings

  • Files reviewed: 16/16 changed files
  • Comments generated: 3

Comment thread dashboard/api/src/blobEnumerator.ts Outdated
Comment thread dashboard/api/src/functions/getToolUsage.ts
Comment thread tests/scripts/upload-tool-usage.ts
- Include tool-usage-*.json blobs in the dashboard data tree so the
  on-demand args fetch can locate them (blobEnumerator).
- Require at least one filter on GET /api/tool-usage, returning 400
  otherwise to avoid unfiltered full-table scans.
- Batch Azure Table writes via submitTransaction in chunks of <=100
  per partition instead of sequential upserts; add unit tests for the
  grouping helper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JasonYeMSFT

Copy link
Copy Markdown
Member

Do you have a screenshot to show how this new UI looks like?

@tmeschter tmeschter merged commit f0db848 into microsoft:main Jun 22, 2026
11 checks passed
@tmeschter tmeschter deleted the 260615-RecordTools branch June 22, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants