feat: review per-call tool usage in the integration-tests dashboard#2659
Merged
Conversation
Each integration-test agent run now writes a per-run tool-usage-<token>.json alongside its agent-metadata-<token>.md report (1:1 correlation by filename), so the ordered list of tools called in a specific run can be reconstructed even when the same stimulus runs multiple times in one directory. The capture records each tool call's name, arguments (secret-redacted, full), toolCallId, success, and order, including the 'skill' pseudo-tool. The dashboard blob enumerators exclude tool-usage-*.json from API enumeration for now. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2a: add the pipeline that makes each integration-test run's tool calls queryable from the dashboard. - tests/scripts/upload-tool-usage.ts: new uploader writing one Azure Table row per tool call (name, order, success, duration, output size); full arguments stay in the per-run blob and are fetched on demand. - dashboard/api getToolUsage.ts: GET /api/tool-usage read endpoint with skill/test/branch/runId/runToken filters. - dashboard/infra: provision the integrationtoolusage table and wire the TOOL_USAGE_TABLE_NAME app setting through main/storage/function-app bicep. - CI: add an 'Upload tool usage to table' step to the integration and azure-deploy workflows. - agent-runner.ts: capture per-call wall-clock durationMs and UTF-8 outputBytes alongside the existing tool sequence. - tests: unit coverage for the uploader transforms and the new capture fields. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2b: surface each run's tool calls for review in the dashboard. - integration-tests App.tsx: add a collapsible 'Tools' toggle under every test item (passed and failed) in the details panel. On expand it lazy-loads GET /api/tool-usage filtered by skill + test + selected date, groups rows by runToken (one block per run), and lists each call as order, success indicator, tool name, duration, and output size. Clicking 'args' fetches the call's full arguments on demand from the per-run tool-usage blob. - getToolUsage.ts: add a runDate filter and include durationMs/outputBytes in the projected rows. - integration-tests.css: styles for the tools toggle, run blocks, call rows, metrics, and the args panel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds end-to-end capture, storage, and UI rendering for per-call tool usage in integration-test agent runs so nightly runs can be reviewed at the “which tools ran, in what order, with what outcome/metrics” level.
Changes:
- Capture ordered tool invocations per agent run (including
skill) with success join, duration, and output size into per-runtool-usage-<token>.json. - Upload one Azure Table row per tool call and expose a new
GET /api/tool-usageendpoint for querying tool-call history. - Add a dashboard “Tools” section under each test that lazy-loads tool-call rows and fetches per-call arguments on demand.
Show a summary per file
| File | Description |
|---|---|
| tests/utils/agent-runner.ts | Captures tool-call sequences + writes per-run tool-usage JSON alongside the markdown report. |
| tests/utils/tests/tool-usage.test.ts | Unit tests for tool-usage capture ordering, joins, metrics, and filename derivation. |
| tests/scripts/upload-tool-usage.ts | Uploads per-run tool calls into Azure Table storage (one row per tool call). |
| tests/scripts/tests/upload-tool-usage.test.ts | Tests deterministic uploader transforms (row keys, token derivation, row expansion). |
| tests/package.json | Adds upload:tool-usage script entry. |
| dashboard/sync/src/msbenchBlobEnumerator.ts | Excludes tool-usage blobs from msbench blob enumeration. |
| dashboard/src/integration-tests/integration-tests.css | Styles for the new per-test “Tools” UI section. |
| dashboard/src/integration-tests/App.tsx | Adds the “Tools” collapsible UI, grouping by runToken and lazy-loading args. |
| dashboard/infra/modules/storage.bicep | Provisions the integrationtoolusage table and outputs its name. |
| dashboard/infra/modules/function-app.bicep | Wires TOOL_USAGE_TABLE_NAME into Function App settings. |
| dashboard/infra/main.bicep | Adds tool-usage table param and passes it through modules. |
| dashboard/api/src/functions/getToolUsage.ts | New anonymous API endpoint to query tool-call rows with optional filters. |
| dashboard/api/src/functions/getData.ts | Updates blob layout documentation to include tool-usage files. |
| dashboard/api/src/blobEnumerator.ts | Updates blob exclusion rules (currently also excludes tool-usage blobs). |
| .github/workflows/test-azure-deploy.yml | Upload step for tool-usage rows after integration tests. |
| .github/workflows/test-all-integration.yml | Upload step for tool-usage rows after integration tests across skills. |
Copilot's findings
- Files reviewed: 16/16 changed files
- Comments generated: 3
- Include tool-usage-*.json blobs in the dashboard data tree so the on-demand args fetch can locate them (blobEnumerator). - Require at least one filter on GET /api/tool-usage, returning 400 otherwise to avoid unfiltered full-table scans. - Batch Azure Table writes via submitTransaction in chunks of <=100 per partition instead of sequential upserts; add unit tests for the grouping helper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
|
Do you have a screenshot to show how this new UI looks like? |
This was referenced Jun 18, 2026
JasonYeMSFT
approved these changes
Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an end-to-end pipeline so the integration-tests dashboard can show exactly which tools were called in each agent run, for reviewability of nightly runs (not automated pass/fail comparison).
Delivered in three compartmentalized phases:
Phase 1 — capture (
de434420)tests/utils/agent-runner.ts:computeToolUsagerecords the ordered tool-call sequence per run (incl. theskillpseudo-tool, success joined bytoolCallId, plus per-calldurationMsandoutputBytes), written to a per-runtool-usage-<token>.jsonblob named 1:1 with the run''s markdown report.Phase 2a — storage + API (
08f682e8)tests/scripts/upload-tool-usage.ts: uploads one Azure Table row per tool call (name, order, success, duration, output size); full arguments stay in the blob and are fetched on demand.dashboard/apigetToolUsage.ts:GET /api/tool-usageread endpoint with skill/test/branch/runId/runToken/runDate filters.dashboard/infra: provisions theintegrationtoolusagetable and wiresTOOL_USAGE_TABLE_NAMEthrough the bicep modules.Phase 2b — dashboard UI (
e6cac93f)runToken(one block per run, ordered by call order), and lists each call as order · ✓/✗/? · tool name · duration · output size. Clicking args fetches that call''s full arguments on demand from the per-run blob.Testing
tests:npm run typecheck,npm run lint, unit tests for capture + uploader transforms all green.dashboard/api:tscbuild clean.dashboard:vite build+tsc --noEmitclean for changed files.az bicep buildclean (pre-existing tag warnings only).