Skip to content

Commit 54d5a25

Browse files
Copilotpelikhangithub-actions[bot]claude
authored
feat: add tool calls diff and tokens-per-turn to audit diff mode (#28494)
* feat: add tool calls diff and tokens-per-turn to audit diff mode - Add ToolCallDiffEntry, ToolCallsDiff, ToolCallsDiffSummary, BashCommandsDiff types - Add tokens-per-turn fields (Run1TokensPerTurn, Run2TokensPerTurn, TokensPerTurnChange) to RunMetricsDiff - Add ToolCallsDiff field to RunMetricsDiff for engine-level tool call analysis - Implement computeToolCallsDiff() diffing LogMetrics.ToolCalls between runs - Implement computeBashCommandsDiff() for bash-specific analysis (handles generic bash/Bash and per-command bash_* entries from Codex) - Implement isBashTool() helper for bash tool name matching - Update computeRunMetricsDiff() to compute tokens-per-turn and include tool calls diff - Add renderToolCallsDiffPrettySection() and renderBashCommandsDiffPrettySection() - Add renderToolCallsDiffMarkdownSection() and renderBashCommandsDiffMarkdownSection() - Update Run Metrics table in both renderers to show Tokens/turn row - Export ToolCallInfo type alias in pkg/cli/logs_models.go - Add 15 new unit tests covering all new functionality Agent-Logs-Url: https://github.com/github/gh-aw/sessions/dbe42488-aa10-4336-bfeb-170f75f44adf Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> * refactor: extract formatMaxSizeCell helper and remove redundant map traversal in bash diff - Extract formatMaxSizeCell() helper to remove duplicated max-size formatting in pretty and markdown renderers - Collect bash tools during main iteration in computeToolCallsDiff() so computeBashCommandsDiff() receives pre-filtered maps, avoiding a second traversal - Update tests to pass pre-filtered bash tool maps to computeBashCommandsDiff() directly Agent-Logs-Url: https://github.com/github/gh-aw/sessions/dbe42488-aa10-4336-bfeb-170f75f44adf Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> * docs(adr): add draft ADR-28494 for tool-call breakdown and tokens-per-turn in audit diff Generated by the Design Decision Gate workflow for PR #28494. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: aggregate duplicate tool names and always show dash for empty change cells - In computeToolCallsDiff: aggregate duplicate tool entries (sum CallCount, take max of sizes) instead of overwriting, matching how other consumers handle metrics appended from multiple log files - In renderToolCallsDiffPrettySection: always substitute "—" for empty change cells regardless of status, consistent with markdown renderer - Add TestComputeToolCallsDiff_DuplicateToolNames test Agent-Logs-Url: https://github.com/github/gh-aw/sessions/60183031-f19f-4034-bf21-75f6983738ad Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Peli de Halleux <pelikhan@users.noreply.github.com>
1 parent f3b8a40 commit 54d5a25

5 files changed

Lines changed: 942 additions & 0 deletions

File tree

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# ADR-28494: Embed Tool-Call Breakdown and Tokens-per-Turn Metrics in Audit Diff Output
2+
3+
**Date**: 2026-04-25
4+
**Status**: Draft
5+
**Deciders**: pelikhan, Copilot
6+
7+
---
8+
9+
## Part 1 — Narrative (Human-Friendly)
10+
11+
### Context
12+
13+
The `audit diff` command compares two workflow runs side-by-side and surfaces metrics like total token usage, turns, duration, and cache efficiency. When agents investigate cost regressions between runs they frequently cannot determine *why* tokens changed — whether the regression comes from more turns, heavier per-turn usage, or a shift in which tools (bash, gh, edit, etc.) are being called. The existing `RunMetricsDiff` structure exposed aggregate token counts and turn counts, but no per-turn token rate and no breakdown by tool type. Engine-level tool call data was already being parsed into `RunSummary.Metrics.ToolCalls` (populated by the Claude, Codex, and Copilot log parsers) but was never surfaced in the diff output.
14+
15+
### Decision
16+
17+
We will enrich `RunMetricsDiff` with two new data structures — `ToolCallsDiff` and a tokens-per-turn scalar — and render both in the existing pretty-console and markdown diff renderers. Tokens per turn uses effective tokens from the firewall proxy when available, falling back to the engine-level token count. Tool call data is sourced from the already-computed `LogMetrics.ToolCalls` slice; bash-related entries (`bash`, `Bash`, `bash_*`) are separated into a dedicated `BashCommandsDiff` sub-structure to expose the Codex-style per-command granularity. All new types follow the existing JSON-serialisable struct conventions used by `TokenUsageDiff` and `GitHubRateLimitDiff`.
18+
19+
### Alternatives Considered
20+
21+
#### Alternative 1: Separate `audit tool-calls` subcommand
22+
23+
A dedicated subcommand could show tool-call detail for a single run or a pair. It was considered because it avoids enlarging the diff output and keeps concerns separated. It was not chosen because the tool-call delta is meaningful only in the context of a comparison; the driver is always "why did cost change between run A and run B?" Putting the data in the diff keeps it contextual and eliminates extra round trips for agents.
24+
25+
#### Alternative 2: Log-level analysis only — do not change the diff output
26+
27+
Agents could perform deeper log analysis themselves by querying the raw run logs. This was considered because it keeps the diff output lean. It was not chosen because the relevant data (`RunSummary.Metrics.ToolCalls`) is already materialised in memory during diff computation; re-parsing logs adds latency and requires agents to implement the aggregation logic every time, which is exactly what prompted this change.
28+
29+
#### Alternative 3: Add data to JSON output only, no render changes
30+
31+
Extending the JSON struct without adding renderer support would satisfy machine consumers but not the human-readable console/markdown reports that are the primary use-case for `audit diff`. This approach was not chosen because the primary consumer is the rendered diff report read by agents and engineers.
32+
33+
### Consequences
34+
35+
#### Positive
36+
- Agents can immediately see which tool types drove a token or call-count change between two runs without additional log queries.
37+
- Per-turn token efficiency distinguishes "more turns" regressions from "heavier per-turn" regressions, enabling more targeted fixes.
38+
- Bash command granularity (via Codex's `bash_*` naming) exposes specific shell commands that changed frequency — actionable detail for prompt/workflow optimisation.
39+
- The `AllTools` slice provides a complete cross-run view, not just the delta, which helps verify expected tool usage patterns.
40+
41+
#### Negative
42+
- The diff output grows in length; runs with many tool types will produce lengthy "Tool Call Breakdown" sections that may be noisy when there are no significant changes.
43+
- The `isBashTool` helper encodes engine-specific naming conventions (`bash`, `Bash`, `bash_*`) directly in the diff logic, creating a coupling point that must be updated if a new engine uses a different shell tool naming scheme.
44+
- Tokens-per-turn uses integer division, silently discarding the fractional part; the resulting value can appear identical between two runs even when there is a small real difference.
45+
46+
#### Neutral
47+
- The new `ToolCallInfo` type alias is exported from `logs_models.go` alongside the existing `LogMetrics` alias, following the established aliasing pattern for shared workflow types.
48+
- Bash diff computation receives pre-filtered maps from the parent iteration to avoid a second traversal, which is a performance micro-optimisation that future readers should be aware of when modifying the iteration logic.
49+
50+
---
51+
52+
## Part 2 — Normative Specification (RFC 2119)
53+
54+
> The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this section are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119).
55+
56+
### Tokens-per-Turn Computation
57+
58+
1. Implementations **MUST** compute tokens-per-turn as `effectiveTokens / turns` where `effectiveTokens` is `TokenUsageSummary.TotalEffectiveTokens` when that value is greater than zero.
59+
2. Implementations **MUST** fall back to the engine-level token count (`WorkflowRun.TokenUsage`) when `TotalEffectiveTokens` is zero or the `TokenUsageSummary` is absent.
60+
3. Implementations **MUST NOT** compute a tokens-per-turn value when the turn count is zero (to avoid division by zero).
61+
4. Implementations **SHOULD** format the tokens-per-turn change as a percentage string (e.g., `+50%`, `-10%`) using the same `formatVolumeChange` helper applied to other percentage-point metrics.
62+
63+
### Tool Calls Diff
64+
65+
1. Implementations **MUST** source tool call data from `RunSummary.Metrics.ToolCalls` (`LogMetrics.ToolCalls`) and **MUST NOT** re-parse raw log files during diff computation.
66+
2. Implementations **MUST** produce a `ToolCallsDiff` that classifies each tool as `new`, `removed`, `changed`, or `unchanged` relative to the baseline run.
67+
3. Implementations **MUST** include every tool seen in either run in the `AllTools` slice, sorted lexicographically by tool name.
68+
4. Implementations **MUST** return `nil` for `ToolCallsDiff` when both runs have no tool call data, to keep the output clean for runs predating this feature.
69+
5. Implementations **SHOULD** include per-entry `MaxInputSize` and `MaxOutputSize` values to provide token-size context for each tool type.
70+
71+
### Bash-Specific Breakdown
72+
73+
1. Implementations **MUST** treat tool names matching `bash` or `Bash` (case-insensitive equality) and names with the prefix `bash_` (case-insensitive) as bash tool invocations.
74+
2. Implementations **MUST** aggregate all bash tool entries into a `BashCommandsDiff` sub-structure and report their combined call count for each run.
75+
3. Implementations **MUST** collect bash tool entries during the main tool iteration loop and **MUST NOT** perform a second traversal of the tool maps to build the bash diff.
76+
4. Implementations **MUST** return `nil` for `BashCommandsDiff` when no bash tool calls are present in either run.
77+
78+
### Rendering
79+
80+
1. Implementations **MUST** render `ToolCallsDiff` in both the pretty-console and markdown output paths when the diff is non-nil.
81+
2. Implementations **MUST** render the tokens-per-turn row in the Run Metrics table when at least one of `Run1TokensPerTurn` or `Run2TokensPerTurn` is greater than zero.
82+
3. Implementations **SHOULD** use a `formatMaxSizeCell` helper (or equivalent) to format `run1 / run2` size pairs, displaying `` when both values are zero and omitting the individual value when it is zero.
83+
4. Implementations **MAY** omit the Bash Commands sub-section from the rendered output when `BashDiff` is nil.
84+
85+
### Conformance
86+
87+
An implementation is considered conformant with this ADR if it satisfies all **MUST** and **MUST NOT** requirements above. Failure to meet any **MUST** or **MUST NOT** requirement constitutes non-conformance.
88+
89+
---
90+
91+
*This is a DRAFT ADR generated by the [Design Decision Gate](https://github.com/github/gh-aw/actions/runs/24940226956) workflow. The PR author must review, complete, and finalize this document before the PR can merge.*

0 commit comments

Comments
 (0)