Skip to content

feat: capture agent statistics via OTel and surface in safe outputs#219

Merged
jamesadevine merged 8 commits into
mainfrom
feat/agent-stats
Apr 15, 2026
Merged

feat: capture agent statistics via OTel and surface in safe outputs#219
jamesadevine merged 8 commits into
mainfrom
feat/agent-stats

Conversation

@jamesadevine

Copy link
Copy Markdown
Collaborator

Summary

Captures agent execution statistics (token usage, duration, model, tool calls, turns) using Copilot CLI's native OpenTelemetry file exporter and surfaces them in safe output write actions.

How it works

  1. Stage 1: Pipeline sets COPILOT_OTEL_FILE_EXPORTER_PATH=/tmp/awf-tools/staging/otel.jsonl (always-on, zero cost)
  2. Stage 2: ado-aw execute parses the OTel JSONL, populates ExecutionContext.agent_stats
  3. Write actions: Append a collapsible stats block to body/description

Example output

<details>
<summary>Agent Stats (Daily Code Review)</summary>

| Metric | Value |
|--------|-------|
| Model | claude-sonnet-4.5 |
| Tokens | 32,949 input / 236 output (33,185 total) |
| Duration | 8s |
| Tool calls | 2 |
| Turns | 2 |

</details>

Safe outputs with stats

Each supports per-tool include-stats: false opt-out in front matter:

Safe Output Where stats appear
create-pull-request PR description
create-work-item Work item description
comment-on-work-item Comment body
add-pr-comment PR comment body
create-wiki-page Wiki page content
update-wiki-page Wiki page content

Opt-out example

safe-outputs:
  create-pull-request:
    target-branch: main
    include-stats: false  # suppress stats on PRs

Changes

File Change
src/agent_stats.rs (new) AgentStats struct, OTel JSONL parser (reuses ndjson), to_markdown(), append_stats_to_body(), 9 unit tests
src/safeoutputs/result.rs Added agent_name and agent_stats to ExecutionContext
src/main.rs Plumbs front_matter.name and loads otel.jsonl into context
templates/base.yml OTel env vars in AWF step (always-on)
6 safe output files Append stats via append_stats_to_body()
tests/fixtures/copilot-otel.jsonl Real OTel fixture for integration test

Closes #168

jamesadevine and others added 2 commits April 15, 2026 22:20
Create src/agent_stats.rs with AgentStats struct that parses Copilot
CLI OpenTelemetry file exporter output. Extracts model, token usage,
duration, tool calls, and turns from the last invoke_agent span and
execute_tool span counts.

Reuses ndjson::read_ndjson_file for JSONL parsing. Includes
to_markdown() renderer with collapsible details block.

9 unit tests including real copilot-otel.jsonl fixture.

Closes: part of #168

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add agent_name: Option<String> and agent_stats: Option<AgentStats>
  to ExecutionContext
- Plumb front_matter.name to ctx.agent_name in main.rs execute handler
- Load otel.jsonl from safe_output_dir, parse into ctx.agent_stats
  (non-fatal if missing — just logs debug/warn)
- Add OTel env vars to base.yml AWF step (always-on: COPILOT_OTEL_ENABLED,
  COPILOT_OTEL_FILE_EXPORTER_PATH, COPILOT_OTEL_EXPORTER_TYPE)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Looks good overall with a couple of things worth addressing before merge — one security gap and one misleading metric.


Findings

🔒 Security Concerns

  • src/agent_stats.rs:116-134 — OTel file values embedded without sanitization

    model (and indirectly agent_name) are inserted verbatim into the markdown stats block that gets appended to PR descriptions, work item bodies, wiki pages, and comments. The model value is sourced from the OTel JSONL at /tmp/awf-tools/staging/otel.jsonl. The agent itself executes inside the AWF container with the staging directory mounted as a writable volume (that's where safe output NDJSON files land). A misbehaving agent could overwrite otel.jsonl to inject content into these fields — at minimum breaking markdown table rendering, potentially worse.

    The existing SanitizeContent pipeline isn't applied anywhere in append_stats_to_body. A minimal fix: sanitize model through the same control-character + ##vso[ pipeline applied to agent-controlled fields before embedding it. agent_name comes from operator front matter (more trusted), but it too bypasses the sanitization that FrontMatter fields receive at compile time, since it's re-read fresh from disk at Stage 2.

🐛 Bugs / Logic Issues

  • src/agent_stats.rs:95-103tool_calls counts internal Copilot CLI spans

    The code counts all spans whose name starts with "execute_tool". The test fixture itself documents the problem inline: "execute_tool report_intent" is included in the count alongside "execute_tool bash", yielding a tool_calls of 2 when a user would reasonably interpret the output as "1 bash call." report_intent is a Copilot CLI-internal mechanism, not a user-visible tool invocation.

    The fix is to exclude known internal tool names. At minimum filter out "execute_tool report_intent" and other first-party Copilot CLI administrative tools, or more robustly only count spans where gen_ai.tool.type == "function" and the tool name isn't in a known internal set.

⚠️ Suggestions

  • src/safeoutputs/result.rsagent_name is redundant

    ExecutionContext now has both agent_name: Option<String> and agent_stats: Option<AgentStats>, where AgentStats already stores agent_name: String. The only consumer of ctx.agent_name (outside of setting it) appears to be stats population. This creates two sources of truth that could drift. Consider removing ExecutionContext.agent_name and reading it from ctx.agent_stats.as_ref().map(|s| s.agent_name.as_str()) at call sites, or just having to_markdown() not embed the name and instead pass it in at the call site.

  • AGENTS.mdinclude-stats is undocumented

    The new per-tool include-stats: false opt-out is a meaningful operator-facing config option that should be documented alongside the other per-tool safe-output configuration fields.

✅ What Looks Good

  • The from_otel_entries / from_otel_file split is clean — pure parsing logic is fully unit-testable without I/O, and the real-fixture integration test (test_from_otel_entries_real_fixture) gives high confidence against the actual data format.
  • Graceful degradation throughout: missing/malformed OTel file just logs a warning and continues without stats — no user-visible failure.
  • Using .last() on invoke_agent spans correctly handles multi-session OTel files where earlier sessions would have lower cumulative token counts.
  • Per-tool include-stats opt-out is a thoughtful addition for operators who want clean PR descriptions.

Generated by Rust PR Reviewer for issue #219 · ● 574.9K ·

Append a collapsible markdown stats block to safe outputs that produce
human-readable content. Each tool reads include_stats from its typed
config struct (deserialized via ctx.get_tool_config), matching the
existing config pattern used for all other tool options.

Safe outputs with stats:
- create-pull-request (PR description)
- create-work-item (work item description)
- comment-on-work-item (comment body)
- add-pr-comment (PR comment body)
- create-wiki-page (wiki page content)
- update-wiki-page (wiki page content)

Per-tool opt-out via front matter:
  safe-outputs:
    create-pull-request:
      include-stats: false

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jamesadevine and others added 3 commits April 15, 2026 22:49
…ent_name

Security:
- Sanitize model and agent_name before embedding in markdown stats
  block. Strips control chars, neutralizes ##vso[ pipeline commands,
  escapes pipe chars that break markdown tables. The OTel file is
  writable by the agent inside AWF, so these values are untrusted.

Bug fix:
- Filter out Copilot CLI internal tool spans (report_intent,
  permission) from tool_calls count. Only user-visible tool
  invocations are counted.

Cleanup:
- Remove redundant agent_name from ExecutionContext — AgentStats
  already stores it. Single source of truth.

Docs:
- Document include-stats option for all 6 safe outputs in AGENTS.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Azure DevOps does not render HTML <details>/<summary> as collapsible
sections — they display as raw text. Switch to a horizontal rule +
italic heading + table, which renders correctly across ADO PRs, work
items, and wiki pages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace table with a single line using middle-dot separators:
🤖 _Agent Name_ · model · 45,230 in / 12,450 out · 23 tool calls · 4m 32s

Remove turns from output (low value to operators). Remove unused
total_tokens() method.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Mostly solid, but has a Default trait inconsistency across three config types that could cause subtle bugs if those configs are ever constructed outside of serde.


Findings

🐛 Bugs / Logic Issues

  • src/safeoutputs/comment_on_work_item.rs:98, create_wiki_page.rs:100, update_wiki_page.rs:96 — All three use #[derive(Default)], which means include_stats: bool derives to false (Rust's bool default), not true as intended. The serde path is correct (#[serde(default = "default_include_stats")]true), but Default::default() / ..Default::default() gives the wrong value. The three configs that received explicit Default impls (AddPrCommentConfig, CreateWorkItemConfig, CreatePrConfig) are correct; these three are not.

    This is masked in tests because ExecutionContext::agent_stats is always None in test fixtures, so append_stats_to_body returns the body unchanged regardless of include_stats. It would bite if a test ever uses real stats or if code constructs these configs directly.

  • src/agent_stats.rs:38INTERNAL_TOOL_NAMES[1] is "execute_tool permission", but the real OTel fixture emits the span as "permission" (without the execute_tool prefix). This entry is dead code — "permission" is already excluded by the starts_with("execute_tool") predicate. The corresponding test entry in test_internal_tools_excluded_from_count (line 353) tests a synthetic span name that the Copilot CLI doesn't actually emit.

⚠️ Suggestions

  • src/agent_stats.rs:156sanitize_for_markdown passes \n through the control-character filter (*c == '\n'). Since to_markdown() produces a deliberately single-line format, a model name or agent name containing a newline (possible if the OTel file is crafted or if front matter has a multi-line name) would silently break the formatting. Consider using .replace('\n', " ") or stripping all control chars uniformly.

  • src/safeoutputs/create_pr.rs:1260 — Stats are appended after the provenance footer: description → footer → stats. This puts the footer (a security watermark) between the human-readable description and the stats block. It may be intentional, but reversing to description → stats → footer keeps the provenance footer as the final, unambiguous marker.

  • src/safeoutputs/add_pr_comment.rs (EOF), create_work_item.rs (EOF) — Both files are missing a trailing newline after the default_include_stats function. cargo fmt and most linters will flag this.

✅ What Looks Good

  • The OTel parsing strategy (last invoke_agent span for totals, separate execute_tool spans for call count) is a clean heuristic that correctly handles multi-turn sessions without double-counting.
  • ##vso[ and ##[ injection via crafted OTel model names is correctly sanitized.
  • The opt-out mechanism (include-stats: false) is uniformly applied across all six safe-output surfaces.
  • Graceful degradation: OTel file absence and parse failures are non-fatal (warn/debug log), which is the right behavior for a best-effort stats feature.
  • 9 unit tests including a real fixture-based test with concrete assertions — good coverage of edge cases.

Generated by Rust PR Reviewer for issue #219 · ● 1.2M ·

…ooter order

Bugs fixed:
- CommentOnWorkItemConfig, CreateWikiPageConfig, UpdateWikiPageConfig:
  replace #[derive(Default)] with manual Default impls so include_stats
  defaults to true (not false from bool default)
- Remove dead "execute_tool permission" from INTERNAL_TOOL_NAMES — the
  real OTel span is "permission" (no prefix), already excluded by the
  starts_with("execute_tool") predicate

Improvements:
- Strip newlines in sanitize_for_markdown (single-line format)
- Reorder PR description: stats before provenance footer (footer is
  the final unambiguous security marker)
- Add trailing newlines to add_pr_comment.rs and create_work_item.rs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Good overall design and solid test coverage; one rendering bug with markdown escaping and a few minor issues worth cleaning up before merge.


Findings

🐛 Bugs / Logic Issues

src/agent_stats.rs:191sanitize_for_markdown doesn't escape _ or * before embedding name in italic markers

let name = sanitize_for_markdown(&self.agent_name);
// ...
"\u{1F916} _{name}_ \u{00B7} {model} \u{00B7} ..."

sanitize_for_markdown strips control chars, ##vso[, and | — but doesn't escape _ or *. A perfectly normal agent name like "Build_Agent_v2" produces _Build_Agent_v2_, which most markdown parsers render as Build (italic) followed by Agent_v2_ — breaking the italic and leaving a trailing underscore. Names with hyphens (very common, e.g. "code-review") are fine, but underscores are a real operator hazard.

Fix: add _ and * escaping before using name in a _..._ context:

fn sanitize_for_markdown(s: &str) -> String {
    s.chars()
        .filter(|c| !c.is_control())
        .collect::<String>()
        .replace("##vso[", "[vso-filtered][")
        .replace("##[", "[filtered][")
        .replace('_', "\\_")   // add
        .replace('*', "\\*")   // add
        .replace('|', "\\|")
}

(Alternatively, don't wrap name in _..._ if it's operator-supplied. model doesn't have this problem since model identifiers use hyphens.)


⚠️ Suggestions

src/agent_stats.rs:240 — Orphaned doc comment line before compute_duration

}        // ← closes append_stats_to_body
///      // ← this becomes an empty first line of compute_duration's doc
/// Times are `[seconds, nanoseconds]` arrays.
fn compute_duration(span: &Value) -> f64 {

The /// sits immediately after append_stats_to_body's closing brace with no function description for compute_duration. It reads as compute_duration's doc comment having an empty subject line. Should be:

/// Compute the wall-clock duration of a span in seconds.
/// Times are `[seconds, nanoseconds]` arrays.
fn compute_duration(span: &Value) -> f64 {

src/safeoutputs/*.rsdefault_include_stats() duplicated five times

add_pr_comment.rs, comment_on_work_item.rs, create_wiki_page.rs, create_work_item.rs, and update_wiki_page.rs each define an identical private fn default_include_stats() -> bool { true }. Meanwhile create_pr.rs correctly reuses the existing default_true() it already had.

Consider exporting from agent_stats.rs:

pub(crate) fn default_include_stats() -> bool { true }

...and using it in the five #[serde(default = "...")] attributes.


src/agent_stats.rs — No unit tests for append_stats_to_body

The append_stats_to_body function has two important branches — include_stats: false early-return and ctx.agent_stats: None pass-through — neither of which is directly tested. The fixture test exercises from_otel_entries and to_markdown well, but a quick table-driven test for append_stats_to_body would close the gap:

#[test]
fn test_append_stats_to_body_opt_out() {
    let ctx = ExecutionContext { agent_stats: Some(/* ... */), ..Default::default() };
    assert_eq!(append_stats_to_body("body", &ctx, false), "body");
}

#[test]
fn test_append_stats_to_body_no_stats() {
    let ctx = ExecutionContext::default(); // agent_stats: None
    assert_eq!(append_stats_to_body("body", &ctx, true), "body");
}

src/safeoutputs/create_wiki_page.rs, update_wiki_page.rs — Missing newline at end of file

Both patches end with \ No newline at end of file. Rust files should end with a newline.


✅ What Looks Good

  • The OTel parsing strategy is solid — using .last() on invoke_agent spans to get aggregated totals rather than summing individual chat spans avoids double-counting and is resilient to multi-turn sessions.
  • Graceful degradation in main.rs: OTel parse failures are log::warn (not fatal), so a missing or corrupt OTel file never blocks Stage 2 execution.
  • The real-fixture test with copilot-otel.jsonl is excellent — it validates turn counts, token aggregation, and tool-call filtering (including the report_intent exclusion) against actual CLI output.
  • The ##vso[/##[ injection prevention in sanitize_for_markdown is a good defence-in-depth measure, even though these commands aren't interpreted in ADO PR/comment bodies.

Generated by Rust PR Reviewer for issue #219 · ● 2.3M ·

…s, tests

- Remove italic underscores from agent name in stats line (plain text)
- Fix orphaned doc comment on compute_duration
- Consolidate 5 duplicate default_include_stats() fns into single
  pub(crate) fn in agent_stats.rs
- Add 3 unit tests for append_stats_to_body (opt-out, no-stats, with-stats)
- Fix trailing newlines

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jamesadevine jamesadevine merged commit 6f8d0f7 into main Apr 15, 2026
4 checks passed
@jamesadevine jamesadevine deleted the feat/agent-stats branch April 15, 2026 22:22
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Rust PR Review

Summary: Looks good overall — clean design with appropriate error handling and good test coverage. A few observations worth noting.

Findings

🐛 Bugs / Logic Issues

  • src/agent_stats.rs:122–137\n---\n separator can render as Setext <h2> heading. to_markdown() starts with \n---\n, so when appended to a body that doesn't end with a newline, the result is:

    [last line of body]
    ---
    🤖 ...
    

    In CommonMark and ADO's renderer, --- directly following a non-empty line is a Setext heading marker for that line. The fix is to start with \n\n---\n (ensuring a blank separator line) or explicitly check that the body ends with a newline before appending.

  • src/agent_stats.rs:28AgentStats.turns is a dead field. It's populated from OTel (github.copilot.turn_count) and tested, but to_markdown() never renders it (the unit test at line 377 explicitly asserts !md.contains("turns")). Either render it or remove the field to avoid silent dead weight. As-is, it occupies space in every AgentStats clone for no observable effect.

⚠️ Suggestions

  • src/main.rs:228 — No file size cap before loading OTel JSONL into memory. read_ndjson_file reads the entire file before parsing. For a long-running agent with hundreds of tool calls, the OTel file can grow large. A defensive size check (e.g., reject if > 10 MB, matching the pattern used for patch files in create_pr.rs) would prevent unexpected memory spikes in Stage 2.

  • src/safeoutputs/create_pr.rs:460 — Inconsistent default function for include_stats. create_pr.rs uses the existing default_true fn: #[serde(default = "default_true", rename = "include-stats")], while all other files use crate::agent_stats::default_include_stats. Both return true, so no functional difference, but it's visually inconsistent when grep-searching for include-stats config fields.

  • src/agent_stats.rs:42INTERNAL_TOOL_NAMES comment says "Names must include the execute_tool prefix" but the const could be misread. Consider renaming to INTERNAL_TOOL_SPAN_NAMES and adding a brief example in the comment to make the exact match vs prefix-match distinction clearer for future maintainers who add entries.

✅ What Looks Good

  • Graceful degradation: OTel parse failure is a warn! not a hard error — Stage 2 proceeds normally without stats. Right call.
  • sanitize_for_markdown covers the right threat model: strips control chars, neutralizes ##vso[ and ##[ pipeline commands, escapes pipe chars for table safety. Appropriate for markdown-body contexts (not pipeline YAML).
  • Using .last() on invoke_agent spans correctly handles the multi-session fixture (two invocations in sequence — takes the final aggregate).
  • include-stats: false opt-out is consistently plumbed through all 6 tools with a proper serde default, compile-time checked in tests.
  • Test coverage is solid: 9 unit tests including real fixture, boundary cases for duration/number formatting, ##vso[ injection, and the internal tool exclusion logic.

Generated by Rust PR Reviewer for issue #219 · ● 669.9K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: capture agent statistics (token usage, duration) and surface in safe output write actions

1 participant