github · idan · Apr 23, 2026 · idan · Apr 23, 2026 · idan
@@ -9,13 +9,13 @@ Agentic workflows that run on every pull request can quietly accumulate large AP
 
 ---
 
-GitHub Agentic Workflows are like the a team of street sweepers that clean up little messes all over your repo. However, like all agentic work cost is a first-class concern. Chatbots work under a user's watchful eye, but automations like agentic workflows run out of view and costs can compound across an entire team's activity. Thankfully it is easier to improve CI automation efficiency than interactive desktop use. A developer's session can be hard to predict since tasks change minute to minute and context is reactive. An agentic workflow's task is fully specified in YAML and it runs the same job every time, which makes systematic optimization easier. 
+GitHub Agentic Workflows are like the a team of street sweepers that clean up little messes all over your repo. Like all agentic work, cost is a first-class concern. Chatbots work under a user's watchful eye, but automations like agentic workflows run out of view and costs can compound across an entire team's activity. Thankfully it is easier to improve the efficiency of agentic automations than it is to optimize interactive agentic applications. A developer's session can be hard to predict because tasks change minute to minute, and the underlying context is reactive. In contrast, an agentic workflow's task is fully specified in YAML and it runs the same job every time, which makes systematic optimization easier. 
 
 We build and maintain GitHub Agentic Workflows as a live product in our own repository, and we worry about our own token efficiency as much as our users do. In early April 2026, we began to systematically optimize the token-usage of the workflows that we rely on every day. This post describes what we instrumented, how we optimized, and the results.
 
 ## Token efficiency 
 
-The repositories that build GitHub Agentic Workflows use agentic workflows for their own CI. We have an Auto-Triage Issues workflow that labels every new issue for discoverability, a Contribution Check that audits incoming pull requests for contributor guideline compliance, a Test Quality Sentinel that reviews test depth on every ready-for-review PR, a Glossary Maintainer that keeps documentation in sync with codebase changes, and three daily quality checks—Daily Syntax Error Quality, Daily Compiler Quality, and Daily Community Attribution—that run on a schedule to test compiler error messages, assess code standards, and track community contributions. These run on production hardware against production API rate limits.
+The repositories that build GitHub Agentic Workflows use agentic workflows for their own CI. We have an `Auto-Triage Issues` workflow that labels every new issue for discoverability, a `Contribution Check` that audits incoming pull requests for contributor guideline compliance, a `Test Quality Sentinel` that reviews test depth on every ready-for-review PR, a `Glossary Maintainer` that keeps documentation in sync with codebase changes, and three daily quality checks — `Daily Syntax Error Quality`, `Daily Compiler Quality`, and `Daily Community Attribution` — each of which run on a schedule to test compiler error messages, assess code standards, and track community contributions. These run on production hardware against production API rate limits.
 
 The fastest path to understanding the true characteristics of a system is to depend on it yourself. When a workflow's context window grows by 20% because we accidentally added an unused MCP tool to a manifest, we see it in our own data. Running our own workflows gives us a strong incentive to improve.
 
@@ -29,9 +29,9 @@ Every workflow run now emits a `token-usage.jsonl` artifact with one record per
 
 Token data in hand, the next question was what to do with it. Rather than analyze it manually, we built two optimization workflows that run on a daily schedule.
 
-A **Daily Token Usage Auditor** reads token usage artifacts from all recent workflow runs, aggregates consumption by workflow and time period, and posts a structured report. Its job is to flag any workflow that has significantly increased its token footprint since the last report, surface the most expensive workflows, and note any runs that look anomalous (e.g., a workflow that normally completes in 4 LLM turns taking 18).
+A `Daily Token Usage Auditor` reads token usage artifacts from all recent workflow runs, aggregates consumption by workflow and time period, and posts a structured report. Its job is to flag any workflow that has significantly increased its token footprint since the last report, surface the most expensive workflows, and note any runs that look anomalous (e.g., a workflow that normally completes in 4 LLM turns taking 18 turns instead).
 
-A **Daily Token Optimizer** goes further. When an Auditor flags a heavy workflow, the Optimizer looks at the workflow's source and recent run logs and creates a new issue with concrete inefficiencies and specific changes. The Optimizer has consistently found many workflow inefficiencies that we had missed.
+A `Daily Token Optimizer` goes further. When an Auditor flags a heavy workflow, the Optimizer looks at the workflow's source and recent run logs and creates a new issue with concrete inefficiencies and specific changes. The Optimizer has consistently found many workflow inefficiencies that we had missed.
 
 Of course, these are agentic workflows themselves, and their token usage also appears in the daily reports, creating a small virtuous cycle.
 
@@ -75,7 +75,7 @@ where *m* is a model cost multiplier (Haiku = 0.25×, Sonnet = 1.0×, Opus = 5.0
 
 **The workload is a live repository.** The workflows we optimize are not operating on consistent benchmark data. A workflow that processes a 200-line PR diff one day genuinely uses more tokens than one processing a 5-line fix a few hours later. The difference is correct behavior, not inefficiency. Raw token counts can conflate workload variation with efficiency changes. We try to normalize for this by tracking LLM API call counts alongside token counts; if the number of LLM turns per run stays constant while tokens-per-call falls, that's a genuine efficiency improvement. If both fall together, it could mean less work is being done.
 
-**Does quality change?** This is the hardest question. A lighter model running a more constrained workflow might produce lower-quality output. We looked at the process-level signals like output tokens per LLM call, turn counts per run, and tool-call completion rates to approximate quality. For our optimized Smoke Copilot workflow all three remained stable across the optimization period even as token consumption fell. The workflow completes in exactly 5 LLM turns every run, before and after the optimizations. Of course, these are process signals, not outcome signals. We cannot directly observe whether the quality of agent output improved, degraded, or stayed flat, because we have no ground-truth labels for what "correct" output looks like. Measuring goodput—tokens per unit of correct work—requires additional instrumentation and thought.
+**Does quality change?** This is the hardest question. A lighter model running a more constrained workflow might produce lower-quality output. We looked at the process-level signals like output tokens per LLM call, turn counts per run, and tool-call completion rates to approximate quality. For our optimized Smoke Copilot workflow all three remained stable across the optimization period even as token consumption fell. The workflow completes in exactly 5 LLM turns every run, before and after the optimizations. Of course, these are process signals, not outcome signals. We cannot directly observe whether the quality of agent output improved, degraded, or stayed flat, because we have no ground-truth labels for what "correct" output looks like. Measuring output—tokens per unit of correct work—requires additional instrumentation and thought.
 
 ## Initial results
 
@@ -97,7 +97,7 @@ From these results, we highlight three patterns that account for most of the gai
 
 The tools that we use to optimize our workflows like API-level observability, automated auditing workflows, MCP tool pruning, and CLI substitution are all available today in the Github Agentic Workflows framework. The measurement methodology (workload normalization, effective tokens) is documented in the [Effective Tokens specification](https://github.com/github/gh-aw/blob/main/docs/src/content/docs/reference/effective-tokens-specification.md) and the data and analysis scripts for this study are published on the [`token-efficiency-paper`](https://github.com/github/gh-aw-firewall/tree/token-efficiency-paper) branch.
 
-The open questions are genuinely hard: measuring goodput requires outcome instrumentation that doesn't yet exist at scale for agentic CI workflows. We're building toward it. In the meantime, the proxy-level observability and the optimizer workflows have already changed how we develop and deploy new agentic automations—we add token monitoring from day one rather than retrofitting it later.
+The open questions are genuinely hard: measuring output requires outcome instrumentation that doesn't yet exist at scale for agentic CI workflows. We're building toward it. In the meantime, the proxy-level observability and the optimizer workflows have already changed how we develop and deploy new agentic automations—we add token monitoring from day one rather than retrofitting it later.
 
 If you're running agentic workflows in CI and wondering whether you're spending more than you need to, the first step is the same as ours: add the API proxy, turn on logging, and let the data tell you where to look.