Skillgym can evaluate whether agents preserve behavior, but it does not yet provide a token-efficient workflow for optimizing skill or project metadata. Maintainers and agents need a way to reduce billable token usage from files such as skills or repository instructions while proving that output quality remains protected by Skillgym cases.
This feature should make Skillgym suitable for an optimization loop where an agent writes or selects evals, captures baseline billable token usage, edits an explicitly provided metadata target, reruns the evals, and repeats until the requested budget, reduction, or iteration limit is reached.
Product Requirements
- Add a built-in
token-usage reporter.
- Add a
token-optimization Skillgym skill that instructs agents how to run this workflow safely.
- Keep the reporter stdout compact, strict JSON, and optimized for agent consumption.
- Reuse normal Skillgym artifacts for failure debugging instead of creating a second detailed token report format.
- Document the reporter and skill so another agent can discover and use them without reading implementation code.
Observed Findings
- Built-in reporters currently include
standard, json, json-summary, github-actions, and html.
- Reporter selection is documented in
docs/reporters.md and exposed through skillgym run <suite.ts> --reporter <name>.
- Reporter hooks receive final
SuiteRunResult data in onSuiteFinish via src/reporters/contract.ts.
- Session usage is normalized in
UsageReport with optional inputTokens, outputTokens, reasoningTokens, cacheTokens, totalTokens, and usage source metadata.
- The default reporter already exposes comparable billable token semantics through normalized total-token reporting, displayed as
billable.
json-summary is useful for LLM consumption, but it still includes more result and artifact detail than this optimization loop needs.
skills/core.md already points agents to reporter and snapshot workflows, but no skill describes a metadata token optimization loop.
DICTIONARY.md does not yet include token-usage or token-optimization.
Suggested Behavior
token-usage Reporter
The new built-in reporter should be selectable with:
skillgym run <suite.ts> --reporter token-usage
It should write only strict JSON to stdout. No progress lines, Markdown, prose, or extra diagnostics should be mixed into stdout.
The output should include:
- top-level
passed for the suite run result
- top-level
billable aggregate using only passed rows with provider-sourced billable usage
- top-level
artifacts pointing to the suite-run artifact directory
rows, one row per case x runner result
Suggested compact shape:
{
"passed": true,
"billable": { "sum": 4200, "avg": 2100 },
"artifacts": "artifacts/...",
"rows": [
{
"case": "keeps-critical-rule",
"runner": "opencode:gpt-5.5",
"passed": true,
"billable": { "sum": 4200, "avg": 2100 }
}
]
}
Case x runner rows are required because token usage can differ by runner and model.
Billable semantics:
- Only use billable token data equivalent to the default reporter's
billable/normalized total-token semantics.
- Only aggregate provider-sourced token usage.
- Failed rows must not contribute to token aggregates because failed runs do not represent normal workflow cost.
- Rows without provider-sourced billable usage must not contribute to token aggregates.
- A failed row should still appear, with
passed: false, billable: null, and enough failure signal to identify what broke.
- A row with unavailable or derived-only token usage should still appear, with
billable: null and a usage marker such as usage: "derived" or usage: "unavailable".
- Top-level aggregate
billable.sum should sum billable usage across included rows.
- Top-level aggregate
billable.avg should average included row averages.
- Each successful comparable row should expose
billable.sum and billable.avg rather than a single number, so repeats can be interpreted correctly.
Failure diagnostics:
- The reporter should be stdout-only.
- It should not write a separate
token-usage-report.json unless existing artifacts later prove insufficient.
- It should include the top-level artifact directory once only, not repeated per row.
- Agents should use the normal Skillgym artifacts or rerun with the default reporter when a row fails.
token-optimization Skill
Add a Skillgym skill named token-optimization, discoverable through the existing skills mechanism.
The skill should instruct an agent to:
- Require an explicit optimization target from the invoker.
- Ask one short clarification question if the optimization target is missing.
- Identify an existing protection suite or create/strengthen the smallest suite that protects the target behavior.
- Run a baseline with
--reporter token-usage before making minimization edits.
- Treat a passing baseline as a hard prerequisite for optimization edits.
- Use compact reporter stdout for before/after token comparison.
- If any row fails, inspect normal Skillgym artifacts/default reporter output and do not count lower token usage from that failed row as an improvement.
- Make the smallest safe metadata changes to the explicit optimization target.
- Rerun with
--reporter token-usage after each change.
- Stop according to the invoker-provided token budget, percentage reduction target, or max iteration count.
- If no stopping rule is provided, default to one safe minimization pass plus one verification run.
- Describe snapshots as optional post-optimization regression protection after behavior is stable.
The skill should not prescribe which files to optimize by default. It may give examples, but the actual target must come from the invoker.
Documentation And Dictionary
Update documentation so users can discover:
- the
token-usage reporter name
- the stdout JSON contract
- provider-sourced billable-token-only aggregation
- failed-row and derived-usage exclusion behavior
- artifact debugging expectations
- the
token-optimization skill workflow
Update DICTIONARY.md with approved terms:
token-usage
token-optimization
Acceptance Criteria
skillgym run <suite.ts> --reporter token-usage loads as a built-in reporter.
- Reporter stdout is strict JSON.
- Reporter stdout has one row per case x runner result.
- Passed rows with provider-sourced billable usage include
{ "sum": number, "avg": number }.
- Failed rows have
billable: null and are excluded from all aggregates.
- Derived or unavailable usage rows have
billable: null and are excluded from all aggregates.
- Top-level
artifacts points to the suite-run artifact directory and is not repeated per row.
- The reporter does not create a separate detailed token report file.
- Tests cover built-in reporter loading/listing and the aggregation/exclusion semantics above.
token-optimization skill exists and documents the target requirement, baseline passing rule, optimization loop, stopping rule, failure handling, and optional snapshot usage.
- Reporter docs and dictionary are updated.
Resolution Summary
Skillgym should provide a compact token-usage reporter and a token-optimization skill so agents can safely minimize explicitly targeted metadata while preserving behavior through passing evals and comparing only provider-sourced billable token usage from successful runs.
Skillgym can evaluate whether agents preserve behavior, but it does not yet provide a token-efficient workflow for optimizing skill or project metadata. Maintainers and agents need a way to reduce billable token usage from files such as skills or repository instructions while proving that output quality remains protected by Skillgym cases.
This feature should make Skillgym suitable for an optimization loop where an agent writes or selects evals, captures baseline billable token usage, edits an explicitly provided metadata target, reruns the evals, and repeats until the requested budget, reduction, or iteration limit is reached.
Product Requirements
token-usagereporter.token-optimizationSkillgym skill that instructs agents how to run this workflow safely.Observed Findings
standard,json,json-summary,github-actions, andhtml.docs/reporters.mdand exposed throughskillgym run <suite.ts> --reporter <name>.SuiteRunResultdata inonSuiteFinishviasrc/reporters/contract.ts.UsageReportwith optionalinputTokens,outputTokens,reasoningTokens,cacheTokens,totalTokens, and usage source metadata.billable.json-summaryis useful for LLM consumption, but it still includes more result and artifact detail than this optimization loop needs.skills/core.mdalready points agents to reporter and snapshot workflows, but no skill describes a metadata token optimization loop.DICTIONARY.mddoes not yet includetoken-usageortoken-optimization.Suggested Behavior
token-usageReporterThe new built-in reporter should be selectable with:
It should write only strict JSON to stdout. No progress lines, Markdown, prose, or extra diagnostics should be mixed into stdout.
The output should include:
passedfor the suite run resultbillableaggregate using only passed rows with provider-sourced billable usageartifactspointing to the suite-run artifact directoryrows, one row per case x runner resultSuggested compact shape:
{ "passed": true, "billable": { "sum": 4200, "avg": 2100 }, "artifacts": "artifacts/...", "rows": [ { "case": "keeps-critical-rule", "runner": "opencode:gpt-5.5", "passed": true, "billable": { "sum": 4200, "avg": 2100 } } ] }Case x runner rows are required because token usage can differ by runner and model.
Billable semantics:
billable/normalized total-token semantics.passed: false,billable: null, and enough failure signal to identify what broke.billable: nulland a usage marker such asusage: "derived"orusage: "unavailable".billable.sumshould sum billable usage across included rows.billable.avgshould average included row averages.billable.sumandbillable.avgrather than a single number, so repeats can be interpreted correctly.Failure diagnostics:
token-usage-report.jsonunless existing artifacts later prove insufficient.token-optimizationSkillAdd a Skillgym skill named
token-optimization, discoverable through the existing skills mechanism.The skill should instruct an agent to:
--reporter token-usagebefore making minimization edits.--reporter token-usageafter each change.The skill should not prescribe which files to optimize by default. It may give examples, but the actual target must come from the invoker.
Documentation And Dictionary
Update documentation so users can discover:
token-usagereporter nametoken-optimizationskill workflowUpdate
DICTIONARY.mdwith approved terms:token-usagetoken-optimizationAcceptance Criteria
skillgym run <suite.ts> --reporter token-usageloads as a built-in reporter.{ "sum": number, "avg": number }.billable: nulland are excluded from all aggregates.billable: nulland are excluded from all aggregates.artifactspoints to the suite-run artifact directory and is not repeated per row.token-optimizationskill exists and documents the target requirement, baseline passing rule, optimization loop, stopping rule, failure handling, and optional snapshot usage.Resolution Summary
Skillgym should provide a compact
token-usagereporter and atoken-optimizationskill so agents can safely minimize explicitly targeted metadata while preserving behavior through passing evals and comparing only provider-sourced billable token usage from successful runs.