Skip to content

Add token usage reporter and optimization skill #43

@V3RON

Description

@V3RON

Skillgym can evaluate whether agents preserve behavior, but it does not yet provide a token-efficient workflow for optimizing skill or project metadata. Maintainers and agents need a way to reduce billable token usage from files such as skills or repository instructions while proving that output quality remains protected by Skillgym cases.

This feature should make Skillgym suitable for an optimization loop where an agent writes or selects evals, captures baseline billable token usage, edits an explicitly provided metadata target, reruns the evals, and repeats until the requested budget, reduction, or iteration limit is reached.

Product Requirements

  1. Add a built-in token-usage reporter.
  2. Add a token-optimization Skillgym skill that instructs agents how to run this workflow safely.
  3. Keep the reporter stdout compact, strict JSON, and optimized for agent consumption.
  4. Reuse normal Skillgym artifacts for failure debugging instead of creating a second detailed token report format.
  5. Document the reporter and skill so another agent can discover and use them without reading implementation code.

Observed Findings

  • Built-in reporters currently include standard, json, json-summary, github-actions, and html.
  • Reporter selection is documented in docs/reporters.md and exposed through skillgym run <suite.ts> --reporter <name>.
  • Reporter hooks receive final SuiteRunResult data in onSuiteFinish via src/reporters/contract.ts.
  • Session usage is normalized in UsageReport with optional inputTokens, outputTokens, reasoningTokens, cacheTokens, totalTokens, and usage source metadata.
  • The default reporter already exposes comparable billable token semantics through normalized total-token reporting, displayed as billable.
  • json-summary is useful for LLM consumption, but it still includes more result and artifact detail than this optimization loop needs.
  • skills/core.md already points agents to reporter and snapshot workflows, but no skill describes a metadata token optimization loop.
  • DICTIONARY.md does not yet include token-usage or token-optimization.

Suggested Behavior

token-usage Reporter

The new built-in reporter should be selectable with:

skillgym run <suite.ts> --reporter token-usage

It should write only strict JSON to stdout. No progress lines, Markdown, prose, or extra diagnostics should be mixed into stdout.

The output should include:

  • top-level passed for the suite run result
  • top-level billable aggregate using only passed rows with provider-sourced billable usage
  • top-level artifacts pointing to the suite-run artifact directory
  • rows, one row per case x runner result

Suggested compact shape:

{
  "passed": true,
  "billable": { "sum": 4200, "avg": 2100 },
  "artifacts": "artifacts/...",
  "rows": [
    {
      "case": "keeps-critical-rule",
      "runner": "opencode:gpt-5.5",
      "passed": true,
      "billable": { "sum": 4200, "avg": 2100 }
    }
  ]
}

Case x runner rows are required because token usage can differ by runner and model.

Billable semantics:

  • Only use billable token data equivalent to the default reporter's billable/normalized total-token semantics.
  • Only aggregate provider-sourced token usage.
  • Failed rows must not contribute to token aggregates because failed runs do not represent normal workflow cost.
  • Rows without provider-sourced billable usage must not contribute to token aggregates.
  • A failed row should still appear, with passed: false, billable: null, and enough failure signal to identify what broke.
  • A row with unavailable or derived-only token usage should still appear, with billable: null and a usage marker such as usage: "derived" or usage: "unavailable".
  • Top-level aggregate billable.sum should sum billable usage across included rows.
  • Top-level aggregate billable.avg should average included row averages.
  • Each successful comparable row should expose billable.sum and billable.avg rather than a single number, so repeats can be interpreted correctly.

Failure diagnostics:

  • The reporter should be stdout-only.
  • It should not write a separate token-usage-report.json unless existing artifacts later prove insufficient.
  • It should include the top-level artifact directory once only, not repeated per row.
  • Agents should use the normal Skillgym artifacts or rerun with the default reporter when a row fails.

token-optimization Skill

Add a Skillgym skill named token-optimization, discoverable through the existing skills mechanism.

The skill should instruct an agent to:

  1. Require an explicit optimization target from the invoker.
  2. Ask one short clarification question if the optimization target is missing.
  3. Identify an existing protection suite or create/strengthen the smallest suite that protects the target behavior.
  4. Run a baseline with --reporter token-usage before making minimization edits.
  5. Treat a passing baseline as a hard prerequisite for optimization edits.
  6. Use compact reporter stdout for before/after token comparison.
  7. If any row fails, inspect normal Skillgym artifacts/default reporter output and do not count lower token usage from that failed row as an improvement.
  8. Make the smallest safe metadata changes to the explicit optimization target.
  9. Rerun with --reporter token-usage after each change.
  10. Stop according to the invoker-provided token budget, percentage reduction target, or max iteration count.
  11. If no stopping rule is provided, default to one safe minimization pass plus one verification run.
  12. Describe snapshots as optional post-optimization regression protection after behavior is stable.

The skill should not prescribe which files to optimize by default. It may give examples, but the actual target must come from the invoker.

Documentation And Dictionary

Update documentation so users can discover:

  • the token-usage reporter name
  • the stdout JSON contract
  • provider-sourced billable-token-only aggregation
  • failed-row and derived-usage exclusion behavior
  • artifact debugging expectations
  • the token-optimization skill workflow

Update DICTIONARY.md with approved terms:

  • token-usage
  • token-optimization

Acceptance Criteria

  • skillgym run <suite.ts> --reporter token-usage loads as a built-in reporter.
  • Reporter stdout is strict JSON.
  • Reporter stdout has one row per case x runner result.
  • Passed rows with provider-sourced billable usage include { "sum": number, "avg": number }.
  • Failed rows have billable: null and are excluded from all aggregates.
  • Derived or unavailable usage rows have billable: null and are excluded from all aggregates.
  • Top-level artifacts points to the suite-run artifact directory and is not repeated per row.
  • The reporter does not create a separate detailed token report file.
  • Tests cover built-in reporter loading/listing and the aggregation/exclusion semantics above.
  • token-optimization skill exists and documents the target requirement, baseline passing rule, optimization loop, stopping rule, failure handling, and optional snapshot usage.
  • Reporter docs and dictionary are updated.

Resolution Summary

Skillgym should provide a compact token-usage reporter and a token-optimization skill so agents can safely minimize explicitly targeted metadata while preserving behavior through passing evals and comparing only provider-sourced billable token usage from successful runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions