Add token usage reporter and optimization skill

Skillgym can evaluate whether agents preserve behavior, but it does not yet provide a token-efficient workflow for optimizing skill or project metadata. Maintainers and agents need a way to reduce billable token usage from files such as skills or repository instructions while proving that output quality remains protected by Skillgym cases.

This feature should make Skillgym suitable for an optimization loop where an agent writes or selects evals, captures baseline billable token usage, edits an explicitly provided metadata target, reruns the evals, and repeats until the requested budget, reduction, or iteration limit is reached.

## Product Requirements

1. Add a built-in `token-usage` reporter.
2. Add a `token-optimization` Skillgym skill that instructs agents how to run this workflow safely.
3. Keep the reporter stdout compact, strict JSON, and optimized for agent consumption.
4. Reuse normal Skillgym artifacts for failure debugging instead of creating a second detailed token report format.
5. Document the reporter and skill so another agent can discover and use them without reading implementation code.

## Observed Findings

- Built-in reporters currently include `standard`, `json`, `json-summary`, `github-actions`, and `html`.
- Reporter selection is documented in `docs/reporters.md` and exposed through `skillgym run <suite.ts> --reporter <name>`.
- Reporter hooks receive final `SuiteRunResult` data in `onSuiteFinish` via `src/reporters/contract.ts`.
- Session usage is normalized in `UsageReport` with optional `inputTokens`, `outputTokens`, `reasoningTokens`, `cacheTokens`, `totalTokens`, and usage source metadata.
- The default reporter already exposes comparable billable token semantics through normalized total-token reporting, displayed as `billable`.
- `json-summary` is useful for LLM consumption, but it still includes more result and artifact detail than this optimization loop needs.
- `skills/core.md` already points agents to reporter and snapshot workflows, but no skill describes a metadata token optimization loop.
- `DICTIONARY.md` does not yet include `token-usage` or `token-optimization`.

## Suggested Behavior

### `token-usage` Reporter

The new built-in reporter should be selectable with:

```bash
skillgym run <suite.ts> --reporter token-usage
```

It should write only strict JSON to stdout. No progress lines, Markdown, prose, or extra diagnostics should be mixed into stdout.

The output should include:

- top-level `passed` for the suite run result
- top-level `billable` aggregate using only passed rows with provider-sourced billable usage
- top-level `artifacts` pointing to the suite-run artifact directory
- `rows`, one row per case x runner result

Suggested compact shape:

```json
{
  "passed": true,
  "billable": { "sum": 4200, "avg": 2100 },
  "artifacts": "artifacts/...",
  "rows": [
    {
      "case": "keeps-critical-rule",
      "runner": "opencode:gpt-5.5",
      "passed": true,
      "billable": { "sum": 4200, "avg": 2100 }
    }
  ]
}
```

Case x runner rows are required because token usage can differ by runner and model.

Billable semantics:

- Only use billable token data equivalent to the default reporter's `billable`/normalized total-token semantics.
- Only aggregate provider-sourced token usage.
- Failed rows must not contribute to token aggregates because failed runs do not represent normal workflow cost.
- Rows without provider-sourced billable usage must not contribute to token aggregates.
- A failed row should still appear, with `passed: false`, `billable: null`, and enough failure signal to identify what broke.
- A row with unavailable or derived-only token usage should still appear, with `billable: null` and a usage marker such as `usage: "derived"` or `usage: "unavailable"`.
- Top-level aggregate `billable.sum` should sum billable usage across included rows.
- Top-level aggregate `billable.avg` should average included row averages.
- Each successful comparable row should expose `billable.sum` and `billable.avg` rather than a single number, so repeats can be interpreted correctly.

Failure diagnostics:

- The reporter should be stdout-only.
- It should not write a separate `token-usage-report.json` unless existing artifacts later prove insufficient.
- It should include the top-level artifact directory once only, not repeated per row.
- Agents should use the normal Skillgym artifacts or rerun with the default reporter when a row fails.

### `token-optimization` Skill

Add a Skillgym skill named `token-optimization`, discoverable through the existing skills mechanism.

The skill should instruct an agent to:

1. Require an explicit optimization target from the invoker.
2. Ask one short clarification question if the optimization target is missing.
3. Identify an existing protection suite or create/strengthen the smallest suite that protects the target behavior.
4. Run a baseline with `--reporter token-usage` before making minimization edits.
5. Treat a passing baseline as a hard prerequisite for optimization edits.
6. Use compact reporter stdout for before/after token comparison.
7. If any row fails, inspect normal Skillgym artifacts/default reporter output and do not count lower token usage from that failed row as an improvement.
8. Make the smallest safe metadata changes to the explicit optimization target.
9. Rerun with `--reporter token-usage` after each change.
10. Stop according to the invoker-provided token budget, percentage reduction target, or max iteration count.
11. If no stopping rule is provided, default to one safe minimization pass plus one verification run.
12. Describe snapshots as optional post-optimization regression protection after behavior is stable.

The skill should not prescribe which files to optimize by default. It may give examples, but the actual target must come from the invoker.

### Documentation And Dictionary

Update documentation so users can discover:

- the `token-usage` reporter name
- the stdout JSON contract
- provider-sourced billable-token-only aggregation
- failed-row and derived-usage exclusion behavior
- artifact debugging expectations
- the `token-optimization` skill workflow

Update `DICTIONARY.md` with approved terms:

- `token-usage`
- `token-optimization`

## Acceptance Criteria

- `skillgym run <suite.ts> --reporter token-usage` loads as a built-in reporter.
- Reporter stdout is strict JSON.
- Reporter stdout has one row per case x runner result.
- Passed rows with provider-sourced billable usage include `{ "sum": number, "avg": number }`.
- Failed rows have `billable: null` and are excluded from all aggregates.
- Derived or unavailable usage rows have `billable: null` and are excluded from all aggregates.
- Top-level `artifacts` points to the suite-run artifact directory and is not repeated per row.
- The reporter does not create a separate detailed token report file.
- Tests cover built-in reporter loading/listing and the aggregation/exclusion semantics above.
- `token-optimization` skill exists and documents the target requirement, baseline passing rule, optimization loop, stopping rule, failure handling, and optional snapshot usage.
- Reporter docs and dictionary are updated.

## Resolution Summary

Skillgym should provide a compact `token-usage` reporter and a `token-optimization` skill so agents can safely minimize explicitly targeted metadata while preserving behavior through passing evals and comparing only provider-sourced billable token usage from successful runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add token usage reporter and optimization skill #43

Product Requirements

Observed Findings

Suggested Behavior

`token-usage` Reporter

`token-optimization` Skill

Documentation And Dictionary

Acceptance Criteria

Resolution Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add token usage reporter and optimization skill #43

Description

Product Requirements

Observed Findings

Suggested Behavior

token-usage Reporter

token-optimization Skill

Documentation And Dictionary

Acceptance Criteria

Resolution Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`token-usage` Reporter

`token-optimization` Skill