Skip to content

Eval: xlsx evals use soft markers that could match incidentally #55

@olaservo

Description

@olaservo

Problem

The xlsx eval tasks check for generic terms as expected output markers:

Task Marker Risk
xlsx-openpyxl `"openpyxl"` Common Python library name
xlsx-formulas `"formula"` Generic Excel term
xlsx-financial `"blue"` Common color word
xlsx-verify `"recalc.py"` Slightly more specific but still a tool name

Compare with synthetic skills that use unambiguous markers:

  • `SKILLJACK_GREETING_SUCCESS`
  • `SKILLJACK_CODE_FORMATTED`
  • `SKILLJACK_TEMPLATE_LOADED`

A model could mention "openpyxl", "formula", or "blue" in a response about Excel without having actually followed the xlsx skill instructions. The word "blue" in particular is highly likely to appear incidentally.

Suggestion

Since the xlsx skill is a real production skill (not synthetic), injecting artificial markers isn't ideal. Options:

  1. Use more specific compound markers — e.g., check for `"blue text"` AND `"openpyxl"` together
  2. Use regex patterns — the EvalConfig already supports RegExp
  3. Accept the trade-off and document that xlsx evals have weaker assertion guarantees than synthetic skill evals

Files

  • `evals/tasks/xlsx-*.json` (4 files)
  • `evals/lib/eval-checker.ts` (already supports RegExp in expectedOutput)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions