Problem
The xlsx eval tasks check for generic terms as expected output markers:
| Task |
Marker |
Risk |
| xlsx-openpyxl |
`"openpyxl"` |
Common Python library name |
| xlsx-formulas |
`"formula"` |
Generic Excel term |
| xlsx-financial |
`"blue"` |
Common color word |
| xlsx-verify |
`"recalc.py"` |
Slightly more specific but still a tool name |
Compare with synthetic skills that use unambiguous markers:
- `SKILLJACK_GREETING_SUCCESS`
- `SKILLJACK_CODE_FORMATTED`
- `SKILLJACK_TEMPLATE_LOADED`
A model could mention "openpyxl", "formula", or "blue" in a response about Excel without having actually followed the xlsx skill instructions. The word "blue" in particular is highly likely to appear incidentally.
Suggestion
Since the xlsx skill is a real production skill (not synthetic), injecting artificial markers isn't ideal. Options:
- Use more specific compound markers — e.g., check for `"blue text"` AND `"openpyxl"` together
- Use regex patterns — the EvalConfig already supports RegExp
- Accept the trade-off and document that xlsx evals have weaker assertion guarantees than synthetic skill evals
Files
- `evals/tasks/xlsx-*.json` (4 files)
- `evals/lib/eval-checker.ts` (already supports RegExp in expectedOutput)
Problem
The xlsx eval tasks check for generic terms as expected output markers:
Compare with synthetic skills that use unambiguous markers:
A model could mention "openpyxl", "formula", or "blue" in a response about Excel without having actually followed the xlsx skill instructions. The word "blue" in particular is highly likely to appear incidentally.
Suggestion
Since the xlsx skill is a real production skill (not synthetic), injecting artificial markers isn't ideal. Options:
Files