Skip to content

Commit e5cf544

Browse files
docs: align eval guidance with non-pinned Tessl default
1 parent 3fb3e43 commit e5cf544

2 files changed

Lines changed: 15 additions & 3 deletions

File tree

docs/agents/evals.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,11 @@ benchmark claims, or scoring rules.
126126
keys, replay logs, and one-off run details.
127127
- Keep hosted eval usage minimal while preserving confidence:
128128
- Use `scripts/run_eval_suite.sh` so variants match suite purpose and runs use the plugin context.
129+
- Use Tessl's default solver unless the account has model-selection entitlements and you intentionally
130+
want a representative frontier check. The hosted default is intentionally not pinned by policy.
131+
See Tessl model-selection notes if available:
132+
- https://docs.tessl.io/changelog
133+
- https://tessl.io/blog/why-were-changing-our-default-eval-model/
129134
- Main and reference scenarios run with both variants.
130135
- Regression scenarios run with context only by default. Run regression without-context only when
131136
intentionally checking whether a scenario should move back to reference.

docs/agents/workflow.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,16 @@ release-readiness.
3131
patch-bump dry-run for PR safety. For skill, eval, package, or release changes that should publish
3232
a new version, let Release Please bump the version before the exact-version publish check.
3333

34-
- For skill behavior or eval changes, run hosted evals with Sonnet 4.6, but start with the smallest
35-
useful set to conserve Tessl daily rate-limit budget. Use `scripts/run_eval_suite.sh` so the run
36-
uses plugin context and the right variant policy.
34+
- For skill behavior or eval changes, run hosted evals with Tessl's default solver, but start with the
35+
smallest useful set to conserve Tessl daily rate-limit budget. Use `scripts/run_eval_suite.sh` so the
36+
run uses plugin context and the right variant policy.
37+
38+
Model-selection note:
39+
Do not pin Sonnet in default commands; the script runs with the current Tessl default solver.
40+
If model-selection is available, Sonnet 4.6 or better is a good representative check. See Tessl
41+
model-selection and default-model discussions:
42+
- https://docs.tessl.io/changelog
43+
- https://tessl.io/blog/why-were-changing-our-default-eval-model/
3744

3845
If any eval scenario's `task.md`, `criteria.json`, or `capability.txt` changed, run that exact
3946
scenario before finishing the PR. A pure move between `evals/`, `evals-reference/`, and

0 commit comments

Comments
 (0)