Skip to content

chore(eval): add rstest-best-practices eval and report#56

Merged
SoonIter merged 1 commit intomainfrom
chore/rstest-best-practices-eval
Apr 29, 2026
Merged

chore(eval): add rstest-best-practices eval and report#56
SoonIter merged 1 commit intomainfrom
chore/rstest-best-practices-eval

Conversation

@fi3ework
Copy link
Copy Markdown
Member

Mirrors #54.

What to look at

  • Production change: skills/rstest-best-practices/SKILL.md +2 lines under Test-writing — prefer await expect(...).rejects.toThrow() / .resolves.toEqual() over try/catch + expect.fail. Surfaced by eval 2 (fetch-with-retry), the only with_skill failure.
  • Report describes the pre-commit (131-line) baseline run, not the post-fix skill. "[Done in this commit]" notes in report.md mark the recommendation bundled here. Numbers (74/75 with_skill vs 65/75 baseline, +12pp) are not re-run.
  • evals.json schema follows chore(eval): add migrate-to-rstest eval and report #54: fixture_root / runs_root / runner_instructions / notes / evals[]. /tmp paths are defaults — runner agent picks any OS scratch dir per runner_instructions.

Test plan

  • CI green

@fi3ework fi3ework force-pushed the chore/rstest-best-practices-eval branch 2 times, most recently from 73d6aa2 to 5be7ab8 Compare April 29, 2026 07:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5be7ab879a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread skills-test/rstest-best-practices/report.md Outdated
@fi3ework fi3ework force-pushed the chore/rstest-best-practices-eval branch from 5be7ab8 to 203262f Compare April 29, 2026 09:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 203262f168

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread skills-test/rstest-best-practices/evals/evals.json Outdated
Ran the 10-eval suite from evals/evals.json against the current SKILL.md
(with_skill) and a no-skill baseline (without_skill), 1 sample per cell
on Sonnet 4.6. Aggregate: with_skill 74/75 (98.7%), without_skill
65/75 (86.7%) — +12 pts. Skill also cuts mean tokens ~17% and mean wall
time ~26%.

Largest gap in browser-mode (eval 5 react-dropdown-browser-mode: 8/8 vs
3/8) — without the skill the model writes JSDOM-style tests
(querySelector + dispatchEvent KeyboardEvent + document.activeElement)
in a real-Chromium fixture, missing every benefit of @rstest/browser
+ Locator API + expect.element web-first retry.

Also tightens the Test-writing section to prefer
await expect(fn()).rejects.toThrow() / .resolves.toEqual() over
try/catch + expect.fail or .catch(e => e) patterns — surfaced by the
only with_skill failure (eval 2 fetch-with-retry).
@fi3ework fi3ework force-pushed the chore/rstest-best-practices-eval branch from 203262f to 77c7296 Compare April 29, 2026 09:23
@fi3ework
Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@SoonIter SoonIter enabled auto-merge (squash) April 29, 2026 10:02
@SoonIter SoonIter merged commit 5beac9c into main Apr 29, 2026
4 checks passed
@SoonIter SoonIter deleted the chore/rstest-best-practices-eval branch April 29, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants