Skip to content

chore: add Evalite for optional LLM and task evals#1538

Draft
ieduardogf wants to merge 2 commits intomasterfrom
chore/add-evalite-eval-runner
Draft

chore: add Evalite for optional LLM and task evals#1538
ieduardogf wants to merge 2 commits intomasterfrom
chore/add-evalite-eval-runner

Conversation

@ieduardogf
Copy link
Copy Markdown
Collaborator

Summary

Adds Evalite (evalite@beta) and Vitest as dev dependencies so we can run scored evaluations (tasks, prompts, or LLM outputs) with a local UI and trace storage.

This PR introduces the plumbing only (dependencies, scripts, sample eval, README). Follow-up PRs are expected to add new eval suites by dropping additional *.eval.ts files under evals/ (or extending existing ones), without further tooling changes.

How to use

  1. Watch mode (development)yarn eval:dev
    Runs Evalite in watch mode, re-runs when eval files change, and serves the results UI (default http://localhost:3006 per Evalite docs).

  2. Single run (e.g. CI or quick check)yarn eval
    Runs all evals once and exits with a non-zero status if scores are below the threshold.

  3. Authoring — Add or edit files matching evals/**/*.eval.ts. See evals/my-eval.eval.ts for a minimal example using evalite() and deterministic scorers.

  4. Scorers — For string-only scorers (e.g. exactMatch), import from evalite/scorers/deterministic. The main evalite/scorers entry pulls in LLM-judge scorers that need the optional ai peer dependency.

Docs

  • README Development section lists the new scripts and links the Evalite quickstart.

Ref: N/A

Add evalite@beta, vitest, yarn scripts, sample eval under evals/, and README notes.

Made-with: Cursor
@ieduardogf ieduardogf added the AI AI Generated label Apr 21, 2026
@ieduardogf ieduardogf requested review from Marcosld, atabel, aweell and yceballost and removed request for aweell April 21, 2026 16:26
@github-actions
Copy link
Copy Markdown

Size stats

master this branch diff
Total JS 16.1 MB 16.1 MB 0 B
JS without icons 2.01 MB 2.01 MB 0 B
Lib overhead 92.5 kB 92.5 kB 0 B
Lib overhead (gzip) 19.9 kB 19.9 kB 0 B

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Deploy preview for mistica-web ready!

Project:mistica-web
Status: ✅  Deploy successful!
Preview URL:https://mistica-ewew51csv-flows-projects-65bb050e.vercel.app
Latest Commit:cb9e417
Inspect:View deployment

Deployed with vercel-action

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Accessibility report
✔️ No issues found

ℹ️ You can run this locally by executing yarn audit-accessibility.

The evals/ directory imports from evalite subpath exports (e.g.
evalite/scorers/deterministic) which require a modern moduleResolution.
Excluding evals/ from tsconfig.production.json prevents gen-ts-defs from
trying to compile dev-only eval files during the library build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Screenshot tests report

✔️ All passing

Comment thread evals/my-eval.eval.ts
import {evalite} from 'evalite';
import {exactMatch} from 'evalite/scorers/deterministic';

evalite('My Eval', {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to test this lib against its competitors before pushing a installation+configuration to the repo. I mean, there are other options that could potentially be better, perhaps we should test them before commiting (?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should test different tools

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Evalite-based evaluation tooling (plus Vitest dependency) to the repo so developers can author and run local/CI-friendly scored eval suites under evals/, with minimal initial scaffolding.

Changes:

  • Add evalite (beta) and vitest dev dependencies plus yarn eval / yarn eval:dev scripts.
  • Introduce a minimal example eval file under evals/ and document the workflow in the README.
  • Exclude evals/ from the production TS declaration build config and vendor new Yarn cache artifacts.

Reviewed changes

Copilot reviewed 4 out of 123 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tsconfig.production.json Excludes evals/ from the production declaration emit config.
package.json Adds Evalite/Vitest dev deps and new eval scripts.
evals/my-eval.eval.ts Adds a minimal example Evalite suite using deterministic scoring.
README.md Documents how to run/author evals and notes scorer import guidance.
.yarn/cache/why-is-node-running-npm-2.3.0-011cf61a18-58ebbf406e.zip Yarn offline cache update (transitive dependency).
.yarn/cache/token-types-npm-6.1.2-1f6e70d865-ddade9c99f.zip Yarn offline cache update (transitive dependency).
.yarn/cache/tinyrainbow-npm-3.1.0-35ba47f8ae-dbb16b4aa5.zip Yarn offline cache update (transitive dependency).
.yarn/cache/stream-shift-npm-1.0.3-c1c29210c7-a24c0a3f66.zip Yarn offline cache update (transitive dependency).
.yarn/cache/stackback-npm-0.0.2-73273dc92e-2d4dc4e64e.zip Yarn offline cache update (transitive dependency).
.yarn/cache/split2-npm-4.2.0-16aa3883ba-05d5410254.zip Yarn offline cache update (transitive dependency).
.yarn/cache/siginfo-npm-2.0.0-9bbac931f8-8aa5a98640.zip Yarn offline cache update (transitive dependency).
.yarn/cache/set-cookie-parser-npm-2.7.2-e1a4d1221b-9e1b09e718.zip Yarn offline cache update (transitive dependency).
.yarn/cache/real-require-npm-0.2.0-7f69dbc7b6-fa060f19f2.zip Yarn offline cache update (transitive dependency).
.yarn/cache/quick-format-unescaped-npm-4.0.4-7e22c9b7dc-7bc32b9935.zip Yarn offline cache update (transitive dependency).
.yarn/cache/on-exit-leak-free-npm-2.1.2-0d0c5ad67d-6ce7acdc7b.zip Yarn offline cache update (transitive dependency).
.yarn/cache/lodash.truncate-npm-4.4.2-bc50fe1663-b463d8a382.zip Yarn offline cache update (transitive dependency).
.yarn/cache/js-levenshtein-npm-1.1.6-ab883e61a3-409f052a7f.zip Yarn offline cache update (transitive dependency).
.yarn/cache/is-stream-npm-4.0.1-328fd196cc-cbea3f1fc2.zip Yarn offline cache update (transitive dependency).
.yarn/cache/get-port-npm-7.2.0-7f76d3f2ea-f8785ccdcc.zip Yarn offline cache update (transitive dependency).
.yarn/cache/fast-querystring-npm-1.1.2-81dfb4019b-7149f82ee9.zip Yarn offline cache update (transitive dependency).
.yarn/cache/fast-decode-uri-component-npm-1.0.1-578ba9fecf-427a48fe09.zip Yarn offline cache update (transitive dependency).
.yarn/cache/duplexify-npm-4.1.3-f0053971e9-9636a02734.zip Yarn offline cache update (transitive dependency).
.yarn/cache/atomic-sleep-npm-1.0.0-17d8a762a3-b95275afb2.zip Yarn offline cache update (transitive dependency).
.yarn/cache/abstract-logging-npm-2.0.1-b805b8edfa-6967d15e5a.zip Yarn offline cache update (transitive dependency).
.yarn/cache/@standard-schema-spec-npm-1.1.0-d3e5ccd2e2-6245ebef5e.zip Yarn offline cache update (transitive dependency).
.yarn/cache/@lukeed-ms-npm-2.0.2-5e69b6e173-6ae47ed3eb.zip Yarn offline cache update (transitive dependency).
.yarn/cache/@fastify-forwarded-npm-3.0.1-03d48a4e5e-2cde644dc3.zip Yarn offline cache update (transitive dependency).
.yarn/cache/@fastify-accept-negotiator-npm-2.0.1-d797505dde-7a2db0bb9f.zip Yarn offline cache update (transitive dependency).
.yarn/cache/@borewit-text-codec-npm-0.2.2-11871252cc-7b4852c38b.zip Yarn offline cache update (transitive dependency).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment on lines +163 to +165
- `yarn eval:dev` / `yarn eval`: [Evalite](https://v1.evalite.dev/guides/quickstart/) for task/LLM evals in
`evals/*.eval.ts` (use `evalite/scorers/deterministic` for scorers like `exactMatch` unless you add the `ai`
package for LLM-judge scorers from `evalite/scorers`)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README says eval files live in evals/*.eval.ts, but the PR description and intended workflow mention evals/**/*.eval.ts (and follow-up suites likely want nested folders). Suggest updating the README glob/path wording to match the intended recursive pattern (or adjust the tooling expectation if it’s actually non-recursive).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 123 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ieduardogf ieduardogf marked this pull request as draft April 23, 2026 08:17
Comment thread package.json
"vite-plugin-entry-shaking": "^0.5.1",
"vite-plugin-no-bundle": "^4.0.0"
"vite-plugin-no-bundle": "^4.0.0",
"vitest": "^4.1.5"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may not need vitest, can't we run evalite with jest_

Comment thread evals/my-eval.eval.ts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should follow a file convention more consistent with the one we use for tests/stories (-eval.ts instead of .eval.ts)

Comment thread evals/my-eval.eval.ts
import {evalite} from 'evalite';
import {exactMatch} from 'evalite/scorers/deterministic';

evalite('My Eval', {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should test different tools

Comment thread package.json
Comment on lines +46 to +47
"eval:dev": "evalite watch",
"eval": "evalite"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need one eval script, If you want to run it with watch you can just run yarn eval watch

Comment thread package.json
"eslint": "^8.57.0",
"eslint-plugin-mistica-local-rules": "workspace:*",
"eslint-plugin-storybook": "^10.2.8",
"evalite": "beta",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't an stable release yet?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused by this too. They have a stable version but it is kinda stale and they are working actively in their beta 1.0 version. anyways the beta is quite old too (2 months)

Comment thread tsconfig.production.json
"examples"
"examples",
"evals"
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably exclude evals folder from the published mistica code too

Comment thread package.json
"eslint": "^8.57.0",
"eslint-plugin-mistica-local-rules": "workspace:*",
"eslint-plugin-storybook": "^10.2.8",
"evalite": "beta",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused by this too. They have a stable version but it is kinda stale and they are working actively in their beta 1.0 version. anyways the beta is quite old too (2 months)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants