Skip to content

fix(sdk): align script grader results#1659

Merged
christso merged 1 commit into
mainfrom
grading-sdk-script-api
Jul 5, 2026
Merged

fix(sdk): align script grader results#1659
christso merged 1 commit into
mainfrom
grading-sdk-script-api

Conversation

@christso

@christso christso commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

SDK and script-grader authors can now return the finalized pass, score, reason, and optional checks[] vocabulary directly. The SDK builders, Zod schemas, Vitest/workspace adapters, Python helper example, generated assertion scaffolds, and script-grader docs all present that vocabulary as the public surface.

Core script-grader parsing now accepts the finalized JSON protocol, derives aggregate score/pass from checks when needed, and carries reason/checks through the internal evaluator result. It still bridges checks into the current internal assertion_results shape so the artifact writer can be replaced separately by av-kfik.28.6.

Publish-surface considerations: this intentionally updates the experimental @agentv/sdk result surface instead of documenting stale assertions[]/passed aliases as supported public API. The deprecated CodeGraderResult type name remains as an alias to the new result schema, but the wire/public shape is the finalized vocabulary.

Related: av-kfik.28.4

Validation

  • bun run build
  • bun test packages/sdk/test/define-script-grader.test.ts packages/sdk/test/workspace-grader.test.ts packages/sdk/test/vitest-workspace-grader.test.ts
  • bun test packages/core/test/evaluation/graders/script-grader-plain-text.test.ts packages/core/test/evaluation/script-grader-file-backed.test.ts packages/core/test/evaluation/script-grader-multimodal.test.ts
  • bun test packages/core/test/evaluation/graders.test.ts packages/core/test/evaluation/execution-metrics.test.ts
  • uv run pytest in examples/features/sdk-python
  • bun run lint
  • git diff --check
  • Smoke: bun --env-file=.env apps/cli/src/cli.ts eval run examples/features/script-grader-sdk/evals/suite.yaml --target local_cli exercised the SDK script grader path; script grader per-grader score was 1.0 with all returned checks passing. The overall eval scored 50% because the separate LLM rubric target did not resolve from the example-local target config.
  • Attempted live rerun with a temporary combined targets file and --grader-target openai; the LLM grader reached provider execution but failed after retries with pi-ai call failed: Connection error. Script-grader API dogfood is covered; live LLM rubric dogfood is blocked by provider connectivity in this worktree.

Compound Engineering
Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 5, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: e8ddf96
Status: ✅  Deploy successful!
Preview URL: https://2886af14.agentv.pages.dev
Branch Preview URL: https://grading-sdk-script-api.agentv.pages.dev

View logs

@christso christso force-pushed the grading-sdk-script-api branch from 1b2cf44 to e8ddf96 Compare July 5, 2026 04:08
@christso christso marked this pull request as ready for review July 5, 2026 04:10
@christso christso merged commit 5001fa6 into main Jul 5, 2026
8 checks passed
@christso christso deleted the grading-sdk-script-api branch July 5, 2026 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant