fix(eval): reject authored direct input by christso · Pull Request #1646 · EntityProcess/agentv

christso · 2026-07-04T09:26:39Z

Summary

Implements Bead av-kfik.27 hard-deprecation for authored direct input in normal Promptfoo-aligned eval YAML.

Rejects authored top-level input and inline tests[].input at schema/validator/parser/runtime load boundaries with actionable guidance to use top-level prompts plus default_test.vars / tests[].vars.
Keeps raw-case/internal compatibility narrow: external raw-case imports through tests: file://... may still carry internal input; normal authored eval YAML cannot.
Updates focused tests, public eval docs, Promptfoo parity docs, eval-writer skill data, generated schema, and migration guidance to make prompts + vars the canonical authoring model.
Converts nearby legacy parser/input_files tests away from public authored direct input; input_files + input now rejects, while canonical prompt content file blocks with vars render successfully.

Promptfoo Source Evidence

Promptfoo local clone: /home/entity/projects/promptfoo/promptfoo
Evidence commit: 6bfc5a0c7f16f9c4717ac731d276b578e63d0769 (6bfc5a0 chore(deps): update modelaudit schema generator to v0.2.47 (#9635))

Claims verified from source:

src/types/index.ts:851-858: Promptfoo test cases represent unique prompt inputs after substituting vars; vars is the test-row data surface.
src/types/index.ts:1033-1049: suites own providers, prompts, tests, and defaultTest; prompt entries are top-level suite data.
src/evaluator.ts:2165-2173: Promptfoo merges defaultTest.vars, scenario/data vars, and test vars.
src/evaluator.ts:2318-2322: Promptfoo propagates default test prompts/providers onto test cases.

Verification

Passed locally:

bun test packages/core/test/evaluation/prompt-input-authoring.test.ts packages/core/test/evaluation/input-files-shorthand.test.ts packages/core/test/evaluation/yaml-parser-metadata.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts
bun run lint
bun run typecheck
bun --filter @agentv/core build
git diff --check

bun run validate:examples was run and currently fails because the repo still has broad existing authored tests[].input examples: 108 eval YAML files checked, 6 valid, 102 invalid. This is recorded on av-kfik.16 for the examples-authoring worker and on av-kfik.15 with exact codemod conversion requirements. This PR intentionally does not duplicate that broad examples/codemod sweep.

Live Dogfood

Passed with local OpenAI-compatible endpoint:

AGENTV_DOGFOOD_OPENAI_BASE_URL=http://127.0.0.1:10531/v1 \
AGENTV_DOGFOOD_OPENAI_API_KEY=local-proxy \
AGENTV_DOGFOOD_OPENAI_MODEL=gpt-5.4-mini \
AGENTV_NO_UPDATE_CHECK=1 \
bun apps/cli/src/cli.ts eval run prompt-vars.eval.yaml \
  --targets .agentv/targets.yaml \
  --output .agentv/results/prompt-vars-hard-deprecation \
  --grader-target local-grader \
  --threshold 0 \
  --no-cache

Result: PASS (1/1, mean 100%) using canonical .agentv/results/prompt-vars-hard-deprecation output.

Private evidence: EntityProcess/agentv-private branch evidence/av-kfik-27-input-hard-deprecation, commit 475b3d6.

Coordinator correction: the worker briefly pushed an agentv-private branch to the public EntityProcess/agentv remote while publishing evidence. That public branch was deleted with git push origin :agentv-private; do not use commit 19612e7c as evidence.

Compatibility Decision

Normal public authored eval YAML now rejects direct input everywhere this PR touches. External raw-case files loaded through tests: file://... remain the only tested internal compatibility path for input; they are deliberately kept out of canonical public docs and covered by prompt-input-authoring.test.ts.

Beads

av-kfik.27: implementation PR, left open for coordinator review/merge.
av-kfik.15: updated with exact remaining codemod conversion requirements for top-level input, inline tests input, message arrays, and input_files + input.
av-kfik.16: updated with the exact validate:examples blocker and examples migration handoff.

Current CI Status

As of coordinator review on 2026-07-04, this draft PR is not mergeable: GitHub Actions Test fails because full @agentv/core fixtures still contain legacy authored input surfaces (127 failures), and Validate Evals fails because repo examples still broadly use tests[].input. The follow-up migration work is recorded on av-kfik.15 and av-kfik.16. Keep this PR draft until those branches are sequenced and CI is green.

cloudflare-workers-and-pages · 2026-07-04T09:26:57Z

Deploying agentv with Cloudflare Pages

Latest commit:	`b60671f`
Status:	✅ Deploy successful!
Preview URL:	https://70b42b69.agentv.pages.dev
Branch Preview URL:	https://promptfoo-input-hard-depreca.agentv.pages.dev

View logs

christso · 2026-07-04T09:36:16Z

Evidence correction verified: public EntityProcess/agentv no longer has an origin/agentv-private branch (git ls-remote --heads origin agentv-private returns empty), and the temporary local agentv-private worktree/branch at commit 19612e7 was removed. Correct evidence is in the separate EntityProcess/agentv-private repo on orphan branch evidence/av-kfik-27-input-hard-deprecation at commit 475b3d6d5f68496bf7c3377daafa22fb0136b96d; rev-list shows no parent and the branch contains source/prompt-vars.eval.yaml, source/targets.yaml, and run-bundle/.

christso · 2026-07-04T13:10:17Z

Codemod migration for Beads av-kfik.15.1 / av-kfik.15 is now pushed directly to this PR branch at commit 1e5d130.

Validation evidence:

Local: bun run lint passed
Local: bun run typecheck passed
Local: bun run validate:examples passed 108/108
Local: bun run test passed (agentv 746 pass, dashboard 153 pass in captured tail)
README scan passed: no Promptfoo mentions and no public YAML/defineEval/evaluate tests[].input examples outside vars.input/task parameter usage
Remote CI on fix(eval): reject authored direct input #1646 is green: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all passed on run https://github.com/EntityProcess/agentv/actions/runs/28707200216

The earlier successor PR #1650 has been merged into this hard-deprecation branch; this PR is now the green-ready branch for av-kfik.27.

christso · 2026-07-04T14:04:02Z

Coordinator verification update after merging #1652:

fix(eval): reject mixed criteria assertions #1652 merged into this branch and closed av-kfik.42.
CI on promptfoo-input-hard-deprecation is green at b60671ffdbe50ba5cf4823d229041f1684f4de62.
Local build in the integrated worktree passed (bun run build; existing dashboard chunk-size warning only).
Live dogfood passed with a real provider and real LLM grader using gpt-5.4-mini via the available Azure OpenAI secret.

Dogfood command shape:

bun apps/cli/src/cli.ts eval run examples/features/default-graders/evals/suite.yaml \
  --test-id greeting \
  --target azure \
  --grader-target azure \
  --workers 1 \
  --output .agentv/results/av-kfik-27-input-hard-deprecation-azure-openai-20260704T140222Z \
  --threshold 0.5 \
  --no-results-push

Result: PASS, 1/1, mean score 100%.

Credential note: standard local OPENAI_API_KEY was dummy, so the successful live run used Bitwarden secret azure-openai-chris-shared, which notes gpt-5.4 and gpt-5.4-mini; the process set AZURE_DEPLOYMENT_NAME=gpt-5.4-mini.

Private evidence branch: EntityProcess/agentv-private:evidence/av-kfik-27-input-hard-deprecation-20260704
Evidence commit: 2ca68df

christso · 2026-07-04T18:44:42Z

Correction: local OpenAI-compatible endpoint dogfood was rerun and is now the primary evidence.

Local endpoint check:

curl http://127.0.0.1:10531/v1/models
# includes gpt-5.4-mini

Dogfood command shape:

LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1 \
LOCAL_OPENAI_PROXY_API_KEY=local \
LOCAL_OPENAI_PROXY_MODEL=gpt-5.4-mini \
bun apps/cli/src/cli.ts eval run examples/features/default-graders/evals/suite.yaml \
  --test-id greeting \
  --target local-openai \
  --grader-target local-openai-grader \
  --workers 1 \
  --output .agentv/results/av-kfik-27-input-hard-deprecation-local-openai-20260704T184248Z \
  --threshold 0.5 \
  --no-results-push

Result: PASS, 1/1, mean score 100%.

Private evidence branch: EntityProcess/agentv-private:evidence/av-kfik-27-input-hard-deprecation-20260704
Evidence commit: 4f476f0

The previous Azure OpenAI run remains in the evidence branch as secondary fallback evidence; the local OpenAI-compatible run is the correct primary dogfood for this PR.

fix(eval): reject authored direct input

e851757

Migrate eval input authoring to prompts vars

1e5d130

fix(eval): reject mixed criteria assertions (#1652)

b60671f

christso marked this pull request as ready for review July 4, 2026 14:04

christso merged commit 7741f9d into main Jul 4, 2026
8 checks passed

christso deleted the promptfoo-input-hard-deprecation branch July 4, 2026 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): reject authored direct input#1646

fix(eval): reject authored direct input#1646
christso merged 3 commits into
mainfrom
promptfoo-input-hard-deprecation

christso commented Jul 4, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

christso commented Jul 4, 2026

Uh oh!

christso commented Jul 4, 2026

Uh oh!

christso commented Jul 4, 2026

Uh oh!

Uh oh!

christso commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Promptfoo Source Evidence

Verification

Live Dogfood

Compatibility Decision

Beads

Current CI Status

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jul 4, 2026

Uh oh!

christso commented Jul 4, 2026

Uh oh!

christso commented Jul 4, 2026

Uh oh!

Uh oh!

christso commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Jul 4, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jul 4, 2026 •

edited

Loading