fix(eval): reject authored direct input#1646
Conversation
Deploying agentv with
|
| Latest commit: |
b60671f
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://70b42b69.agentv.pages.dev |
| Branch Preview URL: | https://promptfoo-input-hard-depreca.agentv.pages.dev |
|
Evidence correction verified: public EntityProcess/agentv no longer has an origin/agentv-private branch (git ls-remote --heads origin agentv-private returns empty), and the temporary local agentv-private worktree/branch at commit 19612e7 was removed. Correct evidence is in the separate EntityProcess/agentv-private repo on orphan branch evidence/av-kfik-27-input-hard-deprecation at commit 475b3d6d5f68496bf7c3377daafa22fb0136b96d; rev-list shows no parent and the branch contains source/prompt-vars.eval.yaml, source/targets.yaml, and run-bundle/. |
|
Codemod migration for Beads av-kfik.15.1 / av-kfik.15 is now pushed directly to this PR branch at commit 1e5d130. Validation evidence:
The earlier successor PR #1650 has been merged into this hard-deprecation branch; this PR is now the green-ready branch for av-kfik.27. |
|
Coordinator verification update after merging #1652:
Dogfood command shape: bun apps/cli/src/cli.ts eval run examples/features/default-graders/evals/suite.yaml \
--test-id greeting \
--target azure \
--grader-target azure \
--workers 1 \
--output .agentv/results/av-kfik-27-input-hard-deprecation-azure-openai-20260704T140222Z \
--threshold 0.5 \
--no-results-pushResult: PASS, 1/1, mean score 100%. Credential note: standard local Private evidence branch: |
|
Correction: local OpenAI-compatible endpoint dogfood was rerun and is now the primary evidence. Local endpoint check: curl http://127.0.0.1:10531/v1/models
# includes gpt-5.4-miniDogfood command shape: LOCAL_OPENAI_PROXY_BASE_URL=http://127.0.0.1:10531/v1 \
LOCAL_OPENAI_PROXY_API_KEY=local \
LOCAL_OPENAI_PROXY_MODEL=gpt-5.4-mini \
bun apps/cli/src/cli.ts eval run examples/features/default-graders/evals/suite.yaml \
--test-id greeting \
--target local-openai \
--grader-target local-openai-grader \
--workers 1 \
--output .agentv/results/av-kfik-27-input-hard-deprecation-local-openai-20260704T184248Z \
--threshold 0.5 \
--no-results-pushResult: PASS, 1/1, mean score 100%. Private evidence branch: The previous Azure OpenAI run remains in the evidence branch as secondary fallback evidence; the local OpenAI-compatible run is the correct primary dogfood for this PR. |
Summary
Implements Bead
av-kfik.27hard-deprecation for authored direct input in normal Promptfoo-aligned eval YAML.inputand inlinetests[].inputat schema/validator/parser/runtime load boundaries with actionable guidance to use top-levelpromptsplusdefault_test.vars/tests[].vars.tests: file://...may still carry internalinput; normal authored eval YAML cannot.input_files + inputnow rejects, while canonical prompt content file blocks with vars render successfully.Promptfoo Source Evidence
Promptfoo local clone:
/home/entity/projects/promptfoo/promptfooEvidence commit:
6bfc5a0c7f16f9c4717ac731d276b578e63d0769(6bfc5a0 chore(deps): update modelaudit schema generator to v0.2.47 (#9635))Claims verified from source:
src/types/index.ts:851-858: Promptfoo test cases represent unique prompt inputs after substitutingvars;varsis the test-row data surface.src/types/index.ts:1033-1049: suites ownproviders,prompts,tests, anddefaultTest; prompt entries are top-level suite data.src/evaluator.ts:2165-2173: Promptfoo mergesdefaultTest.vars, scenario/data vars, and test vars.src/evaluator.ts:2318-2322: Promptfoo propagates default test prompts/providers onto test cases.Verification
Passed locally:
bun test packages/core/test/evaluation/prompt-input-authoring.test.ts packages/core/test/evaluation/input-files-shorthand.test.ts packages/core/test/evaluation/yaml-parser-metadata.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.tsbun run lintbun run typecheckbun --filter @agentv/core buildgit diff --checkbun run validate:exampleswas run and currently fails because the repo still has broad existing authoredtests[].inputexamples: 108 eval YAML files checked, 6 valid, 102 invalid. This is recorded onav-kfik.16for theexamples-authoringworker and onav-kfik.15with exact codemod conversion requirements. This PR intentionally does not duplicate that broad examples/codemod sweep.Live Dogfood
Passed with local OpenAI-compatible endpoint:
AGENTV_DOGFOOD_OPENAI_BASE_URL=http://127.0.0.1:10531/v1 \ AGENTV_DOGFOOD_OPENAI_API_KEY=local-proxy \ AGENTV_DOGFOOD_OPENAI_MODEL=gpt-5.4-mini \ AGENTV_NO_UPDATE_CHECK=1 \ bun apps/cli/src/cli.ts eval run prompt-vars.eval.yaml \ --targets .agentv/targets.yaml \ --output .agentv/results/prompt-vars-hard-deprecation \ --grader-target local-grader \ --threshold 0 \ --no-cacheResult:
PASS (1/1, mean 100%)using canonical.agentv/results/prompt-vars-hard-deprecationoutput.Private evidence:
EntityProcess/agentv-privatebranchevidence/av-kfik-27-input-hard-deprecation, commit475b3d6.Coordinator correction: the worker briefly pushed an
agentv-privatebranch to the publicEntityProcess/agentvremote while publishing evidence. That public branch was deleted withgit push origin :agentv-private; do not use commit19612e7cas evidence.Compatibility Decision
Normal public authored eval YAML now rejects direct
inputeverywhere this PR touches. External raw-case files loaded throughtests: file://...remain the only tested internal compatibility path forinput; they are deliberately kept out of canonical public docs and covered byprompt-input-authoring.test.ts.Beads
av-kfik.27: implementation PR, left open for coordinator review/merge.av-kfik.15: updated with exact remaining codemod conversion requirements for top-level input, inline tests input, message arrays, andinput_files + input.av-kfik.16: updated with the exactvalidate:examplesblocker and examples migration handoff.Current CI Status
As of coordinator review on 2026-07-04, this draft PR is not mergeable: GitHub Actions Test fails because full @agentv/core fixtures still contain legacy authored input surfaces (127 failures), and Validate Evals fails because repo examples still broadly use tests[].input. The follow-up migration work is recorded on av-kfik.15 and av-kfik.16. Keep this PR draft until those branches are sequenced and CI is green.