AIQ Skill Eval Smoke Tests#236
Conversation
Greptile SummaryThis PR introduces the initial AI-Q Agent Skill evaluation harness, the
Confidence Score: 5/5Safe to merge; the new eval harness and deploy skill are additive, the Dockerfile migration is well-structured, and the only flagged items are minor quality nits that do not affect the CI validation path or any existing functionality. All changes are either new infrastructure (eval harness, skill docs, CI workflow) or a straightforward Dockerfile base-image migration. The regular PR CI path — spec discovery, validation, and dataset generation — is exercised on every qualifying PR and has no defects. The two inline findings are confined to the optional Harbor execution path (which requires manual dispatch) and a Python version mismatch in a generated environment Dockerfile; neither affects correctness of the generated datasets or the Dockerfile build verified locally.
Important Files Changed
Sequence DiagramsequenceDiagram
participant CI as GitHub Actions
participant SEA as skills_eval_agent.py
participant GEN as adapters/aiq-research/generate.py
participant Harbor as uvx harbor run
participant VER as aiq_checks.py (verifier)
participant AIQ as AI-Q Server
CI->>SEA: python3 skills_eval_agent.py --all [--run-harbor]
SEA->>SEA: _discover_specs() → list[spec paths]
SEA->>SEA: _validate_spec(spec)
SEA->>GEN: generate.py --spec basic.json --skill-dir aiq-research
GEN->>GEN: write task.toml, instruction.md, test.sh, aiq_checks.py
GEN-->>SEA: generated dataset dirs
SEA->>CI: upload artifact (always)
alt --run-harbor (manual dispatch only)
SEA->>Harbor: uvx harbor run -p task_root -a claude-code
Harbor->>AIQ: run agent instructions
AIQ-->>Harbor: agent trajectory
Harbor->>VER: test.sh → aiq_checks.py --spec --step N
VER->>AIQ: python3 /skills/aiq-research/scripts/aiq.py health
AIQ-->>VER: JSON response
VER->>VER: write reward.txt + aiq-checks.json
Harbor-->>SEA: exit code
SEA->>CI: upload results artifact
end
Reviews (3): Last reviewed commit: "Format AIQ skill eval adapter" | Re-trigger Greptile |
4379c5e to
45600c6
Compare
Summary
This PR adds the initial AI-Q Agent Skill evaluation harness and a deploy skill so we can validate AI-Q skills in the same general style as the VSS skill-eval work, without copying VSS-specific Brev/GPU/video assumptions into AI-Q.
The first checked-in eval target is
aiq-research. The baseline profile is intentionally a smoke test against a live AI-Q server: it verifies the skill can check server health and list available async research agents. It avoids model-generating chat/research calls for now so the harness can run cheaply and deterministically while we stabilize runner credentials, deploy expectations, and the AI-Q runtime contract.What Changed
Added
.github/skill-eval/for AI-Q skill-eval orchestration:skills_eval_agent.pydiscovers skill eval specs, validates their schema, generates task datasets, and optionally invokes Harbor.adapters/aiq-research/generate.pyturns.agents/skills/aiq-research/eval/*.jsonspecs into Harbor-style task directories.verifiers/aiq_checks.pyprovides deterministic checks such as shell command checks, JSON command checks, and trajectory contains/not-contains checks.AGENTS.mdandREADME.mddocument the eval workspace and local/CI usage.Added the first
aiq-researcheval spec:.agents/skills/aiq-research/eval/basic.json/healththrough the skill wrapper./v1/jobs/async/agentsthroughscripts/aiq.py agentsand requires bothdeep_researcherandshallow_researcher.Added an
aiq-deployskill:.agents/skills/aiq-deploy/SKILL.md.agents/skills/aiq-deploy/README.md.claude/skills/aiq-deploysymlink for Claude-compatible skill discovery.Added PR/dispatch CI for skill eval generation:
.github/workflows/skills-eval.ymlaiq-evalrunner whenrun_harbor=true.claude-code,codex, ororacle).Updated Docker backend build path:
deploy/Dockerfilenow uses the NVIDIA Ubuntu Noble base for the builder stage.uvthroughpython3.12 -m pip.Updated docs:
docs/source/integration/agent-skills.mdnow includes the AI-Q skill eval/deploy usage path.Why This Is Useful
This gives AI-Q a repeatable validation loop for Agent Skills. Instead of relying on an agent author to manually inspect a skill and hope it works, the repo can now generate concrete tasks that exercise the skill against a running AI-Q server and verify the outcome with deterministic checks.
The split between
aiq-deployandaiq-researchis intentional:aiq-deployowns getting AI-Q running and proving the runtime is reachable.aiq-researchowns interacting with an already-running AI-Q server..github/skill-evalowns converting skill specs into Harbor tasks and running/verifying them.That separation should make it easier to add deeper evals later without embedding deployment assumptions into every research-skill task.
Current Scope
This PR is a smoke-harness foundation, not a full research-quality benchmark yet.
Covered now:
AIQ_SERVER_URLsupport for local Harbor runs.Not covered yet:
/chator research generation quality checks.TAVILY_API_KEY,SERPER_API_KEY, orEXA_API_KEY.Local Validation Performed
Commands/checks run locally:
Docker/runtime checks performed locally:
docker build -f deploy/Dockerfile --target dev -t aiq-agent:skill-eval-test .Then started the backend through Docker Compose on a non-default host port to avoid collisions with existing local services, and verified:
The deployed backend returned healthy status and listed both expected agents:
deep_researchershallow_researcherHarbor/Codex eval was also run against the compose-deployed backend:
Result:
2/2, mean1.000, no exceptions.CI Notes
The normal PR CI path validates dataset generation. Full Harbor execution is intentionally gated behind manual workflow dispatch because it needs an AI-Q server URL and agent credentials on a self-hosted runner.
Follow-Ups
Suggested follow-up work after this lands:
aiq-researchlifecycle eval that exercises a small research request once runner credentials and inference cost expectations are agreed.aiq-evalrunner should own AI-Q deployment before Harbor runs, or whether it should target a pre-provisioned shared eval server.