Skip to content

AIQ Skill Eval Smoke Tests#236

Merged
3 commits merged into
developfrom
jonp/aiq-skill-eval-smoke
May 19, 2026
Merged

AIQ Skill Eval Smoke Tests#236
3 commits merged into
developfrom
jonp/aiq-skill-eval-smoke

Conversation

@freshyjmp
Copy link
Copy Markdown
Contributor

@freshyjmp freshyjmp commented May 13, 2026

Summary

This PR adds the initial AI-Q Agent Skill evaluation harness and a deploy skill so we can validate AI-Q skills in the same general style as the VSS skill-eval work, without copying VSS-specific Brev/GPU/video assumptions into AI-Q.

The first checked-in eval target is aiq-research. The baseline profile is intentionally a smoke test against a live AI-Q server: it verifies the skill can check server health and list available async research agents. It avoids model-generating chat/research calls for now so the harness can run cheaply and deterministically while we stabilize runner credentials, deploy expectations, and the AI-Q runtime contract.

What Changed

  • Added .github/skill-eval/ for AI-Q skill-eval orchestration:

    • skills_eval_agent.py discovers skill eval specs, validates their schema, generates task datasets, and optionally invokes Harbor.
    • adapters/aiq-research/generate.py turns .agents/skills/aiq-research/eval/*.json specs into Harbor-style task directories.
    • verifiers/aiq_checks.py provides deterministic checks such as shell command checks, JSON command checks, and trajectory contains/not-contains checks.
    • AGENTS.md and README.md document the eval workspace and local/CI usage.
  • Added the first aiq-research eval spec:

    • .agents/skills/aiq-research/eval/basic.json
    • Verifies /health through the skill wrapper.
    • Verifies /v1/jobs/async/agents through scripts/aiq.py agents and requires both deep_researcher and shallow_researcher.
  • Added an aiq-deploy skill:

    • .agents/skills/aiq-deploy/SKILL.md
    • .agents/skills/aiq-deploy/README.md
    • .claude/skills/aiq-deploy symlink for Claude-compatible skill discovery.
    • Covers CLI, local web, Docker Compose, Kubernetes/Helm, FRAG mode checks, health verification, logs, rebuild, and stop flows.
  • Added PR/dispatch CI for skill eval generation:

    • .github/workflows/skills-eval.yml
    • Pull requests validate dataset generation.
    • Manual dispatch can run Harbor trials on a self-hosted aiq-eval runner when run_harbor=true.
    • Harbor agent selection is configurable via workflow inputs (claude-code, codex, or oracle).
  • Updated Docker backend build path:

    • deploy/Dockerfile now uses the NVIDIA Ubuntu Noble base for the builder stage.
    • Removed the Launchpad/deadsnakes PPA dependency, which failed in restricted or partially offline environments.
    • Uses distro Python 3.12 packages and installs uv through python3.12 -m pip.
  • Updated docs:

    • docs/source/integration/agent-skills.md now includes the AI-Q skill eval/deploy usage path.

Why This Is Useful

This gives AI-Q a repeatable validation loop for Agent Skills. Instead of relying on an agent author to manually inspect a skill and hope it works, the repo can now generate concrete tasks that exercise the skill against a running AI-Q server and verify the outcome with deterministic checks.

The split between aiq-deploy and aiq-research is intentional:

  • aiq-deploy owns getting AI-Q running and proving the runtime is reachable.
  • aiq-research owns interacting with an already-running AI-Q server.
  • .github/skill-eval owns converting skill specs into Harbor tasks and running/verifying them.

That separation should make it easier to add deeper evals later without embedding deployment assumptions into every research-skill task.

Current Scope

This PR is a smoke-harness foundation, not a full research-quality benchmark yet.

Covered now:

  • Skill spec discovery and validation.
  • Harbor-style dataset generation.
  • Health check task.
  • Agent listing task.
  • Docker-reachable AIQ_SERVER_URL support for local Harbor runs.
  • Manual Harbor execution with Claude Code or Codex agents.
  • Docker Compose backend build/start verification path.

Not covered yet:

  • End-to-end /chat or research generation quality checks.
  • Web-search-backed research flows requiring TAVILY_API_KEY, SERPER_API_KEY, or EXA_API_KEY.
  • FRAG-specific evals requiring external RAG services.
  • Always-on Harbor execution in regular PR CI. Harbor remains a manual self-hosted workflow path for now.

Local Validation Performed

Commands/checks run locally:

uv run ruff check .
uv run ruff format --check .
python3 -m py_compile \
  .github/skill-eval/adapters/aiq-research/generate.py \
  .github/skill-eval/skills_eval_agent.py \
  .github/skill-eval/verifiers/aiq_checks.py
python3 -m json.tool .agents/skills/aiq-research/eval/basic.json >/dev/null
python3 .github/skill-eval/skills_eval_agent.py --all \
  --output-dir /tmp/aiq-skill-eval/precommit-datasets

Docker/runtime checks performed locally:

docker build -f deploy/Dockerfile --target dev -t aiq-agent:skill-eval-test .

Then started the backend through Docker Compose on a non-default host port to avoid collisions with existing local services, and verified:

curl -sf http://127.0.0.1:18000/health
AIQ_SERVER_URL=http://127.0.0.1:18000 python3 .agents/skills/aiq-research/scripts/aiq.py agents

The deployed backend returned healthy status and listed both expected agents:

  • deep_researcher
  • shallow_researcher

Harbor/Codex eval was also run against the compose-deployed backend:

AIQ_SERVER_URL=http://host.docker.internal:18000 \
AIQ_SKILL_EVAL_AGENT=codex \
AIQ_SKILL_EVAL_MODEL=gpt-5.2 \
CODEX_FORCE_AUTH_JSON=1 \
AIQ_SKILL_EVAL_RESULTS_DIR=/tmp/aiq-skill-eval/compose-codex-retry-results \
python3 .github/skill-eval/skills_eval_agent.py --all --run-harbor \
  --output-dir /tmp/aiq-skill-eval/compose-codex-retry-datasets

Result: 2/2, mean 1.000, no exceptions.

CI Notes

The normal PR CI path validates dataset generation. Full Harbor execution is intentionally gated behind manual workflow dispatch because it needs an AI-Q server URL and agent credentials on a self-hosted runner.

Follow-Ups

Suggested follow-up work after this lands:

  • Add a deeper aiq-research lifecycle eval that exercises a small research request once runner credentials and inference cost expectations are agreed.
  • Add optional eval profiles for web-search-enabled and FRAG-backed modes.
  • Decide whether the self-hosted aiq-eval runner should own AI-Q deployment before Harbor runs, or whether it should target a pre-provisioned shared eval server.
  • Add more deploy-skill checks if the team wants the skill to actively provision Kubernetes/Helm environments instead of only guiding and verifying them.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR introduces the initial AI-Q Agent Skill evaluation harness, the aiq-deploy skill, and a smoke-test eval spec for aiq-research — plus a Docker builder migration from Ubuntu Jammy + deadsnakes PPA to Ubuntu Noble with uv-managed Python 3.13.

  • Eval harness (.github/skill-eval/): skills_eval_agent.py discovers and validates specs, adapters/aiq-research/generate.py produces Harbor-style task directories, and verifiers/aiq_checks.py runs deterministic shell, JSON-command, and trajectory checks. CI validates dataset generation on every qualifying PR; Harbor execution is gated to manual dispatch on a self-hosted runner.
  • aiq-deploy skill (.agents/skills/aiq-deploy/): New portable skill covering CLI, local web, Docker Compose, Kubernetes/Helm, and FRAG deployment modes with safety rules and a handoff path to aiq-research.
  • Dockerfile (deploy/Dockerfile): Removes the Launchpad/deadsnakes PPA dependency and switches to Noble's distro Python as a bootstrap for uv, which then installs and manages Python 3.13; the uv-python directory is copied into both distroless final stages so the venv resolves correctly.

Confidence Score: 5/5

Safe to merge; the new eval harness and deploy skill are additive, the Dockerfile migration is well-structured, and the only flagged items are minor quality nits that do not affect the CI validation path or any existing functionality.

All changes are either new infrastructure (eval harness, skill docs, CI workflow) or a straightforward Dockerfile base-image migration. The regular PR CI path — spec discovery, validation, and dataset generation — is exercised on every qualifying PR and has no defects. The two inline findings are confined to the optional Harbor execution path (which requires manual dispatch) and a Python version mismatch in a generated environment Dockerfile; neither affects correctness of the generated datasets or the Dockerfile build verified locally.

skills_eval_agent.py and aiq_checks.py are worth a second look if Harbor execution is enabled, specifically around how multi-step task roots are passed to Harbor and how the skills-dir path substitution handles non-standard paths.

Important Files Changed

Filename Overview
.github/skill-eval/skills_eval_agent.py Orchestrates spec discovery, validation, dataset generation, and optional Harbor execution; _task_roots collapses multi-step specs to a parent directory that has no task.toml, which may not be what Harbor expects when --run-harbor is used.
.github/skill-eval/verifiers/aiq_checks.py Deterministic verifier for shell, JSON-command, and trajectory checks; AIQ_EVAL_SKILLS_DIR is injected into the shell command string without quoting, breaking on paths with spaces.
.github/skill-eval/adapters/aiq-research/generate.py Generates Harbor task directories from spec JSON; environment Dockerfile uses python:3.12-slim while the project targets Python 3.13.
deploy/Dockerfile Migrates builder base from Ubuntu Jammy + deadsnakes PPA to Ubuntu Noble with uv-managed Python 3.13; adds COPY of uv-python dir to distroless final stages so the venv resolves correctly — logic is sound.
.github/workflows/skills-eval.yml CI workflow validates dataset generation on every PR touching skill files and gates Harbor execution behind manual dispatch on a self-hosted runner; permissions, concurrency, and artifact upload configuration look correct.
.agents/skills/aiq-research/eval/basic.json Smoke-test spec covering health-check and agent-listing steps; trajectory and json_command checks are well-formed and intentionally avoid model-generating calls.
.agents/skills/aiq-deploy/SKILL.md New aiq-deploy skill covering CLI, local web, Docker Compose, Kubernetes, and FRAG modes with clear safety rules; no issues found.
docs/source/integration/agent-skills.md Documentation updated to cover both aiq-deploy and aiq-research skills with install instructions for Claude Code, Codex, and OpenCode; straightforward and accurate.

Sequence Diagram

sequenceDiagram
    participant CI as GitHub Actions
    participant SEA as skills_eval_agent.py
    participant GEN as adapters/aiq-research/generate.py
    participant Harbor as uvx harbor run
    participant VER as aiq_checks.py (verifier)
    participant AIQ as AI-Q Server

    CI->>SEA: python3 skills_eval_agent.py --all [--run-harbor]
    SEA->>SEA: _discover_specs() → list[spec paths]
    SEA->>SEA: _validate_spec(spec)
    SEA->>GEN: generate.py --spec basic.json --skill-dir aiq-research
    GEN->>GEN: write task.toml, instruction.md, test.sh, aiq_checks.py
    GEN-->>SEA: generated dataset dirs
    SEA->>CI: upload artifact (always)

    alt --run-harbor (manual dispatch only)
        SEA->>Harbor: uvx harbor run -p task_root -a claude-code
        Harbor->>AIQ: run agent instructions
        AIQ-->>Harbor: agent trajectory
        Harbor->>VER: test.sh → aiq_checks.py --spec --step N
        VER->>AIQ: python3 /skills/aiq-research/scripts/aiq.py health
        AIQ-->>VER: JSON response
        VER->>VER: write reward.txt + aiq-checks.json
        Harbor-->>SEA: exit code
        SEA->>CI: upload results artifact
    end
Loading

Reviews (3): Last reviewed commit: "Format AIQ skill eval adapter" | Re-trigger Greptile

Comment thread .github/skill-eval/skills_eval_agent.py
Comment thread .github/skill-eval/adapters/aiq-research/generate.py
@AjayThorve AjayThorve added enhancement New feature or request AIQ2.2 labels May 13, 2026
@freshyjmp freshyjmp force-pushed the jonp/aiq-skill-eval-smoke branch from 4379c5e to 45600c6 Compare May 14, 2026 22:08
@exactlyallan exactlyallan self-assigned this May 15, 2026
@exactlyallan exactlyallan closed this pull request by merging all changes into develop in 9310327 May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AIQ2.2 enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants