Skip to content

Implement AI Dev Kit Test Framework and Evaluation Skill #409

@QuentinAmbard

Description

@QuentinAmbard

Milestone: AI Dev Kit Test Framework

Overview

Build a comprehensive test and evaluation framework for AI Dev Kit skills and tools, with an orchestrating skill that ties everything together and enables continuous improvement.

Vision

A complete testing pyramid that validates skills at multiple levels, from basic unit tests to sophisticated LLM-based evaluations, all integrated into the development workflow.

Test Levels

Level Type What It Tests Tool
1 Unit Tests Individual functions, classes pytest
2 Integration Tests Complete workflows, API interactions pytest + Databricks
3 Static Skill Eval Skill quality without execution LLM Judge
4 Thinking Eval LLM reasoning during execution LLM Judge
5 Output Eval Final artifacts vs source of truth LLM Judge
6 Self-Improvement Loop Iterative fixes based on eval feedback Orchestration Skill

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Test Framework Skill                          │
│  (Orchestrates all test types, saves results, suggests fixes)   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│  │  Unit    │  │Integration│  │ Static   │  │ Thinking │  │ Output   │
│  │  Tests   │  │  Tests    │  │ Eval     │  │ Eval     │  │ Eval     │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
│       │             │             │             │             │
│       └─────────────┴─────────────┴─────────────┴─────────────┘
│                                   │
│                           ┌───────▼───────┐
│                           │   MLflow      │
│                           │   Metrics     │
│                           └───────────────┘
└─────────────────────────────────────────────────────────────────┘

Workflow

On-Demand PR Testing

  1. PR is opened/updated
  2. Test framework runs all relevant tests
  3. Results saved to MLflow with branch/commit tags
  4. Compare against previous runs (main branch baseline)
  5. LLM grades each test and provides summary
  6. Self-improvement suggestions generated

Chained Evaluation Flow

For complex evaluations (thinking + output), tests follow a consistent pattern:

  1. Data Generation: Execute data gen skill with test instructions
  2. Downstream Task: Execute dependent skill (pipeline, dashboard, etc.)
  3. Evaluate Both: Assess thinking and output at each step

MLflow Integration

Each test run logs:

  • Run metadata: branch, commit, timestamp, PR number
  • Metrics: pass/fail counts, scores per criteria, execution times
  • Artifacts: full traces, comparison reports, recommendations

Enables:

  • Historical trend analysis
  • PR vs main comparison
  • Regression detection

Self-Improvement Loop (This Issue)

The final piece: an orchestrating skill that:

  1. Runs all test types
  2. Aggregates results
  3. Identifies patterns in failures
  4. Generates fix suggestions
  5. Can iteratively apply fixes and re-test

Skill Capabilities

skill: test-framework
commands:
  - run-all: Execute full test suite
  - run-unit: Execute unit tests only
  - run-integration: Execute integration tests only
  - run-evals: Execute all LLM evaluations
  - compare-pr: Compare current branch to main
  - suggest-fixes: Generate improvement recommendations
  - auto-fix: Apply suggestions and re-test (with approval)

Deliverables

  1. Python tools: One script per test type (reusable, composable)
  2. Test Framework Skill: SKILL.md that orchestrates everything
  3. MLflow schema: Standardized metrics and artifacts
  4. Documentation: How to add tests, create expectations, interpret results

Acceptance Criteria

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions