Overview
Evaluate the quality of skill outputs by comparing them against a pre-defined source of truth.
Concept
After a skill execution produces artifacts (dashboards, pipelines, data, etc.), we:
- Download/serialize the final assets
- Compare against expected outputs (source of truth)
- Have an LLM judge evaluate the differences
Test Instructions
Key principle: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:
- Reproducible datasets across runs
- Comparable outputs between PRs/branches
- Meaningful source-of-truth comparisons
Instruction Structure
tests/eval_scenarios/
├── 01-data-generation.md # Always runs first
├── 02-sdp-pipeline.md # Depends on data gen output
├── 03-unstructured-data.md # Additional data sources
├── 04-knowledge-assistant.md # Depends on unstructured data
├── 05-dashboard.md # Depends on pipeline tables
├── 06-genie-space.md # Depends on pipeline tables
└── ... # Covers all skills
Dependency Chain
Tasks run in sequence with explicit dependencies:
Data Generation → SDP Pipeline → Unstructured Data → KA → Dashboard → Genie → ...
↓ ↓ ↓ ↓ ↓
raw data bronze/silver/ documents, config dashboard
parquet gold tables PDFs, etc. + index definition
Each step produces outputs that can be compared against expected artifacts.
Comparison Approach
For each step in the chain, we maintain:
- Source of truth: Expected output serialized (e.g.,
dashboard.yaml, pipeline.json)
- Expectations file: Mandatory checks and facts
Expectations File Format
source_of_truth: dashboard.yaml
mandatory_facts:
- "Must have exactly 3 widgets"
- "Must include a date filter"
- "Revenue values must be in USD format"
- "Chart titles must match expected naming"
The LLM judge handles the comparison - no need for rigid structural rules. It can interpret intent and determine if the output meets the expectations semantically.
Implementation
- Run skills in dependency order with fixed instructions
- Serialize outputs at each step to comparable format (JSON/YAML)
- Load source of truth and expectations for each step
- Send both to LLM judge with the mandatory facts
- LLM determines if each fact is satisfied
- Produce diff report with scores
Deliverable
A Python tool/script that:
- Downloads/serializes skill outputs at each step
- Compares against source of truth per step
- Validates mandatory facts via LLM judge
- Outputs detailed comparison report (per step and overall)
- Saves metrics to MLflow
Output Format
test_name: "full_chain_eval"
steps:
- step: "01-data-generation"
overall_match: 0.95
mandatory_facts:
- fact: "Must generate ~50K customers"
passed: true
- step: "05-dashboard"
overall_match: 0.85
mandatory_facts:
- fact: "Must have 3 widgets"
passed: true
- fact: "Revenue in USD format"
passed: false
details: "Revenue displayed as raw numbers"
Acceptance Criteria
Overview
Evaluate the quality of skill outputs by comparing them against a pre-defined source of truth.
Concept
After a skill execution produces artifacts (dashboards, pipelines, data, etc.), we:
Test Instructions
Key principle: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:
Instruction Structure
Dependency Chain
Tasks run in sequence with explicit dependencies:
Each step produces outputs that can be compared against expected artifacts.
Comparison Approach
For each step in the chain, we maintain:
dashboard.yaml,pipeline.json)Expectations File Format
The LLM judge handles the comparison - no need for rigid structural rules. It can interpret intent and determine if the output meets the expectations semantically.
Implementation
Deliverable
A Python tool/script that:
Output Format
Acceptance Criteria