Output Eval - Source of Truth Comparison

## Overview

Evaluate the quality of skill outputs by comparing them against a pre-defined source of truth.

## Concept

After a skill execution produces artifacts (dashboards, pipelines, data, etc.), we:
1. Download/serialize the final assets
2. Compare against expected outputs (source of truth)
3. Have an LLM judge evaluate the differences

## Test Instructions

**Key principle**: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:
- Reproducible datasets across runs
- Comparable outputs between PRs/branches
- Meaningful source-of-truth comparisons

### Instruction Structure

```
tests/eval_scenarios/
├── 01-data-generation.md      # Always runs first
├── 02-sdp-pipeline.md         # Depends on data gen output
├── 03-unstructured-data.md    # Additional data sources
├── 04-knowledge-assistant.md  # Depends on unstructured data
├── 05-dashboard.md            # Depends on pipeline tables
├── 06-genie-space.md          # Depends on pipeline tables
└── ...                        # Covers all skills
```

### Dependency Chain

Tasks run in sequence with explicit dependencies:

```
Data Generation → SDP Pipeline → Unstructured Data → KA → Dashboard → Genie → ...
      ↓               ↓                ↓              ↓        ↓
   raw data      bronze/silver/    documents,     config    dashboard
   parquet       gold tables       PDFs, etc.     + index   definition
```

Each step produces outputs that can be compared against expected artifacts.

## Comparison Approach

For each step in the chain, we maintain:
- **Source of truth**: Expected output serialized (e.g., `dashboard.yaml`, `pipeline.json`)
- **Expectations file**: Mandatory checks and facts

### Expectations File Format

```yaml
source_of_truth: dashboard.yaml
mandatory_facts:
  - "Must have exactly 3 widgets"
  - "Must include a date filter"
  - "Revenue values must be in USD format"
  - "Chart titles must match expected naming"
```

The LLM judge handles the comparison - no need for rigid structural rules. It can interpret intent and determine if the output meets the expectations semantically.

## Implementation

- Run skills in dependency order with fixed instructions
- Serialize outputs at each step to comparable format (JSON/YAML)
- Load source of truth and expectations for each step
- Send both to LLM judge with the mandatory facts
- LLM determines if each fact is satisfied
- Produce diff report with scores

## Deliverable

A Python tool/script that:
1. Downloads/serializes skill outputs at each step
2. Compares against source of truth per step
3. Validates mandatory facts via LLM judge
4. Outputs detailed comparison report (per step and overall)
5. Saves metrics to MLflow

## Output Format

```yaml
test_name: "full_chain_eval"
steps:
  - step: "01-data-generation"
    overall_match: 0.95
    mandatory_facts:
      - fact: "Must generate ~50K customers"
        passed: true
  - step: "05-dashboard"
    overall_match: 0.85
    mandatory_facts:
      - fact: "Must have 3 widgets"
        passed: true
      - fact: "Revenue in USD format"
        passed: false
        details: "Revenue displayed as raw numbers"
```

## Acceptance Criteria

- [ ] Fixed instruction files created for each skill
- [ ] Source of truth artifacts created for each step
- [ ] Asset serialization for key resource types
- [ ] Expectations file format defined
- [ ] LLM comparison implemented
- [ ] MLflow metrics logging implemented

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Eval - Source of Truth Comparison #408

Overview

Concept

Test Instructions

Instruction Structure

Dependency Chain

Comparison Approach

Expectations File Format

Implementation

Deliverable

Output Format

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Output Eval - Source of Truth Comparison #408

Description

Overview

Concept

Test Instructions

Instruction Structure

Dependency Chain

Comparison Approach

Expectations File Format

Implementation

Deliverable

Output Format

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions