Skip to content

Output Eval - Source of Truth Comparison #408

@QuentinAmbard

Description

@QuentinAmbard

Overview

Evaluate the quality of skill outputs by comparing them against a pre-defined source of truth.

Concept

After a skill execution produces artifacts (dashboards, pipelines, data, etc.), we:

  1. Download/serialize the final assets
  2. Compare against expected outputs (source of truth)
  3. Have an LLM judge evaluate the differences

Test Instructions

Key principle: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:

  • Reproducible datasets across runs
  • Comparable outputs between PRs/branches
  • Meaningful source-of-truth comparisons

Instruction Structure

tests/eval_scenarios/
├── 01-data-generation.md      # Always runs first
├── 02-sdp-pipeline.md         # Depends on data gen output
├── 03-unstructured-data.md    # Additional data sources
├── 04-knowledge-assistant.md  # Depends on unstructured data
├── 05-dashboard.md            # Depends on pipeline tables
├── 06-genie-space.md          # Depends on pipeline tables
└── ...                        # Covers all skills

Dependency Chain

Tasks run in sequence with explicit dependencies:

Data Generation → SDP Pipeline → Unstructured Data → KA → Dashboard → Genie → ...
      ↓               ↓                ↓              ↓        ↓
   raw data      bronze/silver/    documents,     config    dashboard
   parquet       gold tables       PDFs, etc.     + index   definition

Each step produces outputs that can be compared against expected artifacts.

Comparison Approach

For each step in the chain, we maintain:

  • Source of truth: Expected output serialized (e.g., dashboard.yaml, pipeline.json)
  • Expectations file: Mandatory checks and facts

Expectations File Format

source_of_truth: dashboard.yaml
mandatory_facts:
  - "Must have exactly 3 widgets"
  - "Must include a date filter"
  - "Revenue values must be in USD format"
  - "Chart titles must match expected naming"

The LLM judge handles the comparison - no need for rigid structural rules. It can interpret intent and determine if the output meets the expectations semantically.

Implementation

  • Run skills in dependency order with fixed instructions
  • Serialize outputs at each step to comparable format (JSON/YAML)
  • Load source of truth and expectations for each step
  • Send both to LLM judge with the mandatory facts
  • LLM determines if each fact is satisfied
  • Produce diff report with scores

Deliverable

A Python tool/script that:

  1. Downloads/serializes skill outputs at each step
  2. Compares against source of truth per step
  3. Validates mandatory facts via LLM judge
  4. Outputs detailed comparison report (per step and overall)
  5. Saves metrics to MLflow

Output Format

test_name: "full_chain_eval"
steps:
  - step: "01-data-generation"
    overall_match: 0.95
    mandatory_facts:
      - fact: "Must generate ~50K customers"
        passed: true
  - step: "05-dashboard"
    overall_match: 0.85
    mandatory_facts:
      - fact: "Must have 3 widgets"
        passed: true
      - fact: "Revenue in USD format"
        passed: false
        details: "Revenue displayed as raw numbers"

Acceptance Criteria

  • Fixed instruction files created for each skill
  • Source of truth artifacts created for each step
  • Asset serialization for key resource types
  • Expectations file format defined
  • LLM comparison implemented
  • MLflow metrics logging implemented

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions