Overview
Evaluate the LLM's reasoning process during skill execution - how well did it think through the problem?
Concept
Given a set of business/functional instructions (e.g., "build a pipeline doing X"), we:
- Execute the skill with the instructions
- Capture the full execution trace (thinking, tool calls, errors, retries)
- Have an LLM judge evaluate the reasoning quality
Test Instructions
Key principle: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:
- Reproducible datasets across runs
- Comparable results between PRs/branches
- Consistent baseline for regression detection
Instruction Structure
tests/eval_scenarios/
├── 01-data-generation.md # Always runs first
├── 02-sdp-pipeline.md # Depends on data gen output
├── 03-unstructured-data.md # Additional data sources
├── 04-knowledge-assistant.md # Depends on unstructured data
├── 05-dashboard.md # Depends on pipeline tables
├── 06-genie-space.md # Depends on pipeline tables
└── ... # Covers all skills
Dependency Chain
Tasks run in sequence with explicit dependencies:
Data Generation → SDP Pipeline → Unstructured Data → KA → Dashboard → Genie → ...
↓ ↓ ↓ ↓ ↓
raw data bronze/silver/ documents, config dashboard
parquet gold tables PDFs, etc. + index definition
Each step's output becomes the next step's input. This mirrors real-world usage and tests the full skill chain.
Evaluation Criteria
| Criteria |
Description |
| Efficiency |
How many round trips? Tool call failures? |
| Clarity |
Did the LLM struggle or show confusion? |
| Recovery |
How well did it handle errors? |
| Completeness |
Did it complete all required steps? |
| Corrections |
What had to be manually corrected? |
Implementation
- Execute skills in dependency order with fixed instructions
- Log full conversation trace for each step
- Parse trace for metrics (tool calls, errors, retries)
- Send to LLM judge with evaluation rubric
- Produce structured scores per step and overall
Deliverable
A Python tool/script that:
- Runs skills in sequence with predefined markdown instructions
- Captures execution trace for each step
- Evaluates reasoning quality via LLM judge
- Outputs scores and observations per step
- Saves metrics to MLflow
Acceptance Criteria
Overview
Evaluate the LLM's reasoning process during skill execution - how well did it think through the problem?
Concept
Given a set of business/functional instructions (e.g., "build a pipeline doing X"), we:
Test Instructions
Key principle: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:
Instruction Structure
Dependency Chain
Tasks run in sequence with explicit dependencies:
Each step's output becomes the next step's input. This mirrors real-world usage and tests the full skill chain.
Evaluation Criteria
Implementation
Deliverable
A Python tool/script that:
Acceptance Criteria