Thinking Eval - LLM Reasoning Process Assessment

## Overview

Evaluate the LLM's reasoning process during skill execution - how well did it think through the problem?

## Concept

Given a set of business/functional instructions (e.g., "build a pipeline doing X"), we:
1. Execute the skill with the instructions
2. Capture the full execution trace (thinking, tool calls, errors, retries)
3. Have an LLM judge evaluate the reasoning quality

## Test Instructions

**Key principle**: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:
- Reproducible datasets across runs
- Comparable results between PRs/branches
- Consistent baseline for regression detection

### Instruction Structure

```
tests/eval_scenarios/
├── 01-data-generation.md      # Always runs first
├── 02-sdp-pipeline.md         # Depends on data gen output
├── 03-unstructured-data.md    # Additional data sources
├── 04-knowledge-assistant.md  # Depends on unstructured data
├── 05-dashboard.md            # Depends on pipeline tables
├── 06-genie-space.md          # Depends on pipeline tables
└── ...                        # Covers all skills
```

### Dependency Chain

Tasks run in sequence with explicit dependencies:

```
Data Generation → SDP Pipeline → Unstructured Data → KA → Dashboard → Genie → ...
      ↓               ↓                ↓              ↓        ↓
   raw data      bronze/silver/    documents,     config    dashboard
   parquet       gold tables       PDFs, etc.     + index   definition
```

Each step's output becomes the next step's input. This mirrors real-world usage and tests the full skill chain.

## Evaluation Criteria

| Criteria | Description |
|----------|-------------|
| Efficiency | How many round trips? Tool call failures? |
| Clarity | Did the LLM struggle or show confusion? |
| Recovery | How well did it handle errors? |
| Completeness | Did it complete all required steps? |
| Corrections | What had to be manually corrected? |

## Implementation

- Execute skills in dependency order with fixed instructions
- Log full conversation trace for each step
- Parse trace for metrics (tool calls, errors, retries)
- Send to LLM judge with evaluation rubric
- Produce structured scores per step and overall

## Deliverable

A Python tool/script that:
1. Runs skills in sequence with predefined markdown instructions
2. Captures execution trace for each step
3. Evaluates reasoning quality via LLM judge
4. Outputs scores and observations per step
5. Saves metrics to MLflow

## Acceptance Criteria

- [ ] Fixed instruction files created for each skill
- [ ] Dependency chain execution implemented
- [ ] Execution trace capture implemented
- [ ] LLM judge prompt for thinking eval created
- [ ] MLflow metrics logging implemented

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thinking Eval - LLM Reasoning Process Assessment #407

Overview

Concept

Test Instructions

Instruction Structure

Dependency Chain

Evaluation Criteria

Implementation

Deliverable

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Criteria	Description
Efficiency	How many round trips? Tool call failures?
Clarity	Did the LLM struggle or show confusion?
Recovery	How well did it handle errors?
Completeness	Did it complete all required steps?
Corrections	What had to be manually corrected?

Thinking Eval - LLM Reasoning Process Assessment #407

Description

Overview

Concept

Test Instructions

Instruction Structure

Dependency Chain

Evaluation Criteria

Implementation

Deliverable

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions