Skip to content

Thinking Eval - LLM Reasoning Process Assessment #407

@QuentinAmbard

Description

@QuentinAmbard

Overview

Evaluate the LLM's reasoning process during skill execution - how well did it think through the problem?

Concept

Given a set of business/functional instructions (e.g., "build a pipeline doing X"), we:

  1. Execute the skill with the instructions
  2. Capture the full execution trace (thinking, tool calls, errors, retries)
  3. Have an LLM judge evaluate the reasoning quality

Test Instructions

Key principle: We maintain a fixed set of detailed markdown instruction files that we always run the same way. This ensures:

  • Reproducible datasets across runs
  • Comparable results between PRs/branches
  • Consistent baseline for regression detection

Instruction Structure

tests/eval_scenarios/
├── 01-data-generation.md      # Always runs first
├── 02-sdp-pipeline.md         # Depends on data gen output
├── 03-unstructured-data.md    # Additional data sources
├── 04-knowledge-assistant.md  # Depends on unstructured data
├── 05-dashboard.md            # Depends on pipeline tables
├── 06-genie-space.md          # Depends on pipeline tables
└── ...                        # Covers all skills

Dependency Chain

Tasks run in sequence with explicit dependencies:

Data Generation → SDP Pipeline → Unstructured Data → KA → Dashboard → Genie → ...
      ↓               ↓                ↓              ↓        ↓
   raw data      bronze/silver/    documents,     config    dashboard
   parquet       gold tables       PDFs, etc.     + index   definition

Each step's output becomes the next step's input. This mirrors real-world usage and tests the full skill chain.

Evaluation Criteria

Criteria Description
Efficiency How many round trips? Tool call failures?
Clarity Did the LLM struggle or show confusion?
Recovery How well did it handle errors?
Completeness Did it complete all required steps?
Corrections What had to be manually corrected?

Implementation

  • Execute skills in dependency order with fixed instructions
  • Log full conversation trace for each step
  • Parse trace for metrics (tool calls, errors, retries)
  • Send to LLM judge with evaluation rubric
  • Produce structured scores per step and overall

Deliverable

A Python tool/script that:

  1. Runs skills in sequence with predefined markdown instructions
  2. Captures execution trace for each step
  3. Evaluates reasoning quality via LLM judge
  4. Outputs scores and observations per step
  5. Saves metrics to MLflow

Acceptance Criteria

  • Fixed instruction files created for each skill
  • Dependency chain execution implemented
  • Execution trace capture implemented
  • LLM judge prompt for thinking eval created
  • MLflow metrics logging implemented

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions