Skip to content

feat: error analysis with reasoning steps#65

Merged
EYH0602 merged 6 commits intorelease-0.1.0from
feat-analysis-rsn-err
Sep 10, 2025
Merged

feat: error analysis with reasoning steps#65
EYH0602 merged 6 commits intorelease-0.1.0from
feat-analysis-rsn-err

Conversation

@EYH0602
Copy link
Copy Markdown
Member

@EYH0602 EYH0602 commented Sep 10, 2025

No description provided.

@EYH0602 EYH0602 requested a review from Copilot September 10, 2025 23:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds error analysis functionality with reasoning steps to the tfbench library. The system now captures detailed error information when type inference tasks fail and provides structured analysis of incorrect answers using OpenAI's API.

  • Enhanced evaluation functions to return Result types with error messages instead of simple boolean values
  • Added comprehensive error analysis module with predefined error categories and OpenAI-based classification
  • Created visualization scripts for plotting error distribution across models and dataset splits

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/tfbench/evaluation.py Modified evaluation functions to return Result types with error messages and updated get_incorrect to include error reasons
src/tfbench/error_analysis.py New module implementing error classification system with OpenAI integration and predefined error categories
src/tfbench/init.py Removed deprecated evaluate function from exports
scripts/error_plot.py New script for generating pie chart visualizations of error category distributions
scripts/error_cls.py Empty file for error classification functionality
scripts/error_analysis.py New script for running error analysis on incorrect benchmark results
pyproject.toml Added pokepalette dependency for color palettes in visualization

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@EYH0602 EYH0602 merged commit 42c2925 into release-0.1.0 Sep 10, 2025
4 checks passed
EYH0602 added a commit that referenced this pull request Sep 30, 2025
* resturcture project

* fix: ollama new type api

* fix lint

* add ruff

* add ruff to ci

* refactor: llm api (#60)

New LLM generation workflow.

* add an empty .env

* refactor OpenAI util class

* use new OpenAI client in main

* assume .env unchanged

* fix: response processing

* use new Gemini client in main

* enable reasoning effort from cli

* document why two gemini wrapper

* add Claude API

* add claude models to supported list

* handle UnionType for Literal ReasoningEffort

* add vLLM support and use it as default option

* fix: use vLLM chat interface instead of gen

* env add vllm api key

* add VLLM_HOST and VLLM_PORT

* add vllm server mode

* add vLLM in dependencies

* doc: instruct to run vllm from uv

* make deprecated ollama a standalone script

* doc: revise ollama

* use 3.12

* add Ollama models

* fix: ollama model name

* fix: ollama model name

* fix: Gemini use its own EFFORT_TOKEN_MAP

* remove unused imports

* fix: google-genai version

* fix: ci with uv run

* feat: load TF-Bench from HuggingFace by default (#61)

* update tfb to huggingface with base and pure splits

* feat: load tfbench from huggingface

* remove mandatory path

* avoid loading vLLM for now

* remove vLLM option in main

* feat: update response processing inside tfbench package (#62)

* answer cannot be None from LM

* move evaluation logic inside the tfbench package

* fix: orjson writes binary, error is not an option

* fix: use pure as parameter in main._eval

* feat: script to analysis saved generation results

* use orjsonl in main for consistancy

* feat: evaluation prove type equiv using TypeOperators (#64)

* fix: allow generation to fail

* remove unnecessary imports

* fix: OpenAI response add reasoning summary

* fix: load_gen_results_json type

* fix: analysis_saved script

* fix: evaluation benchmark name

* fix: OpenAI response API add summary

* use pydantic-v2

* extract incorrect task-answer pairs

* fix: groundtruth error (#63)

* fix: missing type class and typevar in benchmark

* fix: order of  tasks in tfb

* fix: allow load_gen_results to load error

* remove error_cls unused imports

* extract type variables from source code

* add GHC type check by proving type equiv

* fix: cp -> process

* fix: API change for AST

* feat: type prover support new type definition

* test: ghc and type_util

* feat: use prover_evaluate for base split

* test: add real tfbench test cases, which the deprecated evaluation failed

* alt  error to syntax parsing error

* feat: typeclass constrains reorder

* fix: AST.get_all_nodes_of_type ignores the root itself

* reorder_constraints using compiler frontend static analysis

* feat: add type definitions for pure tasks

* test: check type equivalence prover after rewriting mono types

* fix: handle type classes alone when ading new definitions

* feat: define new types automatically for pure tasks

* ghc prover remove standalone type class

* doc: detaile docstring for prover_evaluate

* script: analysis_saved run both split

* fix: experiment use prover_evaluate

* feat: error analysis with reasoning steps (#65)

* error analysis use prover

* error analysis script

* feat: record model name when doing error analysis

* add plot script for error analysis

* adjust row and column spacing

* update color map

* revise error_analysis default path

* test: list constructor

* remove tmp file

* fix: main  missing pure parameter to

* error analysis only output category

* default error analysis model to gpt-5-mini

* adjust fontsize for 5 pies in a row

* doc: require GHC >= 9.2.1 for ImpredicativeTypes

* feat: add default option using transformers (#67)

* add transformers generation as default

* remove None option for router

* remove vllm option for ease of dependency

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove unnecessary imports

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* doc: make readme and export clearer (#68)

* add transformers generation as default

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/tfbench/lm/_hf.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove unnecessary imports

* doc: improve instructions

* fix: unused parameter and import

* enable github actions on main commits

* doc: add badges and images

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@EYH0602 EYH0602 deleted the feat-analysis-rsn-err branch November 26, 2025 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants