feat: error analysis with reasoning steps by EYH0602 · Pull Request #65 · SecurityLab-UCD/TF-Bench

EYH0602 · 2025-09-10T23:34:53Z

No description provided.

Copilot

Pull Request Overview

This PR adds error analysis functionality with reasoning steps to the tfbench library. The system now captures detailed error information when type inference tasks fail and provides structured analysis of incorrect answers using OpenAI's API.

Enhanced evaluation functions to return Result types with error messages instead of simple boolean values
Added comprehensive error analysis module with predefined error categories and OpenAI-based classification
Created visualization scripts for plotting error distribution across models and dataset splits

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/tfbench/evaluation.py	Modified evaluation functions to return Result types with error messages and updated get_incorrect to include error reasons
src/tfbench/error_analysis.py	New module implementing error classification system with OpenAI integration and predefined error categories
src/tfbench/init.py	Removed deprecated evaluate function from exports
scripts/error_plot.py	New script for generating pie chart visualizations of error category distributions
scripts/error_cls.py	Empty file for error classification functionality
scripts/error_analysis.py	New script for running error analysis on incorrect benchmark results
pyproject.toml	Added pokepalette dependency for color palettes in visualization

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/tfbench/error_analysis.py

* resturcture project * fix: ollama new type api * fix lint * add ruff * add ruff to ci * refactor: llm api (#60) New LLM generation workflow. * add an empty .env * refactor OpenAI util class * use new OpenAI client in main * assume .env unchanged * fix: response processing * use new Gemini client in main * enable reasoning effort from cli * document why two gemini wrapper * add Claude API * add claude models to supported list * handle UnionType for Literal ReasoningEffort * add vLLM support and use it as default option * fix: use vLLM chat interface instead of gen * env add vllm api key * add VLLM_HOST and VLLM_PORT * add vllm server mode * add vLLM in dependencies * doc: instruct to run vllm from uv * make deprecated ollama a standalone script * doc: revise ollama * use 3.12 * add Ollama models * fix: ollama model name * fix: ollama model name * fix: Gemini use its own EFFORT_TOKEN_MAP * remove unused imports * fix: google-genai version * fix: ci with uv run * feat: load TF-Bench from HuggingFace by default (#61) * update tfb to huggingface with base and pure splits * feat: load tfbench from huggingface * remove mandatory path * avoid loading vLLM for now * remove vLLM option in main * feat: update response processing inside tfbench package (#62) * answer cannot be None from LM * move evaluation logic inside the tfbench package * fix: orjson writes binary, error is not an option * fix: use pure as parameter in main._eval * feat: script to analysis saved generation results * use orjsonl in main for consistancy * feat: evaluation prove type equiv using TypeOperators (#64) * fix: allow generation to fail * remove unnecessary imports * fix: OpenAI response add reasoning summary * fix: load_gen_results_json type * fix: analysis_saved script * fix: evaluation benchmark name * fix: OpenAI response API add summary * use pydantic-v2 * extract incorrect task-answer pairs * fix: groundtruth error (#63) * fix: missing type class and typevar in benchmark * fix: order of tasks in tfb * fix: allow load_gen_results to load error * remove error_cls unused imports * extract type variables from source code * add GHC type check by proving type equiv * fix: cp -> process * fix: API change for AST * feat: type prover support new type definition * test: ghc and type_util * feat: use prover_evaluate for base split * test: add real tfbench test cases, which the deprecated evaluation failed * alt error to syntax parsing error * feat: typeclass constrains reorder * fix: AST.get_all_nodes_of_type ignores the root itself * reorder_constraints using compiler frontend static analysis * feat: add type definitions for pure tasks * test: check type equivalence prover after rewriting mono types * fix: handle type classes alone when ading new definitions * feat: define new types automatically for pure tasks * ghc prover remove standalone type class * doc: detaile docstring for prover_evaluate * script: analysis_saved run both split * fix: experiment use prover_evaluate * feat: error analysis with reasoning steps (#65) * error analysis use prover * error analysis script * feat: record model name when doing error analysis * add plot script for error analysis * adjust row and column spacing * update color map * revise error_analysis default path * test: list constructor * remove tmp file * fix: main missing pure parameter to * error analysis only output category * default error analysis model to gpt-5-mini * adjust fontsize for 5 pies in a row * doc: require GHC >= 9.2.1 for ImpredicativeTypes * feat: add default option using transformers (#67) * add transformers generation as default * remove None option for router * remove vllm option for ease of dependency * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove unnecessary imports --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * doc: make readme and export clearer (#68) * add transformers generation as default * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove unnecessary imports * doc: improve instructions * fix: unused parameter and import * enable github actions on main commits * doc: add badges and images --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

EYH0602 added 6 commits September 10, 2025 05:46

error analysis use prover

926a842

error analysis script

c257b45

feat: record model name when doing error analysis

aabe41c

add plot script for error analysis

fa7cd25

adjust row and column spacing

a4e9ff9

update color map

41ec837

EYH0602 requested a review from Copilot September 10, 2025 23:34

Copilot AI reviewed Sep 10, 2025

View reviewed changes

src/tfbench/error_analysis.py Show resolved Hide resolved

EYH0602 merged commit 42c2925 into release-0.1.0 Sep 10, 2025
4 checks passed

EYH0602 deleted the feat-analysis-rsn-err branch November 26, 2025 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: error analysis with reasoning steps#65

feat: error analysis with reasoning steps#65
EYH0602 merged 6 commits intorelease-0.1.0from
feat-analysis-rsn-err

EYH0602 commented Sep 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EYH0602 commented Sep 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants