feat: add MiniMax as configurable evaluation LLM provider by octo-patch · Pull Request #351 · EvolvingLMMs-Lab/Otter

octo-patch · 2026-03-26T09:34:17Z

Summary

Add configurable evaluation LLM client (pipeline/benchmarks/utils/eval_llm.py) supporting OpenAI and MiniMax providers with auto-detection, temperature clamping, and think-tag stripping
Update MagnifierBench, MathVista, and MM-Vet evaluation datasets to use the configurable client instead of hardcoded OpenAI API calls, with backward-compatible eval_provider parameter
Update Syphus data generation pipeline with MiniMax provider documentation and temperature handling
Add 24 unit tests and 4 integration tests (all passing)

Motivation

The benchmark evaluation system (MagnifierBench, MathVista, MM-Vet) previously hardcoded OpenAI GPT-4 as the evaluation judge LLM. This PR makes the evaluation LLM configurable, enabling users to choose alternative providers like MiniMax M2.7 (1M context window) as a cost-effective evaluation backend.

Configuration

# For benchmark evaluation
export EVAL_LLM_PROVIDER="minimax"
export MINIMAX_API_KEY="your-key"

# For Syphus data generation (via liteLLM)
export MINIMAX_API_KEY="your-key"
export OPENAI_API_ENGINE="openai/MiniMax-M2.7"
export OPENAI_API_BASE="https://api.minimax.io/v1"

Or via YAML config:

datasets:
  - name: magnifierbench
    eval_provider: minimax
    api_key: your-key

Changes

File	Change
pipeline/benchmarks/utils/eval_llm.py	New configurable LLM client with provider registry
pipeline/benchmarks/utils/init.py	Package init
pipeline/benchmarks/datasets/magnifierbench.py	Use EvalLLMClient, add eval_provider/eval_model params
pipeline/benchmarks/datasets/mathvista.py	Use EvalLLMClient, add eval_provider/eval_model params
pipeline/benchmarks/datasets/mmvet.py	Use EvalLLMClient, replace OpenAI() client
mimic-it/syphus/file_utils.py	Add MiniMax docs, temp clamping, query_llm() alias
unit_tests/test_eval_llm.py	24 unit tests
unit_tests/test_eval_llm_integration.py	4 integration tests
README.md	MiniMax badge, config docs

Test plan

24 unit tests passing (provider config, init, temp clamping, think-tag stripping, chat completion, retry logic)
4 integration tests passing against live MiniMax API (basic completion, judge yes/no, scoring, auto-detect)
Verify backward compatibility: existing OpenAI-based evaluation works unchanged when no eval_provider is set

Add support for MiniMax M2.7 as an alternative LLM provider for benchmark evaluation (MagnifierBench, MathVista, MM-Vet) and the Syphus data generation pipeline. Previously, evaluation judging was hardcoded to OpenAI GPT-4. Changes: - Add pipeline/benchmarks/utils/eval_llm.py: Configurable evaluation LLM client supporting OpenAI and MiniMax providers with auto-detection via environment variables, temperature clamping, and think-tag stripping - Update magnifierbench.py, mathvista.py, mmvet.py to use configurable eval LLM client with backward-compatible eval_provider parameter - Update Syphus file_utils.py with MiniMax provider documentation and temperature clamping when MINIMAX_API_KEY is set - Add 24 unit tests and 4 integration tests - Update README with MiniMax configuration docs and badge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MiniMax as configurable evaluation LLM provider#351

feat: add MiniMax as configurable evaluation LLM provider#351
octo-patch wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
octo-patch:feature/add-minimax-provider

octo-patch commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octo-patch commented Mar 26, 2026

Summary

Motivation

Configuration

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant