Skip to content

feat: add MiniMax as configurable evaluation LLM provider#351

Open
octo-patch wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
octo-patch:feature/add-minimax-provider
Open

feat: add MiniMax as configurable evaluation LLM provider#351
octo-patch wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
octo-patch:feature/add-minimax-provider

Conversation

@octo-patch
Copy link
Copy Markdown

Summary

  • Add configurable evaluation LLM client (pipeline/benchmarks/utils/eval_llm.py) supporting OpenAI and MiniMax providers with auto-detection, temperature clamping, and think-tag stripping
  • Update MagnifierBench, MathVista, and MM-Vet evaluation datasets to use the configurable client instead of hardcoded OpenAI API calls, with backward-compatible eval_provider parameter
  • Update Syphus data generation pipeline with MiniMax provider documentation and temperature handling
  • Add 24 unit tests and 4 integration tests (all passing)

Motivation

The benchmark evaluation system (MagnifierBench, MathVista, MM-Vet) previously hardcoded OpenAI GPT-4 as the evaluation judge LLM. This PR makes the evaluation LLM configurable, enabling users to choose alternative providers like MiniMax M2.7 (1M context window) as a cost-effective evaluation backend.

Configuration

# For benchmark evaluation
export EVAL_LLM_PROVIDER="minimax"
export MINIMAX_API_KEY="your-key"

# For Syphus data generation (via liteLLM)
export MINIMAX_API_KEY="your-key"
export OPENAI_API_ENGINE="openai/MiniMax-M2.7"
export OPENAI_API_BASE="https://api.minimax.io/v1"

Or via YAML config:

datasets:
  - name: magnifierbench
    eval_provider: minimax
    api_key: your-key

Changes

File Change
pipeline/benchmarks/utils/eval_llm.py New configurable LLM client with provider registry
pipeline/benchmarks/utils/init.py Package init
pipeline/benchmarks/datasets/magnifierbench.py Use EvalLLMClient, add eval_provider/eval_model params
pipeline/benchmarks/datasets/mathvista.py Use EvalLLMClient, add eval_provider/eval_model params
pipeline/benchmarks/datasets/mmvet.py Use EvalLLMClient, replace OpenAI() client
mimic-it/syphus/file_utils.py Add MiniMax docs, temp clamping, query_llm() alias
unit_tests/test_eval_llm.py 24 unit tests
unit_tests/test_eval_llm_integration.py 4 integration tests
README.md MiniMax badge, config docs

Test plan

  • 24 unit tests passing (provider config, init, temp clamping, think-tag stripping, chat completion, retry logic)
  • 4 integration tests passing against live MiniMax API (basic completion, judge yes/no, scoring, auto-detect)
  • Verify backward compatibility: existing OpenAI-based evaluation works unchanged when no eval_provider is set

Add support for MiniMax M2.7 as an alternative LLM provider for benchmark
evaluation (MagnifierBench, MathVista, MM-Vet) and the Syphus data generation
pipeline. Previously, evaluation judging was hardcoded to OpenAI GPT-4.

Changes:
- Add pipeline/benchmarks/utils/eval_llm.py: Configurable evaluation LLM
  client supporting OpenAI and MiniMax providers with auto-detection via
  environment variables, temperature clamping, and think-tag stripping
- Update magnifierbench.py, mathvista.py, mmvet.py to use configurable
  eval LLM client with backward-compatible eval_provider parameter
- Update Syphus file_utils.py with MiniMax provider documentation and
  temperature clamping when MINIMAX_API_KEY is set
- Add 24 unit tests and 4 integration tests
- Update README with MiniMax configuration docs and badge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant