feat(tests): add EvalScope evaluation suite with needle-in-haystack and acc test by Potterluo · Pull Request #908 · ModelEngine-Group/unified-cache-management

Potterluo · 2026-04-09T09:24:33Z

Summary

This PR introduces a comprehensive evaluation suite built on EvalScope (v1.5.2) to automate accuracy testing for LLMs. It supports two primary evaluation modes:

Mainstream Benchmarks: aime24, aime25, aime26, gsm8k, longbench_v2, ceval, cmmlu, humaneval, mmlu, mmlu_pro, etc.
Needle In A Haystack: Long-context retrieval evaluation with configurable context length ranges.

Key Changes

1. New Utility Class: `EvalScopeRunner`

Encapsulates run() and collect_results() logic.
Handles timestamp-based result directory discovery and JSON report parsing.
Exports metrics via the existing @export_vars decorator for database storage.

2. Refactored Test Cases (`test_evalscope.py`)

Configuration building moved to local helper functions (_build_general_task_config, _build_needle_task_config).
Environment variable support:
- SCOPE_DATASET_ROOT / SCOPE_TREST_LIST
- SCOPE_NEEDLE_MIN / SCOPE_NEEDLE_MAX
Clear separation between test logic and infrastructure code.

3. Enhanced Needle Content

Replaced the public "San Francisco sandwich" needle with a fictional, unique passage to avoid training data contamination.
Added corresponding Chinese needle for bilingual subset testing.

4. Documentation (`README.md`)

Bilingual (Chinese/English) guide covering:
- Environment setup and dataset preparation (online/offline).
- Configuration variables and local override methods.
- Execution commands and result output structure.
- Example JSON output and screenshot references.

How to Test

cd test

# Mainstream benchmarks
pytest suites/E2E/test_evalscope.py::test_eval_accuracy

# Needle In A Haystack
pytest suites/E2E/test_evalscope.py::test_needle_task

# All EvalScope tests
pytest --feature=evalscope

…upport - Introduce EvalScopeRunner utility class to encapsulate task execution and result collection. - Add test cases for mainstream benchmarks (aime, gsm8k, mmlu, etc.) and needle-in-haystack evaluation. - Support environment variable overrides for dataset root, task list, and needle context lengths. - Refactor configuration building into dedicated helper functions for clarity. - Include detailed README with setup instructions, usage examples, and result interpretation.

yuanzhg078 · 2026-04-13T03:20:31Z

test/docs/evalscopeTest.md

+| Environment Variable | Default | Description |
+|----------------------|---------|-------------|
+| `SCOPE_DATASET_ROOT` | | Root directory where datasets are stored |
+| `SCOPE_TREST_LIST` | `aime24,gsm8k` (example) | Comma-separated list of datasets to evaluate |


typo here : SCOPE_TREST_LIST

Potterluo requested review from Wwwzff, mag1c-h and ygwpz as code owners April 9, 2026 09:24

Potterluo force-pushed the feature_pytest_evalscope branch from 8a6a84a to cad3dd6 Compare April 9, 2026 09:25

clean code

934dcaf

yuanzhg078 reviewed Apr 13, 2026

View reviewed changes

Merge branch 'develop' into feature_pytest_evalscope

0b37d94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tests): add EvalScope evaluation suite with needle-in-haystack and acc test#908

feat(tests): add EvalScope evaluation suite with needle-in-haystack and acc test#908
Potterluo wants to merge 3 commits intoModelEngine-Group:developfrom
Potterluo:feature_pytest_evalscope

Potterluo commented Apr 9, 2026

Uh oh!

yuanzhg078 Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Potterluo commented Apr 9, 2026

Summary

Key Changes

1. New Utility Class: EvalScopeRunner

2. Refactored Test Cases (test_evalscope.py)

3. Enhanced Needle Content

4. Documentation (README.md)

How to Test

Uh oh!

yuanzhg078 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. New Utility Class: `EvalScopeRunner`

2. Refactored Test Cases (`test_evalscope.py`)

4. Documentation (`README.md`)