Skip to content

feat(tests): add EvalScope evaluation suite with needle-in-haystack and acc test#908

Open
Potterluo wants to merge 3 commits intoModelEngine-Group:developfrom
Potterluo:feature_pytest_evalscope
Open

feat(tests): add EvalScope evaluation suite with needle-in-haystack and acc test#908
Potterluo wants to merge 3 commits intoModelEngine-Group:developfrom
Potterluo:feature_pytest_evalscope

Conversation

@Potterluo
Copy link
Copy Markdown
Contributor

Summary

This PR introduces a comprehensive evaluation suite built on EvalScope (v1.5.2) to automate accuracy testing for LLMs. It supports two primary evaluation modes:

  • Mainstream Benchmarks: aime24, aime25, aime26, gsm8k, longbench_v2, ceval, cmmlu, humaneval, mmlu, mmlu_pro, etc.
  • Needle In A Haystack: Long-context retrieval evaluation with configurable context length ranges.

Key Changes

1. New Utility Class: EvalScopeRunner

  • Encapsulates run() and collect_results() logic.
  • Handles timestamp-based result directory discovery and JSON report parsing.
  • Exports metrics via the existing @export_vars decorator for database storage.

2. Refactored Test Cases (test_evalscope.py)

  • Configuration building moved to local helper functions (_build_general_task_config, _build_needle_task_config).
  • Environment variable support:
    • SCOPE_DATASET_ROOT / SCOPE_TREST_LIST
    • SCOPE_NEEDLE_MIN / SCOPE_NEEDLE_MAX
  • Clear separation between test logic and infrastructure code.

3. Enhanced Needle Content

  • Replaced the public "San Francisco sandwich" needle with a fictional, unique passage to avoid training data contamination.
  • Added corresponding Chinese needle for bilingual subset testing.

4. Documentation (README.md)

  • Bilingual (Chinese/English) guide covering:
    • Environment setup and dataset preparation (online/offline).
    • Configuration variables and local override methods.
    • Execution commands and result output structure.
    • Example JSON output and screenshot references.

How to Test

cd test

# Mainstream benchmarks
pytest suites/E2E/test_evalscope.py::test_eval_accuracy

# Needle In A Haystack
pytest suites/E2E/test_evalscope.py::test_needle_task

# All EvalScope tests
pytest --feature=evalscope
pic1 pic2 pic3

…upport

- Introduce EvalScopeRunner utility class to encapsulate task execution and result collection.
- Add test cases for mainstream benchmarks (aime, gsm8k, mmlu, etc.) and needle-in-haystack evaluation.
- Support environment variable overrides for dataset root, task list, and needle context lengths.
- Refactor configuration building into dedicated helper functions for clarity.
- Include detailed README with setup instructions, usage examples, and result interpretation.
@Potterluo Potterluo force-pushed the feature_pytest_evalscope branch from 8a6a84a to cad3dd6 Compare April 9, 2026 09:25
| Environment Variable | Default | Description |
|----------------------|---------|-------------|
| `SCOPE_DATASET_ROOT` | | Root directory where datasets are stored |
| `SCOPE_TREST_LIST` | `aime24,gsm8k` (example) | Comma-separated list of datasets to evaluate |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here : SCOPE_TREST_LIST

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants