#4: WIP DRAFT accuracy eval and evaluator; add GPQA for testing by nvzhihanj · Pull Request #27 · mlcommons/endpoints

nvzhihanj · 2025-11-14T06:23:37Z

Address part 1 of #4

Added:

Evaluator for GPQA and AIME (refered to GPT-OSS)
Supports for repeat and pass@k (note that repeats will have N-length responses but only 1-length ground truth for ease of logics. The current assumption is that we don't repeat ground truths or input pickle).
Interface for eval and eval-results CLI
eval needs integration to the benchmark (and requires a deterministic SampleOrder), and changes to accommodate repeats

TODOs (in this PR):

Add integration tests for eval-results (a sample
add unit tests for evaluator components

TODOs (in next PR):

Hook up eval to benchmarks
Add GPT-OSS chat template and system prompt integration for eval
Add eval example (using small/full GPQA dataset on GPT-OSS)

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2025-11-14T06:23:46Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

+import pandas as pd
+
+# Import evaluators to register them
+from . import evaluators  # noqa: F401


nv-alicheng · 2025-12-01T21:57:41Z

+    print(f"Loaded {len(df)} questions")
+
+    # Limit samples if requested
+    if num_samples is not None and num_samples < len(df):


Do we still need this feature? It looks like it's only used for debugging / testing

Base skeleton for accuracy eval and evaluator; add GPQA for testing

6623b27

nvzhihanj self-assigned this Nov 14, 2025

github-actions Bot requested a review from arekay-nv November 14, 2025 06:23

github-code-quality Bot found potential problems Nov 14, 2025

View reviewed changes

Fix precommit and issues

0f0bdfe

github-code-quality Bot found potential problems Nov 14, 2025

View reviewed changes

Comment thread src/inference_endpoint/eval/evaluate.py

import pandas as pd

# Import evaluators to register them

from . import evaluators # noqa: F401

Remove test_eval until it's ready

5e67009

nv-alicheng reviewed Dec 1, 2025

View reviewed changes

arekay-nv closed this Feb 12, 2026

github-actions Bot locked and limited conversation to collaborators Feb 12, 2026

arekay-nv deleted the feat/zhihanj-accuracy branch April 2, 2026 03:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#4: WIP DRAFT accuracy eval and evaluator; add GPQA for testing#27

#4: WIP DRAFT accuracy eval and evaluator; add GPQA for testing#27
nvzhihanj wants to merge 3 commits intomainfrom
feat/zhihanj-accuracy

nvzhihanj commented Nov 14, 2025 •

edited by TheKanter

Loading

Uh oh!

github-actions Bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-alicheng Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nvzhihanj commented Nov 14, 2025 • edited by TheKanter Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-alicheng Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvzhihanj commented Nov 14, 2025 •

edited by TheKanter

Loading

github-actions Bot commented Nov 14, 2025 •

edited

Loading