Skip to content

#4: WIP DRAFT accuracy eval and evaluator; add GPQA for testing#27

Closed
nvzhihanj wants to merge 3 commits intomainfrom
feat/zhihanj-accuracy
Closed

#4: WIP DRAFT accuracy eval and evaluator; add GPQA for testing#27
nvzhihanj wants to merge 3 commits intomainfrom
feat/zhihanj-accuracy

Conversation

@nvzhihanj
Copy link
Copy Markdown
Collaborator

@nvzhihanj nvzhihanj commented Nov 14, 2025

Address part 1 of #4

Added:

  • Evaluator for GPQA and AIME (refered to GPT-OSS)
  • Supports for repeat and pass@k (note that repeats will have N-length responses but only 1-length ground truth for ease of logics. The current assumption is that we don't repeat ground truths or input pickle).
  • Interface for eval and eval-results CLI
  • eval needs integration to the benchmark (and requires a deterministic SampleOrder), and changes to accommodate repeats

TODOs (in this PR):

  • Add integration tests for eval-results (a sample
  • add unit tests for evaluator components

TODOs (in next PR):

  • Hook up eval to benchmarks
  • Add GPT-OSS chat template and system prompt integration for eval
  • Add eval example (using small/full GPQA dataset on GPT-OSS)

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@nvzhihanj nvzhihanj self-assigned this Nov 14, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 14, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested a review from arekay-nv November 14, 2025 06:23
Comment thread src/inference_endpoint/commands/eval.py Fixed
Comment thread tests/unit/eval/test_evaluator.py Fixed
Comment thread src/inference_endpoint/commands/eval.py Fixed
Comment thread src/inference_endpoint/commands/eval.py Fixed
Comment thread src/inference_endpoint/commands/eval.py Fixed
Comment thread src/inference_endpoint/commands/eval.py Fixed
Comment thread src/inference_endpoint/commands/eval_results.py Fixed
Comment thread src/inference_endpoint/eval/evaluate.py Fixed
import pandas as pd

# Import evaluators to register them
from . import evaluators # noqa: F401
print(f"Loaded {len(df)} questions")

# Limit samples if requested
if num_samples is not None and num_samples < len(df):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this feature? It looks like it's only used for debugging / testing

@arekay-nv arekay-nv closed this Feb 12, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators Feb 12, 2026
@arekay-nv arekay-nv deleted the feat/zhihanj-accuracy branch April 2, 2026 03:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants