feat(byob): add completions_logprob endpoint and extend scorers/datasets by kanishks-23 · Pull Request #953 · NVIDIA-NeMo/Evaluator

kanishks-23 · 2026-04-30T10:44:08Z

BYOB support for logprob-based multiple-choice evaluation and extends dataset loading/scoring utilities needed by the Sovereign benchmark suite.

Added completions_logprob as a supported endpoint type
- Pydantic validation now accepts target.api_endpoint.type: completions_logprob
- CLI --model_type completions_logprob is allowed
- Endpoint health checks treat completions_logprob like a /v1/completions endpoint
Added BYOB logprob evaluation flow
- Uses /v1/completions with max_tokens=0, echo=true, and logprobs=1
- Scores candidate continuations using returned token logprobs
- Supports multiple-choice ranking via one request per candidate answer
- Supports nested choice fields such as choices.text
Added/updated BYOB scorers
- multiple_choice_acc: returns acc, acc_norm, and acc_greedy
- mcq_letter_extract: supports A-J options and handles empty/None responses safely
- Added task-oriented scorers for GSM8K-style numeric answers, yes/no tasks, chrF, and ROUGE
Extended Hugging Face dataset URI support.
- Parses extra query params beyond split
- Supports trust_remote_code=true
- Supports row filtering via filter_field/filter_value, including multiple filters with suffixes
- This allows BYOB benchmarks to consume datasets where language is stored as a row field instead of a HF config

copy-pr-bot · 2026-04-30T10:44:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wprazuch · 2026-05-04T14:23:25Z

@kanishks-23 Please amend sign-offs to the commits pushed and then launch pipeline with following syntax to make sure tests pass

kanishks-23 · 2026-05-04T15:24:57Z

/ok to test cdaa77a

kanishks-23 · 2026-05-04T15:40:54Z

/ok to test 5ae25d7

kanishks-23 · 2026-05-05T07:59:20Z

/ok to test bb06b86

wprazuch · 2026-05-05T08:26:46Z

/ok to test bb06b86

Add support for logprob-based multiple-choice evaluation within the BYOB framework, plus several new scorers and dataset-loading enhancements. Endpoint and runner: - Add completions_logprob as a supported endpoint type with Pydantic validation, CLI support, and health-check handling - Logprob scoring uses /v1/completions with max_tokens=0, echo=true, logprobs=1 to score candidate answers - New call_model_loglikelihood helper in runner.py Scorers: - multiple_choice_acc: logprob-based MCQ scoring - mcq_letter_extract: parse first A-D letter from free-form responses - gsm8k_answer: GSM8K-style numeric answer extraction - boolean_yesno: yes/no task scorer - chrf / chrF++ (sacreBLEU-style character n-gram F-score) - Extended ROUGE support Datasets: - HuggingFace loader parses additional query parameters - trust_remote_code=true enabled where required Signed-off-by: kanishks <kanishks@nvidia.com>

wprazuch · 2026-05-05T09:36:11Z

/ok to test 45e9484

@benchmark

When @benchmark(choices_field=...) is set but the row doesn't expose that field, eval_logic silently falls back to @benchmark(choices=...). That makes mis-named or missing fields hard to diagnose. Emit a warning with sample_id and choices_field so the fallback is visible in run logs. Signed-off-by: kanishks <kanishks@nvidia.com>

HuggingFaceFetcher.fetch built the cache filename from (dataset_name, config, split, filters) only, so URIs that differ only in their data_files= / field= query params collided on the same cache entry. The first fetch's content was returned for every subsequent language regardless of which file was actually requested. This silently broke per-language datasets that share a single HF repo but split content across files (e.g. IndicGenBench / FLORES_in's flores_en_<lang>_test.json, IndicGenBench / CrossSum_in's crosssum_english-<lang>_test.json). Every language's task evaluated against whichever language was fetched first. Append `datafile-<value>` and `field-<value>` segments to the cache filename when those options are present so each (data_files, field) pair gets a distinct on-disk cache entry. Adds a regression test that fetches three different data_files= URIs and asserts each lands in its own cache file with its own contents. Signed-off-by: kanishks <kanishks@nvidia.com>

wprazuch · 2026-05-06T09:12:03Z

/ok to test 9556770

prokotg

Looks good. Overall, architecturally logprobs should be realized elsewhere than endpoint_type because endpoint_type only indicates the endpoint itself not behaviour (just like images, video, and audio can be realized through both chat and completions). I think logprobs would be better realized through a Strategy or other mechanism. Similarily choices, num_fewshots and others should fall behind extras. Approved contingent we agree architecture is most likely going to change in the future.

wprazuch · 2026-05-06T10:20:24Z

/ok to test 9556770

kanishks-23 requested review from a team as code owners April 30, 2026 10:44

github-actions Bot added documentation Improvements or additions to documentation nemo-evaluator tests community-request labels Apr 30, 2026

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 2, 2026

kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from b241165 to ca07cb8 Compare May 3, 2026 14:28

wprazuch reviewed May 4, 2026

View reviewed changes

Comment thread docs/libraries/nemo-evaluator/extending/byob/cli.md Outdated

kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from ca07cb8 to cdaa77a Compare May 4, 2026 15:15

kanishks-23 changed the title ~~Add completions_logprob endpoint and extend BYOB scorers/datasets for Sovereign benchmarks~~ feat(byob): add completions_logprob endpoint and extend scorers/datasets May 4, 2026

kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from cdaa77a to 5ae25d7 Compare May 4, 2026 15:40

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 4, 2026

kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from 5ae25d7 to bb06b86 Compare May 5, 2026 07:57

copy-pr-bot Bot temporarily deployed to test May 5, 2026 08:28 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 5, 2026 08:28 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 5, 2026 08:30 Inactive

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 5, 2026

kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from bb06b86 to 45e9484 Compare May 5, 2026 08:57

copy-pr-bot Bot temporarily deployed to test May 5, 2026 09:37 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 5, 2026 09:37 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 5, 2026 09:41 Inactive

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 5, 2026

wprazuch reviewed May 5, 2026

View reviewed changes

Comment thread packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/eval_logic.py

kanishks-23 added 2 commits May 5, 2026 21:18

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 6, 2026

copy-pr-bot Bot temporarily deployed to test May 6, 2026 09:13 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 6, 2026 09:13 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 6, 2026 09:15 Inactive

prokotg approved these changes May 6, 2026

View reviewed changes

wprazuch merged commit 231526c into NVIDIA-NeMo:main May 7, 2026
49 checks passed

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(byob): add completions_logprob endpoint and extend scorers/datasets #953

feat(byob): add completions_logprob endpoint and extend scorers/datasets #953
wprazuch merged 3 commits into
NVIDIA-NeMo:mainfrom
kanishks-23:kanishks/sovereign_benchmarks

kanishks-23 commented Apr 30, 2026

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

Uh oh!

wprazuch commented May 4, 2026

Uh oh!

kanishks-23 commented May 4, 2026

Uh oh!

kanishks-23 commented May 4, 2026

Uh oh!

kanishks-23 commented May 5, 2026

Uh oh!

wprazuch commented May 5, 2026

Uh oh!

wprazuch commented May 5, 2026

Uh oh!

Uh oh!

wprazuch commented May 6, 2026

Uh oh!

prokotg left a comment •

edited

Loading

Uh oh!

wprazuch commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kanishks-23 commented Apr 30, 2026

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

Uh oh!

wprazuch commented May 4, 2026

Uh oh!

kanishks-23 commented May 4, 2026

Uh oh!

kanishks-23 commented May 4, 2026

Uh oh!

kanishks-23 commented May 5, 2026

Uh oh!

wprazuch commented May 5, 2026

Uh oh!

wprazuch commented May 5, 2026

Uh oh!

Uh oh!

wprazuch commented May 6, 2026

Uh oh!

prokotg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wprazuch commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

prokotg left a comment •

edited

Loading