Skip to content

feat(byob): add completions_logprob endpoint and extend scorers/datasets #953

Merged
wprazuch merged 3 commits into
NVIDIA-NeMo:mainfrom
kanishks-23:kanishks/sovereign_benchmarks
May 7, 2026
Merged

feat(byob): add completions_logprob endpoint and extend scorers/datasets #953
wprazuch merged 3 commits into
NVIDIA-NeMo:mainfrom
kanishks-23:kanishks/sovereign_benchmarks

Conversation

@kanishks-23
Copy link
Copy Markdown
Contributor

BYOB support for logprob-based multiple-choice evaluation and extends dataset loading/scoring utilities needed by the Sovereign benchmark suite.

  • Added completions_logprob as a supported endpoint type
    • Pydantic validation now accepts target.api_endpoint.type: completions_logprob
    • CLI --model_type completions_logprob is allowed
    • Endpoint health checks treat completions_logprob like a /v1/completions endpoint
  • Added BYOB logprob evaluation flow
    • Uses /v1/completions with max_tokens=0, echo=true, and logprobs=1
    • Scores candidate continuations using returned token logprobs
    • Supports multiple-choice ranking via one request per candidate answer
    • Supports nested choice fields such as choices.text
  • Added/updated BYOB scorers
    • multiple_choice_acc: returns acc, acc_norm, and acc_greedy
    • mcq_letter_extract: supports A-J options and handles empty/None responses safely
    • Added task-oriented scorers for GSM8K-style numeric answers, yes/no tasks, chrF, and ROUGE
  • Extended Hugging Face dataset URI support.
    • Parses extra query params beyond split
    • Supports trust_remote_code=true
    • Supports row filtering via filter_field/filter_value, including multiple filters with suffixes
    • This allows BYOB benchmarks to consume datasets where language is stored as a row field instead of a HF config

@kanishks-23 kanishks-23 requested review from a team as code owners April 30, 2026 10:44
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added documentation Improvements or additions to documentation nemo-evaluator tests community-request labels Apr 30, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 2, 2026
@kanishks-23 kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from b241165 to ca07cb8 Compare May 3, 2026 14:28
Comment thread docs/libraries/nemo-evaluator/extending/byob/cli.md Outdated
@wprazuch
Copy link
Copy Markdown
Contributor

wprazuch commented May 4, 2026

@kanishks-23 Please amend sign-offs to the commits pushed and then launch pipeline with following syntax to make sure tests pass

@kanishks-23 kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from ca07cb8 to cdaa77a Compare May 4, 2026 15:15
@kanishks-23
Copy link
Copy Markdown
Contributor Author

/ok to test cdaa77a

@kanishks-23 kanishks-23 changed the title Add completions_logprob endpoint and extend BYOB scorers/datasets for Sovereign benchmarks feat(byob): add completions_logprob endpoint and extend scorers/datasets May 4, 2026
@kanishks-23 kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from cdaa77a to 5ae25d7 Compare May 4, 2026 15:40
@kanishks-23
Copy link
Copy Markdown
Contributor Author

/ok to test 5ae25d7

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 4, 2026
@kanishks-23 kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from 5ae25d7 to bb06b86 Compare May 5, 2026 07:57
@kanishks-23
Copy link
Copy Markdown
Contributor Author

/ok to test bb06b86

@wprazuch
Copy link
Copy Markdown
Contributor

wprazuch commented May 5, 2026

/ok to test bb06b86

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 5, 2026
Add support for logprob-based multiple-choice evaluation within the BYOB
framework, plus several new scorers and dataset-loading enhancements.

Endpoint and runner:
- Add completions_logprob as a supported endpoint type with Pydantic
  validation, CLI support, and health-check handling
- Logprob scoring uses /v1/completions with max_tokens=0, echo=true,
  logprobs=1 to score candidate answers
- New call_model_loglikelihood helper in runner.py

Scorers:
- multiple_choice_acc: logprob-based MCQ scoring
- mcq_letter_extract: parse first A-D letter from free-form responses
- gsm8k_answer: GSM8K-style numeric answer extraction
- boolean_yesno: yes/no task scorer
- chrf / chrF++ (sacreBLEU-style character n-gram F-score)
- Extended ROUGE support

Datasets:
- HuggingFace loader parses additional query parameters
- trust_remote_code=true enabled where required

Signed-off-by: kanishks <kanishks@nvidia.com>
@kanishks-23 kanishks-23 force-pushed the kanishks/sovereign_benchmarks branch from bb06b86 to 45e9484 Compare May 5, 2026 08:57
@wprazuch
Copy link
Copy Markdown
Contributor

wprazuch commented May 5, 2026

/ok to test 45e9484

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 5, 2026
  When @benchmark(choices_field=...) is set but the row doesn't expose
  that field, eval_logic silently falls back to @benchmark(choices=...).
  That makes mis-named or missing fields hard to diagnose. Emit a
  warning with sample_id and choices_field so the fallback is visible
  in run logs.

Signed-off-by: kanishks <kanishks@nvidia.com>
HuggingFaceFetcher.fetch built the cache filename from (dataset_name, config, split, filters) only, so URIs that differ only in their data_files= / field= query params collided on the same cache entry. The first fetch's content was returned for every subsequent language regardless of which file was actually requested.

This silently broke per-language datasets that share a single HF repo but split content across files (e.g. IndicGenBench / FLORES_in's flores_en_<lang>_test.json, IndicGenBench / CrossSum_in's crosssum_english-<lang>_test.json). Every language's task evaluated against whichever language was fetched first.

Append `datafile-<value>` and `field-<value>` segments to the cache filename when those options are present so each (data_files, field) pair gets a distinct on-disk cache entry.

Adds a regression test that fetches three different data_files= URIs and asserts each lands in its own cache file with its own contents.

Signed-off-by: kanishks <kanishks@nvidia.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 6, 2026
@wprazuch
Copy link
Copy Markdown
Contributor

wprazuch commented May 6, 2026

/ok to test 9556770

Copy link
Copy Markdown
Collaborator

@prokotg prokotg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Overall, architecturally logprobs should be realized elsewhere than endpoint_type because endpoint_type only indicates the endpoint itself not behaviour (just like images, video, and audio can be realized through both chat and completions). I think logprobs would be better realized through a Strategy or other mechanism. Similarily choices, num_fewshots and others should fall behind extras. Approved contingent we agree architecture is most likely going to change in the future.

@wprazuch
Copy link
Copy Markdown
Contributor

wprazuch commented May 6, 2026

/ok to test 9556770

@wprazuch wprazuch merged commit 231526c into NVIDIA-NeMo:main May 7, 2026
49 checks passed
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request documentation Improvements or additions to documentation nemo-evaluator tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants