feat(byob): add completions_logprob endpoint and extend scorers/datasets #953
Conversation
b241165 to
ca07cb8
Compare
|
@kanishks-23 Please amend sign-offs to the commits pushed and then launch pipeline with following syntax to make sure tests pass |
ca07cb8 to
cdaa77a
Compare
|
/ok to test cdaa77a |
cdaa77a to
5ae25d7
Compare
|
/ok to test 5ae25d7 |
5ae25d7 to
bb06b86
Compare
|
/ok to test bb06b86 |
|
/ok to test bb06b86 |
Add support for logprob-based multiple-choice evaluation within the BYOB framework, plus several new scorers and dataset-loading enhancements. Endpoint and runner: - Add completions_logprob as a supported endpoint type with Pydantic validation, CLI support, and health-check handling - Logprob scoring uses /v1/completions with max_tokens=0, echo=true, logprobs=1 to score candidate answers - New call_model_loglikelihood helper in runner.py Scorers: - multiple_choice_acc: logprob-based MCQ scoring - mcq_letter_extract: parse first A-D letter from free-form responses - gsm8k_answer: GSM8K-style numeric answer extraction - boolean_yesno: yes/no task scorer - chrf / chrF++ (sacreBLEU-style character n-gram F-score) - Extended ROUGE support Datasets: - HuggingFace loader parses additional query parameters - trust_remote_code=true enabled where required Signed-off-by: kanishks <kanishks@nvidia.com>
bb06b86 to
45e9484
Compare
|
/ok to test 45e9484 |
When @benchmark(choices_field=...) is set but the row doesn't expose that field, eval_logic silently falls back to @benchmark(choices=...). That makes mis-named or missing fields hard to diagnose. Emit a warning with sample_id and choices_field so the fallback is visible in run logs. Signed-off-by: kanishks <kanishks@nvidia.com>
HuggingFaceFetcher.fetch built the cache filename from (dataset_name, config, split, filters) only, so URIs that differ only in their data_files= / field= query params collided on the same cache entry. The first fetch's content was returned for every subsequent language regardless of which file was actually requested. This silently broke per-language datasets that share a single HF repo but split content across files (e.g. IndicGenBench / FLORES_in's flores_en_<lang>_test.json, IndicGenBench / CrossSum_in's crosssum_english-<lang>_test.json). Every language's task evaluated against whichever language was fetched first. Append `datafile-<value>` and `field-<value>` segments to the cache filename when those options are present so each (data_files, field) pair gets a distinct on-disk cache entry. Adds a regression test that fetches three different data_files= URIs and asserts each lands in its own cache file with its own contents. Signed-off-by: kanishks <kanishks@nvidia.com>
|
/ok to test 9556770 |
There was a problem hiding this comment.
Looks good. Overall, architecturally logprobs should be realized elsewhere than endpoint_type because endpoint_type only indicates the endpoint itself not behaviour (just like images, video, and audio can be realized through both chat and completions). I think logprobs would be better realized through a Strategy or other mechanism. Similarily choices, num_fewshots and others should fall behind extras. Approved contingent we agree architecture is most likely going to change in the future.
|
/ok to test 9556770 |
BYOB support for logprob-based multiple-choice evaluation and extends dataset loading/scoring utilities needed by the Sovereign benchmark suite.