Skip to content

Commit 231526c

Browse files
authored
feat(byob): add completions_logprob endpoint and extend scorers/datasets (#953)
BYOB support for logprob-based multiple-choice evaluation and extends dataset loading/scoring utilities needed by the Sovereign benchmark suite. - Added completions_logprob as a supported endpoint type - Pydantic validation now accepts target.api_endpoint.type: completions_logprob - CLI --model_type completions_logprob is allowed - Endpoint health checks treat completions_logprob like a /v1/completions endpoint - Added BYOB logprob evaluation flow - Uses /v1/completions with max_tokens=0, echo=true, and logprobs=1 - Scores candidate continuations using returned token logprobs - Supports multiple-choice ranking via one request per candidate answer - Supports nested choice fields such as choices.text - Added/updated BYOB scorers - multiple_choice_acc: returns acc, acc_norm, and acc_greedy - mcq_letter_extract: supports A-J options and handles empty/None responses safely - Added task-oriented scorers for GSM8K-style numeric answers, yes/no tasks, chrF, and ROUGE - Extended Hugging Face dataset URI support. - Parses extra query params beyond split - Supports trust_remote_code=true - Supports row filtering via filter_field/filter_value, including multiple filters with suffixes - This allows BYOB benchmarks to consume datasets where language is stored as a row field instead of a HF config --------- Signed-off-by: kanishks <kanishks@nvidia.com>
1 parent c606b06 commit 231526c

26 files changed

Lines changed: 3221 additions & 77 deletions

docs/libraries/nemo-evaluator/extending/byob/benchmark-decorator.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,18 @@ def check(sample: ScorerInput) -> dict:
2121
| `dataset` | `str` | required | Path to JSONL file or `hf://` URI |
2222
| `prompt` | `str` | required | Format string with `{field}` placeholders, or path to template file |
2323
| `target_field` | `str` | `"target"` | Dataset field containing ground truth |
24-
| `endpoint_type` | `str` | `"chat"` | `"chat"` or `"completions"` |
24+
| `endpoint_type` | `str` | `"chat"` | `"chat"`, `"completions"`, or `"completions_logprob"` |
2525
| `requirements` | `list` or `str` | `None` | Pip deps (list or path to requirements.txt) |
2626
| `field_mapping` | `dict` | `None` | Maps source columns to prompt field names |
2727
| `extra` | `dict` | `None` | Framework-specific params (judge config, etc.) |
2828
| `response_field` | `str` | `None` | JSONL field with pre-generated responses (eval-only mode) |
2929
| `system_prompt` | `str` | `None` | System prompt string or path to template file |
30+
| `choices` | `list[str]` | `None` | Static candidate continuations for `endpoint_type="completions_logprob"` |
31+
| `choices_field` | `str` | `None` | Dataset field containing per-row candidate continuations for `endpoint_type="completions_logprob"`; dotted paths such as `choices.text` are supported |
32+
| `num_fewshot` | `int` | `0` | Number of few-shot examples to prepend to each prompt |
33+
| `fewshot_split` | `str` | `None` | Optional split to sample few-shot examples from |
34+
| `fewshot_template` | `str` | `None` | Optional template for rendering few-shot examples |
35+
| `fewshot_separator` | `str` | `"\n\n"` | Separator between rendered few-shot examples |
3036

3137
## Name Normalization
3238

@@ -141,6 +147,39 @@ def review(sample: ScorerInput) -> dict:
141147
System prompts support Jinja2 templates with the same detection rules as user prompts.
142148
:::
143149

150+
## Logprob Multiple-Choice Benchmarks
151+
152+
Use `endpoint_type="completions_logprob"` when the benchmark should score
153+
candidate answers by likelihood instead of asking the model to generate a
154+
free-form answer. This mode calls an OpenAI-compatible `/v1/completions`
155+
endpoint with `max_tokens=0`, `echo=true`, and `logprobs=1`.
156+
157+
Static choices:
158+
159+
```python
160+
@benchmark(
161+
name="mmlu-mini",
162+
dataset="data.jsonl",
163+
prompt="Question: {question}\nAnswer:",
164+
target_field="answer",
165+
endpoint_type="completions_logprob",
166+
choices=[" A", " B", " C", " D"],
167+
)
168+
```
169+
170+
Per-row choices, including nested HuggingFace fields:
171+
172+
```python
173+
@benchmark(
174+
name="arc-mini",
175+
dataset="hf://my-org/arc-hi?split=test",
176+
prompt="Question: {{question}}\nAnswer:",
177+
target_field="answerKey",
178+
endpoint_type="completions_logprob",
179+
choices_field="choices.text",
180+
)
181+
```
182+
144183
## See Also
145184

146185
- {ref}`byob` -- BYOB overview and quickstart

docs/libraries/nemo-evaluator/extending/byob/cli.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,19 @@ Use `nemo-evaluator-byob --list` to see the exact `eval_type` for each installed
9797
benchmark. This avoids guessing the normalized name.
9898
:::
9999

100+
For logprob-based multiple-choice benchmarks, use a completions endpoint that
101+
supports `echo` and `logprobs`:
102+
103+
```bash
104+
nemo-evaluator run_eval \
105+
--eval_type byob_<normalized_name>.<normalized_name> \
106+
--model_url http://localhost:8000 \
107+
--model_id my-model \
108+
--model_type completions_logprob \
109+
--output_dir ./results \
110+
--api_key_name API_KEY
111+
```
112+
100113
## See Also
101114

102115
- {ref}`byob` -- BYOB overview and quickstart

docs/libraries/nemo-evaluator/extending/byob/datasets.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,15 @@ hf://org/dataset?split=validation
4545
hf://org/dataset/config?split=test
4646
```
4747

48+
BYOB also accepts selected HuggingFace `load_dataset` options as query
49+
parameters:
50+
51+
```
52+
hf://org/dataset?split=test&trust_remote_code=true
53+
hf://org/dataset?split=validation&filter_field=language&filter_value=hi
54+
hf://org/dataset?split=test&data_files=file.json&field=examples
55+
```
56+
4857
### Examples
4958

5059
```python
@@ -66,6 +75,39 @@ hf://org/dataset/config?split=test
6675
)
6776
```
6877

78+
Load a gated or custom-code dataset by allowing remote code execution:
79+
80+
```python
81+
@benchmark(
82+
name="indommlu",
83+
dataset="hf://indolem/IndoMMLU?split=test&trust_remote_code=true",
84+
prompt="{question}\n\n{options}\n\nAnswer:",
85+
target_field="answer",
86+
)
87+
```
88+
89+
Filter a multilingual dataset to one language:
90+
91+
```python
92+
@benchmark(
93+
name="boolq-hi",
94+
dataset="hf://sarvamai/boolq-indic?split=validation&filter_field=language&filter_value=hi",
95+
prompt="Passage: {passage}\nQuestion: {question}\nAnswer:",
96+
target_field="answer",
97+
)
98+
```
99+
100+
Load a nested JSON field from a specific dataset file:
101+
102+
```python
103+
@benchmark(
104+
name="flores-hi",
105+
dataset="hf://google/IndicGenBench_flores_in?split=test&data_files=flores_en_hi_test.json&field=examples",
106+
prompt="Translate to Hindi: {source}",
107+
target_field="target",
108+
)
109+
```
110+
69111
:::{note}
70112
HuggingFace dataset fetching requires the `datasets` pip package. Install it with `pip install datasets`.
71113
:::
@@ -78,6 +120,69 @@ Downloaded datasets are cached at `~/.cache/nemo_evaluator/hf_datasets/` by defa
78120

79121
When no split is specified, the HuggingFace `datasets` library defaults are used. If the result is a `DatasetDict` (multiple splits), the first available split is selected automatically.
80122

123+
### Query Parameters
124+
125+
| Parameter | Description |
126+
|-----------|-------------|
127+
| `split` | Split passed to `datasets.load_dataset`, for example `test` or `validation`. |
128+
| `trust_remote_code=true` | Passes `trust_remote_code=True` to `datasets.load_dataset`. Required by some datasets with custom loading scripts. |
129+
| `filter_field` / `filter_value` | Filters rows after loading, keeping rows where `str(row[filter_field]) == filter_value`. |
130+
| `filter_field_1` / `filter_value_1`, etc. | Additional row filters applied in order. |
131+
| `data_files` | Passes `data_files` to `datasets.load_dataset`, useful for repositories that store examples in individual JSON files. |
132+
| `field` | Passes `field` to `datasets.load_dataset`, useful for JSON files where examples live under a top-level key such as `examples`. |
133+
134+
::::{warning}
135+
If you put `hf://` URIs with `&` query parameters in shell command
136+
templates, quote the dataset argument:
137+
138+
```bash
139+
--dataset "{{config.params.extra.dataset.path}}"
140+
```
141+
142+
Otherwise the shell treats `&` as a background-command separator.
143+
::::
144+
145+
### `extra.dataset.*` namespace
146+
147+
BYOB groups dataset-related configuration under
148+
`config.params.extra.dataset.*` in the FDF / run_config:
149+
150+
| Key | Description |
151+
|-----|-------------|
152+
| `path` | Dataset file path or `hf://` URI (compile-time default from `@benchmark(dataset=...)`). |
153+
| `num_fewshot` | Optional few-shot example count (lm-eval-harness parity). |
154+
| `field_mapping` | Informational mirror of `@benchmark(field_mapping=...)`. |
155+
| `choices` / `choices_field` | Informational mirror of `@benchmark(choices=...)` / `@benchmark(choices_field=...)`. |
156+
157+
### Overriding the dataset at run time
158+
159+
The `@benchmark` decorator's `dataset=` value is the compile-time default. To
160+
swap it for a single run without rebuilding the benchmark, set
161+
`config.params.extra.dataset.path` via the launcher's run_config or CLI. The
162+
launcher deep-merges via OmegaConf, so sibling keys under `extra.dataset`
163+
(`num_fewshot`, `field_mapping`, etc.) and under `extra` (`benchmark_module`,
164+
`requirements`, …) are preserved.
165+
166+
```bash
167+
nemo-evaluator-launcher run --config my_config.yaml \
168+
-o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.path=hf://other/foo?split=test'
169+
```
170+
171+
Or in a run_config YAML:
172+
173+
```yaml
174+
evaluation:
175+
tasks:
176+
- name: <task_name>
177+
nemo_evaluator_config:
178+
config:
179+
params:
180+
extra:
181+
dataset:
182+
path: hf://other/foo?split=test
183+
num_fewshot: 5
184+
```
185+
81186
## Field Mapping
82187
83188
Use `field_mapping` to rename dataset columns so they match the `{placeholder}` names in your prompt template. The mapping is applied after loading the dataset and before prompt rendering.

docs/libraries/nemo-evaluator/extending/byob/scorers.md

Lines changed: 103 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ Every scorer receives a single `ScorerInput` dataclass importable from `nemo_eva
1111
```python
1212
@dataclass
1313
class ScorerInput:
14-
response: str # Model output
14+
response: str # Model output (or argmax choice in logprob mode)
1515
target: Any # Ground truth from dataset
16-
metadata: dict # Full dataset row as a dict
16+
metadata: dict # Dataset row + per-call response metadata
1717
model_call_fn: Optional[Callable] = None
1818
config: Dict[str, Any] = field(default_factory=dict)
1919
conversation: Optional[List[dict]] = None
@@ -22,14 +22,26 @@ class ScorerInput:
2222

2323
| Field | Description |
2424
|-------|-------------|
25-
| `response` | The model output text for the current sample. |
25+
| `response` | The model output text for the current sample. In `completions_logprob` mode this is set to the choice with the highest sum-logprob (i.e. the argmax). |
2626
| `target` | The ground-truth value read from the field specified by `target_field` in `@benchmark`. |
27-
| `metadata` | The entire dataset row as a dictionary, useful for accessing additional fields beyond the target. |
27+
| `metadata` | Shared bag for **dataset-row fields and per-call response metadata**. Standard scorers use it to access any column on the row (e.g. `sample.metadata["passage"]`). Strategies that produce extra per-call data write namespaced keys (prefixed with `_`) into this dict before invoking the scorer. |
2828
| `model_call_fn` | Reserved for multi-turn evaluation (not yet implemented). |
2929
| `config` | Extra configuration passed through `extra=` in `@benchmark` (e.g. judge settings). |
3030
| `conversation` | Reserved for multi-turn benchmarks (not yet implemented). |
3131
| `turn_index` | Reserved for multi-turn benchmarks (not yet implemented). |
3232

33+
### Reserved metadata keys
34+
35+
`MultipleChoiceStrategy` (selected by `endpoint_type="completions_logprob"`) writes the following keys into `ScorerInput.metadata` before invoking the scorer:
36+
37+
| Key | Type | Description |
38+
|-----|------|-------------|
39+
| `_choices` | `list[str]` | Candidate continuations resolved from `choices=` or `choices_field=` on `@benchmark`. |
40+
| `_choices_logprobs` | `list[float]` | Per-choice sum log-probabilities returned by the loglikelihood call. Same length as `_choices`. |
41+
| `_choices_is_greedy` | `list[bool]` | Per-choice booleans: `True` when every continuation token equals the top-1 prediction (i.e. the choice would have been produced under greedy decoding). Same length as `_choices`. |
42+
43+
`response` is also set to `_choices[argmax(_choices_logprobs)]` so legacy text-based scorers continue to work in logprob mode.
44+
3345
## The @scorer Decorator
3446

3547
The `@scorer` decorator marks a function as a BYOB scorer. It validates the function signature at decoration time and sets an internal `_is_scorer` flag used by the framework.
@@ -57,6 +69,11 @@ Import built-in scorers from `nemo_evaluator.contrib.byob.scorers`:
5769
| `bleu` | `{"bleu_1": float, "bleu_2": float, "bleu_3": float, "bleu_4": float}` | Sentence-level BLEU-1 through BLEU-4 with add-1 smoothing |
5870
| `rouge` | `{"rouge_1": float, "rouge_2": float, "rouge_l": float}` | ROUGE-1, ROUGE-2, ROUGE-L F1 scores |
5971
| `retrieval_metrics` | `{"precision_at_k": float, "recall_at_k": float, "mrr": float, "ndcg": float}` | Retrieval quality metrics |
72+
| `multiple_choice_acc` | `{"acc": float, "acc_norm": float, "acc_greedy": float}` | Multiple-choice loglikelihood ranking. `acc` matches lm-evaluation-harness MMLU-style raw argmax; `acc_norm` is per-byte length-normalized argmax (ARC/BoolQ style); `acc_greedy` is the highest-loglikelihood greedy choice. Requires `endpoint_type="completions_logprob"` and either `choices=` or `choices_field=` on `@benchmark`. |
73+
| `mcq_letter_extract` | `{"correct": bool, "parsed": bool}` | Extracts an A-J letter from free-form text (handles "A", "A)", "The answer is B", "(C)", "Option D", and `\boxed{E}`). Targets may be letters, integer indices, or verbatim choice text from the metadata `a`/`b`/`c`/`d` keys. Empty or `None` responses are treated as unparsed rather than raising. |
74+
| `gsm8k_answer` | `{"correct": bool, "parsed": bool}` | Canonical GSM8K numeric extractor. Tries the `#### <number>` marker first, then `\boxed{<number>}`, then falls back to the last number in the response. Strips commas and normalizes trailing zeros. |
75+
| `boolean_yesno` | `{"correct": bool, "parsed": bool}` | Extracts English yes/no decisions from free-form text. Recognizes tokens such as yes/no/yep/nope/true/false. |
76+
| `chrf` | `{"chrf": float, "chrf_pp": float}` | Sentence-level chrF and chrF++ in [0, 100]. Pure-Python sacrebleu-style formula (character 1- to 6-gram F2; chrF++ adds word 1- and 2-gram F2). |
6077

6178
### Usage example
6279

@@ -119,6 +136,88 @@ def combined(sample: ScorerInput) -> dict:
119136
return {**em, **f1}
120137
```
121138

139+
## Multiple-choice loglikelihood ranking
140+
141+
For MMLU-, ARC-, and BoolQ-style benchmarks, BYOB supports per-choice
142+
loglikelihood ranking with **lm-evaluation-harness parity**:
143+
144+
```python
145+
from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
146+
from nemo_evaluator.contrib.byob.scorers import multiple_choice_acc
147+
148+
@benchmark(
149+
name="mmlu-mini",
150+
dataset="hf://my-org/my-mmlu?split=test",
151+
prompt="Question: {question}\nAnswer:",
152+
target_field="answer", # gold letter, e.g. "B"
153+
endpoint_type="completions_logprob", # enables loglikelihood scoring
154+
choices=[" A", " B", " C", " D"], # static candidates per row
155+
num_fewshot=5, # optional fewshot prefix
156+
)
157+
@scorer
158+
def mmlu_score(sample: ScorerInput) -> dict:
159+
return multiple_choice_acc(sample) # {acc, acc_norm, acc_greedy}
160+
```
161+
162+
For datasets with **per-row variable choices** (e.g. ARC), set
163+
`choices_field` instead of `choices`:
164+
165+
```python
166+
@benchmark(
167+
...,
168+
choices_field="choices_text", # row[choices_text] is a list[str]
169+
)
170+
```
171+
172+
Nested/dotted fields are also supported for HuggingFace datasets that store
173+
choices under a struct-like column:
174+
175+
```python
176+
@benchmark(
177+
...,
178+
choices_field="choices.text", # row["choices"]["text"]
179+
)
180+
```
181+
182+
### How it works
183+
184+
`MultipleChoiceStrategy` (selected automatically when
185+
`endpoint_type="completions_logprob"`) calls the OpenAI-compatible
186+
`/v1/completions` endpoint once per choice, exactly like lm-eval's
187+
`local-completions` adapter:
188+
189+
```text
190+
POST /v1/completions
191+
{
192+
"model": "...",
193+
"prompt": "<context><continuation>",
194+
"max_tokens": 0,
195+
"logprobs": 1,
196+
"echo": true,
197+
"temperature": 0
198+
}
199+
```
200+
201+
The runner inspects `logprobs.text_offset` to locate the continuation
202+
token span, sums `token_logprobs` over that span, and decides
203+
`is_greedy` by checking whether each continuation token matches the
204+
top-1 entry of `top_logprobs`. The resulting per-choice
205+
`(sum_logprob, is_greedy)` tuples are written into `ScorerInput.metadata`
206+
under the reserved keys `_choices`, `_choices_logprobs`, and
207+
`_choices_is_greedy`. `multiple_choice_acc` then computes:
208+
209+
- `acc` -- 1.0 iff `argmax(metadata["_choices_logprobs"]) == gold_index`
210+
(MMLU canonical).
211+
- `acc_norm` -- 1.0 iff
212+
`argmax(metadata["_choices_logprobs"][i] /
213+
max(len(metadata["_choices"][i].encode("utf-8")), 1)) == gold_index`
214+
(ARC/BoolQ canonical, per-byte length normalization).
215+
- `acc_greedy` -- 1.0 iff the highest-loglikelihood **greedy** choice
216+
matches gold (diagnostic).
217+
218+
The gold answer can be a letter (`"A"`..`"J"`), an integer index, or
219+
the verbatim choice string -- `multiple_choice_acc` handles all three.
220+
122221
## See Also
123222

124223
- {ref}`byob` -- BYOB overview and quickstart

0 commit comments

Comments
 (0)