NVIDIA-NeMo
diff --git a/‎docs/libraries/nemo-evaluator/extending/byob/benchmark-decorator.md‎
Lines changed: 40 additions & 1 deletion b/‎docs/libraries/nemo-evaluator/extending/byob/benchmark-decorator.md‎
Lines changed: 40 additions & 1 deletion
diff --git a/‎docs/libraries/nemo-evaluator/extending/byob/cli.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/libraries/nemo-evaluator/extending/byob/cli.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎docs/libraries/nemo-evaluator/extending/byob/datasets.md‎
Lines changed: 105 additions & 0 deletions b/‎docs/libraries/nemo-evaluator/extending/byob/datasets.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎docs/libraries/nemo-evaluator/extending/byob/scorers.md‎
Lines changed: 103 additions & 4 deletions b/‎docs/libraries/nemo-evaluator/extending/byob/scorers.md‎
Lines changed: 103 additions & 4 deletions
@@ -21,12 +21,18 @@ def check(sample: ScorerInput) -> dict:
 | `dataset` | `str` | required | Path to JSONL file or `hf://` URI |
 | `prompt` | `str` | required | Format string with `{field}` placeholders, or path to template file |
 | `target_field` | `str` | `"target"` | Dataset field containing ground truth |
-| `endpoint_type` | `str` | `"chat"` | `"chat"` or `"completions"` |
+| `endpoint_type` | `str` | `"chat"` | `"chat"`, `"completions"`, or `"completions_logprob"` |
 | `requirements` | `list` or `str` | `None` | Pip deps (list or path to requirements.txt) |
 | `field_mapping` | `dict` | `None` | Maps source columns to prompt field names |
 | `extra` | `dict` | `None` | Framework-specific params (judge config, etc.) |
 | `response_field` | `str` | `None` | JSONL field with pre-generated responses (eval-only mode) |
 | `system_prompt` | `str` | `None` | System prompt string or path to template file |
+| `choices` | `list[str]` | `None` | Static candidate continuations for `endpoint_type="completions_logprob"` |
+| `choices_field` | `str` | `None` | Dataset field containing per-row candidate continuations for `endpoint_type="completions_logprob"`; dotted paths such as `choices.text` are supported |
+| `num_fewshot` | `int` | `0` | Number of few-shot examples to prepend to each prompt |
+| `fewshot_split` | `str` | `None` | Optional split to sample few-shot examples from |
+| `fewshot_template` | `str` | `None` | Optional template for rendering few-shot examples |
+| `fewshot_separator` | `str` | `"\n\n"` | Separator between rendered few-shot examples |
 
 ## Name Normalization
 
@@ -141,6 +147,39 @@ def review(sample: ScorerInput) -> dict:
 System prompts support Jinja2 templates with the same detection rules as user prompts.
 :::
 
+## Logprob Multiple-Choice Benchmarks
+
+Use `endpoint_type="completions_logprob"` when the benchmark should score
+candidate answers by likelihood instead of asking the model to generate a
+free-form answer. This mode calls an OpenAI-compatible `/v1/completions`
+endpoint with `max_tokens=0`, `echo=true`, and `logprobs=1`.
+
+Static choices:
+
+```python
+@benchmark(
+    name="mmlu-mini",
+    dataset="data.jsonl",
+    prompt="Question: {question}\nAnswer:",
+    target_field="answer",
+    endpoint_type="completions_logprob",
+    choices=[" A", " B", " C", " D"],
+)
+```
+
+Per-row choices, including nested HuggingFace fields:
+
+```python
+@benchmark(
+    name="arc-mini",
+    dataset="hf://my-org/arc-hi?split=test",
+    prompt="Question: {{question}}\nAnswer:",
+    target_field="answerKey",
+    endpoint_type="completions_logprob",
+    choices_field="choices.text",
+)
+```
+
 ## See Also
 
 - {ref}`byob` -- BYOB overview and quickstart
 
@@ -97,6 +97,19 @@ Use `nemo-evaluator-byob --list` to see the exact `eval_type` for each installed
 benchmark. This avoids guessing the normalized name.
 :::
 
+For logprob-based multiple-choice benchmarks, use a completions endpoint that
+supports `echo` and `logprobs`:
+
+```bash
+nemo-evaluator run_eval \
+  --eval_type byob_<normalized_name>.<normalized_name> \
+  --model_url http://localhost:8000 \
+  --model_id my-model \
+  --model_type completions_logprob \
+  --output_dir ./results \
+  --api_key_name API_KEY
+```
+
 ## See Also
 
 - {ref}`byob` -- BYOB overview and quickstart
 
@@ -45,6 +45,15 @@ hf://org/dataset?split=validation
 hf://org/dataset/config?split=test
 ```
 
+BYOB also accepts selected HuggingFace `load_dataset` options as query
+parameters:
+
+```
+hf://org/dataset?split=test&trust_remote_code=true
+hf://org/dataset?split=validation&filter_field=language&filter_value=hi
+hf://org/dataset?split=test&data_files=file.json&field=examples
+```
+
 ### Examples
 
 ```python
@@ -66,6 +75,39 @@ hf://org/dataset/config?split=test
 )
 ```
 
+Load a gated or custom-code dataset by allowing remote code execution:
+
+```python
+@benchmark(
+    name="indommlu",
+    dataset="hf://indolem/IndoMMLU?split=test&trust_remote_code=true",
+    prompt="{question}\n\n{options}\n\nAnswer:",
+    target_field="answer",
+)
+```
+
+Filter a multilingual dataset to one language:
+
+```python
+@benchmark(
+    name="boolq-hi",
+    dataset="hf://sarvamai/boolq-indic?split=validation&filter_field=language&filter_value=hi",
+    prompt="Passage: {passage}\nQuestion: {question}\nAnswer:",
+    target_field="answer",
+)
+```
+
+Load a nested JSON field from a specific dataset file:
+
+```python
+@benchmark(
+    name="flores-hi",
+    dataset="hf://google/IndicGenBench_flores_in?split=test&data_files=flores_en_hi_test.json&field=examples",
+    prompt="Translate to Hindi: {source}",
+    target_field="target",
+)
+```
+
 :::{note}
 HuggingFace dataset fetching requires the `datasets` pip package. Install it with `pip install datasets`.
 :::
@@ -78,6 +120,69 @@ Downloaded datasets are cached at `~/.cache/nemo_evaluator/hf_datasets/` by defa
 
 When no split is specified, the HuggingFace `datasets` library defaults are used. If the result is a `DatasetDict` (multiple splits), the first available split is selected automatically.
 
+### Query Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `split` | Split passed to `datasets.load_dataset`, for example `test` or `validation`. |
+| `trust_remote_code=true` | Passes `trust_remote_code=True` to `datasets.load_dataset`. Required by some datasets with custom loading scripts. |
+| `filter_field` / `filter_value` | Filters rows after loading, keeping rows where `str(row[filter_field]) == filter_value`. |
+| `filter_field_1` / `filter_value_1`, etc. | Additional row filters applied in order. |
+| `data_files` | Passes `data_files` to `datasets.load_dataset`, useful for repositories that store examples in individual JSON files. |
+| `field` | Passes `field` to `datasets.load_dataset`, useful for JSON files where examples live under a top-level key such as `examples`. |
+
+::::{warning}
+If you put `hf://` URIs with `&` query parameters in shell command
+templates, quote the dataset argument:
+
+```bash
+--dataset "{{config.params.extra.dataset.path}}"
+```
+
+Otherwise the shell treats `&` as a background-command separator.
+::::
+
+### `extra.dataset.*` namespace
+
+BYOB groups dataset-related configuration under
+`config.params.extra.dataset.*` in the FDF / run_config:
+
+| Key | Description |
+|-----|-------------|
+| `path` | Dataset file path or `hf://` URI (compile-time default from `@benchmark(dataset=...)`). |
+| `num_fewshot` | Optional few-shot example count (lm-eval-harness parity). |
+| `field_mapping` | Informational mirror of `@benchmark(field_mapping=...)`. |
+| `choices` / `choices_field` | Informational mirror of `@benchmark(choices=...)` / `@benchmark(choices_field=...)`. |
+
+### Overriding the dataset at run time
+
+The `@benchmark` decorator's `dataset=` value is the compile-time default. To
+swap it for a single run without rebuilding the benchmark, set
+`config.params.extra.dataset.path` via the launcher's run_config or CLI. The
+launcher deep-merges via OmegaConf, so sibling keys under `extra.dataset`
+(`num_fewshot`, `field_mapping`, etc.) and under `extra` (`benchmark_module`,
+`requirements`, …) are preserved.
+
+```bash
+nemo-evaluator-launcher run --config my_config.yaml \
+  -o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.path=hf://other/foo?split=test'
+```
+
+Or in a run_config YAML:
+
+```yaml
+evaluation:
+  tasks:
+    - name: <task_name>
+      nemo_evaluator_config:
+        config:
+          params:
+            extra:
+              dataset:
+                path: hf://other/foo?split=test
+                num_fewshot: 5
+```
+
 ## Field Mapping
 
 Use `field_mapping` to rename dataset columns so they match the `{placeholder}` names in your prompt template. The mapping is applied after loading the dataset and before prompt rendering.
 
@@ -11,9 +11,9 @@ Every scorer receives a single `ScorerInput` dataclass importable from `nemo_eva
 ```python
 @dataclass
 class ScorerInput:
-    response: str              # Model output
+    response: str              # Model output (or argmax choice in logprob mode)
     target: Any                # Ground truth from dataset
-    metadata: dict             # Full dataset row as a dict
+    metadata: dict             # Dataset row + per-call response metadata
     model_call_fn: Optional[Callable] = None
     config: Dict[str, Any] = field(default_factory=dict)
     conversation: Optional[List[dict]] = None
@@ -22,14 +22,26 @@ class ScorerInput:
 
 | Field | Description |
 |-------|-------------|
-| `response` | The model output text for the current sample. |
+| `response` | The model output text for the current sample. In `completions_logprob` mode this is set to the choice with the highest sum-logprob (i.e. the argmax). |
 | `target` | The ground-truth value read from the field specified by `target_field` in `@benchmark`. |
-| `metadata` | The entire dataset row as a dictionary, useful for accessing additional fields beyond the target. |
+| `metadata` | Shared bag for **dataset-row fields and per-call response metadata**. Standard scorers use it to access any column on the row (e.g. `sample.metadata["passage"]`). Strategies that produce extra per-call data write namespaced keys (prefixed with `_`) into this dict before invoking the scorer. |
 | `model_call_fn` | Reserved for multi-turn evaluation (not yet implemented). |
 | `config` | Extra configuration passed through `extra=` in `@benchmark` (e.g. judge settings). |
 | `conversation` | Reserved for multi-turn benchmarks (not yet implemented). |
 | `turn_index` | Reserved for multi-turn benchmarks (not yet implemented). |
 
+### Reserved metadata keys
+
+`MultipleChoiceStrategy` (selected by `endpoint_type="completions_logprob"`) writes the following keys into `ScorerInput.metadata` before invoking the scorer:
+
+| Key | Type | Description |
+|-----|------|-------------|
+| `_choices` | `list[str]` | Candidate continuations resolved from `choices=` or `choices_field=` on `@benchmark`. |
+| `_choices_logprobs` | `list[float]` | Per-choice sum log-probabilities returned by the loglikelihood call. Same length as `_choices`. |
+| `_choices_is_greedy` | `list[bool]` | Per-choice booleans: `True` when every continuation token equals the top-1 prediction (i.e. the choice would have been produced under greedy decoding). Same length as `_choices`. |
+
+`response` is also set to `_choices[argmax(_choices_logprobs)]` so legacy text-based scorers continue to work in logprob mode.
+
 ## The @scorer Decorator
 
 The `@scorer` decorator marks a function as a BYOB scorer. It validates the function signature at decoration time and sets an internal `_is_scorer` flag used by the framework.
@@ -57,6 +69,11 @@ Import built-in scorers from `nemo_evaluator.contrib.byob.scorers`:
 | `bleu` | `{"bleu_1": float, "bleu_2": float, "bleu_3": float, "bleu_4": float}` | Sentence-level BLEU-1 through BLEU-4 with add-1 smoothing |
 | `rouge` | `{"rouge_1": float, "rouge_2": float, "rouge_l": float}` | ROUGE-1, ROUGE-2, ROUGE-L F1 scores |
 | `retrieval_metrics` | `{"precision_at_k": float, "recall_at_k": float, "mrr": float, "ndcg": float}` | Retrieval quality metrics |
+| `multiple_choice_acc` | `{"acc": float, "acc_norm": float, "acc_greedy": float}` | Multiple-choice loglikelihood ranking. `acc` matches lm-evaluation-harness MMLU-style raw argmax; `acc_norm` is per-byte length-normalized argmax (ARC/BoolQ style); `acc_greedy` is the highest-loglikelihood greedy choice. Requires `endpoint_type="completions_logprob"` and either `choices=` or `choices_field=` on `@benchmark`. |
+| `mcq_letter_extract` | `{"correct": bool, "parsed": bool}` | Extracts an A-J letter from free-form text (handles "A", "A)", "The answer is B", "(C)", "Option D", and `\boxed{E}`). Targets may be letters, integer indices, or verbatim choice text from the metadata `a`/`b`/`c`/`d` keys. Empty or `None` responses are treated as unparsed rather than raising. |
+| `gsm8k_answer` | `{"correct": bool, "parsed": bool}` | Canonical GSM8K numeric extractor. Tries the `#### <number>` marker first, then `\boxed{<number>}`, then falls back to the last number in the response. Strips commas and normalizes trailing zeros. |
+| `boolean_yesno` | `{"correct": bool, "parsed": bool}` | Extracts English yes/no decisions from free-form text. Recognizes tokens such as yes/no/yep/nope/true/false. |
+| `chrf` | `{"chrf": float, "chrf_pp": float}` | Sentence-level chrF and chrF++ in [0, 100]. Pure-Python sacrebleu-style formula (character 1- to 6-gram F2; chrF++ adds word 1- and 2-gram F2). |
 
 ### Usage example
 
@@ -119,6 +136,88 @@ def combined(sample: ScorerInput) -> dict:
     return {**em, **f1}
 ```
 
+## Multiple-choice loglikelihood ranking
+
+For MMLU-, ARC-, and BoolQ-style benchmarks, BYOB supports per-choice
+loglikelihood ranking with **lm-evaluation-harness parity**:
+
+```python
+from nemo_evaluator.contrib.byob import benchmark, scorer, ScorerInput
+from nemo_evaluator.contrib.byob.scorers import multiple_choice_acc
+
+@benchmark(
+    name="mmlu-mini",
+    dataset="hf://my-org/my-mmlu?split=test",
+    prompt="Question: {question}\nAnswer:",
+    target_field="answer",                    # gold letter, e.g. "B"
+    endpoint_type="completions_logprob",      # enables loglikelihood scoring
+    choices=[" A", " B", " C", " D"],         # static candidates per row
+    num_fewshot=5,                            # optional fewshot prefix
+)
+@scorer
+def mmlu_score(sample: ScorerInput) -> dict:
+    return multiple_choice_acc(sample)        # {acc, acc_norm, acc_greedy}
+```
+
+For datasets with **per-row variable choices** (e.g. ARC), set
+`choices_field` instead of `choices`:
+
+```python
+@benchmark(
+    ...,
+    choices_field="choices_text",             # row[choices_text] is a list[str]
+)
+```
+
+Nested/dotted fields are also supported for HuggingFace datasets that store
+choices under a struct-like column:
+
+```python
+@benchmark(
+    ...,
+    choices_field="choices.text",             # row["choices"]["text"]
+)
+```
+
+### How it works
+
+`MultipleChoiceStrategy` (selected automatically when
+`endpoint_type="completions_logprob"`) calls the OpenAI-compatible
+`/v1/completions` endpoint once per choice, exactly like lm-eval's
+`local-completions` adapter:
+
+```text
+POST /v1/completions
+{
+  "model": "...",
+  "prompt": "<context><continuation>",
+  "max_tokens": 0,
+  "logprobs": 1,
+  "echo": true,
+  "temperature": 0
+}
+```
+
+The runner inspects `logprobs.text_offset` to locate the continuation
+token span, sums `token_logprobs` over that span, and decides
+`is_greedy` by checking whether each continuation token matches the
+top-1 entry of `top_logprobs`. The resulting per-choice
+`(sum_logprob, is_greedy)` tuples are written into `ScorerInput.metadata`
+under the reserved keys `_choices`, `_choices_logprobs`, and
+`_choices_is_greedy`. `multiple_choice_acc` then computes:
+
+- `acc` -- 1.0 iff `argmax(metadata["_choices_logprobs"]) == gold_index`
+  (MMLU canonical).
+- `acc_norm` -- 1.0 iff
+  `argmax(metadata["_choices_logprobs"][i] /
+  max(len(metadata["_choices"][i].encode("utf-8")), 1)) == gold_index`
+  (ARC/BoolQ canonical, per-byte length normalization).
+- `acc_greedy` -- 1.0 iff the highest-loglikelihood **greedy** choice
+  matches gold (diagnostic).
+
+The gold answer can be a letter (`"A"`..`"J"`), an integer index, or
+the verbatim choice string -- `multiple_choice_acc` handles all three.
+
 ## See Also
 
 - {ref}`byob` -- BYOB overview and quickstart