Skip to content

Commit ca07cb8

Browse files
committed
Address review comments
Signed-off-by: kanishks <kanishks@nvidia.com>
1 parent 3ac7a3e commit ca07cb8

15 files changed

Lines changed: 590 additions & 230 deletions

File tree

docs/libraries/nemo-evaluator/extending/byob/datasets.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,12 +136,53 @@ If you put `hf://` URIs with `&` query parameters in shell command
136136
templates, quote the dataset argument:
137137

138138
```bash
139-
--dataset "{{config.params.extra.dataset_uri}}"
139+
--dataset "{{config.params.extra.dataset.path}}"
140140
```
141141

142142
Otherwise the shell treats `&` as a background-command separator.
143143
::::
144144

145+
### `extra.dataset.*` namespace
146+
147+
BYOB groups dataset-related configuration under
148+
`config.params.extra.dataset.*` in the FDF / run_config:
149+
150+
| Key | Description |
151+
|-----|-------------|
152+
| `path` | Dataset file path or `hf://` URI (compile-time default from `@benchmark(dataset=...)`). |
153+
| `num_fewshot` | Optional few-shot example count (lm-eval-harness parity). |
154+
| `field_mapping` | Informational mirror of `@benchmark(field_mapping=...)`. |
155+
| `choices` / `choices_field` | Informational mirror of `@benchmark(choices=...)` / `@benchmark(choices_field=...)`. |
156+
157+
### Overriding the dataset at run time
158+
159+
The `@benchmark` decorator's `dataset=` value is the compile-time default. To
160+
swap it for a single run without rebuilding the benchmark, set
161+
`config.params.extra.dataset.path` via the launcher's run_config or CLI. The
162+
launcher deep-merges via OmegaConf, so sibling keys under `extra.dataset`
163+
(`num_fewshot`, `field_mapping`, etc.) and under `extra` (`benchmark_module`,
164+
`requirements`, …) are preserved.
165+
166+
```bash
167+
nemo-evaluator-launcher run --config my_config.yaml \
168+
-o 'evaluation.tasks.<task_name>.nemo_evaluator_config.config.params.extra.dataset.path=hf://other/foo?split=test'
169+
```
170+
171+
Or in a run_config YAML:
172+
173+
```yaml
174+
evaluation:
175+
tasks:
176+
- name: <task_name>
177+
nemo_evaluator_config:
178+
config:
179+
params:
180+
extra:
181+
dataset:
182+
path: hf://other/foo?split=test
183+
num_fewshot: 5
184+
```
185+
145186
## Field Mapping
146187
147188
Use `field_mapping` to rename dataset columns so they match the `{placeholder}` names in your prompt template. The mapping is applied after loading the dataset and before prompt rendering.

docs/libraries/nemo-evaluator/extending/byob/scorers.md

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,28 +13,34 @@ Every scorer receives a single `ScorerInput` dataclass importable from `nemo_eva
1313
class ScorerInput:
1414
response: str # Model output (or argmax choice in logprob mode)
1515
target: Any # Ground truth from dataset
16-
metadata: dict # Full dataset row as a dict
16+
metadata: dict # Dataset row + per-call response metadata
1717
model_call_fn: Optional[Callable] = None
1818
config: Dict[str, Any] = field(default_factory=dict)
1919
conversation: Optional[List[dict]] = None
2020
turn_index: Optional[int] = None
21-
choices: Optional[List[str]] = None
22-
choices_logprobs: Optional[List[float]] = None
23-
choices_is_greedy: Optional[List[bool]] = None
2421
```
2522

2623
| Field | Description |
2724
|-------|-------------|
2825
| `response` | The model output text for the current sample. In `completions_logprob` mode this is set to the choice with the highest sum-logprob (i.e. the argmax). |
2926
| `target` | The ground-truth value read from the field specified by `target_field` in `@benchmark`. |
30-
| `metadata` | The entire dataset row as a dictionary, useful for accessing additional fields beyond the target. |
27+
| `metadata` | Shared bag for **dataset-row fields and per-call response metadata**. Standard scorers use it to access any column on the row (e.g. `sample.metadata["passage"]`). Strategies that produce extra per-call data write namespaced keys (prefixed with `_`) into this dict before invoking the scorer. |
3128
| `model_call_fn` | Reserved for multi-turn evaluation (not yet implemented). |
3229
| `config` | Extra configuration passed through `extra=` in `@benchmark` (e.g. judge settings). |
3330
| `conversation` | Reserved for multi-turn benchmarks (not yet implemented). |
3431
| `turn_index` | Reserved for multi-turn benchmarks (not yet implemented). |
35-
| `choices` | Populated by `MultipleChoiceStrategy` with the candidate continuation list (resolved from `choices=` or `choices_field=` on `@benchmark`). |
36-
| `choices_logprobs` | Per-choice sum log-probabilities returned by the loglikelihood call. Same length as `choices`. |
37-
| `choices_is_greedy` | Per-choice booleans: True when every continuation token equals the top-1 prediction (i.e. the choice would have been produced under greedy decoding). Same length as `choices`. |
32+
33+
### Reserved metadata keys
34+
35+
`MultipleChoiceStrategy` (selected by `endpoint_type="completions_logprob"`) writes the following keys into `ScorerInput.metadata` before invoking the scorer:
36+
37+
| Key | Type | Description |
38+
|-----|------|-------------|
39+
| `_choices` | `list[str]` | Candidate continuations resolved from `choices=` or `choices_field=` on `@benchmark`. |
40+
| `_choices_logprobs` | `list[float]` | Per-choice sum log-probabilities returned by the loglikelihood call. Same length as `_choices`. |
41+
| `_choices_is_greedy` | `list[bool]` | Per-choice booleans: `True` when every continuation token equals the top-1 prediction (i.e. the choice would have been produced under greedy decoding). Same length as `_choices`. |
42+
43+
`response` is also set to `_choices[argmax(_choices_logprobs)]` so legacy text-based scorers continue to work in logprob mode.
3844

3945
## The @scorer Decorator
4046

@@ -196,15 +202,16 @@ The runner inspects `logprobs.text_offset` to locate the continuation
196202
token span, sums `token_logprobs` over that span, and decides
197203
`is_greedy` by checking whether each continuation token matches the
198204
top-1 entry of `top_logprobs`. The resulting per-choice
199-
`(sum_logprob, is_greedy)` tuples are placed on `ScorerInput.choices`,
200-
`choices_logprobs`, and `choices_is_greedy`. `multiple_choice_acc`
201-
then computes:
202-
203-
- `acc` -- 1.0 iff `argmax(choices_logprobs) == gold_index` (MMLU
204-
canonical).
205-
- `acc_norm` -- 1.0 iff `argmax(choices_logprobs[i] /
206-
max(len(choices[i].encode("utf-8")), 1)) == gold_index` (ARC/BoolQ
207-
canonical, per-byte length normalization).
205+
`(sum_logprob, is_greedy)` tuples are written into `ScorerInput.metadata`
206+
under the reserved keys `_choices`, `_choices_logprobs`, and
207+
`_choices_is_greedy`. `multiple_choice_acc` then computes:
208+
209+
- `acc` -- 1.0 iff `argmax(metadata["_choices_logprobs"]) == gold_index`
210+
(MMLU canonical).
211+
- `acc_norm` -- 1.0 iff
212+
`argmax(metadata["_choices_logprobs"][i] /
213+
max(len(metadata["_choices"][i].encode("utf-8")), 1)) == gold_index`
214+
(ARC/BoolQ canonical, per-byte length normalization).
208215
- `acc_greedy` -- 1.0 iff the highest-loglikelihood **greedy** choice
209216
matches gold (diagnostic).
210217

packages/nemo-evaluator/src/nemo_evaluator/client/client.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,36 @@ async def _make_request():
232232
response = await self._retry_with_backoff(_make_request)
233233
return response.choices[0].text or ""
234234

235+
async def loglikelihood(self, prompt: str, **kwargs) -> dict:
236+
"""Score *prompt* for per-token loglikelihoods (lm-eval-harness contract).
237+
238+
Posts ``/v1/completions`` with ``echo=true, logprobs=1, max_tokens=0``
239+
so the server returns per-token log-probabilities for the entire
240+
prompt without generating new tokens. Returns the full response body
241+
as a dict so callers can inspect ``logprobs.tokens``,
242+
``logprobs.token_logprobs``, ``logprobs.text_offset``, and
243+
``logprobs.top_logprobs``.
244+
245+
Honours ``self.semaphore`` and ``self._retry_with_backoff`` exactly
246+
like ``chat_completion`` / ``completion``.
247+
"""
248+
params = {
249+
"model": self.model_id,
250+
"prompt": prompt,
251+
"max_tokens": 0,
252+
"temperature": 0.0,
253+
"logprobs": 1,
254+
"echo": True,
255+
**kwargs,
256+
}
257+
258+
async def _make_request():
259+
async with self.semaphore:
260+
return await self.client.completions.create(**params)
261+
262+
response = await self._retry_with_backoff(_make_request)
263+
return response.model_dump()
264+
235265
def completions(
236266
self,
237267
prompts: List[str],

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/cli.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,8 @@ def byob_compile(args=None):
202202
for name, fdf in compiled.items():
203203
eval_entry = fdf["evaluations"][0]
204204
print(f" - {eval_entry['name']} (normalized: {name})")
205-
ds = fdf["defaults"]["config"]["params"]["extra"]["dataset"]
205+
ds_cfg = fdf["defaults"]["config"]["params"]["extra"]["dataset"]
206+
ds = ds_cfg["path"] if isinstance(ds_cfg, dict) else ds_cfg
206207
print(f" Dataset: {ds}")
207208
if os.path.exists(ds):
208209
with open(ds) as f:

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/compiler.py

Lines changed: 29 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -42,26 +42,22 @@
4242
# Jinja2 command template for runner invocation
4343
# NOTE: Use plain string concatenation to avoid f-string escaping issues with {{ }}
4444
#
45-
# Dataset resolution precedence (Req 2 - "swap the input file, keep same
46-
# task name / prompt / scoring"):
45+
# Dataset-related config is grouped under ``config.params.extra.dataset.*``:
4746
#
48-
# 1. config.params.extra.dataset_uri (override; hf:// URI or local path)
49-
# 2. config.params.extra.dataset (compile-time default from @benchmark)
47+
# - ``path`` -- dataset file path or ``hf://`` URI (compile-time
48+
# default from ``@benchmark(dataset=...)``)
49+
# - ``num_fewshot`` -- optional few-shot example count (lm-eval-harness
50+
# parity)
51+
# - ``field_mapping``, ``choices``, ``choices_field`` -- informational
52+
# metadata; the runner picks up the live values from
53+
# the ``@benchmark`` registry, but they appear in the
54+
# FDF for inspection / override.
5055
#
51-
# When dataset_uri is set on a run (e.g. via
52-
# ``evaluation.tasks[0].nemo_evaluator_config.config.params.extra.dataset_uri=...``)
53-
# the runner fetches that URI instead, without any change to the benchmark
54-
# module, prompt template, scorer, or task name.
5556
COMMAND_TEMPLATE = (
5657
"python -m nemo_evaluator.contrib.byob.runner"
5758
" --benchmark-module {{config.params.extra.benchmark_module}}"
5859
" --benchmark-name {{config.params.task}}"
59-
"{% if config.params.extra.dataset_uri is defined"
60-
" and config.params.extra.dataset_uri is not none %}"
61-
' --dataset "{{config.params.extra.dataset_uri}}"'
62-
"{% else %}"
63-
' --dataset "{{config.params.extra.dataset}}"'
64-
"{% endif %}"
60+
' --dataset "{{config.params.extra.dataset.path}}"'
6561
" --output-dir {{config.output_dir}}"
6662
" --model-url {{target.api_endpoint.url}}"
6763
" --model-id {{target.api_endpoint.model_id}}"
@@ -84,9 +80,9 @@
8480
"{% if config.params.request_timeout is not none %}"
8581
" --request-timeout {{config.params.request_timeout}}"
8682
"{% endif %}"
87-
"{% if config.params.extra.num_fewshot is defined"
88-
" and config.params.extra.num_fewshot is not none %}"
89-
" --num-fewshot {{config.params.extra.num_fewshot}}"
83+
"{% if config.params.extra.dataset.num_fewshot is defined"
84+
" and config.params.extra.dataset.num_fewshot is not none %}"
85+
" --num-fewshot {{config.params.extra.dataset.num_fewshot}}"
9086
"{% endif %}"
9187
)
9288

@@ -108,29 +104,27 @@ def _build_fdf(
108104
Returns:
109105
FDF dict ready for YAML serialization.
110106
"""
111-
extra_params: dict = {
112-
"benchmark_module": benchmark_module_ref,
113-
"dataset": dataset_path,
114-
# ``dataset_uri`` is the Req 2 override slot: setting it at run
115-
# time (e.g. to a different hf:// URI with the same schema) makes
116-
# the BYOB runner load that dataset instead of ``dataset`` while
117-
# keeping task name, prompt template, and scorer unchanged. Null
118-
# by default so the compile-time ``dataset`` is used.
119-
"dataset_uri": None,
120-
"requirements": bench.requirements,
121-
}
122-
# Propagate field_mapping if declared
107+
# Dataset-specific config grouped under ``extra.dataset.*`` so that all
108+
# dataset-shaped settings (path, fewshot count, field mapping, candidate
109+
# choices) live under one namespace and don't pollute the top level of
110+
# ``extra``.
111+
dataset_params: dict = {"path": dataset_path}
123112
if bench.field_mapping:
124-
extra_params["field_mapping"] = bench.field_mapping
125-
# Few-shot defaults (lm-eval-harness parity)
113+
dataset_params["field_mapping"] = bench.field_mapping
126114
if bench.num_fewshot:
127-
extra_params["num_fewshot"] = bench.num_fewshot
115+
dataset_params["num_fewshot"] = bench.num_fewshot
128116
# Multiple-choice loglikelihood metadata (informational; the runner
129-
# picks up choices/choices_field from the @benchmark itself)
117+
# picks up choices/choices_field from the @benchmark registry itself).
130118
if bench.choices is not None:
131-
extra_params["choices"] = list(bench.choices)
119+
dataset_params["choices"] = list(bench.choices)
132120
if bench.choices_field is not None:
133-
extra_params["choices_field"] = bench.choices_field
121+
dataset_params["choices_field"] = bench.choices_field
122+
123+
extra_params: dict = {
124+
"benchmark_module": benchmark_module_ref,
125+
"dataset": dataset_params,
126+
"requirements": bench.requirements,
127+
}
134128
# Propagate judge config(s) from @benchmark kwargs
135129
# Supports: judge={...}, judge_1={...}, judge_2={...}, etc.
136130
for key, value in bench.extra_config.items():

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/containerize.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ def generate_dockerfile(
118118
def rewrite_fdf_paths(fdf: dict, pkg_name: str) -> dict:
119119
"""Rewrite host-local paths in an FDF dict to container paths.
120120
121-
Transforms ``extra.benchmark_module`` and ``extra.dataset`` from
121+
Transforms ``extra.benchmark_module`` and ``extra.dataset.path`` from
122122
absolute host paths to container-relative paths under ``/opt/byob/``.
123123
124124
Args:
@@ -136,12 +136,13 @@ def rewrite_fdf_paths(fdf: dict, pkg_name: str) -> dict:
136136
filename = os.path.basename(benchmark_module)
137137
extra["benchmark_module"] = f"/opt/byob/code/{filename}"
138138

139-
dataset = extra.get("dataset", "")
139+
dataset_cfg = extra.get("dataset") or {}
140+
dataset = dataset_cfg.get("path", "") if isinstance(dataset_cfg, dict) else ""
140141
if dataset and not dataset.startswith(
141142
("hf://", "s3://", "gs://", "http://", "https://")
142143
):
143144
filename = os.path.basename(dataset)
144-
extra["dataset"] = f"/opt/byob/data/{filename}"
145+
dataset_cfg["path"] = f"/opt/byob/data/{filename}"
145146

146147
return fdf
147148

@@ -193,7 +194,8 @@ def prepare_build_context(
193194
# Copy or fetch dataset to data/
194195
data_dir = context / "data"
195196
data_dir.mkdir(parents=True, exist_ok=True)
196-
dataset = extra.get("dataset", "")
197+
dataset_cfg = extra.get("dataset") or {}
198+
dataset = dataset_cfg.get("path", "") if isinstance(dataset_cfg, dict) else ""
197199
if dataset:
198200
if os.path.isfile(dataset):
199201
# Local file — copy directly
@@ -207,7 +209,7 @@ def prepare_build_context(
207209
result = fetcher.fetch(dataset, cache_dir=data_dir)
208210
# Update the FDF to point to the local filename inside /opt/byob/data/
209211
local_name = result.local_path.name
210-
extra["dataset"] = f"/opt/byob/data/{local_name}"
212+
dataset_cfg["path"] = f"/opt/byob/data/{local_name}"
211213
# Move/copy if fetched to a different location than data_dir
212214
if result.local_path.parent != data_dir:
213215
shutil.copy2(str(result.local_path), str(data_dir / local_name))

packages/nemo-evaluator/src/nemo_evaluator/contrib/byob/decorators.py

Lines changed: 14 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -47,14 +47,18 @@ class ScorerInput:
4747
4848
This is the single argument passed to all BYOB scorer functions.
4949
Standard scorers use response, target, and metadata.
50-
Advanced scorers (judge, multi-turn, multiple-choice loglikelihood)
51-
use the optional fields.
52-
53-
For multiple-choice loglikelihood evaluation (lm-eval-harness style),
54-
``MultipleChoiceStrategy`` populates ``choices``, ``choices_logprobs``,
55-
and ``choices_is_greedy`` before invoking the scorer. ``response`` is
56-
set to ``choices[argmax(choices_logprobs)]`` so legacy text-based
57-
scorers also work.
50+
Advanced scorers (judge, multi-turn) use the optional fields.
51+
52+
(e.g. ``MultipleChoiceStrategy``) write namespaced keys into
53+
``metadata`` before invoking the scorer. Reserved keys currently in
54+
use:
55+
56+
* ``_choices`` -- candidate continuations (list[str])
57+
* ``_choices_logprobs`` -- per-choice sum log-probabilities (list[float])
58+
* ``_choices_is_greedy`` -- per-choice greedy flags (list[bool])
59+
60+
``response`` is set to ``choices[argmax(choices_logprobs)]`` for
61+
multiple-choice mode so legacy text-based scorers also work.
5862
"""
5963

6064
response: str
@@ -65,10 +69,6 @@ class ScorerInput:
6569
config: Dict[str, Any] = field(default_factory=dict)
6670
conversation: Optional[List[dict]] = None
6771
turn_index: Optional[int] = None
68-
# Multiple-choice loglikelihood fields (mirrors lm-eval-harness)
69-
choices: Optional[List[str]] = None
70-
choices_logprobs: Optional[List[float]] = None
71-
choices_is_greedy: Optional[List[bool]] = None
7272

7373

7474
@dataclass
@@ -89,10 +89,8 @@ class BenchmarkDefinition:
8989
_is_jinja2: bool = False
9090
system_prompt: Optional[str] = None
9191
_is_system_prompt_jinja2: bool = False
92-
# Multiple-choice loglikelihood support (mirrors lm-eval-harness)
9392
choices: Optional[List[str]] = None
9493
choices_field: Optional[str] = None
95-
# Few-shot prompting (mirrors lm-eval-harness --num_fewshot)
9694
num_fewshot: int = 0
9795
fewshot_split: Optional[str] = None
9896
fewshot_template: Optional[str] = None
@@ -186,10 +184,7 @@ def benchmark(
186184
prompt: Python format string with {field} placeholders, or path to
187185
a template file (.txt, .md, .jinja, .jinja2).
188186
target_field: JSONL field containing ground truth.
189-
endpoint_type: ``"chat"``, ``"completions"``, or
190-
``"completions_logprob"``. The last value enables per-choice
191-
loglikelihood ranking (lm-evaluation-harness ``local-completions``
192-
parity) and requires either ``choices`` or ``choices_field``.
187+
endpoint_type: ``"chat"``, ``"completions"``, or ``"completions_logprob"``.
193188
requirements: Pip dependencies. Either a list of specifiers
194189
(e.g., ["rouge-score>=0.1.2"]) or a path to a
195190
requirements.txt file. None means no extra deps.
@@ -215,8 +210,7 @@ def benchmark(
215210
num_fewshot: Number of few-shot examples to prepend to each
216211
prompt. Examples are sampled deterministically from
217212
``fewshot_split`` (or the first ``num_fewshot`` rows of the
218-
evaluation dataset when ``fewshot_split`` is None). Mirrors
219-
lm-eval-harness's ``--num_fewshot`` flag.
213+
evaluation dataset when ``fewshot_split`` is None).
220214
fewshot_split: HuggingFace split name to sample few-shot examples
221215
from (e.g. ``"train"`` or ``"dev"``). Only meaningful when the
222216
primary ``dataset`` is an ``hf://`` URI.

0 commit comments

Comments
 (0)