Skip to content
Draft
83 changes: 82 additions & 1 deletion docs/evaluation/other-benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,87 @@ After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-res
}
```

### FACTS Grounding

[FACTS Grounding](https://www.kaggle.com/benchmarks/google/facts-grounding) is a Google DeepMind and Google Research benchmark for measuring whether long-form model responses are grounded in a provided context document.
The benchmark evaluates factuality with respect to the supplied document, rather than open-world factual recall.

- Benchmark definition: [`nemo_skills/dataset/facts_grounding/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/facts_grounding/__init__.py)
- Original benchmark leaderboard is [FACTS Grounding on Kaggle](https://www.kaggle.com/benchmarks/google/facts-grounding).
- Public data is available on Hugging Face as [`google/FACTS-grounding-public`](https://huggingface.co/datasets/google/FACTS-grounding-public).
- The public split contains 860 examples. Each example includes a user request, a long context document, and a full prompt.
- Metrics follow the FACTS Grounding setup: a 3-judge ensemble first checks grounding and eligibility, then reports unadjusted factuality, eligibility rate, and eligibility-adjusted final factuality.
- The leaderboard-style score is `final_factuality`. Ineligible responses are counted as inaccurate in this final score.

#### Data Preparation

Prepare the public split and the evaluation prompts:

```bash
ns prepare_data facts_grounding
```

#### Running the Evaluation

FACTS Grounding uses LLM judges through the NVIDIA Inference API by default, so make sure `NVIDIA_API_KEY` is defined.
The default NeMo-Skills judge ensemble uses Gemini 3.1 Pro Preview, GPT-5.2, and Claude Opus 4.5.

```bash
export NVIDIA_API_KEY=<>
ns eval \
--cluster=<CLUSTER_NAME> \
--model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--server_type=vllm \
--server_gpus=8 \
--benchmarks=facts_grounding \
--output_dir=<OUTPUT_DIR> \
--server_args="--max-model-len 65536" \
++parse_reasoning=True \
++max_concurrent_requests=24
```

You can override the judge set with `++judge_models=[...]`.
The default 3-judge ensemble is:

```text
gcp/google/gemini-3.1-pro-preview
azure/openai/gpt-5.2
aws/anthropic/claude-opus-4-5
```

#### Verifying Results

After all jobs are complete, check the results in `<OUTPUT_DIR>/eval-results/facts_grounding/metrics.json`.
The results table is printed to stdout and captured in the summarize-results srun log.

Example public-split results (Nemotron-3-Nano, `facts_grounding`):

```text
--------------------------------------- facts_grounding ---------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | unadjusted_factuality | eligibility_rate | final_factuality | grounding_correct | quality_passed | factuality_correct | num_judges
pass@1 | 860 | 943 | 285 | 41.90% | 95.12% | 39.81% | 17.91% | 95.12% | 17.09% | 3
```

Per-judge unadjusted factuality and eligibility:

| Judge | Unadjusted factuality | Eligibility |
|:---|---:|---:|
| Gemini 3.1 Pro Preview | 45.93% | 88.60% |
| GPT-5.2 | 24.19% | 83.14% |
| Claude Opus 4.5 | 55.58% | 89.77% |

Sentence-level grounding labels from the JSON-style judges:

| Label | Share |
|:---|---:|
| Supported | 59.06% |
| Unsupported | 21.73% |
| Contradictory | 1.07% |
| No RAD | 18.05% |

!!! note
These numbers are for the public split and the NeMo-Skills default judge set. They are useful for local comparison, but they are not directly identical to the Kaggle leaderboard's private-split score.

### HotpotQA

[HotpotQA](https://hotpotqa.github.io/) is a multi-hop question-answering benchmark that requires reasoning over multiple Wikipedia paragraphs. Two variants are supported:
Expand Down Expand Up @@ -288,4 +369,4 @@ eval(
output_dir="/workspace/experiments/aa-omniscience-eval",
data_dir="/workspace/data_dir"
)
```
```
43 changes: 43 additions & 0 deletions nemo_skills/dataset/facts_grounding/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Google FACTS Grounding benchmark
# Evaluates LLMs' ability to generate long-form responses grounded in a provided context document.
# See: https://huggingface.co/datasets/google/FACTS-grounding-public

METRICS_TYPE = "facts_grounding"
GENERATION_ARGS = "++prompt_config=generic/facts_grounding ++inference.tokens_to_generate=8192"

# LLM judge evaluation using NVIDIA Inference API.
# ``model`` seeds server/client setup in the nemo-skills pipeline; the judge
# task itself spins up one client per entry in ``judge_models`` (see
# ``nemo_skills.inference.eval.facts_grounding_judge``). All three judges are
# reachable via the same base URL, so only the ``model`` field varies per call.
JUDGE_PIPELINE_ARGS = {
"generation_module": "nemo_skills.inference.eval.facts_grounding_judge",
"model": "gcp/google/gemini-3.1-pro-preview",
"server_type": "openai",
"server_address": "https://inference-api.nvidia.com/v1",
}
# Default 3-judge ensemble — closest available counterparts to the
# google-facts reference's gemini-3-pro / gpt-5.2 / claude-opus-4-5.
# Override via ``++judge_models=[...]`` on the CLI.
JUDGE_ARGS = (
"++prompt_config=judge/facts_grounding "
"++generation_key=judgement "
"++add_generation_stats=False "
"++inference.temperature=0.5 "
"++inference.tokens_to_generate=8192 "
"++judge_models=[gcp/google/gemini-3.1-pro-preview,azure/openai/gpt-5.2,aws/anthropic/claude-opus-4-5]"
)
71 changes: 71 additions & 0 deletions nemo_skills/dataset/facts_grounding/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import json
from pathlib import Path

from datasets import load_dataset
from tqdm import tqdm


def write_data_to_file(output_file, examples):
with open(output_file, "wt", encoding="utf-8") as fout:
for row in tqdm(examples, desc=f"Writing {output_file.name}"):
json.dump(row, fout, ensure_ascii=False)
fout.write("\n")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--split",
default="test",
choices=("test",),
help="Dataset split to process.",
)
args = parser.parse_args()

data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)

# Download the FACTS Grounding examples
ds = load_dataset("google/FACTS-grounding-public", "examples", split="public")

examples = []
for idx, sample in enumerate(ds):
examples.append(
{
"id": f"facts_grounding_{idx}",
"full_prompt": sample["full_prompt"],
"user_request": sample["user_request"],
"context_document": sample["context_document"],
}
)

output_file = data_dir / f"{args.split}.jsonl"
write_data_to_file(output_file, examples)

# Download evaluation prompts used by the judge
eval_ds = load_dataset("google/FACTS-grounding-public", "evaluation_prompts", split="prompts")
eval_prompts = {}
for item in eval_ds:
eval_prompts[item["evaluation_method"]] = item["evaluation_prompt"]

eval_prompts_file = data_dir / "eval_prompts.json"
with open(eval_prompts_file, "wt", encoding="utf-8") as f:
json.dump(eval_prompts, f, ensure_ascii=False, indent=2)

print(f"Wrote {len(examples)} examples to {output_file}")
print(f"Wrote {len(eval_prompts)} evaluation prompts to {eval_prompts_file}")
Loading
Loading