NVIDIA-NeMo · MahanFathi · Apr 16, 2026 · Apr 16, 2026 · Apr 20, 2026 · Apr 29, 2026
diff --git a/docs/evaluation/other-benchmarks.md b/docs/evaluation/other-benchmarks.md
@@ -127,6 +127,87 @@ After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-res
 }
 ```
 
+### FACTS Grounding
+
+[FACTS Grounding](https://www.kaggle.com/benchmarks/google/facts-grounding) is a Google DeepMind and Google Research benchmark for measuring whether long-form model responses are grounded in a provided context document.
+The benchmark evaluates factuality with respect to the supplied document, rather than open-world factual recall.
+
+- Benchmark definition: [`nemo_skills/dataset/facts_grounding/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/facts_grounding/__init__.py)
+- Original benchmark leaderboard is [FACTS Grounding on Kaggle](https://www.kaggle.com/benchmarks/google/facts-grounding).
+- Public data is available on Hugging Face as [`google/FACTS-grounding-public`](https://huggingface.co/datasets/google/FACTS-grounding-public).
+- The public split contains 860 examples. Each example includes a user request, a long context document, and a full prompt.
+- Metrics follow the FACTS Grounding setup: a 3-judge ensemble first checks grounding and eligibility, then reports unadjusted factuality, eligibility rate, and eligibility-adjusted final factuality.
+- The leaderboard-style score is `final_factuality`. Ineligible responses are counted as inaccurate in this final score.
+
+#### Data Preparation
+
+Prepare the public split and the evaluation prompts:
+
+```bash
+ns prepare_data facts_grounding
+```
+
+#### Running the Evaluation
+
+FACTS Grounding uses LLM judges through the NVIDIA Inference API by default, so make sure `NVIDIA_API_KEY` is defined.
+The default NeMo-Skills judge ensemble uses Gemini 3.1 Pro Preview, GPT-5.2, and Claude Opus 4.5.
+
+```bash
+export NVIDIA_API_KEY=<>
+ns eval \
+    --cluster=<CLUSTER_NAME> \
+    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+    --server_type=vllm \
+    --server_gpus=8 \
+    --benchmarks=facts_grounding \
+    --output_dir=<OUTPUT_DIR> \
+    --server_args="--max-model-len 65536" \
+    ++parse_reasoning=True \
+    ++max_concurrent_requests=24
+```
+
+You can override the judge set with `++judge_models=[...]`.
+The default 3-judge ensemble is:
+
+```text
+gcp/google/gemini-3.1-pro-preview
+azure/openai/gpt-5.2
+aws/anthropic/claude-opus-4-5
+```
+
+#### Verifying Results
+
+After all jobs are complete, check the results in `<OUTPUT_DIR>/eval-results/facts_grounding/metrics.json`.
+The results table is printed to stdout and captured in the summarize-results srun log.
+
+Example public-split results (Nemotron-3-Nano, `facts_grounding`):
+
+```text
+--------------------------------------- facts_grounding ---------------------------------------
+evaluation_mode | num_entries | avg_tokens | gen_seconds | unadjusted_factuality | eligibility_rate | final_factuality | grounding_correct | quality_passed | factuality_correct | num_judges
+pass@1          | 860         | 943        | 285         | 41.90%                | 95.12%           | 39.81%           | 17.91%            | 95.12%         | 17.09%             | 3
+```
+
+Per-judge unadjusted factuality and eligibility:
+
+| Judge | Unadjusted factuality | Eligibility |
+|:---|---:|---:|
+| Gemini 3.1 Pro Preview | 45.93% | 88.60% |
+| GPT-5.2 | 24.19% | 83.14% |
+| Claude Opus 4.5 | 55.58% | 89.77% |
+
+Sentence-level grounding labels from the JSON-style judges:
+
+| Label | Share |
+|:---|---:|
+| Supported | 59.06% |
+| Unsupported | 21.73% |
+| Contradictory | 1.07% |
+| No RAD | 18.05% |
+
+!!! note
+    These numbers are for the public split and the NeMo-Skills default judge set. They are useful for local comparison, but they are not directly identical to the Kaggle leaderboard's private-split score.
+
 ### HotpotQA
 
 [HotpotQA](https://hotpotqa.github.io/) is a multi-hop question-answering benchmark that requires reasoning over multiple Wikipedia paragraphs. Two variants are supported:
@@ -288,4 +369,4 @@ eval(
     output_dir="/workspace/experiments/aa-omniscience-eval",
     data_dir="/workspace/data_dir"
 )
-```
+```
diff --git a/nemo_skills/dataset/facts_grounding/__init__.py b/nemo_skills/dataset/facts_grounding/__init__.py
@@ -0,0 +1,43 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Google FACTS Grounding benchmark
+# Evaluates LLMs' ability to generate long-form responses grounded in a provided context document.
+# See: https://huggingface.co/datasets/google/FACTS-grounding-public
+
+METRICS_TYPE = "facts_grounding"
+GENERATION_ARGS = "++prompt_config=generic/facts_grounding ++inference.tokens_to_generate=8192"
+
+# LLM judge evaluation using NVIDIA Inference API.
+# ``model`` seeds server/client setup in the nemo-skills pipeline; the judge
+# task itself spins up one client per entry in ``judge_models`` (see
+# ``nemo_skills.inference.eval.facts_grounding_judge``). All three judges are
+# reachable via the same base URL, so only the ``model`` field varies per call.
+JUDGE_PIPELINE_ARGS = {
+    "generation_module": "nemo_skills.inference.eval.facts_grounding_judge",
+    "model": "gcp/google/gemini-3.1-pro-preview",
+    "server_type": "openai",
+    "server_address": "https://inference-api.nvidia.com/v1",
+}
+# Default 3-judge ensemble — closest available counterparts to the
+# google-facts reference's gemini-3-pro / gpt-5.2 / claude-opus-4-5.
+# Override via ``++judge_models=[...]`` on the CLI.
+JUDGE_ARGS = (
+    "++prompt_config=judge/facts_grounding "
+    "++generation_key=judgement "
+    "++add_generation_stats=False "
+    "++inference.temperature=0.5 "
+    "++inference.tokens_to_generate=8192 "
+    "++judge_models=[gcp/google/gemini-3.1-pro-preview,azure/openai/gpt-5.2,aws/anthropic/claude-opus-4-5]"
+)
diff --git a/nemo_skills/dataset/facts_grounding/prepare.py b/nemo_skills/dataset/facts_grounding/prepare.py
@@ -0,0 +1,71 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from tqdm import tqdm
+
+
+def write_data_to_file(output_file, examples):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for row in tqdm(examples, desc=f"Writing {output_file.name}"):
+            json.dump(row, fout, ensure_ascii=False)
+            fout.write("\n")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--split",
+        default="test",
+        choices=("test",),
+        help="Dataset split to process.",
+    )
+    args = parser.parse_args()
+
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+
+    # Download the FACTS Grounding examples
+    ds = load_dataset("google/FACTS-grounding-public", "examples", split="public")
+
+    examples = []
+    for idx, sample in enumerate(ds):
+        examples.append(
+            {
+                "id": f"facts_grounding_{idx}",
+                "full_prompt": sample["full_prompt"],
+                "user_request": sample["user_request"],
+                "context_document": sample["context_document"],
+            }
+        )
+
+    output_file = data_dir / f"{args.split}.jsonl"
+    write_data_to_file(output_file, examples)
+
+    # Download evaluation prompts used by the judge
+    eval_ds = load_dataset("google/FACTS-grounding-public", "evaluation_prompts", split="prompts")
+    eval_prompts = {}
+    for item in eval_ds:
+        eval_prompts[item["evaluation_method"]] = item["evaluation_prompt"]
+
+    eval_prompts_file = data_dir / "eval_prompts.json"
+    with open(eval_prompts_file, "wt", encoding="utf-8") as f:
+        json.dump(eval_prompts, f, ensure_ascii=False, indent=2)
+
+    print(f"Wrote {len(examples)} examples to {output_file}")
+    print(f"Wrote {len(eval_prompts)} evaluation prompts to {eval_prompts_file}")