feat(tasks): Add RASB 26H1 evaluator integration by pbelcak · Pull Request #943 · NVIDIA-NeMo/Evaluator

pbelcak · 2026-04-27T20:19:16Z

feat(tasks): Add RASB 26H1 evaluator integration

What is RASB 26H1?

RASB 26H1 is the first snapshot of the Real Agent Scaffolds Bench. You can find the details of the benchmark in the tech report (NVIDIA Internal link).

Summary

Adds RASB 26H1 as a NeMo Evaluator task:

Adds the rasb-26h1.rasb_26h1 harness, runner, output parser, and unit tests.
Adds a local RASB evaluator image build helper under docker/rasb-26h1-local/.
Adds a launcher example config at packages/nemo-evaluator-launcher/examples/local_rasb_26h1.yaml.
Adds straightforward benchmark documentation under docs/evaluation/benchmarks/rasb-26h1.md and links it from the benchmark catalog.
Packages the RASB framework.yml, README.md, and QUICKSTART.md.
Adds minimal launcher support for explicit local image references such as localhost/rasb-26h1:local.

Motivation

RASB runs each benchmark environment inside its own Docker container. This integration lets NeMo Evaluator and NeMo Evaluator Launcher run the benchmark while preserving RASB's existing per-environment Docker lifecycle.

The generic launcher changes are limited to what is needed for local RASB development images: recognizing loopback/local image references and skipping registry metadata extraction for those local references.

Validation

Ran focused unit tests:

.venv/bin/python -m pytest \
  packages/nemo-evaluator/tests/unit_tests/rasb_26h1 \
  packages/nemo-evaluator/tests/unit_tests/core/test_custom_harness_discovery.py

Result: 17 passed

.venv/bin/python -m pytest \
  packages/nemo-evaluator-launcher/tests/unit_tests/test_helpers.py::TestIsLocalImagePath \
  packages/nemo-evaluator-launcher/tests/unit_tests/test_mapping.py

Result: 22 passed

Additional checks:

git show --check HEAD
.venv/bin/python -m compileall -q <changed Python files>

Also ran a one-environment RASB smoke test against azure/anthropic/claude-opus-4-5 using https://inference-api.nvidia.com.

Result:

1 RASB environment completed
30 samples evaluated
27 passed
3 failed benchmark requirements
0 runtime/API errors
overall pass rate: 0.9

Documentation

Added:

docs/evaluation/benchmarks/rasb-26h1.md
package-level README.md
package-level QUICKSTART.md

The docs cover the local image flow, NGC image flow, dataset location, endpoint configuration, and RASB's nested Docker requirement.

Backwards Compatibility

This should not affect existing benchmark tasks. The tiny launcher behavior change only applies to explicit local/loopback image references such as localhost/..., localhost:port/..., 127.0.0.1:port/..., or local image file paths.

Signed-off-by: Peter Belcak <pbelcak@nvidia.com>

copy-pr-bot · 2026-04-27T20:19:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

marta-sd

Thank you for contribution @pbelcak ❤️
Could you please refactor your PR to use BYOB for adding your benchmark? I think it fits your use-case well 🙂 Please let us know if you have any questions

gchlebus · 2026-05-04T15:25:42Z

@pbelcak Please let us know if you need any support from the Evaluator team on this PR. Thanks!
FYI: We have a BYOB agent skill that you could use during the onboarding.

feat(tasks): add RASB 26H1 evaluator integration

774f28a

Signed-off-by: Peter Belcak <pbelcak@nvidia.com>

pbelcak requested review from a team and agronskiy as code owners April 27, 2026 20:19

github-actions Bot added documentation Improvements or additions to documentation nemo-evaluator-launcher nemo-evaluator tests CI community-request labels Apr 27, 2026

marta-sd reviewed Apr 28, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tasks): Add RASB 26H1 evaluator integration#943

feat(tasks): Add RASB 26H1 evaluator integration#943
pbelcak wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
pbelcak:main

pbelcak commented Apr 27, 2026

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

marta-sd left a comment

Uh oh!

gchlebus commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pbelcak commented Apr 27, 2026

feat(tasks): Add RASB 26H1 evaluator integration

What is RASB 26H1?

Summary

Motivation

Validation

Documentation

Backwards Compatibility

Uh oh!

copy-pr-bot Bot commented Apr 27, 2026

Uh oh!

marta-sd left a comment

Choose a reason for hiding this comment

Uh oh!

gchlebus commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gchlebus commented May 4, 2026 •

edited

Loading