Skip to content

feat(tasks): Add RASB 26H1 evaluator integration#943

Open
pbelcak wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
pbelcak:main
Open

feat(tasks): Add RASB 26H1 evaluator integration#943
pbelcak wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
pbelcak:main

Conversation

@pbelcak
Copy link
Copy Markdown

@pbelcak pbelcak commented Apr 27, 2026

feat(tasks): Add RASB 26H1 evaluator integration

What is RASB 26H1?

RASB 26H1 is the first snapshot of the Real Agent Scaffolds Bench. You can find the details of the benchmark in the tech report (NVIDIA Internal link).

Summary

Adds RASB 26H1 as a NeMo Evaluator task:

  • Adds the rasb-26h1.rasb_26h1 harness, runner, output parser, and unit tests.
  • Adds a local RASB evaluator image build helper under docker/rasb-26h1-local/.
  • Adds a launcher example config at packages/nemo-evaluator-launcher/examples/local_rasb_26h1.yaml.
  • Adds straightforward benchmark documentation under docs/evaluation/benchmarks/rasb-26h1.md and links it from the benchmark catalog.
  • Packages the RASB framework.yml, README.md, and QUICKSTART.md.
  • Adds minimal launcher support for explicit local image references such as localhost/rasb-26h1:local.

Motivation

RASB runs each benchmark environment inside its own Docker container. This integration lets NeMo Evaluator and NeMo Evaluator Launcher run the benchmark while preserving RASB's existing per-environment Docker lifecycle.

The generic launcher changes are limited to what is needed for local RASB development images: recognizing loopback/local image references and skipping registry metadata extraction for those local references.

Validation

Ran focused unit tests:

.venv/bin/python -m pytest \
  packages/nemo-evaluator/tests/unit_tests/rasb_26h1 \
  packages/nemo-evaluator/tests/unit_tests/core/test_custom_harness_discovery.py

Result: 17 passed

.venv/bin/python -m pytest \
  packages/nemo-evaluator-launcher/tests/unit_tests/test_helpers.py::TestIsLocalImagePath \
  packages/nemo-evaluator-launcher/tests/unit_tests/test_mapping.py

Result: 22 passed

Additional checks:

git show --check HEAD
.venv/bin/python -m compileall -q <changed Python files>

Also ran a one-environment RASB smoke test against azure/anthropic/claude-opus-4-5 using https://inference-api.nvidia.com.

Result:

  • 1 RASB environment completed
  • 30 samples evaluated
  • 27 passed
  • 3 failed benchmark requirements
  • 0 runtime/API errors
  • overall pass rate: 0.9

Documentation

Added:

  • docs/evaluation/benchmarks/rasb-26h1.md
  • package-level README.md
  • package-level QUICKSTART.md

The docs cover the local image flow, NGC image flow, dataset location, endpoint configuration, and RASB's nested Docker requirement.

Backwards Compatibility

This should not affect existing benchmark tasks. The tiny launcher behavior change only applies to explicit local/loopback image references such as localhost/..., localhost:port/..., 127.0.0.1:port/..., or local image file paths.

Signed-off-by: Peter Belcak <pbelcak@nvidia.com>
@pbelcak pbelcak requested review from a team and agronskiy as code owners April 27, 2026 20:19
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@marta-sd marta-sd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution @pbelcak ❤️
Could you please refactor your PR to use BYOB for adding your benchmark? I think it fits your use-case well 🙂 Please let us know if you have any questions

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 28, 2026
@gchlebus
Copy link
Copy Markdown
Contributor

gchlebus commented May 4, 2026

@pbelcak Please let us know if you need any support from the Evaluator team on this PR. Thanks!
FYI: We have a BYOB agent skill that you could use during the onboarding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI community-request documentation Improvements or additions to documentation nemo-evaluator nemo-evaluator-launcher tests waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants