feat: add local no-docker mode for nemo-evaluator-launcher by mishana · Pull Request #786 · NVIDIA-NeMo/Evaluator

mishana · 2026-03-01T04:19:52Z

Summary

add --no-docker flag to nemo-evaluator-launcher run
add execution.use_docker (default: true) for local executor config
support host-process execution path (no docker run) in local executor scripts
keep Docker as default behavior for backward compatibility
add tests and docs for no-docker mode

Rationale

This change enables running nemo-evaluator-launcher in environments where users do not have root access and cannot run Docker at all. A concrete example is workloads running on GKE Autopilot pods, where Docker-in-containerd patterns are prohibited.

This is especially useful for R&D workflows where teams want to run evaluations inside managed, restricted compute environments while still using the launcher orchestration and task config experience.

Notes

no-docker mode is restricted to execution.type=local with deployment.type=none
existing Docker-based local flows remain unchanged by default

Validation

uv run pytest tests/unit_tests/test_get_eval_factory_command.py tests/unit_tests/test_local_executor.py tests/unit_tests/test_cli_integration.py -q
uv run ruff check ... on modified files

copy-pr-bot · 2026-03-01T04:19:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: mishana <mishana4life@gmail.com>

prokotg · 2026-03-02T16:21:32Z

hi @mishana, thank you for your contribution! Really appreciated!

This looks overall very wel. One important missing piece is that when someone uses nel with --no-docker, it is now assumed that the appropriate package is installed (the task from the config is available in the python environment). In fact, right now we're re-using container process in which we check if the task is available in any container and then pull that particular container (except we do not pull). Here, we should utilise Nemotron Evaluator Core API to check if the user has a wheel that contains appropriate task. Please let us know if that's something you'd like to contribute as well (I can point to relevant resources)

Signed-off-by: mishana <mishana4life@gmail.com>

mishana · 2026-03-02T19:00:26Z

Addressed. I added a no-docker preflight validation that uses NeMo Evaluator Core API (nemo_evaluator.api.get_available_evaluations) to verify each requested harness/task is actually installed locally before execution.

Details:

execution.use_docker=false now checks local harness/task availability early and fails with actionable install guidance when missing.
Added unit coverage for success/failure paths in no-docker mode.
Updated local executor docs to explicitly call out the local harness-wheel requirement.

This is now in commit ff8c248 on this PR branch.

Signed-off-by: mishana <mishana4life@gmail.com>

This reverts commit 07e6372. Signed-off-by: mishana <mishana4life@gmail.com>

prokotg · 2026-03-03T10:23:25Z

/ok to test 665f689

prokotg

For now, TOCTOU is not obligatory. There is one suggestion which is important. Please fix linting, we'll run tests and hopefully have it merged. Thank you!

prokotg · 2026-03-03T15:02:48Z

    )
    # For unlisted tasks, use the full harness.task format
    # For listed tasks, use just the task name (existing behavior)
    if task_definition.get("is_unlisted"):


Suggested change

if task_definition.get("is_unlisted") or (cfg.execution.type == "local" and cfg.execution.use_docker):

Normally, we send only task invocation because that's unique to a container. Now, in a mixed environment, that's no longer the case.

prokotg · 2026-03-03T15:13:52Z

-        if result.returncode == 0:
-            killed_something = True
+        # Try to stop script process group if a pid file is present.
+        pid_file = output_dir / "logs" / "stage.pid"


This unfortunately introduces TOCTOU vulnerability. If anything happens to stage.pid file (overwrite, symlink race) we risk killing an alien process or other malicious actions. Since we operate on third party process, we cannot use typical flock approach. If we were to stay with PID file, then checking start time / process info before killing could be beneficial. For the scope of this MR, it's okay to skip this

pribalta

Power Review

Reviewed 10 files, left 18 suggestions: 1 error, 12 warnings, 5 suggestions.

Dimensions Analyzed

Dimension	🔴	⚠️	💡
ai-code-hygiene/architecture	0	2	0
ai-code-hygiene/patterns	0	1	0
code-correctness/api-contracts	0	1	0
code-correctness/boundary-conditions	1	1	2
code-correctness/concurrency	0	1	0
code-correctness/data-integrity	0	3	0
code-correctness/error-handling	0	1	2
code-correctness/logic-correctness	0	1	1
code-correctness/resource-lifecycle	0	1	0
Total	1	12	5

Across the add local no-docker mode files we swept,
And 18 suggestions we have kept.
AI may code fast and read at scale,
But human judgment tips the final scale.

⏱️ Reviewed in 6.9 min (est. manual review: ~38 min, 6x faster)

Automated review by Power Review — findings are advisory. Learn more about how it works.

pribalta · 2026-03-03T16:00:27Z

+        pid_file = output_dir / "logs" / "stage.pid"
+        if pid_file.exists():
+            try:
+                pid = int(pid_file.read_text().strip())


[Power Review] 🔴 Missing validation that PID is positive before os.killpg — PID 0 kills caller's process group · 95% confidence

Per the knowledge document (Missing Input Validation, CWE-20): data arriving from outside the trust boundary (file read) is used without checking range. The PID is read from a file (stage.pid) and passed directly to os.killpg(pid, signal.SIGTERM). If the file contains 0, os.killpg(0, SIGTERM) sends SIGTERM to every process in the calling process group — potentially killing the launcher itself and sibling processes. A negative value would also be invalid. The int() conversion alone does not guard against these dangerous values.

💡 Suggestion: After parsing the PID, validate it is strictly positive before use: if pid <= 0: raise ValueError(f'Invalid PID: {pid}'). This should be placed before the os.killpg() call and within the existing try/except block so that invalid values are caught gracefully.

pribalta · 2026-03-03T16:00:27Z

+                else:
+                    os.kill(pid, signal.SIGTERM)
+                killed_something = True
+            except (OSError, ValueError):


[Power Review] ⚠️ Swallowed exception in PID-based kill path · 90% confidence

Per the knowledge document (ERROR: Swallowed Exception With No Action): 'An exception is caught and silently ignored. The caller has no way to know the operation failed.' Here, except (OSError, ValueError): pass discards all diagnostic information when the PID file read or process kill fails. A ValueError from int() (corrupt PID file) or an OSError other than 'no such process' (e.g., EPERM — permission denied) would be silently lost, making debugging kill failures very difficult. Downgraded to warning because the killed_something flag does communicate functional outcome and the code falls through to Docker-based killing as a secondary strategy.

💡 Suggestion: At minimum, log the exception at debug/warning level so operators can diagnose kill failures. Consider also narrowing the OSError handling — e.g., only silencing ProcessLookupError (errno ESRCH) when the process is already dead, while letting permission errors propagate.

pribalta · 2026-03-03T16:00:27Z

        source "$task_dir/.secrets.env"
        {% endif -%}

+        echo "$$" > "$logs_dir/stage.pid"


[Power Review] ⚠️ $$` in subshell stores parent shell PID, not subshell PID · 88% confidence

Per knowledge doc 'Wrong Variable or Field Referenced': $$ inside a bash subshell (...) always expands to the PID of the top-level script process, not the subshell's PID. Since this echo is inside a (...) subshell block (starting at line 63), every task's stage.pid file will contain the same PID — the parent script's PID. The kill_job function in executor.py then calls os.killpg(pid, signal.SIGTERM) on this PID, which sends SIGTERM to the entire process group. In parallel execution mode, killing any single task would terminate ALL running tasks instead of just the targeted one. This is a stale/wrong value bug — the stored PID does not identify the process the code intends to kill.

💡 Suggestion: Use $BASHPID (available since Bash 4.0, 2009) instead of $$ to capture the actual subshell PID. Additionally, consider whether os.killpg in kill_job is the correct signal mechanism — if the subshell shares its process group with the parent, you may need os.kill(pid, signal.SIGTERM) or have the subshell create its own process group (e.g., via set -m or wrapping with setsid).

pribalta · 2026-03-03T16:00:27Z

+                else:
+                    os.kill(pid, signal.SIGTERM)
+                killed_something = True
+            except (OSError, ValueError):


[Power Review] ⚠️ Swallowed exceptions in kill_job PID cleanup · 85% confidence

The knowledge document states: 'Never swallow exceptions without logging or re-raising.' While the exceptions caught here are specific (OSError, ValueError) rather than bare, they are silently discarded with pass. An OSError could indicate a permission-denied condition (not just process-already-gone), and a ValueError could indicate a corrupt PID file — both worth logging for debuggability of kill failures.

💡 Suggestion: Add a debug/warning log for the caught exceptions. For example: except (OSError, ValueError) as exc: logger.debug("Could not signal process from %s: %s", pid_file, exc). This preserves the non-fatal behavior while leaving a diagnostic trail.

pribalta · 2026-03-03T16:00:27Z

-        if result.returncode == 0:
-            killed_something = True
+        # Try to stop script process group if a pid file is present.
+        pid_file = output_dir / "logs" / "stage.pid"


[Power Review] ⚠️ Filesystem TOCTOU on pid_file.exists() then pid_file.read_text() · 85% confidence

The knowledge document explicitly flags os.path.exists(path) then file operation as a Non-Atomic Check-Then-Act (TOCTOU) anti-pattern (CWE-367): 'filesystem TOCTOU — os.path.exists(path) then open(path) ... another process can replace the file between check and use. Instead: open first and handle the error.' Here, pid_file.exists() is checked on line 810, then pid_file.read_text() is called on line 812. The shell template's trap 'rm -f "$logs_dir/stage.pid"' EXIT (run.template.sh line 70) actively removes this file on process exit, creating a real window where the file disappears between check and read. While the except (OSError, ValueError): pass block on line 818 would catch the resulting FileNotFoundError, the exists() guard is redundant and constitutes a textbook TOCTOU pattern.

💡 Suggestion: Remove the pid_file.exists() check and rely solely on exception handling. This is both more correct and more concise:

try: pid = int(pid_file.read_text().strip()) if hasattr(os, 'killpg'): os.killpg(pid, signal.SIGTERM) else: os.kill(pid, signal.SIGTERM) killed_something = True except (OSError, ValueError): pass

pribalta · 2026-03-03T16:00:27Z

+                _validate_task_available_locally(
+                    task_query=task.name,
+                    task_definition=task_definition,
+                    available_tasks_by_harness=local_available_tasks or {},


[Power Review] 💡 Defensive or {} is unreachable — consider assertion instead · 85% confidence

At line 208, available_tasks_by_harness=local_available_tasks or {} uses or {} as a fallback. However, local_available_tasks is set to a dict at line 156 (inside if not use_docker), and _validate_task_available_locally is only called at line 204 (inside if not use_docker). So local_available_tasks is guaranteed to be a non-None dict at this point. The or {} can never activate, making it a tautological guard. This falls under the 'Redundant or Tautological Condition' suggestion pattern.

💡 Suggestion: Replace local_available_tasks or {} with just local_available_tasks, optionally preceded by assert local_available_tasks is not None to document the invariant.

pribalta · 2026-03-03T16:00:27Z

        Raises:
            RuntimeError: If the run script fails.
        """
+        use_docker = bool(cfg.execution.get("use_docker", True))


[Power Review] 💡 Missing test for boundary between use_docker=true and use_docker=false in kill_job · 80% confidence

Per the knowledge document (Missing Test for 'Just Inside / Just Outside' Pair): the kill_job method now branches on use_docker (line 802/821) but the test files in this diff do not include any tests for the kill_job path in no-docker mode — specifically testing: (1) kill with valid PID file, (2) kill with missing PID file, (3) kill with empty/invalid PID file content, (4) kill with use_docker=False and no container name. These boundary transitions should be exercised.

💡 Suggestion: Add test cases in test_local_executor.py covering kill_job for no-docker jobs: valid PID file, missing PID file, malformed PID file content (empty, non-numeric, zero, negative), and the transition where use_docker=False with no container name.

pribalta · 2026-03-03T16:00:27Z

+    available_tasks_by_harness: dict[str, set[str]],
+) -> None:
+    """Validate that a task exists in locally installed NeMo Evaluator packages."""
+    harness_name = str(task_definition.get("harness") or "")


[Power Review] 💡 Missing defensive handling when both harness and task name are empty in _validate_task_available_locally · 70% confidence

Per the knowledge document (Lack of Defensive default/else Branch): the function handles two cases — harness_name is truthy (lines 98-114) and harness_name is falsy (lines 116-123). In the falsy branch, if task_name is also empty (both harness and task keys are missing or empty in task_definition), the code searches for "" across all harnesses and produces a generic error message. A more explicit guard at the top of the function would provide a clearer, earlier error message for this malformed-input boundary.

💡 Suggestion: Add an early guard: if not harness_name and not task_name: raise ValueError(f"Task '{task_query}' has no harness or task name defined in its task definition.") before the main branching logic.

pribalta · 2026-03-03T16:00:27Z

+        {% if task.dataset_env_var_value -%}
+        export NEMO_EVALUATOR_DATASET_DIR="{{ task.dataset_env_var_value }}"
+        {% endif -%}
+        {{ task.eval_factory_command }} > "$logs_dir/client_stdout.log" 2>&1


[Power Review] 💡 Host eval command exit code not explicitly checked · 70% confidence

Per the knowledge document (SUGGESTION: Shell Command Without Error Check): 'A shell script runs commands without checking the exit code.' In the no-docker branch, {{ task.eval_factory_command }} > "$logs_dir/client_stdout.log" 2>&1 runs the evaluation command, and exit_code=$? captures its status. However, unlike the Docker path which has explicit success/failure echo statements and a structured exit, the no-docker path has no diagnostic logging on failure. If the eval command fails, the only record is in client_stdout.log — there's no stderr message indicating the command failed, its exit code, or guidance for the operator.

💡 Suggestion: Add a failure diagnostic similar to the Docker path, e.g.:

{{ task.eval_factory_command }} > "$logs_dir/client_stdout.log" 2>&1 exit_code=$? if [ $exit_code -ne 0 ]; then echo "Evaluation command failed with exit code $exit_code" >&2 fi

pribalta · 2026-03-03T16:00:27Z

-        if not container_name:
-            raise ValueError(f"No container name found for job {job_id}")
+        use_docker = bool(job_data.data.get("use_docker", True))
+        output_dir = pathlib.Path(job_data.data.get("output_dir", ""))


[Power Review] 💡 Empty-string fallback for output_dir could create relative PID file path · 70% confidence

Per the knowledge document (WARNING: Error Message Without Context / ERROR: Null/None Dereference on Fallible Return): job_data.data.get("output_dir", "") defaults to an empty string. pathlib.Path("") / "logs" / "stage.pid" produces the relative path logs/stage.pid, which could inadvertently match a file in the process's current working directory. While pid_file.exists() guards against reading a nonexistent file, if such a relative path happens to exist, kill_job could send SIGTERM to an unrelated process.

💡 Suggestion: Validate that output_dir is non-empty and absolute before using it for PID file lookups. If missing, skip the PID-based kill or raise an error early:

output_dir_str = job_data.data.get("output_dir", "") if not output_dir_str: raise ValueError(f"No output directory found for job {job_id}") output_dir = pathlib.Path(output_dir_str)

mishana requested review from a team as code owners March 1, 2026 04:19

github-actions Bot added documentation Improvements or additions to documentation nemo-evaluator-launcher tests community-request labels Mar 1, 2026

mishana changed the title ~~Add local no-docker mode for nemo-evaluator-launcher~~ feat: add local no-docker mode for nemo-evaluator-launcher Mar 1, 2026

Add no-docker mode for local nemo-evaluator-launcher runs

ca28143

Signed-off-by: mishana <mishana4life@gmail.com>

mishana force-pushed the codex/no-docker-local-launcher branch from a92a469 to ca28143 Compare March 1, 2026 04:24

prokotg self-assigned this Mar 2, 2026

fix: validate no-docker tasks against local core API

ff8c248

Signed-off-by: mishana <mishana4life@gmail.com>

mishana added 2 commits March 2, 2026 20:17

docs: add no-docker dependency management examples

07e6372

Signed-off-by: mishana <mishana4life@gmail.com>

Revert "docs: add no-docker dependency management examples"

665f689

This reverts commit 07e6372. Signed-off-by: mishana <mishana4life@gmail.com>

mishana force-pushed the codex/no-docker-local-launcher branch from 2c10a58 to 665f689 Compare March 3, 2026 01:51

prokotg requested changes Mar 3, 2026

View reviewed changes

pribalta reviewed Mar 3, 2026

View reviewed changes

chtruong814 added waiting-for-customer waiting-on-customer Waiting on the original author to respond and removed waiting-for-customer labels Apr 14, 2026


	if task_definition.get("is_unlisted") or (cfg.execution.type == "local" and cfg.execution.use_docker):

Conversation

mishana commented Mar 1, 2026

Summary

Rationale

Notes

Validation

Uh oh!

copy-pr-bot Bot commented Mar 1, 2026

Uh oh!

prokotg commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mishana commented Mar 2, 2026

Uh oh!

prokotg commented Mar 3, 2026

Uh oh!

prokotg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pribalta left a comment

Choose a reason for hiding this comment

Power Review

Dimensions Analyzed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

prokotg commented Mar 2, 2026 •

edited

Loading