feat: add local no-docker mode for nemo-evaluator-launcher#786
feat: add local no-docker mode for nemo-evaluator-launcher#786mishana wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: mishana <mishana4life@gmail.com>
a92a469 to
ca28143
Compare
|
hi @mishana, thank you for your contribution! Really appreciated! This looks overall very wel. One important missing piece is that when someone uses |
Signed-off-by: mishana <mishana4life@gmail.com>
|
Addressed. I added a no-docker preflight validation that uses NeMo Evaluator Core API ( Details:
This is now in commit |
Signed-off-by: mishana <mishana4life@gmail.com>
This reverts commit 07e6372. Signed-off-by: mishana <mishana4life@gmail.com>
2c10a58 to
665f689
Compare
|
/ok to test 665f689 |
prokotg
left a comment
There was a problem hiding this comment.
For now, TOCTOU is not obligatory. There is one suggestion which is important. Please fix linting, we'll run tests and hopefully have it merged. Thank you!
| ) | ||
| # For unlisted tasks, use the full harness.task format | ||
| # For listed tasks, use just the task name (existing behavior) | ||
| if task_definition.get("is_unlisted"): |
There was a problem hiding this comment.
| if task_definition.get("is_unlisted") or (cfg.execution.type == "local" and cfg.execution.use_docker): |
Normally, we send only task invocation because that's unique to a container. Now, in a mixed environment, that's no longer the case.
| if result.returncode == 0: | ||
| killed_something = True | ||
| # Try to stop script process group if a pid file is present. | ||
| pid_file = output_dir / "logs" / "stage.pid" |
There was a problem hiding this comment.
This unfortunately introduces TOCTOU vulnerability. If anything happens to stage.pid file (overwrite, symlink race) we risk killing an alien process or other malicious actions. Since we operate on third party process, we cannot use typical flock approach. If we were to stay with PID file, then checking start time / process info before killing could be beneficial. For the scope of this MR, it's okay to skip this
pribalta
left a comment
There was a problem hiding this comment.
Power Review
Reviewed 10 files, left 18 suggestions: 1 error, 12 warnings, 5 suggestions.
Dimensions Analyzed
| Dimension | 🔴 | 💡 | |
|---|---|---|---|
| ai-code-hygiene/architecture | 0 | 2 | 0 |
| ai-code-hygiene/patterns | 0 | 1 | 0 |
| code-correctness/api-contracts | 0 | 1 | 0 |
| code-correctness/boundary-conditions | 1 | 1 | 2 |
| code-correctness/concurrency | 0 | 1 | 0 |
| code-correctness/data-integrity | 0 | 3 | 0 |
| code-correctness/error-handling | 0 | 1 | 2 |
| code-correctness/logic-correctness | 0 | 1 | 1 |
| code-correctness/resource-lifecycle | 0 | 1 | 0 |
| Total | 1 | 12 | 5 |
Across the add local no-docker mode files we swept,
And 18 suggestions we have kept.
AI may code fast and read at scale,
But human judgment tips the final scale.
⏱️ Reviewed in 6.9 min (est. manual review: ~38 min, 6x faster)
Automated review by Power Review — findings are advisory. Learn more about how it works.
| pid_file = output_dir / "logs" / "stage.pid" | ||
| if pid_file.exists(): | ||
| try: | ||
| pid = int(pid_file.read_text().strip()) |
There was a problem hiding this comment.
[Power Review] 🔴 Missing validation that PID is positive before os.killpg — PID 0 kills caller's process group · 95% confidence
Per the knowledge document (Missing Input Validation, CWE-20): data arriving from outside the trust boundary (file read) is used without checking range. The PID is read from a file (stage.pid) and passed directly to os.killpg(pid, signal.SIGTERM). If the file contains 0, os.killpg(0, SIGTERM) sends SIGTERM to every process in the calling process group — potentially killing the launcher itself and sibling processes. A negative value would also be invalid. The int() conversion alone does not guard against these dangerous values.
💡 Suggestion: After parsing the PID, validate it is strictly positive before use: if pid <= 0: raise ValueError(f'Invalid PID: {pid}'). This should be placed before the os.killpg() call and within the existing try/except block so that invalid values are caught gracefully.
| else: | ||
| os.kill(pid, signal.SIGTERM) | ||
| killed_something = True | ||
| except (OSError, ValueError): |
There was a problem hiding this comment.
[Power Review]
Per the knowledge document (ERROR: Swallowed Exception With No Action): 'An exception is caught and silently ignored. The caller has no way to know the operation failed.' Here, except (OSError, ValueError): pass discards all diagnostic information when the PID file read or process kill fails. A ValueError from int() (corrupt PID file) or an OSError other than 'no such process' (e.g., EPERM — permission denied) would be silently lost, making debugging kill failures very difficult. Downgraded to warning because the killed_something flag does communicate functional outcome and the code falls through to Docker-based killing as a secondary strategy.
💡 Suggestion: At minimum, log the exception at debug/warning level so operators can diagnose kill failures. Consider also narrowing the OSError handling — e.g., only silencing ProcessLookupError (errno ESRCH) when the process is already dead, while letting permission errors propagate.
| source "$task_dir/.secrets.env" | ||
| {% endif -%} | ||
|
|
||
| echo "$$" > "$logs_dir/stage.pid" |
There was a problem hiding this comment.
[Power Review]
Per knowledge doc 'Wrong Variable or Field Referenced': $$ inside a bash subshell (...) always expands to the PID of the top-level script process, not the subshell's PID. Since this echo is inside a (...) subshell block (starting at line 63), every task's stage.pid file will contain the same PID — the parent script's PID. The kill_job function in executor.py then calls os.killpg(pid, signal.SIGTERM) on this PID, which sends SIGTERM to the entire process group. In parallel execution mode, killing any single task would terminate ALL running tasks instead of just the targeted one. This is a stale/wrong value bug — the stored PID does not identify the process the code intends to kill.
💡 Suggestion: Use $BASHPID (available since Bash 4.0, 2009) instead of $$ to capture the actual subshell PID. Additionally, consider whether os.killpg in kill_job is the correct signal mechanism — if the subshell shares its process group with the parent, you may need os.kill(pid, signal.SIGTERM) or have the subshell create its own process group (e.g., via set -m or wrapping with setsid).
| else: | ||
| os.kill(pid, signal.SIGTERM) | ||
| killed_something = True | ||
| except (OSError, ValueError): |
There was a problem hiding this comment.
[Power Review]
The knowledge document states: 'Never swallow exceptions without logging or re-raising.' While the exceptions caught here are specific (OSError, ValueError) rather than bare, they are silently discarded with pass. An OSError could indicate a permission-denied condition (not just process-already-gone), and a ValueError could indicate a corrupt PID file — both worth logging for debuggability of kill failures.
💡 Suggestion: Add a debug/warning log for the caught exceptions. For example: except (OSError, ValueError) as exc: logger.debug("Could not signal process from %s: %s", pid_file, exc). This preserves the non-fatal behavior while leaving a diagnostic trail.
| if result.returncode == 0: | ||
| killed_something = True | ||
| # Try to stop script process group if a pid file is present. | ||
| pid_file = output_dir / "logs" / "stage.pid" |
There was a problem hiding this comment.
[Power Review]
The knowledge document explicitly flags os.path.exists(path) then file operation as a Non-Atomic Check-Then-Act (TOCTOU) anti-pattern (CWE-367): 'filesystem TOCTOU — os.path.exists(path) then open(path) ... another process can replace the file between check and use. Instead: open first and handle the error.' Here, pid_file.exists() is checked on line 810, then pid_file.read_text() is called on line 812. The shell template's trap 'rm -f "$logs_dir/stage.pid"' EXIT (run.template.sh line 70) actively removes this file on process exit, creating a real window where the file disappears between check and read. While the except (OSError, ValueError): pass block on line 818 would catch the resulting FileNotFoundError, the exists() guard is redundant and constitutes a textbook TOCTOU pattern.
💡 Suggestion: Remove the pid_file.exists() check and rely solely on exception handling. This is both more correct and more concise:
try:
pid = int(pid_file.read_text().strip())
if hasattr(os, 'killpg'):
os.killpg(pid, signal.SIGTERM)
else:
os.kill(pid, signal.SIGTERM)
killed_something = True
except (OSError, ValueError):
pass| _validate_task_available_locally( | ||
| task_query=task.name, | ||
| task_definition=task_definition, | ||
| available_tasks_by_harness=local_available_tasks or {}, |
There was a problem hiding this comment.
[Power Review] 💡 Defensive or {} is unreachable — consider assertion instead · 85% confidence
At line 208, available_tasks_by_harness=local_available_tasks or {} uses or {} as a fallback. However, local_available_tasks is set to a dict at line 156 (inside if not use_docker), and _validate_task_available_locally is only called at line 204 (inside if not use_docker). So local_available_tasks is guaranteed to be a non-None dict at this point. The or {} can never activate, making it a tautological guard. This falls under the 'Redundant or Tautological Condition' suggestion pattern.
💡 Suggestion: Replace local_available_tasks or {} with just local_available_tasks, optionally preceded by assert local_available_tasks is not None to document the invariant.
| Raises: | ||
| RuntimeError: If the run script fails. | ||
| """ | ||
| use_docker = bool(cfg.execution.get("use_docker", True)) |
There was a problem hiding this comment.
[Power Review] 💡 Missing test for boundary between use_docker=true and use_docker=false in kill_job · 80% confidence
Per the knowledge document (Missing Test for 'Just Inside / Just Outside' Pair): the kill_job method now branches on use_docker (line 802/821) but the test files in this diff do not include any tests for the kill_job path in no-docker mode — specifically testing: (1) kill with valid PID file, (2) kill with missing PID file, (3) kill with empty/invalid PID file content, (4) kill with use_docker=False and no container name. These boundary transitions should be exercised.
💡 Suggestion: Add test cases in test_local_executor.py covering kill_job for no-docker jobs: valid PID file, missing PID file, malformed PID file content (empty, non-numeric, zero, negative), and the transition where use_docker=False with no container name.
| available_tasks_by_harness: dict[str, set[str]], | ||
| ) -> None: | ||
| """Validate that a task exists in locally installed NeMo Evaluator packages.""" | ||
| harness_name = str(task_definition.get("harness") or "") |
There was a problem hiding this comment.
[Power Review] 💡 Missing defensive handling when both harness and task name are empty in _validate_task_available_locally · 70% confidence
Per the knowledge document (Lack of Defensive default/else Branch): the function handles two cases — harness_name is truthy (lines 98-114) and harness_name is falsy (lines 116-123). In the falsy branch, if task_name is also empty (both harness and task keys are missing or empty in task_definition), the code searches for "" across all harnesses and produces a generic error message. A more explicit guard at the top of the function would provide a clearer, earlier error message for this malformed-input boundary.
💡 Suggestion: Add an early guard: if not harness_name and not task_name: raise ValueError(f"Task '{task_query}' has no harness or task name defined in its task definition.") before the main branching logic.
| {% if task.dataset_env_var_value -%} | ||
| export NEMO_EVALUATOR_DATASET_DIR="{{ task.dataset_env_var_value }}" | ||
| {% endif -%} | ||
| {{ task.eval_factory_command }} > "$logs_dir/client_stdout.log" 2>&1 |
There was a problem hiding this comment.
[Power Review] 💡 Host eval command exit code not explicitly checked · 70% confidence
Per the knowledge document (SUGGESTION: Shell Command Without Error Check): 'A shell script runs commands without checking the exit code.' In the no-docker branch, {{ task.eval_factory_command }} > "$logs_dir/client_stdout.log" 2>&1 runs the evaluation command, and exit_code=$? captures its status. However, unlike the Docker path which has explicit success/failure echo statements and a structured exit, the no-docker path has no diagnostic logging on failure. If the eval command fails, the only record is in client_stdout.log — there's no stderr message indicating the command failed, its exit code, or guidance for the operator.
💡 Suggestion: Add a failure diagnostic similar to the Docker path, e.g.:
{{ task.eval_factory_command }} > "$logs_dir/client_stdout.log" 2>&1
exit_code=$?
if [ $exit_code -ne 0 ]; then
echo "Evaluation command failed with exit code $exit_code" >&2
fi| if not container_name: | ||
| raise ValueError(f"No container name found for job {job_id}") | ||
| use_docker = bool(job_data.data.get("use_docker", True)) | ||
| output_dir = pathlib.Path(job_data.data.get("output_dir", "")) |
There was a problem hiding this comment.
[Power Review] 💡 Empty-string fallback for output_dir could create relative PID file path · 70% confidence
Per the knowledge document (WARNING: Error Message Without Context / ERROR: Null/None Dereference on Fallible Return): job_data.data.get("output_dir", "") defaults to an empty string. pathlib.Path("") / "logs" / "stage.pid" produces the relative path logs/stage.pid, which could inadvertently match a file in the process's current working directory. While pid_file.exists() guards against reading a nonexistent file, if such a relative path happens to exist, kill_job could send SIGTERM to an unrelated process.
💡 Suggestion: Validate that output_dir is non-empty and absolute before using it for PID file lookups. If missing, skip the PID-based kill or raise an error early:
output_dir_str = job_data.data.get("output_dir", "")
if not output_dir_str:
raise ValueError(f"No output directory found for job {job_id}")
output_dir = pathlib.Path(output_dir_str)
Summary
--no-dockerflag tonemo-evaluator-launcher runexecution.use_docker(default:true) for local executor configdocker run) in local executor scriptsRationale
This change enables running
nemo-evaluator-launcherin environments where users do not have root access and cannot run Docker at all. A concrete example is workloads running on GKE Autopilot pods, where Docker-in-containerd patterns are prohibited.This is especially useful for R&D workflows where teams want to run evaluations inside managed, restricted compute environments while still using the launcher orchestration and task config experience.
Notes
execution.type=localwithdeployment.type=noneValidation
uv run pytest tests/unit_tests/test_get_eval_factory_command.py tests/unit_tests/test_local_executor.py tests/unit_tests/test_cli_integration.py -quv run ruff check ...on modified files