Skip to content

Speed-up COMET judge scoring#1447

Open
lilithgrigoryan wants to merge 11 commits into
NVIDIA-NeMo:mainfrom
lilithgrigoryan:lgrigoryan/multi-device-comet
Open

Speed-up COMET judge scoring#1447
lilithgrigoryan wants to merge 11 commits into
NVIDIA-NeMo:mainfrom
lilithgrigoryan:lgrigoryan/multi-device-comet

Conversation

@lilithgrigoryan
Copy link
Copy Markdown
Collaborator

@lilithgrigoryan lilithgrigoryan commented May 13, 2026

Summary by CodeRabbit

  • New Features

    • COMET evaluation supports distributed multi-GPU execution.
    • Added configurable inference precision for model weight casting (default: bf16).
    • Comet judge accepts a configurable batch size (default: 16) and adapts tasks to available GPUs.
    • Scored outputs are written once by the primary process.
  • Bug Fixes

    • Custom judge configurations now apply across all evaluation paths.
    • Improved concurrent handling for judge dependency installation.

Review Change Stack

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
@lilithgrigoryan lilithgrigoryan marked this pull request as draft May 13, 2026 20:34
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

Adds distributed execution and numeric-precision control to COMET evaluation: optional dtype casting on load, GPU-count-driven prediction, global-rank-0 output gating, a --precision CLI flag, judge kwargs propagation to custom judge steps, and per-node guarded install with precision and batch-size passed to judge tasks.

Changes

COMET distributed and precision evaluation

Layer / File(s) Summary
COMET evaluator distributed and precision infrastructure
nemo_skills/evaluation/evaluator/comet.py
Imports torch/torch.distributed; monkey-patches torch.load to allow weights_only=False; adds _is_global_rank_zero() and _PRECISION_TO_DTYPE; updates load_comet_model signature to accept an optional dtype, cast weights when provided, and defer device placement.
COMET inference execution with GPU detection and ranking
nemo_skills/evaluation/evaluator/comet.py
Validates and reads input JSONL directly, detects torch.cuda.device_count() for gpus argument, calls comet_model.predict(..., gpus=num_gpus), attaches per-sample comet scores, and writes the output JSONL and .done only on global rank 0.
COMET evaluator CLI configuration
nemo_skills/evaluation/evaluator/comet.py
Adds --precision CLI option (default bf16) with choices from _PRECISION_TO_DTYPE; passes the selected dtype into load_comet_model in main().
Judge pipeline kwargs application to custom judge path
nemo_skills/pipeline/eval.py
When judge_step_fn is provided, parses judge_pipeline_kwargs and merges them into judge_pipeline_args before calling the custom judge step creator.
COMET judge task creation with per-node locking and configuration
nemo_skills/pipeline/judges/comet_judge.py
Reads precision (default bf16) and batch_size (default 16) from judge_pipeline_args; appends --precision and --batch-size to the evaluator invocation; replaces unconditional pip install with a per-node lock/ready-file guarded install sequence; sets num_tasks to judge_server_gpus or 1.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective of the pull request—optimizing COMET judge scoring performance through distributed execution, configurable precision, and process I/O gating.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 83-84: The default batch_size in comet_judge.py is hardcoded to 16
via judge_pipeline_args.get("batch_size", 16), which diverges from the
evaluator's new default of 64; update the default to match by changing the
fallback value to 64 so that batch_size = judge_pipeline_args.get("batch_size",
64) (leave precision handling as-is) and ensure any documentation/comments
reflect the aligned default.
- Around line 120-137: The current install_cmd string in comet_judge.py can
leave LOCK_DIR behind if the installing process is killed, causing other tasks
to spin forever; update the shell sequence built in the install_cmd variable so
the lock creator registers a trap to remove LOCK_DIR on EXIT/termination (and
exit nonzero on failure) before attempting pip install, ensuring the trap is
cleared or lock removed after success and READY_FILE is touched so siblings
never hang waiting on a stale lock; locate and update the install_cmd assignment
in comet_judge.py (symbol: install_cmd) to add the trap cleanup logic around the
mkdir "$LOCK_DIR" / pip install / touch "$READY_FILE" steps.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aca1cb60-0104-428d-aac4-f5a22157a066

📥 Commits

Reviewing files that changed from the base of the PR and between 410b337 and b8b5955.

📒 Files selected for processing (3)
  • nemo_skills/evaluation/evaluator/comet.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/judges/comet_judge.py

Comment on lines +83 to +84
precision = judge_pipeline_args.get("precision", "bf16")
batch_size = judge_pipeline_args.get("batch_size", 16)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the default batch size aligned with the evaluator.

comet.py now defaults --batch-size to 64, but this path still injects 16 when batch_size is omitted. That means the default pipeline invocation stays slower than running the evaluator directly, which undercuts the goal of this change.

🛠️ Proposed fix
-    batch_size = judge_pipeline_args.get("batch_size", 16)
+    batch_size = judge_pipeline_args.get("batch_size", 64)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 83 - 84, The default
batch_size in comet_judge.py is hardcoded to 16 via
judge_pipeline_args.get("batch_size", 16), which diverges from the evaluator's
new default of 64; update the default to match by changing the fallback value to
64 so that batch_size = judge_pipeline_args.get("batch_size", 64) (leave
precision handling as-is) and ensure any documentation/comments reflect the
aligned default.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the batch_size default to 16 in comet.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@AlexGrinch, understood — keeping 16 as the canonical default and aligning comet.py to it. Here's the change needed:

  • Update --batch-size default in comet.py from 6416

🏁 Script executed:

#!/bin/bash
# Find the batch-size argument definition in comet.py
rg -n "batch.size" nemo_skills/evaluation/evaluator/comet.py

Length of output: 181


🏁 Script executed:

#!/bin/bash
rg -n -A5 '"--batch-size"' nemo_skills/evaluation/evaluator/comet.py

Length of output: 169


@AlexGrinch, confirmed — the CLI argument at line 162 has default=64 while the function signature at line 83 already uses 16. Here's the fix:

# nemo_skills/evaluation/evaluator/comet.py  line 162
-        default=64,
+        default=16,

That brings the CLI default in line with the function default and the judge pipeline's batch_size fallback of 16.


✏️ Learnings added
Learnt from: AlexGrinch
Repo: NVIDIA-NeMo/Skills PR: 1447
File: nemo_skills/pipeline/judges/comet_judge.py:83-84
Timestamp: 2026-05-15T22:20:07.609Z
Learning: In `nemo_skills/pipeline/judges/comet_judge.py` and `nemo_skills/evaluation/evaluator/comet.py`, the canonical default batch size for COMET judge scoring is `16`. Both files should use `16` as the default `--batch-size` / `batch_size` fallback, not `64`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


  • 📌 Commit the changes to this branch

Comment thread nemo_skills/pipeline/judges/comet_judge.py
@AlexGrinch AlexGrinch self-assigned this May 15, 2026
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
@lilithgrigoryan lilithgrigoryan marked this pull request as ready for review May 19, 2026 15:14
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/evaluation/evaluator/comet.py`:
- Around line 52-56: The call to comet_model.predict currently uses
torch.cuda.device_count() on every process, causing each rank to request all
GPUs; modify the code so GPU count is only derived when this process is global
rank 0 (use _is_global_rank_zero() guard) or, if the process group is
initialized (dist.is_initialized()), pass gpus=1 for non-zero ranks;
specifically, ensure the torch.cuda.device_count() call and any nearby logic
that computes gpus is moved inside the _is_global_rank_zero() branch (or
replaced by using the launcher-provided per-rank GPU value) and make
comet_model.predict(...) use gpus=1 on distributed/non-zero ranks so only rank 0
requests the full device set and performs the I/O/predict work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b93dfd33-e00b-41c8-922f-b60d25619d26

📥 Commits

Reviewing files that changed from the base of the PR and between b8b5955 and 5d03073.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/evaluator/comet.py
  • nemo_skills/pipeline/judges/comet_judge.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/pipeline/judges/comet_judge.py

Comment thread nemo_skills/evaluation/evaluator/comet.py
@naymaraq naymaraq self-requested a review May 19, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants