Speed-up COMET judge scoring by lilithgrigoryan · Pull Request #1447 · NVIDIA-NeMo/Skills

lilithgrigoryan · 2026-05-13T20:32:43Z

Summary by CodeRabbit

New Features
- COMET evaluation supports distributed multi-GPU execution.
- Added configurable inference precision for model weight casting (default: bf16).
- Comet judge accepts a configurable batch size (default: 16) and adapts tasks to available GPUs.
- Scored outputs are written once by the primary process.
Bug Fixes
- Custom judge configurations now apply across all evaluation paths.
- Improved concurrent handling for judge dependency installation.

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

coderabbitai · 2026-05-13T20:39:27Z

📝 Walkthrough

Walkthrough

Adds distributed execution and numeric-precision control to COMET evaluation: optional dtype casting on load, GPU-count-driven prediction, global-rank-0 output gating, a --precision CLI flag, judge kwargs propagation to custom judge steps, and per-node guarded install with precision and batch-size passed to judge tasks.

Changes

COMET distributed and precision evaluation

Layer / File(s)	Summary
COMET evaluator distributed and precision infrastructure `nemo_skills/evaluation/evaluator/comet.py`	Imports `torch`/`torch.distributed`; monkey-patches `torch.load` to allow `weights_only=False`; adds `_is_global_rank_zero()` and `_PRECISION_TO_DTYPE`; updates `load_comet_model` signature to accept an optional `dtype`, cast weights when provided, and defer device placement.
COMET inference execution with GPU detection and ranking `nemo_skills/evaluation/evaluator/comet.py`	Validates and reads input JSONL directly, detects `torch.cuda.device_count()` for `gpus` argument, calls `comet_model.predict(..., gpus=num_gpus)`, attaches per-sample `comet` scores, and writes the output JSONL and `.done` only on global rank 0.
COMET evaluator CLI configuration `nemo_skills/evaluation/evaluator/comet.py`	Adds `--precision` CLI option (default `bf16`) with choices from `_PRECISION_TO_DTYPE`; passes the selected dtype into `load_comet_model` in `main()`.
Judge pipeline kwargs application to custom judge path `nemo_skills/pipeline/eval.py`	When `judge_step_fn` is provided, parses `judge_pipeline_kwargs` and merges them into `judge_pipeline_args` before calling the custom judge step creator.
COMET judge task creation with per-node locking and configuration `nemo_skills/pipeline/judges/comet_judge.py`	Reads `precision` (default `bf16`) and `batch_size` (default `16`) from `judge_pipeline_args`; appends `--precision` and `--batch-size` to the evaluator invocation; replaces unconditional `pip install` with a per-node lock/ready-file guarded install sequence; sets `num_tasks` to `judge_server_gpus or 1`.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main objective of the pull request—optimizing COMET judge scoring performance through distributed execution, configurable precision, and process I/O gating.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 83-84: The default batch_size in comet_judge.py is hardcoded to 16
via judge_pipeline_args.get("batch_size", 16), which diverges from the
evaluator's new default of 64; update the default to match by changing the
fallback value to 64 so that batch_size = judge_pipeline_args.get("batch_size",
64) (leave precision handling as-is) and ensure any documentation/comments
reflect the aligned default.
- Around line 120-137: The current install_cmd string in comet_judge.py can
leave LOCK_DIR behind if the installing process is killed, causing other tasks
to spin forever; update the shell sequence built in the install_cmd variable so
the lock creator registers a trap to remove LOCK_DIR on EXIT/termination (and
exit nonzero on failure) before attempting pip install, ensuring the trap is
cleared or lock removed after success and READY_FILE is touched so siblings
never hang waiting on a stale lock; locate and update the install_cmd assignment
in comet_judge.py (symbol: install_cmd) to add the trap cleanup logic around the
mkdir "$LOCK_DIR" / pip install / touch "$READY_FILE" steps.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aca1cb60-0104-428d-aac4-f5a22157a066

📥 Commits

Reviewing files that changed from the base of the PR and between 410b337 and b8b5955.

📒 Files selected for processing (3)

nemo_skills/evaluation/evaluator/comet.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/judges/comet_judge.py

coderabbitai · 2026-05-13T20:39:31Z

+    precision = judge_pipeline_args.get("precision", "bf16")
+    batch_size = judge_pipeline_args.get("batch_size", 16)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the default batch size aligned with the evaluator.

comet.py now defaults --batch-size to 64, but this path still injects 16 when batch_size is omitted. That means the default pipeline invocation stays slower than running the evaluator directly, which undercuts the goal of this change.

🛠️ Proposed fix

- batch_size = judge_pipeline_args.get("batch_size", 16) + batch_size = judge_pipeline_args.get("batch_size", 64)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 83 - 84, The default batch_size in comet_judge.py is hardcoded to 16 via judge_pipeline_args.get("batch_size", 16), which diverges from the evaluator's new default of 64; update the default to match by changing the fallback value to 64 so that batch_size = judge_pipeline_args.get("batch_size", 64) (leave precision handling as-is) and ensure any documentation/comments reflect the aligned default.

Let's set the batch_size default to 16 in comet.py

🧩 Analysis chain

@AlexGrinch, understood — keeping 16 as the canonical default and aligning comet.py to it. Here's the change needed:

Update --batch-size default in comet.py from 64 → 16

🏁 Script executed:

#!/bin/bash # Find the batch-size argument definition in comet.py rg -n "batch.size" nemo_skills/evaluation/evaluator/comet.py

Length of output: 181

🏁 Script executed:

#!/bin/bash rg -n -A5 '"--batch-size"' nemo_skills/evaluation/evaluator/comet.py

Length of output: 169

@AlexGrinch, confirmed — the CLI argument at line 162 has default=64 while the function signature at line 83 already uses 16. Here's the fix:

# nemo_skills/evaluation/evaluator/comet.py line 162 - default=64, + default=16,

That brings the CLI default in line with the function default and the judge pipeline's batch_size fallback of 16.

✏️ Learnings added

Learnt from: AlexGrinch Repo: NVIDIA-NeMo/Skills PR: 1447 File: nemo_skills/pipeline/judges/comet_judge.py:83-84 Timestamp: 2026-05-15T22:20:07.609Z Learning: In `nemo_skills/pipeline/judges/comet_judge.py` and `nemo_skills/evaluation/evaluator/comet.py`, the canonical default batch size for COMET judge scoring is `16`. Both files should use `16` as the default `--batch-size` / `batch_size` fallback, not `64`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

📌 Commit the changes to this branch

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/evaluation/evaluator/comet.py`:
- Around line 52-56: The call to comet_model.predict currently uses
torch.cuda.device_count() on every process, causing each rank to request all
GPUs; modify the code so GPU count is only derived when this process is global
rank 0 (use _is_global_rank_zero() guard) or, if the process group is
initialized (dist.is_initialized()), pass gpus=1 for non-zero ranks;
specifically, ensure the torch.cuda.device_count() call and any nearby logic
that computes gpus is moved inside the _is_global_rank_zero() branch (or
replaced by using the launcher-provided per-rank GPU value) and make
comet_model.predict(...) use gpus=1 on distributed/non-zero ranks so only rank 0
requests the full device set and performs the I/O/predict work.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b93dfd33-e00b-41c8-922f-b60d25619d26

📥 Commits

Reviewing files that changed from the base of the PR and between b8b5955 and 5d03073.

📒 Files selected for processing (2)

nemo_skills/evaluation/evaluator/comet.py
nemo_skills/pipeline/judges/comet_judge.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/pipeline/judges/comet_judge.py

lilithgrigoryan added 10 commits May 13, 2026 17:38

add test benchmark

6cc348a

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

adjusting the prompt

2116a34

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

hardcode

ef3f194

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

8gpu run

fa12d26

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

set num gpus to 8

fe11b9f

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

add lock for installing unbabel comet

d23bfb1

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

fix loading model and IO

2caafae

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

fix writing

91875d5

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

add bfloat16

561b142

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

clean up

b8b5955

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

lilithgrigoryan marked this pull request as draft May 13, 2026 20:34

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

AlexGrinch self-assigned this May 15, 2026

restore batch size, add proper hook

5d03073

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

lilithgrigoryan marked this pull request as ready for review May 19, 2026 15:14

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread nemo_skills/evaluation/evaluator/comet.py

lilithgrigoryan requested a review from AlexGrinch May 19, 2026 15:19

naymaraq self-requested a review May 19, 2026 17:16

naymaraq added the run GPU tests label May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up COMET judge scoring#1447

Speed-up COMET judge scoring#1447
lilithgrigoryan wants to merge 11 commits into
NVIDIA-NeMo:mainfrom
lilithgrigoryan:lgrigoryan/multi-device-comet

lilithgrigoryan commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026 •

edited

Loading

Uh oh!

AlexGrinch May 15, 2026

Uh oh!

coderabbitai Bot May 15, 2026

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		precision = judge_pipeline_args.get("precision", "bf16")
		batch_size = judge_pipeline_args.get("batch_size", 16)

Conversation

lilithgrigoryan commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexGrinch May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lilithgrigoryan commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

coderabbitai Bot May 13, 2026 •

edited

Loading