Speed-up COMET judge scoring#1447
Conversation
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
📝 WalkthroughWalkthroughAdds distributed execution and numeric-precision control to COMET evaluation: optional dtype casting on load, GPU-count-driven prediction, global-rank-0 output gating, a ChangesCOMET distributed and precision evaluation
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 83-84: The default batch_size in comet_judge.py is hardcoded to 16
via judge_pipeline_args.get("batch_size", 16), which diverges from the
evaluator's new default of 64; update the default to match by changing the
fallback value to 64 so that batch_size = judge_pipeline_args.get("batch_size",
64) (leave precision handling as-is) and ensure any documentation/comments
reflect the aligned default.
- Around line 120-137: The current install_cmd string in comet_judge.py can
leave LOCK_DIR behind if the installing process is killed, causing other tasks
to spin forever; update the shell sequence built in the install_cmd variable so
the lock creator registers a trap to remove LOCK_DIR on EXIT/termination (and
exit nonzero on failure) before attempting pip install, ensuring the trap is
cleared or lock removed after success and READY_FILE is touched so siblings
never hang waiting on a stale lock; locate and update the install_cmd assignment
in comet_judge.py (symbol: install_cmd) to add the trap cleanup logic around the
mkdir "$LOCK_DIR" / pip install / touch "$READY_FILE" steps.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: aca1cb60-0104-428d-aac4-f5a22157a066
📒 Files selected for processing (3)
nemo_skills/evaluation/evaluator/comet.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/judges/comet_judge.py
| precision = judge_pipeline_args.get("precision", "bf16") | ||
| batch_size = judge_pipeline_args.get("batch_size", 16) |
There was a problem hiding this comment.
Keep the default batch size aligned with the evaluator.
comet.py now defaults --batch-size to 64, but this path still injects 16 when batch_size is omitted. That means the default pipeline invocation stays slower than running the evaluator directly, which undercuts the goal of this change.
🛠️ Proposed fix
- batch_size = judge_pipeline_args.get("batch_size", 16)
+ batch_size = judge_pipeline_args.get("batch_size", 64)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 83 - 84, The default
batch_size in comet_judge.py is hardcoded to 16 via
judge_pipeline_args.get("batch_size", 16), which diverges from the evaluator's
new default of 64; update the default to match by changing the fallback value to
64 so that batch_size = judge_pipeline_args.get("batch_size", 64) (leave
precision handling as-is) and ensure any documentation/comments reflect the
aligned default.
There was a problem hiding this comment.
Let's set the batch_size default to 16 in comet.py
There was a problem hiding this comment.
🧩 Analysis chain
@AlexGrinch, understood — keeping 16 as the canonical default and aligning comet.py to it. Here's the change needed:
- Update
--batch-sizedefault incomet.pyfrom64→16
🏁 Script executed:
#!/bin/bash
# Find the batch-size argument definition in comet.py
rg -n "batch.size" nemo_skills/evaluation/evaluator/comet.py
Length of output: 181
🏁 Script executed:
#!/bin/bash
rg -n -A5 '"--batch-size"' nemo_skills/evaluation/evaluator/comet.py
Length of output: 169
@AlexGrinch, confirmed — the CLI argument at line 162 has default=64 while the function signature at line 83 already uses 16. Here's the fix:
# nemo_skills/evaluation/evaluator/comet.py line 162
- default=64,
+ default=16,That brings the CLI default in line with the function default and the judge pipeline's batch_size fallback of 16.
✏️ Learnings added
Learnt from: AlexGrinch
Repo: NVIDIA-NeMo/Skills PR: 1447
File: nemo_skills/pipeline/judges/comet_judge.py:83-84
Timestamp: 2026-05-15T22:20:07.609Z
Learning: In `nemo_skills/pipeline/judges/comet_judge.py` and `nemo_skills/evaluation/evaluator/comet.py`, the canonical default batch size for COMET judge scoring is `16`. Both files should use `16` as the default `--batch-size` / `batch_size` fallback, not `64`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
- 📌 Commit the changes to this branch
Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/evaluation/evaluator/comet.py`:
- Around line 52-56: The call to comet_model.predict currently uses
torch.cuda.device_count() on every process, causing each rank to request all
GPUs; modify the code so GPU count is only derived when this process is global
rank 0 (use _is_global_rank_zero() guard) or, if the process group is
initialized (dist.is_initialized()), pass gpus=1 for non-zero ranks;
specifically, ensure the torch.cuda.device_count() call and any nearby logic
that computes gpus is moved inside the _is_global_rank_zero() branch (or
replaced by using the launcher-provided per-rank GPU value) and make
comet_model.predict(...) use gpus=1 on distributed/non-zero ranks so only rank 0
requests the full device set and performs the I/O/predict work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b93dfd33-e00b-41c8-922f-b60d25619d26
📒 Files selected for processing (2)
nemo_skills/evaluation/evaluator/comet.pynemo_skills/pipeline/judges/comet_judge.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/pipeline/judges/comet_judge.py
Summary by CodeRabbit
New Features
Bug Fixes