Skip to content

Add MT-Bench and Komt-Bench (Korean MT-Bench) benchmarks#1378

Open
bzantium wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
bzantium:feat/mt-bench-komt-bench
Open

Add MT-Bench and Komt-Bench (Korean MT-Bench) benchmarks#1378
bzantium wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
bzantium:feat/mt-bench-komt-bench

Conversation

@bzantium
Copy link
Copy Markdown
Contributor

@bzantium bzantium commented Apr 17, 2026

Closes #1377

Summary

Add MT-Bench (80 two-turn English questions) and Komt-Bench (80 two-turn Korean questions) with LLM-as-judge evaluation.

Architecture

Both benchmarks share the same generation + judge modules:

  • mt_bench.py — Two-turn generation: answer_1 for question_1, then answer_2 conditioned on full conversation
  • mt_bench_judge.py — Per-turn 1-10 scoring with multi-judge averaging (default 8 calls), reference-aware prompts for math/reasoning/coding categories

Metrics

  • MTBenchMetrics — Parses [[rating]] from judge output; per-turn and per-category averages
  • KomtBenchMetrics — Extends MTBenchMetrics with sqrt(score) penalty for non-Korean responses (code-output questions exempted)

Data

Benchmark Source Format
mt-bench FastChat repo questions + GPT-4 ref answers prepare.py downloads
komt-bench LGAI-EXAONE/KoMT-Bench prepare.py downloads

MT-Bench reuses existing English judge prompts at judge/mt-bench/. Komt-Bench adds Korean judge prompts at judge/komt-bench/ (4 turn/ref variants).

Changes

New files (12)

  • nemo_skills/dataset/{mt-bench,komt-bench}/__init__.py + prepare.py each
  • nemo_skills/inference/eval/mt_bench.py, mt_bench_judge.py
  • nemo_skills/evaluation/metrics/mt_bench_metrics.py, komt_bench_metrics.py
  • nemo_skills/prompt/config/judge/komt-bench/ — 4 YAMLs

Modified files (1)

  • nemo_skills/evaluation/metrics/map_metrics.py — register mt_bench + komt_bench (+4 lines)

Test plan

  • ns eval --benchmarks mt-bench --model <model> — verify two-turn generation + judge scoring + metrics
  • ns eval --benchmarks komt-bench --model <model> — verify Korean prompts + language penalty
  • Existing benchmarks unaffected (metrics map additions only)

Summary by CodeRabbit

  • New Features

    • MT-Bench evaluation: two-turn generation, judging, and Hydra-configured CLI entrypoints.
    • KoMT-Bench (Korean MT-Bench) dataset support with dataset preparation utilities.
    • New Korean judge prompt templates for single- and multi-turn evaluation (with/without references).
    • Multi-judge execution with aggregated and per-turn outputs; language-aware scoring that applies a penalty when answers mismatch a target language.
  • Tests

    • Comprehensive tests validating MT-Bench metrics, averaging, penalty logic, multi-judge behavior, and error handling.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds MT-Bench and Komt-Bench: dataset initializers and prep scripts, a two‑turn generation task, an LLM-as-judge task with multi‑judge support, MTBench metrics (parsing judge ratings and optional language penalty), Korean judge prompt templates, and unit tests for metrics.

Changes

Cohort / File(s) Summary
Dataset Config & Exports
nemo_skills/dataset/mt-bench/__init__.py, nemo_skills/dataset/komt-bench/__init__.py
New package initializers exporting default benchmark settings: METRICS_TYPE, GENERATION_MODULE, GENERATION_ARGS, JUDGE_PIPELINE_ARGS, and JUDGE_ARGS.
Dataset Preparation Scripts
nemo_skills/dataset/mt-bench/prepare.py, nemo_skills/dataset/komt-bench/prepare.py
New CLI scripts to download/normalize datasets into test.jsonl, parse stringified fields, enforce exactly two turns, include up to two references, and set target_language for Komt-Bench.
Two-turn Generation
nemo_skills/inference/eval/mt_bench.py
Adds Hydra-configured MTBenchGenerationTask and config dataclasses: generates turn‑1, conditions turn‑2 on [Q1,A1,Q2], optionally parses reasoning, returns combined results and token counts; exposes generate(cfg) entrypoint.
LLM-as-Judge Task
nemo_skills/inference/eval/mt_bench_judge.py
Adds MTBenchJudgeTask and config: selects per-turn/category prompt templates (with/without refs), runs N parallel judgements with 180s timeout, returns primary and aggregated judgements and supports multi-judge outputs; Hydra entrypoint provided.
Metrics Implementation & Registry
nemo_skills/evaluation/metrics/mt_bench_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
Adds MTBenchMetrics (parses [[rating]], averages per-turn/category, supports multi-judge lists, optional ISO‑639-1 target_language validation and sqrt penalty on mismatch) and registers "mt_bench" in metrics map.
Korean Judge Prompts
nemo_skills/prompt/config/judge/komt-bench/turn1.yaml, .../turn1_with_ref.yaml, .../turn2.yaml, .../turn2_with_ref.yaml
New Korean prompt templates for turn1/turn2 with and without reference answers; require final [[rating]] token in output.
Tests
tests/test_mt_bench_metrics.py
New comprehensive unit tests for MTBenchMetrics: parsing, aggregation, multi-judge averaging, invalid-rating handling, language-penalty behavior, ISO code validation, and missing-field error cases.

Sequence Diagram(s)

sequenceDiagram
    participant Dataset as Dataset\n(prepared test.jsonl)
    participant Gen as Generation Task\n(MTBenchGenerationTask)
    participant Judge as Judge Task\n(MTBenchJudgeTask)
    participant Metrics as Metrics\n(MTBenchMetrics)

    Dataset->>Gen: send {question_1, question_2, refs?, category, target_language?}
    Gen->>Gen: generate answer_1 from question_1
    Gen->>Gen: generate answer_2 from [Q1,A1,Q2]
    Gen->>Judge: submit Qs, As, refs, category for judging
    par per turn
        Judge->>Judge: select prompt (with/without ref)
        Judge->>Judge: run N parallel judge calls (timeout 180s)
        Judge->>Judge: extract [[rating]] from outputs
    end
    Judge->>Metrics: send judgements, all_judgements, num_judges, answers, target_language
    Metrics->>Metrics: parse ratings, apply sqrt penalty if language mismatch, aggregate per-turn/category/overall
    Metrics-->>Dataset: emit pass@1 and aggregated metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main objective of the PR: adding two benchmarks (MT-Bench and Komt-Bench) with clear reference to the Korean variant.
Linked Issues check ✅ Passed The PR implementation comprehensively addresses all linked issue objectives: two-turn generation module [mt_bench.py], judge module [mt_bench_judge.py], MTBenchMetrics and KomtBenchMetrics implementations, Korean judge prompts [judge/komt-bench/], dataset prepare scripts, and metric registration in map_metrics.py.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the PR objectives; files added focus exclusively on MT-Bench/Komt-Bench implementation including generation, judging, metrics, prompts, data preparation, and test coverage with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/komt-bench/prepare.py`:
- Around line 29-41: The current loop writes to output_file while transforming
rows which can leave a partial test.jsonl on failure; instead, first iterate ds
and build a list of entries (using the same logic that constructs entry from
turns, reference, question_id, category, question_1/2, ref_answer_1/2), then
open output_file once and write all json.dumps(entry) lines from that in-memory
list; update prepare.py to compute entries from ds first (e.g., collect into a
list named entries) and only then open output_file for writing and flush/write
entries to avoid partial files.
- Around line 31-39: Validate that parsed 'turns' and 'reference' both have
exactly two elements before building 'entry' (use the existing variables 'turns'
and 'reference'); if not, raise a ValueError (include the row["question_id"] and
row content for context) instead of falling back to empty string/None, then
proceed to construct 'entry' with turns[0], turns[1], reference[0],
reference[1]; replace the current len-checks around
"question_1"/"question_2"/"ref_answer_1"/"ref_answer_2" with this strict
validation to enforce the fixed two-turn benchmark invariant.

In `@nemo_skills/dataset/mt-bench/__init__.py`:
- Around line 17-28: Add MT-Bench documentation and CI/test coverage: document
the new METRICS_TYPE="mt_bench" and the generation/judge wiring
(GENERATION_MODULE="nemo_skills.inference.eval.mt_bench", JUDGE_PIPELINE_ARGS
and JUDGE_ARGS) in docs/evaluation/external-benchmarks.md with example run
commands (how to invoke the generation and judge pipelines and expected metric
outputs) and expected results for tested models; create a
tests/slurm-tests/mt-bench/ directory mirroring other benchmark test folders
(include SLURM job scripts, minimal input configs and assertions that run the
generation and judge pipelines and validate metrics); and add mt-bench to the CI
by updating .github/workflows/gpu_tests.yml (and any test matrix) so the new
SLURM test directory is executed in CI. Ensure filenames and commands reference
the above symbols (METRICS_TYPE, GENERATION_MODULE, JUDGE_PIPELINE_ARGS,
JUDGE_ARGS) so the tests exercise the actual code paths.

In `@nemo_skills/dataset/mt-bench/prepare.py`:
- Around line 47-48: Read and transform the entire input (questions_file) into
memory or a temporary buffer first, and only after all rows are successfully
processed open output_file for writing (or write to a temporary file and
atomically replace output_file with os.replace) to avoid leaving a partially
written test.jsonl on failure; adjust the existing with-open that currently
opens fout early so that only reading/parsing happens before opening/writing,
referencing the questions_file and output_file variables used in the current
block.
- Around line 45-60: The loop that builds entry currently tolerates missing
turns and missing ref answers (using fallback ""/None/.get()) and should instead
fail fast: validate that data["turns"] has exactly two items and that
ref_answers contains qid (ref_answers[qid]) before constructing entry; if
validation fails, raise a clear exception (e.g., ValueError including qid) so
malformed records are rejected. Replace the conditional fallbacks for
"question_2", "ref_answer_1", and "ref_answer_2" with direct indexing
(data["turns"][1], ref_answers[qid]["turn1"], ref_answers[qid]["turn2"]) after
the checks to enforce strict two-turn, aligned refs.

In `@nemo_skills/evaluation/metrics/komt_bench_metrics.py`:
- Around line 59-84: This code silently defaults critical fields using dict.get
(category, question_id, answer_1, answer_2, judgement_turn1, judgement_turn2)
which can mis-score records; change to fail-fast by accessing those keys
directly (e.g., pred["question_id"], pred["answer_1"], etc.) or validate
presence at the start of processing and raise a clear exception if missing, then
keep using _parse_rating, CODE_OUTPUT_QUESTION_IDS and _detect_korean as before;
ensure the multi-judge branches also require and validate
all_judgements_turn1/all_judgements_turn2 exist before computing averages so
malformed inputs raise an error instead of using silent defaults.

In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py`:
- Around line 45-59: The code silently masks malformed judge records by using
pred.get(...) with defaults for "category" and fallback
"judgement_turn1"/"judgement_turn2"; change these to direct key access (e.g.,
pred["category"], pred["judgement_turn1"], pred["judgement_turn2"]) so missing
fields raise immediately, and validate the retrieved values before calling
self._parse_rating (raise a clear error mentioning the record/pred id if types
are wrong). Keep the existing multi-judge handling for
"all_judgements_turn1"/"all_judgements_turn2" but ensure you still validate
their presence/type when expected.
- Around line 22-24: The MTBenchMetrics __init__ currently only calls
self.reset(), bypassing BaseMetrics initialization; change
MTBenchMetrics.__init__ to call the base constructor first (super().__init__()
or BaseMetrics.__init__(self)) and then call self.reset() so any base attributes
set in BaseMetrics.__init__ are initialized while still resetting subclass state
via reset().

In `@nemo_skills/inference/eval/mt_bench_judge.py`:
- Around line 121-126: The except asyncio.TimeoutError handler currently returns
an empty string which makes timeouts appear as missing scores; instead surface
the failure by raising a specific exception (e.g., raise RuntimeError or a
custom JudgeTimeoutError) so the caller/metrics layer can count it separately;
keep the LOG.warning call (using LOG.warning, JUDGE_TIMEOUT, and data_point) but
replace the final "return \"\"" with raising the exception containing context
(turn, JUDGE_TIMEOUT, category) so timeouts are explicit failures rather than
dropped scores.
- Around line 96-107: The current branch populates judge_input with
ref_answer_1/2 using data_point.get(..., ""), which silently masks missing
reference answers; when category is in NEED_REF_CATS you must require those keys
to be present. Replace the .get(...) calls with direct indexing (e.g.,
data_point["ref_answer_1"] and data_point["ref_answer_2"]) inside the blocks
that check category in NEED_REF_CATS so a KeyError surfaces for bad records (or
add an explicit check that raises a clear error if either key is missing) and
keep the surrounding fields assigned as-is to judge_input.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 798045c1-5a4b-410a-92b5-f738273ad309

📥 Commits

Reviewing files that changed from the base of the PR and between 420d006 and 7b9eb25.

📒 Files selected for processing (13)
  • nemo_skills/dataset/komt-bench/__init__.py
  • nemo_skills/dataset/komt-bench/prepare.py
  • nemo_skills/dataset/mt-bench/__init__.py
  • nemo_skills/dataset/mt-bench/prepare.py
  • nemo_skills/evaluation/metrics/komt_bench_metrics.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/evaluation/metrics/mt_bench_metrics.py
  • nemo_skills/inference/eval/mt_bench.py
  • nemo_skills/inference/eval/mt_bench_judge.py
  • nemo_skills/prompt/config/judge/komt-bench/turn1.yaml
  • nemo_skills/prompt/config/judge/komt-bench/turn1_with_ref.yaml
  • nemo_skills/prompt/config/judge/komt-bench/turn2.yaml
  • nemo_skills/prompt/config/judge/komt-bench/turn2_with_ref.yaml

Comment on lines +29 to +41
with open(output_file, "wt", encoding="utf-8") as fout:
for row in ds:
turns = ast.literal_eval(row["turns"])
reference = ast.literal_eval(row["reference"])
entry = {
"question_id": int(row["question_id"]),
"category": row["category"],
"question_1": turns[0] if len(turns) > 0 else "",
"question_2": turns[1] if len(turns) > 1 else "",
"ref_answer_1": reference[0] if len(reference) > 0 else None,
"ref_answer_2": reference[1] if len(reference) > 1 else None,
}
fout.write(json.dumps(entry, ensure_ascii=False) + "\n")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Compute entries first, then write test.jsonl in a separate step.

Opening test.jsonl before transformation finishes can leave a partial file if one row fails.

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/komt-bench/prepare.py` around lines 29 - 41, The current
loop writes to output_file while transforming rows which can leave a partial
test.jsonl on failure; instead, first iterate ds and build a list of entries
(using the same logic that constructs entry from turns, reference, question_id,
category, question_1/2, ref_answer_1/2), then open output_file once and write
all json.dumps(entry) lines from that in-memory list; update prepare.py to
compute entries from ds first (e.g., collect into a list named entries) and only
then open output_file for writing and flush/write entries to avoid partial
files.

Comment thread nemo_skills/dataset/komt-bench/prepare.py Outdated
Comment on lines +17 to +28
METRICS_TYPE = "mt_bench"
GENERATION_MODULE = "nemo_skills.inference.eval.mt_bench"
GENERATION_ARGS = "++prompt_format=openai"

JUDGE_PIPELINE_ARGS = {
"generation_module": "nemo_skills.inference.eval.mt_bench_judge",
"model": "gpt-4o-2024-08-06",
"server_type": "openai",
"server_address": "https://api.openai.com/v1",
}

JUDGE_ARGS = "++num_judges=8"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Docs mentioning MT-Bench / KoMT-Bench =="
rg -n --iglob '*.md' 'mt[-_ ]bench|komt[-_ ]bench|komt_bench|mt_bench|KoMT-Bench'

echo
echo "== Benchmark dataset/config files =="
fd -i 'mt-bench|komt-bench' nemo_skills

echo
echo "== Slurm / CI test references =="
rg -n --iglob '*slurm*' --iglob '*test*' 'mt[-_ ]bench|komt[-_ ]bench|komt_bench|mt_bench'

Repository: NVIDIA-NeMo/Skills

Length of output: 105


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Searching for MT-Bench references in repo ==="
# Search more broadly without glob restrictions first
rg -i 'mt.?bench' --max-count 20

echo
echo "=== Looking at docs structure ==="
find . -type f -name "*.md" | head -20

echo
echo "=== Checking for SLURM test files ==="
find . -type f -name "*slurm*" -o -name "*test*.py" | grep -i slurm | head -10

echo
echo "=== Checking benchmark directories ==="
ls -la nemo_skills/dataset/ 2>/dev/null || echo "Directory not found"

echo
echo "=== Looking for CI/workflow files ==="
find . -type f \( -name "*.yml" -o -name "*.yaml" \) | grep -E '(workflow|github|ci)' | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 15635


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Checking external-benchmarks.md for MT-Bench/KoMT-Bench ==="
cat ./docs/evaluation/external-benchmarks.md | grep -i "mt.bench\|komt" || echo "No MT-Bench/KoMT-Bench found in external-benchmarks.md"

echo
echo "=== Listing all SLURM test directories ==="
ls -1 ./tests/slurm-tests/

echo
echo "=== Checking if mt-bench SLURM test exists ==="
ls -la ./tests/slurm-tests/mt-bench 2>/dev/null || echo "MT-Bench SLURM test directory NOT found"

echo
echo "=== Checking if komt-bench SLURM test exists ==="
ls -la ./tests/slurm-tests/komt-bench 2>/dev/null || echo "KoMT-Bench SLURM test directory NOT found"

echo
echo "=== Checking CI test configuration files ==="
find ./tests -name "*.py" -type f | head -5 | xargs grep -l "mt.bench\|komt" 2>/dev/null || echo "No MT-Bench/KoMT-Bench references in test files"

echo
echo "=== Checking GPU tests workflow ==="
grep -n "mt.bench\|komt" ./.github/workflows/gpu_tests.yml || echo "No MT-Bench/KoMT-Bench in GPU tests"

echo
echo "=== Checking mt-bench __init__.py file ==="
head -30 nemo_skills/dataset/mt-bench/__init__.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1929


Add benchmark documentation and SLURM test coverage for MT-Bench.

While the config wiring is correct, the benchmark setup is incomplete:

  1. Missing documentation: MT-Bench is not documented in docs/evaluation/external-benchmarks.md. Add example run commands and expected results for tested models.

  2. Missing SLURM test coverage: No SLURM test directory exists for MT-Bench (all other benchmarks have corresponding tests in tests/slurm-tests/). Add comprehensive test coverage following the pattern of existing benchmarks.

  3. Missing CI integration: MT-Bench is not referenced in .github/workflows/gpu_tests.yml or other test configurations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mt-bench/__init__.py` around lines 17 - 28, Add MT-Bench
documentation and CI/test coverage: document the new METRICS_TYPE="mt_bench" and
the generation/judge wiring
(GENERATION_MODULE="nemo_skills.inference.eval.mt_bench", JUDGE_PIPELINE_ARGS
and JUDGE_ARGS) in docs/evaluation/external-benchmarks.md with example run
commands (how to invoke the generation and judge pipelines and expected metric
outputs) and expected results for tested models; create a
tests/slurm-tests/mt-bench/ directory mirroring other benchmark test folders
(include SLURM job scripts, minimal input configs and assertions that run the
generation and judge pipelines and validate metrics); and add mt-bench to the CI
by updating .github/workflows/gpu_tests.yml (and any test matrix) so the new
SLURM test directory is executed in CI. Ensure filenames and commands reference
the above symbols (METRICS_TYPE, GENERATION_MODULE, JUDGE_PIPELINE_ARGS,
JUDGE_ARGS) so the tests exercise the actual code paths.

Comment thread nemo_skills/dataset/mt-bench/prepare.py
Comment thread nemo_skills/dataset/mt-bench/prepare.py Outdated
Comment on lines +47 to +48
with open(questions_file, "rt", encoding="utf-8") as fin, \
open(output_file, "wt", encoding="utf-8") as fout:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Avoid partial test.jsonl writes if a later row fails.

fout is opened before transformation finishes, so a mid-run failure can leave a truncated dataset file.

Proposed fix
-    with open(questions_file, "rt", encoding="utf-8") as fin, \
-         open(output_file, "wt", encoding="utf-8") as fout:
+    entries = []
+    with open(questions_file, "rt", encoding="utf-8") as fin:
         for line in fin:
             data = json.loads(line)
             qid = data["question_id"]
@@
-            fout.write(json.dumps(entry, ensure_ascii=False) + "\n")
+            entries.append(entry)
+
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for entry in entries:
+            fout.write(json.dumps(entry, ensure_ascii=False) + "\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mt-bench/prepare.py` around lines 47 - 48, Read and
transform the entire input (questions_file) into memory or a temporary buffer
first, and only after all rows are successfully processed open output_file for
writing (or write to a temporary file and atomically replace output_file with
os.replace) to avoid leaving a partially written test.jsonl on failure; adjust
the existing with-open that currently opens fout early so that only
reading/parsing happens before opening/writing, referencing the questions_file
and output_file variables used in the current block.

Comment on lines +59 to +84
pred = predictions[0]
category = pred.get("category", "unknown")
question_id = pred.get("question_id")

# Support multi-judge averaging
all_j1 = pred.get("all_judgements_turn1")
all_j2 = pred.get("all_judgements_turn2")

if all_j1 and all_j2:
scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
score_turn1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
score_turn2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
else:
score_turn1 = self._parse_rating(pred.get("judgement_turn1", ""))
score_turn2 = self._parse_rating(pred.get("judgement_turn2", ""))

# Apply language penalty
skip_penalty = question_id in CODE_OUTPUT_QUESTION_IDS
penalized_turn1 = score_turn1
penalized_turn2 = score_turn2

if not skip_penalty:
answer_1 = pred.get("answer_1", "")
answer_2 = pred.get("answer_2", "")
if score_turn1 is not None and not _detect_korean(answer_1):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on missing benchmark schema fields.

This path treats category, question_id, answer_*, and the fallback judgement_turn* fields as optional, but they drive penalty exemption and score parsing. Defaulting them here can silently mis-score malformed records, especially by applying the Korean penalty to code-output questions when question_id is missing.

Suggested fix
-        category = pred.get("category", "unknown")
-        question_id = pred.get("question_id")
+        category = pred["category"]
+        question_id = pred["question_id"]
...
-            score_turn1 = self._parse_rating(pred.get("judgement_turn1", ""))
-            score_turn2 = self._parse_rating(pred.get("judgement_turn2", ""))
+            score_turn1 = self._parse_rating(pred["judgement_turn1"])
+            score_turn2 = self._parse_rating(pred["judgement_turn2"])
...
-            answer_1 = pred.get("answer_1", "")
-            answer_2 = pred.get("answer_2", "")
+            answer_1 = pred["answer_1"]
+            answer_2 = pred["answer_2"]

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/komt_bench_metrics.py` around lines 59 - 84,
This code silently defaults critical fields using dict.get (category,
question_id, answer_1, answer_2, judgement_turn1, judgement_turn2) which can
mis-score records; change to fail-fast by accessing those keys directly (e.g.,
pred["question_id"], pred["answer_1"], etc.) or validate presence at the start
of processing and raise a clear exception if missing, then keep using
_parse_rating, CODE_OUTPUT_QUESTION_IDS and _detect_korean as before; ensure the
multi-judge branches also require and validate
all_judgements_turn1/all_judgements_turn2 exist before computing averages so
malformed inputs raise an error instead of using silent defaults.

Comment thread nemo_skills/evaluation/metrics/mt_bench_metrics.py Outdated
Comment thread nemo_skills/evaluation/metrics/mt_bench_metrics.py Outdated
Comment thread nemo_skills/inference/eval/mt_bench_judge.py Outdated
Comment on lines +121 to +126
except asyncio.TimeoutError:
LOG.warning(
"Judge call timed out (turn=%d, timeout=%ds, category=%s)",
turn, JUDGE_TIMEOUT, data_point.get("category", ""),
)
return ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't turn judge timeouts into dropped scores.

Returning "" here makes the metrics layer treat the timed-out judge as “missing” and exclude it from the average, so runs with partial judge failure can still report normal-looking scores. This should surface as an explicit failure or be counted separately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/mt_bench_judge.py` around lines 121 - 126, The
except asyncio.TimeoutError handler currently returns an empty string which
makes timeouts appear as missing scores; instead surface the failure by raising
a specific exception (e.g., raise RuntimeError or a custom JudgeTimeoutError) so
the caller/metrics layer can count it separately; keep the LOG.warning call
(using LOG.warning, JUDGE_TIMEOUT, and data_point) but replace the final "return
\"\"" with raising the exception containing context (turn, JUDGE_TIMEOUT,
category) so timeouts are explicit failures rather than dropped scores.

@bzantium
Copy link
Copy Markdown
Contributor Author

bzantium commented Apr 17, 2026

Evaluation Results

Tested with two models (8 judges per turn, GPT-4o-2024-08-06 as judge):

MT-Bench (English)

Category kanana-2-30b-a3b-instruct-2601 Qwen3-30B-A3B-Instruct-2507
Overall 8.30 8.88
Turn 1 8.67 9.22
Turn 2 7.93 8.54
Coding 8.55 9.34
Extraction 7.59 8.58
Humanities 9.04 9.46
Math 8.28 8.42
Reasoning 6.58 7.61
Roleplay 8.73 9.03
STEM 9.04 9.51
Writing 8.58 9.09

Komt-Bench (Korean, with language penalty)

Category kanana-2-30b-a3b-instruct-2601 Qwen3-30B-A3B-Instruct-2507
Overall (penalized) 8.18 8.69
Overall (raw) 8.25 8.78
Turn 1 (penalized) 8.37 8.99
Turn 2 (penalized) 7.99 8.40
Coding 8.41 8.77
Extraction 7.33 7.65
Humanities 8.98 8.73
Math 9.09 9.98
Reasoning 6.89 8.71
Roleplay 8.19 8.09
STEM 8.70 9.20
Writing 7.86 8.44

Both benchmarks run end-to-end: prepare → two-turn generation → multi-judge scoring → metrics aggregation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
nemo_skills/evaluation/metrics/mt_bench_metrics.py (2)

58-59: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize BaseMetrics here.

MTBenchMetrics still bypasses BaseMetrics.__init__, so compute_no_answer is never set and the new mt_bench registry entry can fail as soon as get_metrics(..., **kwargs) forwards constructor args. Call super().__init__(compute_no_answer=compute_no_answer) instead.

Suggested fix
 class MTBenchMetrics(BaseMetrics):
-    def __init__(self):
-        self.reset()
+    def __init__(self, compute_no_answer: bool = True):
+        super().__init__(compute_no_answer=compute_no_answer)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py` around lines 58 - 59,
MTBenchMetrics.__init__ currently calls self.reset() directly and bypasses
BaseMetrics.__init__, so compute_no_answer is never initialized; modify
MTBenchMetrics.__init__ to call
super().__init__(compute_no_answer=compute_no_answer) (passing the
compute_no_answer parameter) before any reset/use of instance state so
BaseMetrics sets up compute_no_answer and related defaults used later by
get_metrics and mt_bench registry.

89-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on malformed judge records.

The .get(..., "") / "unknown" fallbacks hide broken inputs and can silently drop scores or collapse categories into a fake bucket. Use direct indexing for required judge fields and validate the multi-judge payload explicitly instead of falling back to the single-judge path when a list is missing or empty. As per coding guidelines, don't use .get() for dictionary keys when the code expects them to be present.

Suggested fix
-        all_j1 = pred.get("all_judgements_turn1")
-        all_j2 = pred.get("all_judgements_turn2")
-        if all_j1 and all_j2:
+        all_j1 = pred["all_judgements_turn1"]
+        all_j2 = pred["all_judgements_turn2"]
+        if all_j1 and all_j2:
             scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
             scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
             score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
             score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
         else:
-            score_t1 = self._parse_rating(pred.get("judgement_turn1", ""))
-            score_t2 = self._parse_rating(pred.get("judgement_turn2", ""))
+            score_t1 = self._parse_rating(pred["judgement_turn1"])
+            score_t2 = self._parse_rating(pred["judgement_turn2"])
@@
-            answer_1 = pred.get("answer_1", "")
-            answer_2 = pred.get("answer_2", "")
+            answer_1 = pred["answer_1"]
+            answer_2 = pred["answer_2"]
@@
-            "category": pred.get("category", "unknown"),
+            "category": pred["category"],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py` around lines 89 - 111,
The code currently uses .get fallbacks for required judge fields and silently
falls back to single-judge handling; change this to fail fast by validating
presence and shape of multi-judge payloads: in the block using all_j1/all_j2
replace pred.get(...) with direct indexing (pred["all_judgements_turn1"],
pred["all_judgements_turn2"]) and raise a descriptive exception if they are
missing or not non-empty lists; remove the .get(..., "") fallbacks for
"judgement_turn1"/"judgement_turn2" and require those keys when single-judge
mode is expected (use pred["judgement_turn1"] / pred["judgement_turn2"] after
validating the payload type), and keep the existing usage of _parse_rating,
_validate_target_language, answer_1/answer_2 and language penalty logic
unchanged—this ensures malformed records are validated up front instead of being
hidden by defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py`:
- Around line 58-59: MTBenchMetrics.__init__ currently calls self.reset()
directly and bypasses BaseMetrics.__init__, so compute_no_answer is never
initialized; modify MTBenchMetrics.__init__ to call
super().__init__(compute_no_answer=compute_no_answer) (passing the
compute_no_answer parameter) before any reset/use of instance state so
BaseMetrics sets up compute_no_answer and related defaults used later by
get_metrics and mt_bench registry.
- Around line 89-111: The code currently uses .get fallbacks for required judge
fields and silently falls back to single-judge handling; change this to fail
fast by validating presence and shape of multi-judge payloads: in the block
using all_j1/all_j2 replace pred.get(...) with direct indexing
(pred["all_judgements_turn1"], pred["all_judgements_turn2"]) and raise a
descriptive exception if they are missing or not non-empty lists; remove the
.get(..., "") fallbacks for "judgement_turn1"/"judgement_turn2" and require
those keys when single-judge mode is expected (use pred["judgement_turn1"] /
pred["judgement_turn2"] after validating the payload type), and keep the
existing usage of _parse_rating, _validate_target_language, answer_1/answer_2
and language penalty logic unchanged—this ensures malformed records are
validated up front instead of being hidden by defaults.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a909f557-c1b1-41e3-9f83-a5ae75899249

📥 Commits

Reviewing files that changed from the base of the PR and between 70e3c8d and 2720d78.

📒 Files selected for processing (5)
  • nemo_skills/dataset/komt-bench/__init__.py
  • nemo_skills/dataset/komt-bench/prepare.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/evaluation/metrics/mt_bench_metrics.py
  • tests/test_mt_bench_metrics.py
✅ Files skipped from review due to trivial changes (1)
  • nemo_skills/dataset/komt-bench/prepare.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/komt-bench/init.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
nemo_skills/dataset/mt-bench/prepare.py (2)

47-48: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Buffer the transformed rows before opening test.jsonl for writing.

If a later row fails validation, this leaves a partially written dataset behind.

Suggested fix
-    with open(questions_file, "rt", encoding="utf-8") as fin, \
-         open(output_file, "wt", encoding="utf-8") as fout:
+    entries = []
+    with open(questions_file, "rt", encoding="utf-8") as fin:
         for line in fin:
             data = json.loads(line)
             qid = data["question_id"]
             turns = data["turns"]
             if len(turns) != 2:
                 raise ValueError(f"MT-Bench question {qid} expected 2 turns, got {len(turns)}")
             ref = ref_answers.get(qid)
             entry = {
                 "question_id": qid,
                 "category": data["category"],
                 "question_1": turns[0],
                 "question_2": turns[1],
                 "ref_answer_1": ref["turn1"] if ref is not None else None,
                 "ref_answer_2": ref["turn2"] if ref is not None else None,
             }
-            fout.write(json.dumps(entry, ensure_ascii=False) + "\n")
+            entries.append(entry)
+
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for entry in entries:
+            fout.write(json.dumps(entry, ensure_ascii=False) + "\n")

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mt-bench/prepare.py` around lines 47 - 48, The code
currently opens the output file (output_file / fout) while streaming
transformations from questions_file, which can leave a partially written
test.jsonl if validation later fails; instead, read and parse questions_file
first, buffer all transformed/validated rows in memory (e.g., build a list of
transformed entries inside the parsing loop in prepare.py), perform all
validation and transformations on that buffer, and only after successful
completion open output_file for writing and iterate the buffered list to write
lines; reference the existing variables/questions_file, output_file, fin/fout
and the transformation/validation logic so you only move the file open for fout
to after buffering and validation.

44-45: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when a reference-scored category is missing reference turns.

mt_bench_judge.py unconditionally uses the _with_ref prompts for math/reasoning/coding, so emitting None here silently degrades the judge input for exactly the categories that need references. Keep refs optional for other categories, but require both turns for the reference-scored ones.

Suggested fix
+NEED_REF_CATS = {"math", "reasoning", "coding"}
+
 ...
-            ref_answers[qid] = {"turn1": turns[0], "turn2": turns[1] if len(turns) > 1 else None}
+            ref_answers[qid] = {"turn1": turns[0], "turn2": turns[1] if len(turns) > 1 else None}
 
     with open(questions_file, "rt", encoding="utf-8") as fin, \
          open(output_file, "wt", encoding="utf-8") as fout:
         for line in fin:
             data = json.loads(line)
             qid = data["question_id"]
             turns = data["turns"]
             if len(turns) != 2:
                 raise ValueError(f"MT-Bench question {qid} expected 2 turns, got {len(turns)}")
-            # Reference answers are populated upstream only for math/reasoning/coding;
-            # other categories legitimately have none.
-            ref = ref_answers.get(qid, {})
+            ref = ref_answers.get(qid)
+            if data["category"] in NEED_REF_CATS:
+                if ref is None or ref["turn1"] is None or ref["turn2"] is None:
+                    raise ValueError(f"MT-Bench question {qid} is missing required reference answers")
             entry = {
                 "question_id": qid,
                 "category": data["category"],
                 "question_1": turns[0],
                 "question_2": turns[1],
-                "ref_answer_1": ref.get("turn1"),
-                "ref_answer_2": ref.get("turn2"),
+                "ref_answer_1": ref["turn1"] if ref is not None else None,
+                "ref_answer_2": ref["turn2"] if ref is not None else None,
             }

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 57-64

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mt-bench/prepare.py` around lines 44 - 45, When building
ref_answers from data (the assignment using data["choices"][0]["turns"] and
ref_answers[qid] = {"turn1": turns[0], "turn2": turns[1] if len(turns) > 1 else
None}), enforce that both reference turns exist for categories that require
references (math, reasoning, coding) and raise an explicit error if they are
missing instead of writing None; keep optional behavior for other categories.
Locate the ref extraction logic (the turns = data["choices"][0]["turns"] block
and the later similar block at lines ~57-64) and change it to directly access
required keys/indices (avoid .get()) and validate len(turns) >= 2 for the
reference-scored categories, throwing a clear exception with qid and category
when the check fails so the pipeline fails fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/inference/eval/mt_bench_judge.py`:
- Around line 139-141: The code currently treats num_judges <= 1 as the default
path which silently accepts 0 or negative values; update the validation around
num_judges (the variable assigned from self.cfg.num_judges in mt_bench_judge.py)
to explicitly reject invalid values: if num_judges < 1 raise a ValueError (with
a helpful message referencing num_judges), and keep num_judges == 1 as a valid
case if that is intended; ensure any later logic that assumed the fallback path
still works when validation raises.

---

Duplicate comments:
In `@nemo_skills/dataset/mt-bench/prepare.py`:
- Around line 47-48: The code currently opens the output file (output_file /
fout) while streaming transformations from questions_file, which can leave a
partially written test.jsonl if validation later fails; instead, read and parse
questions_file first, buffer all transformed/validated rows in memory (e.g.,
build a list of transformed entries inside the parsing loop in prepare.py),
perform all validation and transformations on that buffer, and only after
successful completion open output_file for writing and iterate the buffered list
to write lines; reference the existing variables/questions_file, output_file,
fin/fout and the transformation/validation logic so you only move the file open
for fout to after buffering and validation.
- Around line 44-45: When building ref_answers from data (the assignment using
data["choices"][0]["turns"] and ref_answers[qid] = {"turn1": turns[0], "turn2":
turns[1] if len(turns) > 1 else None}), enforce that both reference turns exist
for categories that require references (math, reasoning, coding) and raise an
explicit error if they are missing instead of writing None; keep optional
behavior for other categories. Locate the ref extraction logic (the turns =
data["choices"][0]["turns"] block and the later similar block at lines ~57-64)
and change it to directly access required keys/indices (avoid .get()) and
validate len(turns) >= 2 for the reference-scored categories, throwing a clear
exception with qid and category when the check fails so the pipeline fails fast.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 52a68afb-43d6-407e-a4f0-cda71ae3e776

📥 Commits

Reviewing files that changed from the base of the PR and between 2720d78 and 214a4ea.

📒 Files selected for processing (5)
  • nemo_skills/dataset/komt-bench/prepare.py
  • nemo_skills/dataset/mt-bench/prepare.py
  • nemo_skills/evaluation/metrics/mt_bench_metrics.py
  • nemo_skills/inference/eval/mt_bench_judge.py
  • tests/test_mt_bench_metrics.py
✅ Files skipped from review due to trivial changes (1)
  • nemo_skills/dataset/komt-bench/prepare.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/evaluation/metrics/mt_bench_metrics.py

Comment on lines +139 to +141
num_judges = self.cfg.num_judges

if num_judges <= 1:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject invalid num_judges values instead of silently falling back to one judge.

num_judges <= 1 makes 0 and negative values behave like the default path, so a bad Hydra/CLI value still produces scores instead of failing fast.

Suggested fix
         num_judges = self.cfg.num_judges
+        if num_judges < 1:
+            raise ValueError(f"num_judges must be >= 1, got {num_judges}")
 
-        if num_judges <= 1:
+        if num_judges == 1:
             judgement_turn1, judgement_turn2 = await asyncio.gather(
                 self._safe_judge(data_point, all_data, turn=1),
                 self._safe_judge(data_point, all_data, turn=2),

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/mt_bench_judge.py` around lines 139 - 141, The
code currently treats num_judges <= 1 as the default path which silently accepts
0 or negative values; update the validation around num_judges (the variable
assigned from self.cfg.num_judges in mt_bench_judge.py) to explicitly reject
invalid values: if num_judges < 1 raise a ValueError (with a helpful message
referencing num_judges), and keep num_judges == 1 as a valid case if that is
intended; ensure any later logic that assumed the fallback path still works when
validation raises.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py`:
- Around line 91-100: The current conditional silently falls back to
single-judge fields when only one of all_judgements_turn1 / all_judgements_turn2
is present; update the logic in the block that reads all_j1 =
pred.get("all_judgements_turn1") and all_j2 = pred.get("all_judgements_turn2")
to detect the case where exactly one is present and fail loudly (e.g., raise a
ValueError or similar) with a clear message referencing the record id/context,
otherwise proceed to compute average scores using self._parse_rating for each
list; keep the existing single-judge fallback (parsing pred["judgement_turn1"] /
pred["judgement_turn2"]) only when neither all_j1 nor all_j2 is present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e29d5f99-f0ca-4ed9-aa24-1f430c18d00a

📥 Commits

Reviewing files that changed from the base of the PR and between 214a4ea and 845f550.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/metrics/mt_bench_metrics.py
  • tests/test_mt_bench_metrics.py

Comment on lines +91 to +100
all_j1 = pred.get("all_judgements_turn1")
all_j2 = pred.get("all_judgements_turn2")
if all_j1 and all_j2:
scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
else:
score_t1 = self._parse_rating(pred["judgement_turn1"])
score_t2 = self._parse_rating(pred["judgement_turn2"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail loudly when only one multi-judge field is present.

if all_j1 and all_j2 silently falls back to the single-judge path when a record contains just one of all_judgements_turn1 / all_judgements_turn2. That can mis-score partially corrupted judge output instead of surfacing the data issue.

Suggested fix
-        all_j1 = pred.get("all_judgements_turn1")
-        all_j2 = pred.get("all_judgements_turn2")
-        if all_j1 and all_j2:
+        has_all_judgements = "all_judgements_turn1" in pred or "all_judgements_turn2" in pred
+        if has_all_judgements:
+            all_j1 = pred["all_judgements_turn1"]
+            all_j2 = pred["all_judgements_turn2"]
+            if not all_j1 or not all_j2:
+                raise ValueError(
+                    "multi-judge records must include both all_judgements_turn1 and all_judgements_turn2"
+                )
             scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
             scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
             score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
             score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
         else:

As per coding guidelines, required judge fields should fail loudly instead of being silently downgraded.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
all_j1 = pred.get("all_judgements_turn1")
all_j2 = pred.get("all_judgements_turn2")
if all_j1 and all_j2:
scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
else:
score_t1 = self._parse_rating(pred["judgement_turn1"])
score_t2 = self._parse_rating(pred["judgement_turn2"])
has_all_judgements = "all_judgements_turn1" in pred or "all_judgements_turn2" in pred
if has_all_judgements:
all_j1 = pred["all_judgements_turn1"]
all_j2 = pred["all_judgements_turn2"]
if not all_j1 or not all_j2:
raise ValueError(
"multi-judge records must include both all_judgements_turn1 and all_judgements_turn2"
)
scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
else:
score_t1 = self._parse_rating(pred["judgement_turn1"])
score_t2 = self._parse_rating(pred["judgement_turn2"])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py` around lines 91 - 100,
The current conditional silently falls back to single-judge fields when only one
of all_judgements_turn1 / all_judgements_turn2 is present; update the logic in
the block that reads all_j1 = pred.get("all_judgements_turn1") and all_j2 =
pred.get("all_judgements_turn2") to detect the case where exactly one is present
and fail loudly (e.g., raise a ValueError or similar) with a clear message
referencing the record id/context, otherwise proceed to compute average scores
using self._parse_rating for each list; keep the existing single-judge fallback
(parsing pred["judgement_turn1"] / pred["judgement_turn2"]) only when neither
all_j1 nor all_j2 is present.

Copy link
Copy Markdown
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bzantium is this benchmark widely used? We had English mtbench before but removed it because our researchers found it low-quality, noisy and not very relevant to modern models. So, wondering if you find this benchmark valuable on your side or not

@bzantium
Copy link
Copy Markdown
Contributor Author

bzantium commented May 12, 2026

@Kipok Fair point on the quality concerns. We don't use MT-Bench or KoMT-Bench for frontier model comparisons either; the headroom is gone up there and we reach for other benchmarks. Where we still find them useful is the sub-30B range, where the score distribution is wide enough that the noise floor doesn't drown the signal.

Agreed that it's not suitable for tight model-vs-model rankings. We treat it more as a coarse threshold: "does this checkpoint clear the ~8.0 line or fall below it?" rather than as a precise score. KoMT-Bench specifically is still one of the few open Korean multi-turn dialog benchmarks with a reference answer set, so even with the limitations it's worth keeping in the suite for smaller models.

@bzantium bzantium force-pushed the feat/mt-bench-komt-bench branch 2 times, most recently from 0f91275 to e44da44 Compare May 12, 2026 06:40
Add MT-Bench and its Korean adaptation KoMT-Bench, both judge-based
multi-turn dialog benchmarks (80 prompts × 2 turns, 8 categories).

Datasets:
- mt-bench: LMSYS FastChat questions + GPT-4 references, pulled from the
  upstream GitHub raw URLs.
- komt-bench: LGAI-EXAONE/KoMT-Bench from the project's GitHub repo (the
  HuggingFace parquet only ships 39/80 references; the GitHub repo carries
  the full 80/80, matching upstream evaluations).

Evaluator:
- nemo_skills/inference/eval/mt_bench.py wraps the standard generation task
  for two-turn dialog rollouts; mt_bench_judge.py is a multi-judge variant.
- MTBenchMetrics aggregates 1-10 judge scores and applies an optional sqrt
  language penalty when target_language is set on the entry. KoMT-Bench sets
  target_language: ko on all but qids 138 and 140 (JSON / CSV format
  outputs, exempted per upstream detector.py).
- Language detection uses MediaPipe's language_detector.tflite (~315KB,
  lazy fetch to ~/.cache/nemo_skills/ or $NEMO_SKILLS_CACHE) for more
  reliable Korean detection than langdetect's heuristic.

Korean judge prompts in nemo_skills/prompt/config/judge/komt-bench/ mirror
the upstream FastChat turn1 / turn2 templates with Korean instructions.

Dockerfile picks up libgl1-mesa-glx / libgles2-mesa / libegl1-mesa for the
mediapipe runtime.

Signed-off-by: bzantium <ryumin93@gmail.com>
@bzantium bzantium force-pushed the feat/mt-bench-komt-bench branch from e44da44 to 203923e Compare May 12, 2026 06:48
@shuoyangd
Copy link
Copy Markdown
Collaborator

@eagle705 Any thoughts here?

Signed-off-by: Minho Ryu <ryumin93@gmail.com>
@bzantium
Copy link
Copy Markdown
Contributor Author

bzantium commented May 15, 2026

Following up with concrete numbers in the sub-30B range. Ran the suite on nine open models from 0.6B to 4B (Qwen3, Qwen3.5, gemma-3, gemma-4), each in thinking and non-thinking mode where applicable, for 17 (model, mode) pairs. Generation on a single H100, judge gpt-4o-2024-08-06.

Model Mode mt-bench komt-bench
Qwen3-0.6B think 5.81 4.12
Qwen3-0.6B nothink 4.99 3.45
Qwen3.5-0.8B think 3.23 2.74
Qwen3.5-0.8B nothink 4.42 3.11
gemma-3-1b-it n/a 6.45 4.71
Qwen3-1.7B think 7.54 5.91
Qwen3-1.7B nothink 7.01 5.49
Qwen3.5-2B think 6.26 5.60
Qwen3.5-2B nothink 6.33 4.81
gemma-4-E2B-it think 8.60 8.61
gemma-4-E2B-it nothink 8.50 7.93
Qwen3-4B think 8.50 7.58
Qwen3-4B nothink 8.08 7.00
Qwen3.5-4B think 8.10 7.88
Qwen3.5-4B nothink 7.84 6.96
gemma-4-E4B-it think 8.81 8.89
gemma-4-E4B-it nothink 8.73 8.53

A few takeaways supporting the points above:

  • Distribution is wide. mt-bench spans 3.23 to 8.81 (5.6 pt spread); komt-bench spans 2.74 to 8.89 (6.15 pt). At 1B-scale and below, the table cleanly stratifies models by capability without saturating.
  • The 8.0 threshold is informative. mt-bench ≥ 8.0 picks 7 of 17 runs; komt-bench ≥ 8.0 picks only 3, all from the gemma-4 family. As a binary "usable for casual (Korean) dialog?" gate this is the kind of signal we actually act on; the precise rank between 8.50 and 8.60 is not.
  • Korean signal that MCQ benchmarks compress out. Among the four 4B-class models, three clear mt-bench ≥ 8.0 but only one (gemma-4-E4B-it) clears komt-bench ≥ 8.0. Qwen3-4B drops 0.92 points from English to Korean; gemma-4-E4B-it actually edges higher on Korean. MMMLU and KMMLU-style benchmarks tend to collapse this gap into a couple of accuracy points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MT-Bench and Komt-Bench (Korean MT-Bench) benchmarks

3 participants