Add MT-Bench and Komt-Bench (Korean MT-Bench) benchmarks by bzantium · Pull Request #1378 · NVIDIA-NeMo/Skills

bzantium · 2026-04-17T02:22:45Z

Summary

Add MT-Bench (80 two-turn English questions) and Komt-Bench (80 two-turn Korean questions) with LLM-as-judge evaluation.

Architecture

Both benchmarks share the same generation + judge modules:

mt_bench.py — Two-turn generation: answer_1 for question_1, then answer_2 conditioned on full conversation
mt_bench_judge.py — Per-turn 1-10 scoring with multi-judge averaging (default 8 calls), reference-aware prompts for math/reasoning/coding categories

Metrics

MTBenchMetrics — Parses [[rating]] from judge output; per-turn and per-category averages
KomtBenchMetrics — Extends MTBenchMetrics with sqrt(score) penalty for non-Korean responses (code-output questions exempted)

Data

Benchmark	Source	Format
mt-bench	FastChat repo questions + GPT-4 ref answers	`prepare.py` downloads
komt-bench	LGAI-EXAONE/KoMT-Bench	`prepare.py` downloads

MT-Bench reuses existing English judge prompts at judge/mt-bench/. Komt-Bench adds Korean judge prompts at judge/komt-bench/ (4 turn/ref variants).

Changes

New files (12)

nemo_skills/dataset/{mt-bench,komt-bench}/ — __init__.py + prepare.py each
nemo_skills/inference/eval/mt_bench.py, mt_bench_judge.py
nemo_skills/evaluation/metrics/mt_bench_metrics.py, komt_bench_metrics.py
nemo_skills/prompt/config/judge/komt-bench/ — 4 YAMLs

Modified files (1)

nemo_skills/evaluation/metrics/map_metrics.py — register mt_bench + komt_bench (+4 lines)

Test plan

ns eval --benchmarks mt-bench --model <model> — verify two-turn generation + judge scoring + metrics
ns eval --benchmarks komt-bench --model <model> — verify Korean prompts + language penalty
Existing benchmarks unaffected (metrics map additions only)

Summary by CodeRabbit

New Features
- MT-Bench evaluation: two-turn generation, judging, and Hydra-configured CLI entrypoints.
- KoMT-Bench (Korean MT-Bench) dataset support with dataset preparation utilities.
- New Korean judge prompt templates for single- and multi-turn evaluation (with/without references).
- Multi-judge execution with aggregated and per-turn outputs; language-aware scoring that applies a penalty when answers mismatch a target language.
Tests
- Comprehensive tests validating MT-Bench metrics, averaging, penalty logic, multi-judge behavior, and error handling.

coderabbitai · 2026-04-17T02:29:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds MT-Bench and Komt-Bench: dataset initializers and prep scripts, a two‑turn generation task, an LLM-as-judge task with multi‑judge support, MTBench metrics (parsing judge ratings and optional language penalty), Korean judge prompt templates, and unit tests for metrics.

Changes

Cohort / File(s)	Summary
Dataset Config & Exports `nemo_skills/dataset/mt-bench/__init__.py`, `nemo_skills/dataset/komt-bench/__init__.py`	New package initializers exporting default benchmark settings: `METRICS_TYPE`, `GENERATION_MODULE`, `GENERATION_ARGS`, `JUDGE_PIPELINE_ARGS`, and `JUDGE_ARGS`.
Dataset Preparation Scripts `nemo_skills/dataset/mt-bench/prepare.py`, `nemo_skills/dataset/komt-bench/prepare.py`	New CLI scripts to download/normalize datasets into `test.jsonl`, parse stringified fields, enforce exactly two turns, include up to two references, and set `target_language` for Komt-Bench.
Two-turn Generation `nemo_skills/inference/eval/mt_bench.py`	Adds Hydra-configured `MTBenchGenerationTask` and config dataclasses: generates turn‑1, conditions turn‑2 on [Q1,A1,Q2], optionally parses reasoning, returns combined results and token counts; exposes `generate(cfg)` entrypoint.
LLM-as-Judge Task `nemo_skills/inference/eval/mt_bench_judge.py`	Adds `MTBenchJudgeTask` and config: selects per-turn/category prompt templates (with/without refs), runs N parallel judgements with 180s timeout, returns primary and aggregated judgements and supports multi-judge outputs; Hydra entrypoint provided.
Metrics Implementation & Registry `nemo_skills/evaluation/metrics/mt_bench_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Adds `MTBenchMetrics` (parses `[[rating]]`, averages per-turn/category, supports multi-judge lists, optional ISO‑639-1 `target_language` validation and sqrt penalty on mismatch) and registers `"mt_bench"` in metrics map.
Korean Judge Prompts `nemo_skills/prompt/config/judge/komt-bench/turn1.yaml`, `.../turn1_with_ref.yaml`, `.../turn2.yaml`, `.../turn2_with_ref.yaml`	New Korean prompt templates for turn1/turn2 with and without reference answers; require final `[[rating]]` token in output.
Tests `tests/test_mt_bench_metrics.py`	New comprehensive unit tests for `MTBenchMetrics`: parsing, aggregation, multi-judge averaging, invalid-rating handling, language-penalty behavior, ISO code validation, and missing-field error cases.

Sequence Diagram(s)

sequenceDiagram
    participant Dataset as Dataset\n(prepared test.jsonl)
    participant Gen as Generation Task\n(MTBenchGenerationTask)
    participant Judge as Judge Task\n(MTBenchJudgeTask)
    participant Metrics as Metrics\n(MTBenchMetrics)

    Dataset->>Gen: send {question_1, question_2, refs?, category, target_language?}
    Gen->>Gen: generate answer_1 from question_1
    Gen->>Gen: generate answer_2 from [Q1,A1,Q2]
    Gen->>Judge: submit Qs, As, refs, category for judging
    par per turn
        Judge->>Judge: select prompt (with/without ref)
        Judge->>Judge: run N parallel judge calls (timeout 180s)
        Judge->>Judge: extract [[rating]] from outputs
    end
    Judge->>Metrics: send judgements, all_judgements, num_judges, answers, target_language
    Metrics->>Metrics: parse ratings, apply sqrt penalty if language mismatch, aggregate per-turn/category/overall
    Metrics-->>Dataset: emit pass@1 and aggregated metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main objective of the PR: adding two benchmarks (MT-Bench and Komt-Bench) with clear reference to the Korean variant.
Linked Issues check	✅ Passed	The PR implementation comprehensively addresses all linked issue objectives: two-turn generation module [mt_bench.py], judge module [mt_bench_judge.py], MTBenchMetrics and KomtBenchMetrics implementations, Korean judge prompts [judge/komt-bench/], dataset prepare scripts, and metric registration in map_metrics.py.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the PR objectives; files added focus exclusively on MT-Bench/Komt-Bench implementation including generation, judging, metrics, prompts, data preparation, and test coverage with no unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/komt-bench/prepare.py`:
- Around line 29-41: The current loop writes to output_file while transforming
rows which can leave a partial test.jsonl on failure; instead, first iterate ds
and build a list of entries (using the same logic that constructs entry from
turns, reference, question_id, category, question_1/2, ref_answer_1/2), then
open output_file once and write all json.dumps(entry) lines from that in-memory
list; update prepare.py to compute entries from ds first (e.g., collect into a
list named entries) and only then open output_file for writing and flush/write
entries to avoid partial files.
- Around line 31-39: Validate that parsed 'turns' and 'reference' both have
exactly two elements before building 'entry' (use the existing variables 'turns'
and 'reference'); if not, raise a ValueError (include the row["question_id"] and
row content for context) instead of falling back to empty string/None, then
proceed to construct 'entry' with turns[0], turns[1], reference[0],
reference[1]; replace the current len-checks around
"question_1"/"question_2"/"ref_answer_1"/"ref_answer_2" with this strict
validation to enforce the fixed two-turn benchmark invariant.

In `@nemo_skills/dataset/mt-bench/__init__.py`:
- Around line 17-28: Add MT-Bench documentation and CI/test coverage: document
the new METRICS_TYPE="mt_bench" and the generation/judge wiring
(GENERATION_MODULE="nemo_skills.inference.eval.mt_bench", JUDGE_PIPELINE_ARGS
and JUDGE_ARGS) in docs/evaluation/external-benchmarks.md with example run
commands (how to invoke the generation and judge pipelines and expected metric
outputs) and expected results for tested models; create a
tests/slurm-tests/mt-bench/ directory mirroring other benchmark test folders
(include SLURM job scripts, minimal input configs and assertions that run the
generation and judge pipelines and validate metrics); and add mt-bench to the CI
by updating .github/workflows/gpu_tests.yml (and any test matrix) so the new
SLURM test directory is executed in CI. Ensure filenames and commands reference
the above symbols (METRICS_TYPE, GENERATION_MODULE, JUDGE_PIPELINE_ARGS,
JUDGE_ARGS) so the tests exercise the actual code paths.

In `@nemo_skills/dataset/mt-bench/prepare.py`:
- Around line 47-48: Read and transform the entire input (questions_file) into
memory or a temporary buffer first, and only after all rows are successfully
processed open output_file for writing (or write to a temporary file and
atomically replace output_file with os.replace) to avoid leaving a partially
written test.jsonl on failure; adjust the existing with-open that currently
opens fout early so that only reading/parsing happens before opening/writing,
referencing the questions_file and output_file variables used in the current
block.
- Around line 45-60: The loop that builds entry currently tolerates missing
turns and missing ref answers (using fallback ""/None/.get()) and should instead
fail fast: validate that data["turns"] has exactly two items and that
ref_answers contains qid (ref_answers[qid]) before constructing entry; if
validation fails, raise a clear exception (e.g., ValueError including qid) so
malformed records are rejected. Replace the conditional fallbacks for
"question_2", "ref_answer_1", and "ref_answer_2" with direct indexing
(data["turns"][1], ref_answers[qid]["turn1"], ref_answers[qid]["turn2"]) after
the checks to enforce strict two-turn, aligned refs.

In `@nemo_skills/evaluation/metrics/komt_bench_metrics.py`:
- Around line 59-84: This code silently defaults critical fields using dict.get
(category, question_id, answer_1, answer_2, judgement_turn1, judgement_turn2)
which can mis-score records; change to fail-fast by accessing those keys
directly (e.g., pred["question_id"], pred["answer_1"], etc.) or validate
presence at the start of processing and raise a clear exception if missing, then
keep using _parse_rating, CODE_OUTPUT_QUESTION_IDS and _detect_korean as before;
ensure the multi-judge branches also require and validate
all_judgements_turn1/all_judgements_turn2 exist before computing averages so
malformed inputs raise an error instead of using silent defaults.

In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py`:
- Around line 45-59: The code silently masks malformed judge records by using
pred.get(...) with defaults for "category" and fallback
"judgement_turn1"/"judgement_turn2"; change these to direct key access (e.g.,
pred["category"], pred["judgement_turn1"], pred["judgement_turn2"]) so missing
fields raise immediately, and validate the retrieved values before calling
self._parse_rating (raise a clear error mentioning the record/pred id if types
are wrong). Keep the existing multi-judge handling for
"all_judgements_turn1"/"all_judgements_turn2" but ensure you still validate
their presence/type when expected.
- Around line 22-24: The MTBenchMetrics __init__ currently only calls
self.reset(), bypassing BaseMetrics initialization; change
MTBenchMetrics.__init__ to call the base constructor first (super().__init__()
or BaseMetrics.__init__(self)) and then call self.reset() so any base attributes
set in BaseMetrics.__init__ are initialized while still resetting subclass state
via reset().

In `@nemo_skills/inference/eval/mt_bench_judge.py`:
- Around line 121-126: The except asyncio.TimeoutError handler currently returns
an empty string which makes timeouts appear as missing scores; instead surface
the failure by raising a specific exception (e.g., raise RuntimeError or a
custom JudgeTimeoutError) so the caller/metrics layer can count it separately;
keep the LOG.warning call (using LOG.warning, JUDGE_TIMEOUT, and data_point) but
replace the final "return \"\"" with raising the exception containing context
(turn, JUDGE_TIMEOUT, category) so timeouts are explicit failures rather than
dropped scores.
- Around line 96-107: The current branch populates judge_input with
ref_answer_1/2 using data_point.get(..., ""), which silently masks missing
reference answers; when category is in NEED_REF_CATS you must require those keys
to be present. Replace the .get(...) calls with direct indexing (e.g.,
data_point["ref_answer_1"] and data_point["ref_answer_2"]) inside the blocks
that check category in NEED_REF_CATS so a KeyError surfaces for bad records (or
add an explicit check that raises a clear error if either key is missing) and
keep the surrounding fields assigned as-is to judge_input.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 798045c1-5a4b-410a-92b5-f738273ad309

📥 Commits

Reviewing files that changed from the base of the PR and between 420d006 and 7b9eb25.

📒 Files selected for processing (13)

nemo_skills/dataset/komt-bench/__init__.py
nemo_skills/dataset/komt-bench/prepare.py
nemo_skills/dataset/mt-bench/__init__.py
nemo_skills/dataset/mt-bench/prepare.py
nemo_skills/evaluation/metrics/komt_bench_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/mt_bench_metrics.py
nemo_skills/inference/eval/mt_bench.py
nemo_skills/inference/eval/mt_bench_judge.py
nemo_skills/prompt/config/judge/komt-bench/turn1.yaml
nemo_skills/prompt/config/judge/komt-bench/turn1_with_ref.yaml
nemo_skills/prompt/config/judge/komt-bench/turn2.yaml
nemo_skills/prompt/config/judge/komt-bench/turn2_with_ref.yaml

coderabbitai · 2026-04-17T02:29:44Z

+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for row in ds:
+            turns = ast.literal_eval(row["turns"])
+            reference = ast.literal_eval(row["reference"])
+            entry = {
+                "question_id": int(row["question_id"]),
+                "category": row["category"],
+                "question_1": turns[0] if len(turns) > 0 else "",
+                "question_2": turns[1] if len(turns) > 1 else "",
+                "ref_answer_1": reference[0] if len(reference) > 0 else None,
+                "ref_answer_2": reference[1] if len(reference) > 1 else None,
+            }
+            fout.write(json.dumps(entry, ensure_ascii=False) + "\n")


🛠️ Refactor suggestion | 🟠 Major

Compute entries first, then write test.jsonl in a separate step.

Opening test.jsonl before transformation finishes can leave a partial file if one row fails.

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/komt-bench/prepare.py` around lines 29 - 41, The current loop writes to output_file while transforming rows which can leave a partial test.jsonl on failure; instead, first iterate ds and build a list of entries (using the same logic that constructs entry from turns, reference, question_id, category, question_1/2, ref_answer_1/2), then open output_file once and write all json.dumps(entry) lines from that in-memory list; update prepare.py to compute entries from ds first (e.g., collect into a list named entries) and only then open output_file for writing and flush/write entries to avoid partial files.

coderabbitai · 2026-04-17T02:29:44Z

+METRICS_TYPE = "mt_bench"
+GENERATION_MODULE = "nemo_skills.inference.eval.mt_bench"
+GENERATION_ARGS = "++prompt_format=openai"
+
+JUDGE_PIPELINE_ARGS = {
+    "generation_module": "nemo_skills.inference.eval.mt_bench_judge",
+    "model": "gpt-4o-2024-08-06",
+    "server_type": "openai",
+    "server_address": "https://api.openai.com/v1",
+}
+
+JUDGE_ARGS = "++num_judges=8"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Docs mentioning MT-Bench / KoMT-Bench ==" rg -n --iglob '*.md' 'mt[-_ ]bench|komt[-_ ]bench|komt_bench|mt_bench|KoMT-Bench' echo echo "== Benchmark dataset/config files ==" fd -i 'mt-bench|komt-bench' nemo_skills echo echo "== Slurm / CI test references ==" rg -n --iglob '*slurm*' --iglob '*test*' 'mt[-_ ]bench|komt[-_ ]bench|komt_bench|mt_bench'

Repository: NVIDIA-NeMo/Skills

Length of output: 105

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Searching for MT-Bench references in repo ===" # Search more broadly without glob restrictions first rg -i 'mt.?bench' --max-count 20 echo echo "=== Looking at docs structure ===" find . -type f -name "*.md" | head -20 echo echo "=== Checking for SLURM test files ===" find . -type f -name "*slurm*" -o -name "*test*.py" | grep -i slurm | head -10 echo echo "=== Checking benchmark directories ===" ls -la nemo_skills/dataset/ 2>/dev/null || echo "Directory not found" echo echo "=== Looking for CI/workflow files ===" find . -type f $ -name "*.yml" -o -name "*.yaml" $ | grep -E '(workflow|github|ci)' | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 15635

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Checking external-benchmarks.md for MT-Bench/KoMT-Bench ===" cat ./docs/evaluation/external-benchmarks.md | grep -i "mt.bench\|komt" || echo "No MT-Bench/KoMT-Bench found in external-benchmarks.md" echo echo "=== Listing all SLURM test directories ===" ls -1 ./tests/slurm-tests/ echo echo "=== Checking if mt-bench SLURM test exists ===" ls -la ./tests/slurm-tests/mt-bench 2>/dev/null || echo "MT-Bench SLURM test directory NOT found" echo echo "=== Checking if komt-bench SLURM test exists ===" ls -la ./tests/slurm-tests/komt-bench 2>/dev/null || echo "KoMT-Bench SLURM test directory NOT found" echo echo "=== Checking CI test configuration files ===" find ./tests -name "*.py" -type f | head -5 | xargs grep -l "mt.bench\|komt" 2>/dev/null || echo "No MT-Bench/KoMT-Bench references in test files" echo echo "=== Checking GPU tests workflow ===" grep -n "mt.bench\|komt" ./.github/workflows/gpu_tests.yml || echo "No MT-Bench/KoMT-Bench in GPU tests" echo echo "=== Checking mt-bench __init__.py file ===" head -30 nemo_skills/dataset/mt-bench/__init__.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1929

Add benchmark documentation and SLURM test coverage for MT-Bench.

While the config wiring is correct, the benchmark setup is incomplete:

Missing documentation: MT-Bench is not documented in docs/evaluation/external-benchmarks.md. Add example run commands and expected results for tested models.

Missing SLURM test coverage: No SLURM test directory exists for MT-Bench (all other benchmarks have corresponding tests in tests/slurm-tests/). Add comprehensive test coverage following the pattern of existing benchmarks.

Missing CI integration: MT-Bench is not referenced in .github/workflows/gpu_tests.yml or other test configurations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mt-bench/__init__.py` around lines 17 - 28, Add MT-Bench documentation and CI/test coverage: document the new METRICS_TYPE="mt_bench" and the generation/judge wiring (GENERATION_MODULE="nemo_skills.inference.eval.mt_bench", JUDGE_PIPELINE_ARGS and JUDGE_ARGS) in docs/evaluation/external-benchmarks.md with example run commands (how to invoke the generation and judge pipelines and expected metric outputs) and expected results for tested models; create a tests/slurm-tests/mt-bench/ directory mirroring other benchmark test folders (include SLURM job scripts, minimal input configs and assertions that run the generation and judge pipelines and validate metrics); and add mt-bench to the CI by updating .github/workflows/gpu_tests.yml (and any test matrix) so the new SLURM test directory is executed in CI. Ensure filenames and commands reference the above symbols (METRICS_TYPE, GENERATION_MODULE, JUDGE_PIPELINE_ARGS, JUDGE_ARGS) so the tests exercise the actual code paths.

coderabbitai · 2026-04-17T02:29:44Z

+    with open(questions_file, "rt", encoding="utf-8") as fin, \
+         open(output_file, "wt", encoding="utf-8") as fout:


🛠️ Refactor suggestion | 🟠 Major

Avoid partial test.jsonl writes if a later row fails.

fout is opened before transformation finishes, so a mid-run failure can leave a truncated dataset file.

Proposed fix

- with open(questions_file, "rt", encoding="utf-8") as fin, \ - open(output_file, "wt", encoding="utf-8") as fout: + entries = [] + with open(questions_file, "rt", encoding="utf-8") as fin: for line in fin: data = json.loads(line) qid = data["question_id"] @@ - fout.write(json.dumps(entry, ensure_ascii=False) + "\n") + entries.append(entry) + + with open(output_file, "wt", encoding="utf-8") as fout: + for entry in entries: + fout.write(json.dumps(entry, ensure_ascii=False) + "\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/mt-bench/prepare.py` around lines 47 - 48, Read and transform the entire input (questions_file) into memory or a temporary buffer first, and only after all rows are successfully processed open output_file for writing (or write to a temporary file and atomically replace output_file with os.replace) to avoid leaving a partially written test.jsonl on failure; adjust the existing with-open that currently opens fout early so that only reading/parsing happens before opening/writing, referencing the questions_file and output_file variables used in the current block.

coderabbitai · 2026-04-17T02:29:44Z

+        pred = predictions[0]
+        category = pred.get("category", "unknown")
+        question_id = pred.get("question_id")
+
+        # Support multi-judge averaging
+        all_j1 = pred.get("all_judgements_turn1")
+        all_j2 = pred.get("all_judgements_turn2")
+
+        if all_j1 and all_j2:
+            scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
+            scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
+            score_turn1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
+            score_turn2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
+        else:
+            score_turn1 = self._parse_rating(pred.get("judgement_turn1", ""))
+            score_turn2 = self._parse_rating(pred.get("judgement_turn2", ""))
+
+        # Apply language penalty
+        skip_penalty = question_id in CODE_OUTPUT_QUESTION_IDS
+        penalized_turn1 = score_turn1
+        penalized_turn2 = score_turn2
+
+        if not skip_penalty:
+            answer_1 = pred.get("answer_1", "")
+            answer_2 = pred.get("answer_2", "")
+            if score_turn1 is not None and not _detect_korean(answer_1):


⚠️ Potential issue | 🟠 Major

Fail fast on missing benchmark schema fields.

This path treats category, question_id, answer_*, and the fallback judgement_turn* fields as optional, but they drive penalty exemption and score parsing. Defaulting them here can silently mis-score malformed records, especially by applying the Korean penalty to code-output questions when question_id is missing.

Suggested fix

- category = pred.get("category", "unknown") - question_id = pred.get("question_id") + category = pred["category"] + question_id = pred["question_id"] ... - score_turn1 = self._parse_rating(pred.get("judgement_turn1", "")) - score_turn2 = self._parse_rating(pred.get("judgement_turn2", "")) + score_turn1 = self._parse_rating(pred["judgement_turn1"]) + score_turn2 = self._parse_rating(pred["judgement_turn2"]) ... - answer_1 = pred.get("answer_1", "") - answer_2 = pred.get("answer_2", "") + answer_1 = pred["answer_1"] + answer_2 = pred["answer_2"]

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/komt_bench_metrics.py` around lines 59 - 84, This code silently defaults critical fields using dict.get (category, question_id, answer_1, answer_2, judgement_turn1, judgement_turn2) which can mis-score records; change to fail-fast by accessing those keys directly (e.g., pred["question_id"], pred["answer_1"], etc.) or validate presence at the start of processing and raise a clear exception if missing, then keep using _parse_rating, CODE_OUTPUT_QUESTION_IDS and _detect_korean as before; ensure the multi-judge branches also require and validate all_judgements_turn1/all_judgements_turn2 exist before computing averages so malformed inputs raise an error instead of using silent defaults.

coderabbitai · 2026-04-17T02:29:44Z

+        except asyncio.TimeoutError:
+            LOG.warning(
+                "Judge call timed out (turn=%d, timeout=%ds, category=%s)",
+                turn, JUDGE_TIMEOUT, data_point.get("category", ""),
+            )
+            return ""


⚠️ Potential issue | 🟠 Major

Don't turn judge timeouts into dropped scores.

Returning "" here makes the metrics layer treat the timed-out judge as “missing” and exclude it from the average, so runs with partial judge failure can still report normal-looking scores. This should surface as an explicit failure or be counted separately.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/mt_bench_judge.py` around lines 121 - 126, The except asyncio.TimeoutError handler currently returns an empty string which makes timeouts appear as missing scores; instead surface the failure by raising a specific exception (e.g., raise RuntimeError or a custom JudgeTimeoutError) so the caller/metrics layer can count it separately; keep the LOG.warning call (using LOG.warning, JUDGE_TIMEOUT, and data_point) but replace the final "return \"\"" with raising the exception containing context (turn, JUDGE_TIMEOUT, category) so timeouts are explicit failures rather than dropped scores.

bzantium · 2026-04-17T03:03:58Z

Evaluation Results

Tested with two models (8 judges per turn, GPT-4o-2024-08-06 as judge):

MT-Bench (English)

Category	kanana-2-30b-a3b-instruct-2601	Qwen3-30B-A3B-Instruct-2507
Overall	8.30	8.88
Turn 1	8.67	9.22
Turn 2	7.93	8.54
Coding	8.55	9.34
Extraction	7.59	8.58
Humanities	9.04	9.46
Math	8.28	8.42
Reasoning	6.58	7.61
Roleplay	8.73	9.03
STEM	9.04	9.51
Writing	8.58	9.09

Komt-Bench (Korean, with language penalty)

Category	kanana-2-30b-a3b-instruct-2601	Qwen3-30B-A3B-Instruct-2507
Overall (penalized)	8.18	8.69
Overall (raw)	8.25	8.78
Turn 1 (penalized)	8.37	8.99
Turn 2 (penalized)	7.99	8.40
Coding	8.41	8.77
Extraction	7.33	7.65
Humanities	8.98	8.73
Math	9.09	9.98
Reasoning	6.89	8.71
Roleplay	8.19	8.09
STEM	8.70	9.20
Writing	7.86	8.44

Both benchmarks run end-to-end: prepare → two-turn generation → multi-judge scoring → metrics aggregation.

coderabbitai

♻️ Duplicate comments (2)

nemo_skills/evaluation/metrics/mt_bench_metrics.py (2)

58-59: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize BaseMetrics here.

MTBenchMetrics still bypasses BaseMetrics.__init__, so compute_no_answer is never set and the new mt_bench registry entry can fail as soon as get_metrics(..., **kwargs) forwards constructor args. Call super().__init__(compute_no_answer=compute_no_answer) instead.

Suggested fix

 class MTBenchMetrics(BaseMetrics):
-    def __init__(self):
-        self.reset()
+    def __init__(self, compute_no_answer: bool = True):
+        super().__init__(compute_no_answer=compute_no_answer)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py` around lines 58 - 59,
MTBenchMetrics.__init__ currently calls self.reset() directly and bypasses
BaseMetrics.__init__, so compute_no_answer is never initialized; modify
MTBenchMetrics.__init__ to call
super().__init__(compute_no_answer=compute_no_answer) (passing the
compute_no_answer parameter) before any reset/use of instance state so
BaseMetrics sets up compute_no_answer and related defaults used later by
get_metrics and mt_bench registry.

89-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on malformed judge records.

The .get(..., "") / "unknown" fallbacks hide broken inputs and can silently drop scores or collapse categories into a fake bucket. Use direct indexing for required judge fields and validate the multi-judge payload explicitly instead of falling back to the single-judge path when a list is missing or empty. As per coding guidelines, don't use .get() for dictionary keys when the code expects them to be present.

Suggested fix

-        all_j1 = pred.get("all_judgements_turn1")
-        all_j2 = pred.get("all_judgements_turn2")
-        if all_j1 and all_j2:
+        all_j1 = pred["all_judgements_turn1"]
+        all_j2 = pred["all_judgements_turn2"]
+        if all_j1 and all_j2:
             scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
             scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
             score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
             score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
         else:
-            score_t1 = self._parse_rating(pred.get("judgement_turn1", ""))
-            score_t2 = self._parse_rating(pred.get("judgement_turn2", ""))
+            score_t1 = self._parse_rating(pred["judgement_turn1"])
+            score_t2 = self._parse_rating(pred["judgement_turn2"])
@@
-            answer_1 = pred.get("answer_1", "")
-            answer_2 = pred.get("answer_2", "")
+            answer_1 = pred["answer_1"]
+            answer_2 = pred["answer_2"]
@@
-            "category": pred.get("category", "unknown"),
+            "category": pred["category"],

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py` around lines 89 - 111,
The code currently uses .get fallbacks for required judge fields and silently
falls back to single-judge handling; change this to fail fast by validating
presence and shape of multi-judge payloads: in the block using all_j1/all_j2
replace pred.get(...) with direct indexing (pred["all_judgements_turn1"],
pred["all_judgements_turn2"]) and raise a descriptive exception if they are
missing or not non-empty lists; remove the .get(..., "") fallbacks for
"judgement_turn1"/"judgement_turn2" and require those keys when single-judge
mode is expected (use pred["judgement_turn1"] / pred["judgement_turn2"] after
validating the payload type), and keep the existing usage of _parse_rating,
_validate_target_language, answer_1/answer_2 and language penalty logic
unchanged—this ensures malformed records are validated up front instead of being
hidden by defaults.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py`:
- Around line 58-59: MTBenchMetrics.__init__ currently calls self.reset()
directly and bypasses BaseMetrics.__init__, so compute_no_answer is never
initialized; modify MTBenchMetrics.__init__ to call
super().__init__(compute_no_answer=compute_no_answer) (passing the
compute_no_answer parameter) before any reset/use of instance state so
BaseMetrics sets up compute_no_answer and related defaults used later by
get_metrics and mt_bench registry.
- Around line 89-111: The code currently uses .get fallbacks for required judge
fields and silently falls back to single-judge handling; change this to fail
fast by validating presence and shape of multi-judge payloads: in the block
using all_j1/all_j2 replace pred.get(...) with direct indexing
(pred["all_judgements_turn1"], pred["all_judgements_turn2"]) and raise a
descriptive exception if they are missing or not non-empty lists; remove the
.get(..., "") fallbacks for "judgement_turn1"/"judgement_turn2" and require
those keys when single-judge mode is expected (use pred["judgement_turn1"] /
pred["judgement_turn2"] after validating the payload type), and keep the
existing usage of _parse_rating, _validate_target_language, answer_1/answer_2
and language penalty logic unchanged—this ensures malformed records are
validated up front instead of being hidden by defaults.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a909f557-c1b1-41e3-9f83-a5ae75899249

📥 Commits

Reviewing files that changed from the base of the PR and between 70e3c8d and 2720d78.

📒 Files selected for processing (5)

nemo_skills/dataset/komt-bench/__init__.py
nemo_skills/dataset/komt-bench/prepare.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/mt_bench_metrics.py
tests/test_mt_bench_metrics.py

✅ Files skipped from review due to trivial changes (1)

nemo_skills/dataset/komt-bench/prepare.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/komt-bench/init.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

nemo_skills/dataset/mt-bench/prepare.py (2)

47-48: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Buffer the transformed rows before opening test.jsonl for writing.

If a later row fails validation, this leaves a partially written dataset behind.

Suggested fix

-    with open(questions_file, "rt", encoding="utf-8") as fin, \
-         open(output_file, "wt", encoding="utf-8") as fout:
+    entries = []
+    with open(questions_file, "rt", encoding="utf-8") as fin:
         for line in fin:
             data = json.loads(line)
             qid = data["question_id"]
             turns = data["turns"]
             if len(turns) != 2:
                 raise ValueError(f"MT-Bench question {qid} expected 2 turns, got {len(turns)}")
             ref = ref_answers.get(qid)
             entry = {
                 "question_id": qid,
                 "category": data["category"],
                 "question_1": turns[0],
                 "question_2": turns[1],
                 "ref_answer_1": ref["turn1"] if ref is not None else None,
                 "ref_answer_2": ref["turn2"] if ref is not None else None,
             }
-            fout.write(json.dumps(entry, ensure_ascii=False) + "\n")
+            entries.append(entry)
+
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for entry in entries:
+            fout.write(json.dumps(entry, ensure_ascii=False) + "\n")

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mt-bench/prepare.py` around lines 47 - 48, The code
currently opens the output file (output_file / fout) while streaming
transformations from questions_file, which can leave a partially written
test.jsonl if validation later fails; instead, read and parse questions_file
first, buffer all transformed/validated rows in memory (e.g., build a list of
transformed entries inside the parsing loop in prepare.py), perform all
validation and transformations on that buffer, and only after successful
completion open output_file for writing and iterate the buffered list to write
lines; reference the existing variables/questions_file, output_file, fin/fout
and the transformation/validation logic so you only move the file open for fout
to after buffering and validation.

44-45: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when a reference-scored category is missing reference turns.

mt_bench_judge.py unconditionally uses the _with_ref prompts for math/reasoning/coding, so emitting None here silently degrades the judge input for exactly the categories that need references. Keep refs optional for other categories, but require both turns for the reference-scored ones.

Suggested fix

+NEED_REF_CATS = {"math", "reasoning", "coding"}
+
 ...
-            ref_answers[qid] = {"turn1": turns[0], "turn2": turns[1] if len(turns) > 1 else None}
+            ref_answers[qid] = {"turn1": turns[0], "turn2": turns[1] if len(turns) > 1 else None}
 
     with open(questions_file, "rt", encoding="utf-8") as fin, \
          open(output_file, "wt", encoding="utf-8") as fout:
         for line in fin:
             data = json.loads(line)
             qid = data["question_id"]
             turns = data["turns"]
             if len(turns) != 2:
                 raise ValueError(f"MT-Bench question {qid} expected 2 turns, got {len(turns)}")
-            # Reference answers are populated upstream only for math/reasoning/coding;
-            # other categories legitimately have none.
-            ref = ref_answers.get(qid, {})
+            ref = ref_answers.get(qid)
+            if data["category"] in NEED_REF_CATS:
+                if ref is None or ref["turn1"] is None or ref["turn2"] is None:
+                    raise ValueError(f"MT-Bench question {qid} is missing required reference answers")
             entry = {
                 "question_id": qid,
                 "category": data["category"],
                 "question_1": turns[0],
                 "question_2": turns[1],
-                "ref_answer_1": ref.get("turn1"),
-                "ref_answer_2": ref.get("turn2"),
+                "ref_answer_1": ref["turn1"] if ref is not None else None,
+                "ref_answer_2": ref["turn2"] if ref is not None else None,
             }

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 57-64

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/mt-bench/prepare.py` around lines 44 - 45, When building
ref_answers from data (the assignment using data["choices"][0]["turns"] and
ref_answers[qid] = {"turn1": turns[0], "turn2": turns[1] if len(turns) > 1 else
None}), enforce that both reference turns exist for categories that require
references (math, reasoning, coding) and raise an explicit error if they are
missing instead of writing None; keep optional behavior for other categories.
Locate the ref extraction logic (the turns = data["choices"][0]["turns"] block
and the later similar block at lines ~57-64) and change it to directly access
required keys/indices (avoid .get()) and validate len(turns) >= 2 for the
reference-scored categories, throwing a clear exception with qid and category
when the check fails so the pipeline fails fast.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/inference/eval/mt_bench_judge.py`:
- Around line 139-141: The code currently treats num_judges <= 1 as the default
path which silently accepts 0 or negative values; update the validation around
num_judges (the variable assigned from self.cfg.num_judges in mt_bench_judge.py)
to explicitly reject invalid values: if num_judges < 1 raise a ValueError (with
a helpful message referencing num_judges), and keep num_judges == 1 as a valid
case if that is intended; ensure any later logic that assumed the fallback path
still works when validation raises.

---

Duplicate comments:
In `@nemo_skills/dataset/mt-bench/prepare.py`:
- Around line 47-48: The code currently opens the output file (output_file /
fout) while streaming transformations from questions_file, which can leave a
partially written test.jsonl if validation later fails; instead, read and parse
questions_file first, buffer all transformed/validated rows in memory (e.g.,
build a list of transformed entries inside the parsing loop in prepare.py),
perform all validation and transformations on that buffer, and only after
successful completion open output_file for writing and iterate the buffered list
to write lines; reference the existing variables/questions_file, output_file,
fin/fout and the transformation/validation logic so you only move the file open
for fout to after buffering and validation.
- Around line 44-45: When building ref_answers from data (the assignment using
data["choices"][0]["turns"] and ref_answers[qid] = {"turn1": turns[0], "turn2":
turns[1] if len(turns) > 1 else None}), enforce that both reference turns exist
for categories that require references (math, reasoning, coding) and raise an
explicit error if they are missing instead of writing None; keep optional
behavior for other categories. Locate the ref extraction logic (the turns =
data["choices"][0]["turns"] block and the later similar block at lines ~57-64)
and change it to directly access required keys/indices (avoid .get()) and
validate len(turns) >= 2 for the reference-scored categories, throwing a clear
exception with qid and category when the check fails so the pipeline fails fast.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 52a68afb-43d6-407e-a4f0-cda71ae3e776

📥 Commits

Reviewing files that changed from the base of the PR and between 2720d78 and 214a4ea.

📒 Files selected for processing (5)

nemo_skills/dataset/komt-bench/prepare.py
nemo_skills/dataset/mt-bench/prepare.py
nemo_skills/evaluation/metrics/mt_bench_metrics.py
nemo_skills/inference/eval/mt_bench_judge.py
tests/test_mt_bench_metrics.py

✅ Files skipped from review due to trivial changes (1)

nemo_skills/dataset/komt-bench/prepare.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/metrics/mt_bench_metrics.py

coderabbitai · 2026-05-01T03:22:42Z

+        num_judges = self.cfg.num_judges
+
+        if num_judges <= 1:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject invalid num_judges values instead of silently falling back to one judge.

num_judges <= 1 makes 0 and negative values behave like the default path, so a bad Hydra/CLI value still produces scores instead of failing fast.

Suggested fix

num_judges = self.cfg.num_judges + if num_judges < 1: + raise ValueError(f"num_judges must be >= 1, got {num_judges}") - if num_judges <= 1: + if num_judges == 1: judgement_turn1, judgement_turn2 = await asyncio.gather( self._safe_judge(data_point, all_data, turn=1), self._safe_judge(data_point, all_data, turn=2),

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/mt_bench_judge.py` around lines 139 - 141, The code currently treats num_judges <= 1 as the default path which silently accepts 0 or negative values; update the validation around num_judges (the variable assigned from self.cfg.num_judges in mt_bench_judge.py) to explicitly reject invalid values: if num_judges < 1 raise a ValueError (with a helpful message referencing num_judges), and keep num_judges == 1 as a valid case if that is intended; ensure any later logic that assumed the fallback path still works when validation raises.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py`:
- Around line 91-100: The current conditional silently falls back to
single-judge fields when only one of all_judgements_turn1 / all_judgements_turn2
is present; update the logic in the block that reads all_j1 =
pred.get("all_judgements_turn1") and all_j2 = pred.get("all_judgements_turn2")
to detect the case where exactly one is present and fail loudly (e.g., raise a
ValueError or similar) with a clear message referencing the record id/context,
otherwise proceed to compute average scores using self._parse_rating for each
list; keep the existing single-judge fallback (parsing pred["judgement_turn1"] /
pred["judgement_turn2"]) only when neither all_j1 nor all_j2 is present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e29d5f99-f0ca-4ed9-aa24-1f430c18d00a

📥 Commits

Reviewing files that changed from the base of the PR and between 214a4ea and 845f550.

📒 Files selected for processing (2)

nemo_skills/evaluation/metrics/mt_bench_metrics.py
tests/test_mt_bench_metrics.py

coderabbitai · 2026-05-01T06:48:33Z

+        all_j1 = pred.get("all_judgements_turn1")
+        all_j2 = pred.get("all_judgements_turn2")
+        if all_j1 and all_j2:
+            scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]
+            scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]
+            score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None
+            score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None
+        else:
+            score_t1 = self._parse_rating(pred["judgement_turn1"])
+            score_t2 = self._parse_rating(pred["judgement_turn2"])


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail loudly when only one multi-judge field is present.

if all_j1 and all_j2 silently falls back to the single-judge path when a record contains just one of all_judgements_turn1 / all_judgements_turn2. That can mis-score partially corrupted judge output instead of surfacing the data issue.

Suggested fix

- all_j1 = pred.get("all_judgements_turn1") - all_j2 = pred.get("all_judgements_turn2") - if all_j1 and all_j2: + has_all_judgements = "all_judgements_turn1" in pred or "all_judgements_turn2" in pred + if has_all_judgements: + all_j1 = pred["all_judgements_turn1"] + all_j2 = pred["all_judgements_turn2"] + if not all_j1 or not all_j2: + raise ValueError( + "multi-judge records must include both all_judgements_turn1 and all_judgements_turn2" + ) scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None] scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None] score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None else:

As per coding guidelines, required judge fields should fail loudly instead of being silently downgraded.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

all_j1 = pred.get("all_judgements_turn1")

all_j2 = pred.get("all_judgements_turn2")

if all_j1 and all_j2:

scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]

scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]

score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None

score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None

else:

score_t1 = self._parse_rating(pred["judgement_turn1"])

score_t2 = self._parse_rating(pred["judgement_turn2"])

has_all_judgements = "all_judgements_turn1" in pred or "all_judgements_turn2" in pred

if has_all_judgements:

all_j1 = pred["all_judgements_turn1"]

all_j2 = pred["all_judgements_turn2"]

if not all_j1 or not all_j2:

raise ValueError(

"multi-judge records must include both all_judgements_turn1 and all_judgements_turn2"

)

scores_t1 = [s for s in (self._parse_rating(j) for j in all_j1) if s is not None]

scores_t2 = [s for s in (self._parse_rating(j) for j in all_j2) if s is not None]

score_t1 = sum(scores_t1) / len(scores_t1) if scores_t1 else None

score_t2 = sum(scores_t2) / len(scores_t2) if scores_t2 else None

else:

score_t1 = self._parse_rating(pred["judgement_turn1"])

score_t2 = self._parse_rating(pred["judgement_turn2"])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/mt_bench_metrics.py` around lines 91 - 100, The current conditional silently falls back to single-judge fields when only one of all_judgements_turn1 / all_judgements_turn2 is present; update the logic in the block that reads all_j1 = pred.get("all_judgements_turn1") and all_j2 = pred.get("all_judgements_turn2") to detect the case where exactly one is present and fail loudly (e.g., raise a ValueError or similar) with a clear message referencing the record id/context, otherwise proceed to compute average scores using self._parse_rating for each list; keep the existing single-judge fallback (parsing pred["judgement_turn1"] / pred["judgement_turn2"]) only when neither all_j1 nor all_j2 is present.

Kipok

@bzantium is this benchmark widely used? We had English mtbench before but removed it because our researchers found it low-quality, noisy and not very relevant to modern models. So, wondering if you find this benchmark valuable on your side or not

bzantium · 2026-05-12T06:10:12Z

@Kipok Fair point on the quality concerns. We don't use MT-Bench or KoMT-Bench for frontier model comparisons either; the headroom is gone up there and we reach for other benchmarks. Where we still find them useful is the sub-30B range, where the score distribution is wide enough that the noise floor doesn't drown the signal.

Agreed that it's not suitable for tight model-vs-model rankings. We treat it more as a coarse threshold: "does this checkpoint clear the ~8.0 line or fall below it?" rather than as a precise score. KoMT-Bench specifically is still one of the few open Korean multi-turn dialog benchmarks with a reference answer set, so even with the limitations it's worth keeping in the suite for smaller models.

Add MT-Bench and its Korean adaptation KoMT-Bench, both judge-based multi-turn dialog benchmarks (80 prompts × 2 turns, 8 categories). Datasets: - mt-bench: LMSYS FastChat questions + GPT-4 references, pulled from the upstream GitHub raw URLs. - komt-bench: LGAI-EXAONE/KoMT-Bench from the project's GitHub repo (the HuggingFace parquet only ships 39/80 references; the GitHub repo carries the full 80/80, matching upstream evaluations). Evaluator: - nemo_skills/inference/eval/mt_bench.py wraps the standard generation task for two-turn dialog rollouts; mt_bench_judge.py is a multi-judge variant. - MTBenchMetrics aggregates 1-10 judge scores and applies an optional sqrt language penalty when target_language is set on the entry. KoMT-Bench sets target_language: ko on all but qids 138 and 140 (JSON / CSV format outputs, exempted per upstream detector.py). - Language detection uses MediaPipe's language_detector.tflite (~315KB, lazy fetch to ~/.cache/nemo_skills/ or $NEMO_SKILLS_CACHE) for more reliable Korean detection than langdetect's heuristic. Korean judge prompts in nemo_skills/prompt/config/judge/komt-bench/ mirror the upstream FastChat turn1 / turn2 templates with Korean instructions. Dockerfile picks up libgl1-mesa-glx / libgles2-mesa / libegl1-mesa for the mediapipe runtime. Signed-off-by: bzantium <ryumin93@gmail.com>

shuoyangd · 2026-05-12T18:44:14Z

@eagle705 Any thoughts here?

Signed-off-by: Minho Ryu <ryumin93@gmail.com>

bzantium · 2026-05-15T06:38:27Z

Following up with concrete numbers in the sub-30B range. Ran the suite on nine open models from 0.6B to 4B (Qwen3, Qwen3.5, gemma-3, gemma-4), each in thinking and non-thinking mode where applicable, for 17 (model, mode) pairs. Generation on a single H100, judge gpt-4o-2024-08-06.

Model	Mode	mt-bench	komt-bench
Qwen3-0.6B	think	5.81	4.12
Qwen3-0.6B	nothink	4.99	3.45
Qwen3.5-0.8B	think	3.23	2.74
Qwen3.5-0.8B	nothink	4.42	3.11
gemma-3-1b-it	n/a	6.45	4.71
Qwen3-1.7B	think	7.54	5.91
Qwen3-1.7B	nothink	7.01	5.49
Qwen3.5-2B	think	6.26	5.60
Qwen3.5-2B	nothink	6.33	4.81
gemma-4-E2B-it	think	8.60	8.61
gemma-4-E2B-it	nothink	8.50	7.93
Qwen3-4B	think	8.50	7.58
Qwen3-4B	nothink	8.08	7.00
Qwen3.5-4B	think	8.10	7.88
Qwen3.5-4B	nothink	7.84	6.96
gemma-4-E4B-it	think	8.81	8.89
gemma-4-E4B-it	nothink	8.73	8.53

A few takeaways supporting the points above:

Distribution is wide. mt-bench spans 3.23 to 8.81 (5.6 pt spread); komt-bench spans 2.74 to 8.89 (6.15 pt). At 1B-scale and below, the table cleanly stratifies models by capability without saturating.
The 8.0 threshold is informative. mt-bench ≥ 8.0 picks 7 of 17 runs; komt-bench ≥ 8.0 picks only 3, all from the gemma-4 family. As a binary "usable for casual (Korean) dialog?" gate this is the kind of signal we actually act on; the precise rank between 8.50 and 8.60 is not.
Korean signal that MCQ benchmarks compress out. Among the four 4B-class models, three clear mt-bench ≥ 8.0 but only one (gemma-4-E4B-it) clears komt-bench ≥ 8.0. Qwen3-4B drops 0.92 points from English to Korean; gemma-4-E4B-it actually edges higher on Korean. MMMLU and KMMLU-style benchmarks tend to collapse this gap into a couple of accuracy points.

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

gwarmstrong mentioned this pull request Apr 27, 2026

fix(proof-arena-judge): pin MathArena/*_2025_outputs to 2026-03-25 revisions #1403

Merged

2 tasks

shuoyangd self-assigned this Apr 29, 2026

shuoyangd mentioned this pull request Apr 29, 2026

Add Korean benchmarks: KMMLU, KMMLU-Pro, KMMLU-Redux, KoIFEval, KorMedMCQA, KoSimpleQA #1375

Merged

bzantium force-pushed the feat/mt-bench-komt-bench branch from 70e3c8d to 2720d78 Compare May 1, 2026 03:00

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Kipok reviewed May 12, 2026

View reviewed changes

bzantium force-pushed the feat/mt-bench-komt-bench branch 2 times, most recently from 0f91275 to e44da44 Compare May 12, 2026 06:40

bzantium force-pushed the feat/mt-bench-komt-bench branch from e44da44 to 203923e Compare May 12, 2026 06:48

Update Dockerfile.nemo-skills

e4b5103

Signed-off-by: Minho Ryu <ryumin93@gmail.com>

		with open(questions_file, "rt", encoding="utf-8") as fin, \
		open(output_file, "wt", encoding="utf-8") as fout:

Conversation

bzantium commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Metrics

Data

Changes

New files (12)

Modified files (1)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

bzantium commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evaluation Results

MT-Bench (English)

Komt-Bench (Korean, with language penalty)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

bzantium commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuoyangd commented May 12, 2026

Uh oh!

bzantium commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

bzantium commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

bzantium commented Apr 17, 2026 •

edited

Loading

bzantium commented May 12, 2026 •

edited

Loading

bzantium commented May 15, 2026 •

edited

Loading