Skip to content

feat(datasets): add HLE dataset#301

Merged
SJTUyh merged 3 commits into
AISBench:masterfrom
ivanbao9783:feat/hle-dataset
May 20, 2026
Merged

feat(datasets): add HLE dataset#301
SJTUyh merged 3 commits into
AISBench:masterfrom
ivanbao9783:feat/hle-dataset

Conversation

@ivanbao9783

@ivanbao9783 ivanbao9783 commented May 15, 2026

Copy link
Copy Markdown
Collaborator

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Relates to #(issue ID / issue 编号)

🔍 Motivation / 变更动机

将 HLE(Humanity's Last Exam)多模态视觉问答数据集适配到 AISBench 评测框架中,使用裁判模型(Judge Model)进行自动评估,支持准确率、置信区间、校准误差等完整指标的计算,并保持与 HLE 官方评估逻辑的一致性。

主要目标:

  • 支持 HLE 数据集的推理和评估
  • 完整实现官方 calib_err()dump_metrics() 计算逻辑
  • 支持多模态(文本 + 图片)输入处理
  • 提供中英文说明文档

📝 Modification / 修改内容

新增文件 (4个)

文件路径 功能说明
ais_bench/benchmark/datasets/hle.py HLE 数据集和评估器实现,包含 HLEDatasetHLEJGDatasetHLEJudgeEvaluator 以及 calib_err()dump_metrics() 等函数
ais_bench/benchmark/configs/datasets/hle/hle_llmjudge.py HLE 数据集配置文件,包含 hle_reader_cfghle_infer_cfghle_judge_infer_cfghle_eval_cfg 等配置
ais_bench/benchmark/configs/datasets/hle/README.md 中文说明文档
ais_bench/benchmark/configs/datasets/hle/README_en.md 英文说明文档

修改文件 (1个)

文件路径 修改说明
ais_bench/benchmark/summarizers/default.py _format_table() 方法增加类型判断,支持字符串类型的 metric 值直接显示(如 "8.0% +/- 2.38%")
ais_bench/benchmark/datasets/\__init__.py 增加 hle 新增文件内容导入

📐 Associated Test Results / 关联测试结果

【测试报告】HLE多模态闭卷学术评测接入AISBench

⚠️ BC-breaking (Optional) / 向后不兼容变更(可选)

NA

⚠️ Performance degradation (Optional) / 性能下降(可选)

NA

🌟 Use cases (Optional) / 使用案例(可选)

NA

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响,应在那些项目中测试此 PR。
  • CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署,且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

  • Suggested Reviewers / 建议审核人: @xxx
  • Relevant Module Owners / 相关模块负责人: @xxx
  • Other Collaboration Notes / 其他协作说明:

🌟 Useful CI Command / 实用的CI命令

Command / 命令 Introduction / 介绍
/gemini review Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
/gemini summary Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
/gemini help Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
/readthedocs build Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the HLE (Humanity's Last Exam) dataset, including documentation, LLM-judge configurations, and evaluation logic. Key additions include the HLEDataset and HLEJudgeEvaluator classes, along with updates to the summarizer to handle string-based metrics. Feedback from the review identifies a critical logic error in the calibration error calculation that skips the final data bin, suggests performance optimizations for dataset loading using vectorization, and recommends replacing print and assert statements with proper logging and exception handling.


cerr = 0
total_examples = len(confidence)
for i in range(len(bins) - 1):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

逻辑错误:range(len(bins) - 1) 会导致最后一个 bin 被跳过,无法参与计算。应该遍历所有的 bins。

Suggested change
for i in range(len(bins) - 1):
for i in range(len(bins)):

@ivanbao9783 ivanbao9783 May 18, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +71 to +75
dataset = []
for i in range(len(data)):
line = data.iloc[i]
parsed_item = parse_hle_item(line.to_dict())
dataset.append(parsed_item)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

使用循环和 iloc 逐行处理 DataFrame 效率较低。建议使用 to_dict('records') 进行向量化处理,以提高数据加载速度。

Suggested change
dataset = []
for i in range(len(data)):
line = data.iloc[i]
parsed_item = parse_hle_item(line.to_dict())
dataset.append(parsed_item)
dataset = [parse_hle_item(row) for row in data.to_dict('records')]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

保持和AISBench其他数据集解析处理风格一致

Comment thread ais_bench/benchmark/datasets/hle.py Outdated
}
)
except json.JSONDecodeError:
print(f"Error: wrong format prediction: {cleaned}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

建议使用 logger.error 代替 print 语句,以便更好地集成到系统的日志管理中。

Suggested change
print(f"Error: wrong format prediction: {cleaned}")
logger.error(f"wrong format prediction: {cleaned}")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关键信息,希望能直接打屏

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment thread ais_bench/benchmark/datasets/hle.py Outdated
elif p == "infty" or p == "infinity" or p == "max":
cerr = np.maximum(cerr, difference)
else:
assert False, "p must be '1', '2', or 'infty'"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

不建议使用 assert False 来处理预期的错误情况。建议抛出 ValueError 异常。

Suggested change
assert False, "p must be '1', '2', or 'infty'"
raise ValueError("p must be '1', '2', or 'infty'")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment thread ais_bench/benchmark/datasets/hle.py Outdated
@ivanbao9783 ivanbao9783 force-pushed the feat/hle-dataset branch 4 times, most recently from 3e69213 to 77731e2 Compare May 18, 2026 07:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the HLE (Humanity’s Last Exam) multimodal VQA dataset to AISBench, including dataset loading, LLM-judge-based evaluation (accuracy + calibration metrics), corresponding configs/docs, and summarizer support for string-formatted metrics.

Changes:

  • Added HLEDataset / HLEJGDataset and HLEJudgeEvaluator with metric utilities (calib_err, dump_metrics).
  • Added HLE dataset config and bilingual README docs.
  • Updated default summarizer to allow string metric values and display them without numeric formatting.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
ais_bench/benchmark/datasets/hle.py New HLE dataset + judge evaluator implementation and metric computations.
ais_bench/benchmark/configs/datasets/hle/hle_llmjudge.py New HLE task configuration for inference + judge inference + evaluation.
ais_bench/benchmark/configs/datasets/hle/README.md New Chinese documentation for deploying/using HLE.
ais_bench/benchmark/configs/datasets/hle/README_en.md New English documentation for deploying/using HLE.
ais_bench/benchmark/summarizers/default.py Allows string-valued metrics and prints them as-is in summary tables.
ais_bench/benchmark/datasets/__init__.py Exposes HLE dataset module via package import.
tests/UT/datasets/test_hle.py Adds unit tests for HLE parsing, metrics, and evaluator scoring.
tests/UT/summarizers/test_default.py Updates summarizer unit test expectations for string metric handling.
Comments suppressed due to low confidence (1)

ais_bench/benchmark/summarizers/default.py:116

  • Allowing any string-valued key through _pick_up_results can accidentally treat non-metric string fields as metrics (and makes the existing "unknown result format" path unreachable in some cases). Consider only allowing str for known metric keys (e.g. those in METRIC_WHITELIST) or for specific metrics like accuracy, instead of all keys.
                _rst, _dm = {}, []
                for metric, score in result.items():
                    if metric not in METRIC_BLACKLIST and isinstance(score, (int, float, str)):
                        _rst[metric] = score
                        _dm.append(metric)
                    else:
                        continue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ais_bench/benchmark/datasets/hle.py Outdated
return HLEDataset


def parse_predictions(predictions: list) -> list[dict]:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


cerr = 0
total_examples = len(confidence)
for i in range(len(bins) - 1):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread ais_bench/benchmark/datasets/hle.py Outdated
elif p == "infty" or p == "infinity" or p == "max":
cerr = np.maximum(cerr, difference)
else:
assert False, "p must be '1', '2', or 'infty'"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment on lines +121 to +125
}
)
except json.JSONDecodeError:
print(f"Error: wrong format prediction: {cleaned}")
continue

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment thread ais_bench/benchmark/datasets/hle.py Outdated
Comment on lines +16 to +17
from ais_bench.benchmark.openicl.icl_evaluator.icl_base_evaluator import \
BaseEvaluator

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment on lines +36 to +41
## Available Dataset Tasks

| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path |
| --- | --- | --- | --- | --- | --- |
| hle | HLE dataset | Accuracy, Calibration Error | 0-shot | Chat format | hle_llmjudge.py |

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非问题,md文件表格显示无问题


correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.

confidence: The extracted confidence score between 0|\%| and 100|\%| from [response]. Put 100 if there is no confidence score available.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和官方提示词保持一致

Comment on lines 294 to 300
# 验证结果
self.assertIn("test_model", raw_results)
self.assertIn("test_ds", raw_results["test_model"])

self.assertIn("test_model", parsed_results)
self.assertNotIn("test_ds", parsed_results["test_model"])
self.assertIn("test_ds", parsed_results["test_model"])

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非问题,UT用例适配单点修改,不做str校验

Comment on lines +1 to +10
import unittest
import json
import os
import tempfile
import numpy as np
import pandas as pd
from unittest.mock import patch
from datasets import Dataset

from ais_bench.benchmark.datasets.hle import (

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment on lines +8 to +9
from ais_bench.benchmark.utils.postprocess.model_postprocessors import \
extract_non_reasoning_content

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment thread ais_bench/benchmark/datasets/hle.py Outdated
}
)
except json.JSONDecodeError:
print(f"Error: wrong format prediction: {cleaned}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】1. logger.error 和print行为是一样的,非debug模式下这个subprocess内的stdout会被重定向到文件中,print也无法直接打屏的。而且print的Error不够醒目。
因此这里建议替换为logger.error

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改为logger.error

Comment on lines +22 to +26
# 使用 modelscope 下载 (需要安装 modelscope)
modelscope download --dataset cais/hle --local_dir {tool_root_path}/ais_bench/datasets/hle/

# 使用 huggingface-cli 下载 (需要安装 transformers 并登录)
huggingface-cli download cais/hle --repo-type dataset --local-dir {tool_root_path}/ais_bench/datasets/hle/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

【review】1. modelscope不是AISBench的依赖,用这条命令还需要装modelscope。
2. huggingface-cli虽然是AISBench的依赖,但是高版本下是使用hf 二进制,huggingface-cli这个无效
3. 在国内网络环境下huggingface-cli未必搞得下来

因此建议是参考其他数据集,只给获取渠道,不放具体的获取命令

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

    1. 删除未引用的依赖
    2. 将print改为logger.error
    3. 修改readme内容,仅提供数据来源
    4. 修改dump_metrics函数,无效judge_results场景的默认返回值
    5. 修改parse_predictions函数返回值类型,由list[dict]改为List[Dict[str, Any]]
    6. 修改calib_err函数中异常抛出类型,改为ValueError
    7. 补充get_started中HLE数据集资料
@ivanbao9783 ivanbao9783 temporarily deployed to smoke-test-approval May 20, 2026 03:00 — with GitHub Actions Inactive
@SJTUyh SJTUyh merged commit 58e0e8e into AISBench:master May 20, 2026
6 checks passed
"additionalProperties": False,
},
"name": "ExtractedAnswer",
"strict": True,

@Libotry Libotry May 21, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The schema sets "strict": True at the JSON-schema wrapper level and also requires a strict field inside the model output. That field is not consumed by the evaluator, but it increases output complexity and raises the chance of schema validation failures.
suggestion: Keep only the top-level json_schema.strict=True and remove the inner strict property and its required entry.

"enum": ["yes", "no"],
"type": "string",
},
"confidence": {"title": "Confidence", "type": "integer"},

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The confidence field is defined only as an integer without a valid range. The downstream evaluator assumes the value is between 0 and 100, so out-of-range values can silently corrupt calibration metrics.
suggestion: Add minimum: 0 and maximum: 100 to the schema, and add defensive validation or clamping in the evaluator before metric calculation.

role="HUMAN",
prompt_mm={
"text": {"type": "text", "text": "{question}"},
"image": {"type": "image_url", "image_url": {"url": "{image}"}},

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The dataset image field is forwarded directly as image_url to the remote model service. If the dataset source is not fully trusted, this creates SSRF and internal resource access risk on the backend service side.
suggestion: Restrict allowed image sources with protocol and domain allowlists, and prefer controlled local-path mapping or preprocessed data URIs instead of arbitrary remote URLs.

host_port=8080,
url="",
max_out_len=8192,
batch_size=100,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The judge model configuration depends on multiple empty placeholders and localhost defaults, such as model="", api_key="", and url="". This creates hidden runtime dependencies and makes failures environment-dependent and harder to diagnose.
suggestion: Require at least one of model or url, load sensitive and environment-specific values from external configuration or environment variables, and fail fast with explicit validation errors.

host_ip="localhost",
host_port=8080,
url="",
max_out_len=8192,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: max_out_len=8192 and batch_size=100 are aggressive defaults for a judge task that only needs short structured JSON output. This increases memory usage, timeout risk, throughput instability, and serving cost.
suggestion: Reduce max_out_len to a tighter range such as 256-512 and tune batch_size based on actual serving capacity and load-test results.

),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The file mixes judge protocol definition, prompt design, schema definition, model connection settings, and dataset wiring in one place. This increases coupling and makes future *_llmjudge.py configurations more likely to drift or duplicate logic.
suggestion: Extract shared judge schema, reusable prompt fragments, and common judge-model defaults into a shared module so dataset-specific files only define the parts that differ.

),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The file mixes judge protocol definition, prompt design, schema definition, model connection settings, and dataset wiring in one place. This increases coupling and makes future *_llmjudge.py configurations more likely to drift or duplicate logic.
suggestion: Extract shared judge schema, reusable prompt fragments, and common judge-model defaults into a shared module so dataset-specific files only define the parts that differ.

correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.

confidence: The extracted confidence score between 0|\%| and 100|\%| from [response]. Put 100 if there is no confidence score available.
""".strip()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The prompt relies heavily on natural-language compliance for answer extraction and correctness judgment, but it does not explicitly constrain ambiguous cases such as multiple final answers, malformed confidence text, or partially structured responses. That leaves room for non-deterministic judge behavior.
suggestion: Add explicit rules for ambiguous responses, multiple extracted answers, invalid confidence formats, and empty answers so the judge behavior is more deterministic and reproducible.

chat_template_kwargs=dict(
enable_thinking=False,
),
),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The prompt relies heavily on natural-language compliance for answer extraction and correctness judgment, but it does not explicitly constrain ambiguous cases such as multiple final answers, malformed confidence text, or partially structured responses. That leaves room for non-deterministic judge behavior.
suggestion: Add explicit rules for ambiguous responses, multiple extracted answers, invalid confidence formats, and empty answers so the judge behavior is more deterministic and reproducible.

infer_cfg=hle_infer_cfg,
judge_infer_cfg=hle_judge_infer_cfg,
eval_cfg=hle_eval_cfg,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: The dataset path is hardcoded to a specific parquet file location. This reduces portability across environments and makes packaging, CI, and external reuse more brittle.
suggestion: Move the dataset path into an external config layer or resolve it through a dataset registry so the config is easier to reuse across machines and deployment environments.

logger.debug(f"Loading HLE dataset from: {resolved_path}")

if not os.path.exists(resolved_path):
raise FileNotFoundError(f"HLE parquet file not found: {resolved_path}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: HLEDataset.load reads the entire parquet file into a pandas DataFrame and then iterates row by row to build a Python list before converting it again into a HuggingFace Dataset. This introduces unnecessary memory amplification and poor scalability for larger datasets.
Suggestion: Avoid the pandas-to-list-to-Dataset conversion chain. Prefer a direct parquet-to-dataset loading path or a vectorized transformation pipeline to reduce memory usage and improve throughput.


Returns:
Dictionary with 'content' (formatted prompt) and 'answer' fields.
"""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: The image field is appended directly into the message payload as image_url without any validation or normalization. If dataset content is not fully trusted, this can propagate unsafe URLs into downstream services and create SSRF or unintended remote fetch risk.
Suggestion: Validate image sources before adding them to the prompt payload. Restrict supported schemes and trusted domains, and normalize the representation expected by the downstream model interface.

logger = AISLogger()


def parse_hle_item(item: Dict) -> Dict:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: parse_hle_item silently falls back to empty strings for missing question, image, and answer. This makes bad input look valid and shifts data quality failures into later inference or evaluation stages, where root-cause analysis becomes harder.
Suggestion: Enforce required fields explicitly, or at minimum log and count malformed samples. Failing fast on structurally invalid records will improve observability and data quality control.

cleaned = re.sub(r"[\n\t\r]+", " ", pred_str)
cleaned = re.sub(r"\s+", " ", cleaned)
logger.debug(f"\n cleaned_pred_input: {cleaned}")
try:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: parse_predictions tries to recover malformed JSON by flattening all newlines, tabs, and repeated whitespace before parsing. This can alter valid string content, hide formatting issues, and still fail on common LLM JSON defects such as trailing prose or fenced code blocks.
Suggestion: Use a stricter structured-output contract and parse the raw response first. If recovery is needed, implement targeted cleanup for known wrappers such as markdown fences rather than globally rewriting whitespace.

for pred_str in predictions:
logger.debug(f"\n original_pred_input: {pred_str}")
cleaned = re.sub(r"[\n\t\r]+", " ", pred_str)
cleaned = re.sub(r"\s+", " ", cleaned)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: Invalid judge outputs are logged and then skipped silently. Later metric computation still uses the original expected sample count, which can understate or distort results while hiding the true parse failure rate.
Suggestion: Surface parse failure statistics explicitly in the returned metrics, and consider treating excessive parse failures as an evaluation error instead of silently continuing.

return np.abs(avg_conf - avg_correct)

bins = [[i * beta, (i + 1) * beta] for i in range(num_bins)]
bins[-1] = [bins[-1][0], len(confidence)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: The binning logic in calib_err is flawed for datasets with at least beta samples. num_bins = len(confidence) // beta combined with for i in range(len(bins) - 1) means the final bin is never processed, so part of the dataset is excluded from calibration error calculation.
Suggestion: Iterate over all bins, not all bins minus one. This is a correctness bug in metric computation and should be fixed before relying on reported calibration values.

avg_correct = np.nanmean(correct)
return np.abs(avg_conf - avg_correct)

bins = [[i * beta, (i + 1) * beta] for i in range(num_bins)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: The bin construction uses fixed-size chunks and then overwrites only the final bin end index. When len(confidence) is not an exact multiple of beta, the remainder samples are not assigned to any bin at all.
Suggestion: Build bins so the final range always includes the tail remainder, or use a simpler slice-based loop such as for start in range(0, len(confidence), beta) to cover the full dataset deterministically.

"confidence_half_width": "+/- 0%",
"calibration_error": 0,
"sample_num": n,
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: dump_metrics divides by n without guarding against n == 0. An empty reference set will trigger division-by-zero errors in both accuracy and confidence interval calculations.
Suggestion: Add an explicit empty-input guard and return a well-defined zero-sample result before performing any metric math.

if not judge_results:
logger.error(UTILS_CODES.UNKNOWN_ERROR, "No available judge_results")
return {
"accuracy": "0%",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[issue] Issue: The code logs a mismatch when parsed prediction count differs from expected count, but it still computes accuracy using the full expected count and the reduced parsed set. This couples data loss and metric bias in a way that is hard to interpret operationally.
Suggestion: Either fail the evaluation on count mismatch or return both expected_sample_num and valid_judge_sample_num so consumers can distinguish model performance from parser failure.

the evaluated predictions.

Args:
judge_results: List of judge response dicts with 'correct' and 'confidence'.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: Correctness is derived using "yes" in judge.get("correct", ""), which is too loose. Any unexpected string containing yes as a substring would be counted as correct, even if the field is malformed.
Suggestion: Use an exact normalized comparison such as str(value).strip().lower() == "yes" to keep the metric contract strict and predictable.


Returns:
Dictionary containing:
- accuracy: Percentage string with confidence interval

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: Confidence values are appended without type validation and then converted using np.array(confidence) / 100. Non-numeric values from judge output can cause dtype degradation or runtime failures during metric computation.
Suggestion: Validate and coerce confidence values explicitly before aggregation, reject invalid entries with clear accounting, and clamp the accepted range to 0-100.



@ICL_EVALUATORS.register_module()
class HLEJudgeEvaluator(LLMJudgeCorrectEvaluator):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] Issue: HLEJudgeEvaluator.score validates only list-length equality before parsing. It does not check response schema completeness, parse failure ratio, or confidence validity before returning metrics. This makes the evaluation path too permissive for a structured judge pipeline.
Suggestion: Add validation gates for parse success rate, required keys, and confidence range, and return explicit error states when judge outputs are not trustworthy enough for metric reporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants