feat(datasets): add HLE dataset by ivanbao9783 · Pull Request #301 · AISBench/benchmark

ivanbao9783 · 2026-05-15T07:26:17Z

PR Type / PR类型

Related Issue | 关联 Issue
Relates to #(issue ID / issue 编号)

🔍 Motivation / 变更动机

将 HLE（Humanity's Last Exam）多模态视觉问答数据集适配到 AISBench 评测框架中，使用裁判模型（Judge Model）进行自动评估，支持准确率、置信区间、校准误差等完整指标的计算，并保持与 HLE 官方评估逻辑的一致性。

主要目标：

支持 HLE 数据集的推理和评估
完整实现官方 calib_err() 和 dump_metrics() 计算逻辑
支持多模态（文本 + 图片）输入处理
提供中英文说明文档

📝 Modification / 修改内容

新增文件 (4个)

文件路径	功能说明
`ais_bench/benchmark/datasets/hle.py`	HLE 数据集和评估器实现，包含 `HLEDataset`、`HLEJGDataset`、`HLEJudgeEvaluator` 以及 `calib_err()`、`dump_metrics()` 等函数
`ais_bench/benchmark/configs/datasets/hle/hle_llmjudge.py`	HLE 数据集配置文件，包含 `hle_reader_cfg`、`hle_infer_cfg`、`hle_judge_infer_cfg`、`hle_eval_cfg` 等配置
`ais_bench/benchmark/configs/datasets/hle/README.md`	中文说明文档
`ais_bench/benchmark/configs/datasets/hle/README_en.md`	英文说明文档

修改文件 (1个)

文件路径	修改说明
`ais_bench/benchmark/summarizers/default.py`	`_format_table()` 方法增加类型判断，支持字符串类型的 metric 值直接显示（如 "8.0% +/- 2.38%"）
`ais_bench/benchmark/datasets/\__init__.py`	增加 hle 新增文件内容导入

📐 Associated Test Results / 关联测试结果

【测试报告】HLE多模态闭卷学术评测接入AISBench

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

NA

⚠️ Performance degradation (Optional) / 性能下降（可选）

NA

🌟 Use cases (Optional) / 使用案例（可选）

NA

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @xxx
Relevant Module Owners / 相关模块负责人: @xxx
Other Collaboration Notes / 其他协作说明：

🌟 Useful CI Command / 实用的CI命令

Command / 命令	Introduction / 介绍
`/gemini review`	Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
`/gemini summary`	Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
`/gemini help`	Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
`/readthedocs build`	Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

gemini-code-assist

Code Review

This pull request introduces support for the HLE (Humanity's Last Exam) dataset, including documentation, LLM-judge configurations, and evaluation logic. Key additions include the HLEDataset and HLEJudgeEvaluator classes, along with updates to the summarizer to handle string-based metrics. Feedback from the review identifies a critical logic error in the calibration error calculation that skips the final data bin, suggests performance optimizations for dataset loading using vectorization, and recommends replacing print and assert statements with proper logging and exception handling.

gemini-code-assist · 2026-05-15T07:27:27Z

+
+    cerr = 0
+    total_examples = len(confidence)
+    for i in range(len(bins) - 1):


逻辑错误：range(len(bins) - 1) 会导致最后一个 bin 被跳过，无法参与计算。应该遍历所有的 bins。

Suggested change

for i in range(len(bins) - 1):

for i in range(len(bins)):

实现逻辑和官方脚本保持一致。
https://github.com/centerforaisafety/hle/blob/main/hle_eval/run_judge_results.py

gemini-code-assist · 2026-05-15T07:27:27Z

+        dataset = []
+        for i in range(len(data)):
+            line = data.iloc[i]
+            parsed_item = parse_hle_item(line.to_dict())
+            dataset.append(parsed_item)


使用循环和 iloc 逐行处理 DataFrame 效率较低。建议使用 to_dict('records') 进行向量化处理，以提高数据加载速度。

Suggested change

dataset = []

for i in range(len(data)):

line = data.iloc[i]

parsed_item = parse_hle_item(line.to_dict())

dataset.append(parsed_item)

dataset = [parse_hle_item(row) for row in data.to_dict('records')]

保持和AISBench其他数据集解析处理风格一致

gemini-code-assist · 2026-05-15T07:27:27Z

+                }
+            )
+        except json.JSONDecodeError:
+            print(f"Error: wrong format prediction: {cleaned}")


建议使用 logger.error 代替 print 语句，以便更好地集成到系统的日志管理中。

Suggested change

print(f"Error: wrong format prediction: {cleaned}")

logger.error(f"wrong format prediction: {cleaned}")

关键信息，希望能直接打屏

gemini-code-assist · 2026-05-15T07:27:28Z

+            elif p == "infty" or p == "infinity" or p == "max":
+                cerr = np.maximum(cerr, difference)
+            else:
+                assert False, "p must be '1', '2', or 'infty'"


不建议使用 assert False 来处理预期的错误情况。建议抛出 ValueError 异常。

Suggested change

assert False, "p must be '1', '2', or 'infty'"

raise ValueError("p must be '1', '2', or 'infty'")

和官方脚本实现逻辑保持一致。
https://github.com/centerforaisafety/hle/blob/main/hle_eval/run_judge_results.py

Copilot

Pull request overview

This PR adds support for the HLE (Humanity’s Last Exam) multimodal VQA dataset to AISBench, including dataset loading, LLM-judge-based evaluation (accuracy + calibration metrics), corresponding configs/docs, and summarizer support for string-formatted metrics.

Changes:

Added HLEDataset / HLEJGDataset and HLEJudgeEvaluator with metric utilities (calib_err, dump_metrics).
Added HLE dataset config and bilingual README docs.
Updated default summarizer to allow string metric values and display them without numeric formatting.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
`ais_bench/benchmark/datasets/hle.py`	New HLE dataset + judge evaluator implementation and metric computations.
`ais_bench/benchmark/configs/datasets/hle/hle_llmjudge.py`	New HLE task configuration for inference + judge inference + evaluation.
`ais_bench/benchmark/configs/datasets/hle/README.md`	New Chinese documentation for deploying/using HLE.
`ais_bench/benchmark/configs/datasets/hle/README_en.md`	New English documentation for deploying/using HLE.
`ais_bench/benchmark/summarizers/default.py`	Allows string-valued metrics and prints them as-is in summary tables.
`ais_bench/benchmark/datasets/__init__.py`	Exposes HLE dataset module via package import.
`tests/UT/datasets/test_hle.py`	Adds unit tests for HLE parsing, metrics, and evaluator scoring.
`tests/UT/summarizers/test_default.py`	Updates summarizer unit test expectations for string metric handling.

Comments suppressed due to low confidence (1)

ais_bench/benchmark/summarizers/default.py:116

Allowing any string-valued key through _pick_up_results can accidentally treat non-metric string fields as metrics (and makes the existing "unknown result format" path unreachable in some cases). Consider only allowing str for known metric keys (e.g. those in METRIC_WHITELIST) or for specific metrics like accuracy, instead of all keys.

                _rst, _dm = {}, []
                for metric, score in result.items():
                    if metric not in METRIC_BLACKLIST and isinstance(score, (int, float, str)):
                        _rst[metric] = score
                        _dm.append(metric)
                    else:
                        continue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ivanbao9783 · 2026-05-19T01:43:52Z

+        return HLEDataset
+
+
+def parse_predictions(predictions: list) -> list[dict]:


ivanbao9783 · 2026-05-19T01:45:00Z

+
+    cerr = 0
+    total_examples = len(confidence)
+    for i in range(len(bins) - 1):


和官方脚本实现逻辑保持一致。
https://github.com/centerforaisafety/hle/blob/main/hle_eval/run_judge_results.py

ivanbao9783 · 2026-05-19T01:44:27Z

+            elif p == "infty" or p == "infinity" or p == "max":
+                cerr = np.maximum(cerr, difference)
+            else:
+                assert False, "p must be '1', '2', or 'infty'"


ivanbao9783 · 2026-05-19T01:44:35Z

+                }
+            )
+        except json.JSONDecodeError:
+            print(f"Error: wrong format prediction: {cleaned}")
+            continue


ivanbao9783 · 2026-05-19T01:45:30Z

+from ais_bench.benchmark.openicl.icl_evaluator.icl_base_evaluator import \
+    BaseEvaluator


ivanbao9783 · 2026-05-19T06:54:23Z

+## Available Dataset Tasks
+
+| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path |
+| --- | --- | --- | --- | --- | --- |
+| hle | HLE dataset | Accuracy, Calibration Error | 0-shot | Chat format | hle_llmjudge.py |
+


非问题，md文件表格显示无问题

ivanbao9783 · 2026-05-19T01:49:20Z

+
+    correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
+
+    confidence: The extracted confidence score between 0|\%| and 100|\%| from [response]. Put 100 if there is no confidence score available.


和官方提示词保持一致

ivanbao9783 · 2026-05-19T06:55:04Z

        # 验证结果
        self.assertIn("test_model", raw_results)
        self.assertIn("test_ds", raw_results["test_model"])

        self.assertIn("test_model", parsed_results)
-        self.assertNotIn("test_ds", parsed_results["test_model"])
+        self.assertIn("test_ds", parsed_results["test_model"])



非问题，UT用例适配单点修改，不做str校验

ivanbao9783 · 2026-05-19T01:51:10Z

+import unittest
+import json
+import os
+import tempfile
+import numpy as np
+import pandas as pd
+from unittest.mock import patch
+from datasets import Dataset
+
+from ais_bench.benchmark.datasets.hle import (


ivanbao9783 · 2026-05-19T01:51:18Z

+from ais_bench.benchmark.utils.postprocess.model_postprocessors import \
+    extract_non_reasoning_content


SJTUyh · 2026-05-18T11:12:19Z

+                }
+            )
+        except json.JSONDecodeError:
+            print(f"Error: wrong format prediction: {cleaned}")


【review】1. logger.error 和print行为是一样的，非debug模式下这个subprocess内的stdout会被重定向到文件中，print也无法直接打屏的。而且print的Error不够醒目。
因此这里建议替换为logger.error

已修改为logger.error

SJTUyh · 2026-05-18T11:26:30Z

+# 使用 modelscope 下载 (需要安装 modelscope)
+modelscope download --dataset cais/hle --local_dir {tool_root_path}/ais_bench/datasets/hle/
+
+# 使用 huggingface-cli 下载 (需要安装 transformers 并登录)
+huggingface-cli download cais/hle --repo-type dataset --local-dir {tool_root_path}/ais_bench/datasets/hle/


【review】1. modelscope不是AISBench的依赖，用这条命令还需要装modelscope。
2. huggingface-cli虽然是AISBench的依赖，但是高版本下是使用hf 二进制，huggingface-cli这个无效
3. 在国内网络环境下huggingface-cli未必搞得下来

因此建议是参考其他数据集，只给获取渠道，不放具体的获取命令

1. 删除未引用的依赖 2. 将print改为logger.error 3. 修改readme内容，仅提供数据来源 4. 修改dump_metrics函数，无效judge_results场景的默认返回值 5. 修改parse_predictions函数返回值类型，由list[dict]改为List[Dict[str, Any]] 6. 修改calib_err函数中异常抛出类型，改为ValueError 7. 补充get_started中HLE数据集资料

Libotry · 2026-05-21T12:57:29Z

+            "additionalProperties": False,
+        },
+        "name": "ExtractedAnswer",
+        "strict": True,


[review] issue: The schema sets "strict": True at the JSON-schema wrapper level and also requires a strict field inside the model output. That field is not consumed by the evaluator, but it increases output complexity and raises the chance of schema validation failures.
suggestion: Keep only the top-level json_schema.strict=True and remove the inner strict property and its required entry.

Libotry · 2026-05-21T13:00:06Z

+                    "enum": ["yes", "no"],
+                    "type": "string",
+                },
+                "confidence": {"title": "Confidence", "type": "integer"},


[review] issue: The confidence field is defined only as an integer without a valid range. The downstream evaluator assumes the value is between 0 and 100, so out-of-range values can silently corrupt calibration metrics.
suggestion: Add minimum: 0 and maximum: 100 to the schema, and add defensive validation or clamping in the evaluator before metric calculation.

Libotry · 2026-05-21T13:01:09Z

+                    role="HUMAN",
+                    prompt_mm={
+                        "text": {"type": "text", "text": "{question}"},
+                        "image": {"type": "image_url", "image_url": {"url": "{image}"}},


[review] issue: The dataset image field is forwarded directly as image_url to the remote model service. If the dataset source is not fully trusted, this creates SSRF and internal resource access risk on the backend service side.
suggestion: Restrict allowed image sources with protocol and domain allowlists, and prefer controlled local-path mapping or preprocessed data URIs instead of arbitrary remote URLs.

Libotry · 2026-05-21T13:06:39Z

+        host_port=8080,
+        url="",
+        max_out_len=8192,
+        batch_size=100,


[review] issue: The judge model configuration depends on multiple empty placeholders and localhost defaults, such as model="", api_key="", and url="". This creates hidden runtime dependencies and makes failures environment-dependent and harder to diagnose.
suggestion: Require at least one of model or url, load sensitive and environment-specific values from external configuration or environment variables, and fail fast with explicit validation errors.

Libotry · 2026-05-21T13:08:12Z

+        host_ip="localhost",
+        host_port=8080,
+        url="",
+        max_out_len=8192,


[review] issue: max_out_len=8192 and batch_size=100 are aggressive defaults for a judge task that only needs short structured JSON output. This increases memory usage, timeout risk, throughput instability, and serving cost.
suggestion: Reduce max_out_len to a tighter range such as 256-512 and tune batch_size based on actual serving capacity and load-test results.

Libotry · 2026-05-21T13:10:36Z

+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)


[review] issue: The file mixes judge protocol definition, prompt design, schema definition, model connection settings, and dataset wiring in one place. This increases coupling and makes future *_llmjudge.py configurations more likely to drift or duplicate logic.
suggestion: Extract shared judge schema, reusable prompt fragments, and common judge-model defaults into a shared module so dataset-specific files only define the parts that differ.

Libotry · 2026-05-21T13:10:46Z

+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)


[review] issue: The file mixes judge protocol definition, prompt design, schema definition, model connection settings, and dataset wiring in one place. This increases coupling and makes future *_llmjudge.py configurations more likely to drift or duplicate logic.
suggestion: Extract shared judge schema, reusable prompt fragments, and common judge-model defaults into a shared module so dataset-specific files only define the parts that differ.

Libotry · 2026-05-21T13:11:56Z

+    correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
+
+    confidence: The extracted confidence score between 0|\%| and 100|\%| from [response]. Put 100 if there is no confidence score available.
+""".strip()


[review] issue: The prompt relies heavily on natural-language compliance for answer extraction and correctness judgment, but it does not explicitly constrain ambiguous cases such as multiple final answers, malformed confidence text, or partially structured responses. That leaves room for non-deterministic judge behavior.
suggestion: Add explicit rules for ambiguous responses, multiple extracted answers, invalid confidence formats, and empty answers so the judge behavior is more deterministic and reproducible.

Libotry · 2026-05-21T13:16:20Z

+            chat_template_kwargs=dict(
+                enable_thinking=False,
+            ),
+        ),


[review] issue: The prompt relies heavily on natural-language compliance for answer extraction and correctness judgment, but it does not explicitly constrain ambiguous cases such as multiple final answers, malformed confidence text, or partially structured responses. That leaves room for non-deterministic judge behavior.
suggestion: Add explicit rules for ambiguous responses, multiple extracted answers, invalid confidence formats, and empty answers so the judge behavior is more deterministic and reproducible.

Libotry · 2026-05-21T13:17:57Z

+        infer_cfg=hle_infer_cfg,
+        judge_infer_cfg=hle_judge_infer_cfg,
+        eval_cfg=hle_eval_cfg,
+    )


[review] issue: The dataset path is hardcoded to a specific parquet file location. This reduces portability across environments and makes packaging, CI, and external reuse more brittle.
suggestion: Move the dataset path into an external config layer or resolve it through a dataset registry so the config is easier to reuse across machines and deployment environments.

Libotry · 2026-05-21T13:34:16Z

+        logger.debug(f"Loading HLE dataset from: {resolved_path}")
+
+        if not os.path.exists(resolved_path):
+            raise FileNotFoundError(f"HLE parquet file not found: {resolved_path}")


[review] Issue: HLEDataset.load reads the entire parquet file into a pandas DataFrame and then iterates row by row to build a Python list before converting it again into a HuggingFace Dataset. This introduces unnecessary memory amplification and poor scalability for larger datasets.
Suggestion: Avoid the pandas-to-list-to-Dataset conversion chain. Prefer a direct parquet-to-dataset loading path or a vectorized transformation pipeline to reduce memory usage and improve throughput.

Libotry · 2026-05-21T13:37:02Z

+
+    Returns:
+        Dictionary with 'content' (formatted prompt) and 'answer' fields.
+    """


[review] Issue: The image field is appended directly into the message payload as image_url without any validation or normalization. If dataset content is not fully trusted, this can propagate unsafe URLs into downstream services and create SSRF or unintended remote fetch risk.
Suggestion: Validate image sources before adding them to the prompt payload. Restrict supported schemes and trusted domains, and normalize the representation expected by the downstream model interface.

Libotry · 2026-05-21T13:38:03Z

+logger = AISLogger()
+
+
+def parse_hle_item(item: Dict) -> Dict:


[review] Issue: parse_hle_item silently falls back to empty strings for missing question, image, and answer. This makes bad input look valid and shifts data quality failures into later inference or evaluation stages, where root-cause analysis becomes harder.
Suggestion: Enforce required fields explicitly, or at minimum log and count malformed samples. Failing fast on structurally invalid records will improve observability and data quality control.

Libotry · 2026-05-21T13:39:13Z

+        cleaned = re.sub(r"[\n\t\r]+", " ", pred_str)
+        cleaned = re.sub(r"\s+", " ", cleaned)
+        logger.debug(f"\n cleaned_pred_input: {cleaned}")
+        try:


[review] Issue: parse_predictions tries to recover malformed JSON by flattening all newlines, tabs, and repeated whitespace before parsing. This can alter valid string content, hide formatting issues, and still fail on common LLM JSON defects such as trailing prose or fenced code blocks.
Suggestion: Use a stricter structured-output contract and parse the raw response first. If recovery is needed, implement targeted cleanup for known wrappers such as markdown fences rather than globally rewriting whitespace.

Libotry · 2026-05-21T13:39:50Z

+    for pred_str in predictions:
+        logger.debug(f"\n original_pred_input: {pred_str}")
+        cleaned = re.sub(r"[\n\t\r]+", " ", pred_str)
+        cleaned = re.sub(r"\s+", " ", cleaned)


[review] Issue: Invalid judge outputs are logged and then skipped silently. Later metric computation still uses the original expected sample count, which can understate or distort results while hiding the true parse failure rate.
Suggestion: Surface parse failure statistics explicitly in the returned metrics, and consider treating excessive parse failures as an evaluation error instead of silently continuing.

Libotry · 2026-05-22T06:30:03Z

+        return np.abs(avg_conf - avg_correct)
+
+    bins = [[i * beta, (i + 1) * beta] for i in range(num_bins)]
+    bins[-1] = [bins[-1][0], len(confidence)]


[review] Issue: The binning logic in calib_err is flawed for datasets with at least beta samples. num_bins = len(confidence) // beta combined with for i in range(len(bins) - 1) means the final bin is never processed, so part of the dataset is excluded from calibration error calculation.
Suggestion: Iterate over all bins, not all bins minus one. This is a correctness bug in metric computation and should be fixed before relying on reported calibration values.

Libotry · 2026-05-22T06:31:11Z

+        avg_correct = np.nanmean(correct)
+        return np.abs(avg_conf - avg_correct)
+
+    bins = [[i * beta, (i + 1) * beta] for i in range(num_bins)]


[review] Issue: The bin construction uses fixed-size chunks and then overwrites only the final bin end index. When len(confidence) is not an exact multiple of beta, the remainder samples are not assigned to any bin at all.
Suggestion: Build bins so the final range always includes the tail remainder, or use a simpler slice-based loop such as for start in range(0, len(confidence), beta) to cover the full dataset deterministically.

Libotry · 2026-05-22T06:31:57Z

+            "confidence_half_width": "+/- 0%",
+            "calibration_error": 0,
+            "sample_num": n,
+        }


[review] Issue: dump_metrics divides by n without guarding against n == 0. An empty reference set will trigger division-by-zero errors in both accuracy and confidence interval calculations.
Suggestion: Add an explicit empty-input guard and return a well-defined zero-sample result before performing any metric math.

Libotry · 2026-05-22T06:32:40Z

+    if not judge_results:
+        logger.error(UTILS_CODES.UNKNOWN_ERROR, "No available judge_results")
+        return {
+            "accuracy": "0%",


[issue] Issue: The code logs a mismatch when parsed prediction count differs from expected count, but it still computes accuracy using the full expected count and the reduced parsed set. This couples data loss and metric bias in a way that is hard to interpret operationally.
Suggestion: Either fail the evaluation on count mismatch or return both expected_sample_num and valid_judge_sample_num so consumers can distinguish model performance from parser failure.

Libotry · 2026-05-25T14:54:14Z

+    the evaluated predictions.
+
+    Args:
+        judge_results: List of judge response dicts with 'correct' and 'confidence'.


[review] Issue: Correctness is derived using "yes" in judge.get("correct", ""), which is too loose. Any unexpected string containing yes as a substring would be counted as correct, even if the field is malformed.
Suggestion: Use an exact normalized comparison such as str(value).strip().lower() == "yes" to keep the metric contract strict and predictable.

Libotry · 2026-05-25T14:54:54Z

+
+    Returns:
+        Dictionary containing:
+            - accuracy: Percentage string with confidence interval


[review] Issue: Confidence values are appended without type validation and then converted using np.array(confidence) / 100. Non-numeric values from judge output can cause dtype degradation or runtime failures during metric computation.
Suggestion: Validate and coerce confidence values explicitly before aggregation, reject invalid entries with clear accounting, and clamp the accepted range to 0-100.

Libotry · 2026-05-25T14:56:43Z

+
+
+@ICL_EVALUATORS.register_module()
+class HLEJudgeEvaluator(LLMJudgeCorrectEvaluator):


[review] Issue: HLEJudgeEvaluator.score validates only list-length equality before parsing. It does not check response schema completeness, parse failure ratio, or confidence validity before returning metrics. This makes the evaluation path too permissive for a structured judge pipeline.
Suggestion: Add validation gates for parse success rate, required keys, and confidence range, and return explicit error states when judge outputs are not trustworthy enough for metric reporting.

github-actions Bot added docs feature labels May 15, 2026

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

ivanbao9783 force-pushed the feat/hle-dataset branch 4 times, most recently from 3e69213 to 77731e2 Compare May 18, 2026 07:40

feat(datasets): add HLE datasets(code & doc & UT)

d420ea4

ivanbao9783 force-pushed the feat/hle-dataset branch from 77731e2 to d420ea4 Compare May 18, 2026 07:42

Merge branch 'master' into feat/hle-dataset

19f632d

github-actions Bot added the test-cases label May 18, 2026

zhongzhouTan-coder requested a review from Copilot May 18, 2026 07:57

Copilot started reviewing on behalf of zhongzhouTan-coder May 18, 2026 07:58 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

SJTUyh reviewed May 18, 2026

View reviewed changes

ivanbao9783 force-pushed the feat/hle-dataset branch from ed57500 to ae2fbce Compare May 20, 2026 02:24

SJTUyh approved these changes May 20, 2026

View reviewed changes

ivanbao9783 temporarily deployed to smoke-test-approval May 20, 2026 03:00 — with GitHub Actions Inactive

SJTUyh merged commit 58e0e8e into AISBench:master May 20, 2026
6 checks passed

Libotry reviewed May 21, 2026

View reviewed changes

Libotry reviewed May 22, 2026

View reviewed changes

Libotry reviewed May 25, 2026

View reviewed changes

SJTUyh mentioned this pull request Jun 24, 2026

【RFC】HLE（Humanity's Last Exam）基准接入AISBench #281

Closed

	print(f"Error: wrong format prediction: {cleaned}")
	logger.error(f"wrong format prediction: {cleaned}")

	assert False, "p must be '1', '2', or 'infty'"
	raise ValueError("p must be '1', '2', or 'infty'")

		return HLEDataset


		def parse_predictions(predictions: list) -> list[dict]:

		from ais_bench.benchmark.openicl.icl_evaluator.icl_base_evaluator import \
		BaseEvaluator


		correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.

		confidence: The extracted confidence score between 0\|\%\| and 100\|\%\| from [response]. Put 100 if there is no confidence score available.

		from ais_bench.benchmark.utils.postprocess.model_postprocessors import \
		extract_non_reasoning_content



		@ICL_EVALUATORS.register_module()
		class HLEJudgeEvaluator(LLMJudgeCorrectEvaluator):

Uh oh!

Conversation

ivanbao9783 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

新增文件 (4个)

修改文件 (1个)

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

🌟 Useful CI Command / 实用的CI命令

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivanbao9783 May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanbao9783 commented May 15, 2026 •

edited

Loading

ivanbao9783 May 18, 2026 •

edited

Loading

Libotry May 21, 2026 •

edited

Loading