feat(datasets): add HLE dataset#301
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the HLE (Humanity's Last Exam) dataset, including documentation, LLM-judge configurations, and evaluation logic. Key additions include the HLEDataset and HLEJudgeEvaluator classes, along with updates to the summarizer to handle string-based metrics. Feedback from the review identifies a critical logic error in the calibration error calculation that skips the final data bin, suggests performance optimizations for dataset loading using vectorization, and recommends replacing print and assert statements with proper logging and exception handling.
|
|
||
| cerr = 0 | ||
| total_examples = len(confidence) | ||
| for i in range(len(bins) - 1): |
There was a problem hiding this comment.
| dataset = [] | ||
| for i in range(len(data)): | ||
| line = data.iloc[i] | ||
| parsed_item = parse_hle_item(line.to_dict()) | ||
| dataset.append(parsed_item) |
There was a problem hiding this comment.
There was a problem hiding this comment.
保持和AISBench其他数据集解析处理风格一致
| } | ||
| ) | ||
| except json.JSONDecodeError: | ||
| print(f"Error: wrong format prediction: {cleaned}") |
| elif p == "infty" or p == "infinity" or p == "max": | ||
| cerr = np.maximum(cerr, difference) | ||
| else: | ||
| assert False, "p must be '1', '2', or 'infty'" |
There was a problem hiding this comment.
3e69213 to
77731e2
Compare
77731e2 to
d420ea4
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds support for the HLE (Humanity’s Last Exam) multimodal VQA dataset to AISBench, including dataset loading, LLM-judge-based evaluation (accuracy + calibration metrics), corresponding configs/docs, and summarizer support for string-formatted metrics.
Changes:
- Added
HLEDataset/HLEJGDatasetandHLEJudgeEvaluatorwith metric utilities (calib_err,dump_metrics). - Added HLE dataset config and bilingual README docs.
- Updated default summarizer to allow string metric values and display them without numeric formatting.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
ais_bench/benchmark/datasets/hle.py |
New HLE dataset + judge evaluator implementation and metric computations. |
ais_bench/benchmark/configs/datasets/hle/hle_llmjudge.py |
New HLE task configuration for inference + judge inference + evaluation. |
ais_bench/benchmark/configs/datasets/hle/README.md |
New Chinese documentation for deploying/using HLE. |
ais_bench/benchmark/configs/datasets/hle/README_en.md |
New English documentation for deploying/using HLE. |
ais_bench/benchmark/summarizers/default.py |
Allows string-valued metrics and prints them as-is in summary tables. |
ais_bench/benchmark/datasets/__init__.py |
Exposes HLE dataset module via package import. |
tests/UT/datasets/test_hle.py |
Adds unit tests for HLE parsing, metrics, and evaluator scoring. |
tests/UT/summarizers/test_default.py |
Updates summarizer unit test expectations for string metric handling. |
Comments suppressed due to low confidence (1)
ais_bench/benchmark/summarizers/default.py:116
- Allowing any string-valued key through
_pick_up_resultscan accidentally treat non-metric string fields as metrics (and makes the existing "unknown result format" path unreachable in some cases). Consider only allowingstrfor known metric keys (e.g. those inMETRIC_WHITELIST) or for specific metrics likeaccuracy, instead of all keys.
_rst, _dm = {}, []
for metric, score in result.items():
if metric not in METRIC_BLACKLIST and isinstance(score, (int, float, str)):
_rst[metric] = score
_dm.append(metric)
else:
continue
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return HLEDataset | ||
|
|
||
|
|
||
| def parse_predictions(predictions: list) -> list[dict]: |
|
|
||
| cerr = 0 | ||
| total_examples = len(confidence) | ||
| for i in range(len(bins) - 1): |
There was a problem hiding this comment.
| elif p == "infty" or p == "infinity" or p == "max": | ||
| cerr = np.maximum(cerr, difference) | ||
| else: | ||
| assert False, "p must be '1', '2', or 'infty'" |
| } | ||
| ) | ||
| except json.JSONDecodeError: | ||
| print(f"Error: wrong format prediction: {cleaned}") | ||
| continue |
| from ais_bench.benchmark.openicl.icl_evaluator.icl_base_evaluator import \ | ||
| BaseEvaluator |
| ## Available Dataset Tasks | ||
|
|
||
| | Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | hle | HLE dataset | Accuracy, Calibration Error | 0-shot | Chat format | hle_llmjudge.py | | ||
|
|
There was a problem hiding this comment.
非问题,md文件表格显示无问题
|
|
||
| correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect. | ||
|
|
||
| confidence: The extracted confidence score between 0|\%| and 100|\%| from [response]. Put 100 if there is no confidence score available. |
| # 验证结果 | ||
| self.assertIn("test_model", raw_results) | ||
| self.assertIn("test_ds", raw_results["test_model"]) | ||
|
|
||
| self.assertIn("test_model", parsed_results) | ||
| self.assertNotIn("test_ds", parsed_results["test_model"]) | ||
| self.assertIn("test_ds", parsed_results["test_model"]) | ||
|
|
There was a problem hiding this comment.
非问题,UT用例适配单点修改,不做str校验
| import unittest | ||
| import json | ||
| import os | ||
| import tempfile | ||
| import numpy as np | ||
| import pandas as pd | ||
| from unittest.mock import patch | ||
| from datasets import Dataset | ||
|
|
||
| from ais_bench.benchmark.datasets.hle import ( |
| from ais_bench.benchmark.utils.postprocess.model_postprocessors import \ | ||
| extract_non_reasoning_content |
| } | ||
| ) | ||
| except json.JSONDecodeError: | ||
| print(f"Error: wrong format prediction: {cleaned}") |
There was a problem hiding this comment.
【review】1. logger.error 和print行为是一样的,非debug模式下这个subprocess内的stdout会被重定向到文件中,print也无法直接打屏的。而且print的Error不够醒目。
因此这里建议替换为logger.error
There was a problem hiding this comment.
已修改为logger.error
| # 使用 modelscope 下载 (需要安装 modelscope) | ||
| modelscope download --dataset cais/hle --local_dir {tool_root_path}/ais_bench/datasets/hle/ | ||
|
|
||
| # 使用 huggingface-cli 下载 (需要安装 transformers 并登录) | ||
| huggingface-cli download cais/hle --repo-type dataset --local-dir {tool_root_path}/ais_bench/datasets/hle/ |
There was a problem hiding this comment.
【review】1. modelscope不是AISBench的依赖,用这条命令还需要装modelscope。
2. huggingface-cli虽然是AISBench的依赖,但是高版本下是使用hf 二进制,huggingface-cli这个无效
3. 在国内网络环境下huggingface-cli未必搞得下来
因此建议是参考其他数据集,只给获取渠道,不放具体的获取命令
1. 删除未引用的依赖
2. 将print改为logger.error
3. 修改readme内容,仅提供数据来源
4. 修改dump_metrics函数,无效judge_results场景的默认返回值
5. 修改parse_predictions函数返回值类型,由list[dict]改为List[Dict[str, Any]]
6. 修改calib_err函数中异常抛出类型,改为ValueError
7. 补充get_started中HLE数据集资料
ed57500 to
ae2fbce
Compare
| "additionalProperties": False, | ||
| }, | ||
| "name": "ExtractedAnswer", | ||
| "strict": True, |
There was a problem hiding this comment.
[review] issue: The schema sets "strict": True at the JSON-schema wrapper level and also requires a strict field inside the model output. That field is not consumed by the evaluator, but it increases output complexity and raises the chance of schema validation failures.
suggestion: Keep only the top-level json_schema.strict=True and remove the inner strict property and its required entry.
| "enum": ["yes", "no"], | ||
| "type": "string", | ||
| }, | ||
| "confidence": {"title": "Confidence", "type": "integer"}, |
There was a problem hiding this comment.
[review] issue: The confidence field is defined only as an integer without a valid range. The downstream evaluator assumes the value is between 0 and 100, so out-of-range values can silently corrupt calibration metrics.
suggestion: Add minimum: 0 and maximum: 100 to the schema, and add defensive validation or clamping in the evaluator before metric calculation.
| role="HUMAN", | ||
| prompt_mm={ | ||
| "text": {"type": "text", "text": "{question}"}, | ||
| "image": {"type": "image_url", "image_url": {"url": "{image}"}}, |
There was a problem hiding this comment.
[review] issue: The dataset image field is forwarded directly as image_url to the remote model service. If the dataset source is not fully trusted, this creates SSRF and internal resource access risk on the backend service side.
suggestion: Restrict allowed image sources with protocol and domain allowlists, and prefer controlled local-path mapping or preprocessed data URIs instead of arbitrary remote URLs.
| host_port=8080, | ||
| url="", | ||
| max_out_len=8192, | ||
| batch_size=100, |
There was a problem hiding this comment.
[review] issue: The judge model configuration depends on multiple empty placeholders and localhost defaults, such as model="", api_key="", and url="". This creates hidden runtime dependencies and makes failures environment-dependent and harder to diagnose.
suggestion: Require at least one of model or url, load sensitive and environment-specific values from external configuration or environment variables, and fail fast with explicit validation errors.
| host_ip="localhost", | ||
| host_port=8080, | ||
| url="", | ||
| max_out_len=8192, |
There was a problem hiding this comment.
[review] issue: max_out_len=8192 and batch_size=100 are aggressive defaults for a judge task that only needs short structured JSON output. This increases memory usage, timeout risk, throughput instability, and serving cost.
suggestion: Reduce max_out_len to a tighter range such as 256-512 and tune batch_size based on actual serving capacity and load-test results.
| ), | ||
| retriever=dict(type=ZeroRetriever), | ||
| inferencer=dict(type=GenInferencer), | ||
| ) |
There was a problem hiding this comment.
[review] issue: The file mixes judge protocol definition, prompt design, schema definition, model connection settings, and dataset wiring in one place. This increases coupling and makes future *_llmjudge.py configurations more likely to drift or duplicate logic.
suggestion: Extract shared judge schema, reusable prompt fragments, and common judge-model defaults into a shared module so dataset-specific files only define the parts that differ.
| ), | ||
| retriever=dict(type=ZeroRetriever), | ||
| inferencer=dict(type=GenInferencer), | ||
| ) |
There was a problem hiding this comment.
[review] issue: The file mixes judge protocol definition, prompt design, schema definition, model connection settings, and dataset wiring in one place. This increases coupling and makes future *_llmjudge.py configurations more likely to drift or duplicate logic.
suggestion: Extract shared judge schema, reusable prompt fragments, and common judge-model defaults into a shared module so dataset-specific files only define the parts that differ.
| correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect. | ||
|
|
||
| confidence: The extracted confidence score between 0|\%| and 100|\%| from [response]. Put 100 if there is no confidence score available. | ||
| """.strip() |
There was a problem hiding this comment.
[review] issue: The prompt relies heavily on natural-language compliance for answer extraction and correctness judgment, but it does not explicitly constrain ambiguous cases such as multiple final answers, malformed confidence text, or partially structured responses. That leaves room for non-deterministic judge behavior.
suggestion: Add explicit rules for ambiguous responses, multiple extracted answers, invalid confidence formats, and empty answers so the judge behavior is more deterministic and reproducible.
| chat_template_kwargs=dict( | ||
| enable_thinking=False, | ||
| ), | ||
| ), |
There was a problem hiding this comment.
[review] issue: The prompt relies heavily on natural-language compliance for answer extraction and correctness judgment, but it does not explicitly constrain ambiguous cases such as multiple final answers, malformed confidence text, or partially structured responses. That leaves room for non-deterministic judge behavior.
suggestion: Add explicit rules for ambiguous responses, multiple extracted answers, invalid confidence formats, and empty answers so the judge behavior is more deterministic and reproducible.
| infer_cfg=hle_infer_cfg, | ||
| judge_infer_cfg=hle_judge_infer_cfg, | ||
| eval_cfg=hle_eval_cfg, | ||
| ) |
There was a problem hiding this comment.
[review] issue: The dataset path is hardcoded to a specific parquet file location. This reduces portability across environments and makes packaging, CI, and external reuse more brittle.
suggestion: Move the dataset path into an external config layer or resolve it through a dataset registry so the config is easier to reuse across machines and deployment environments.
| logger.debug(f"Loading HLE dataset from: {resolved_path}") | ||
|
|
||
| if not os.path.exists(resolved_path): | ||
| raise FileNotFoundError(f"HLE parquet file not found: {resolved_path}") |
There was a problem hiding this comment.
[review] Issue: HLEDataset.load reads the entire parquet file into a pandas DataFrame and then iterates row by row to build a Python list before converting it again into a HuggingFace Dataset. This introduces unnecessary memory amplification and poor scalability for larger datasets.
Suggestion: Avoid the pandas-to-list-to-Dataset conversion chain. Prefer a direct parquet-to-dataset loading path or a vectorized transformation pipeline to reduce memory usage and improve throughput.
|
|
||
| Returns: | ||
| Dictionary with 'content' (formatted prompt) and 'answer' fields. | ||
| """ |
There was a problem hiding this comment.
[review] Issue: The image field is appended directly into the message payload as image_url without any validation or normalization. If dataset content is not fully trusted, this can propagate unsafe URLs into downstream services and create SSRF or unintended remote fetch risk.
Suggestion: Validate image sources before adding them to the prompt payload. Restrict supported schemes and trusted domains, and normalize the representation expected by the downstream model interface.
| logger = AISLogger() | ||
|
|
||
|
|
||
| def parse_hle_item(item: Dict) -> Dict: |
There was a problem hiding this comment.
[review] Issue: parse_hle_item silently falls back to empty strings for missing question, image, and answer. This makes bad input look valid and shifts data quality failures into later inference or evaluation stages, where root-cause analysis becomes harder.
Suggestion: Enforce required fields explicitly, or at minimum log and count malformed samples. Failing fast on structurally invalid records will improve observability and data quality control.
| cleaned = re.sub(r"[\n\t\r]+", " ", pred_str) | ||
| cleaned = re.sub(r"\s+", " ", cleaned) | ||
| logger.debug(f"\n cleaned_pred_input: {cleaned}") | ||
| try: |
There was a problem hiding this comment.
[review] Issue: parse_predictions tries to recover malformed JSON by flattening all newlines, tabs, and repeated whitespace before parsing. This can alter valid string content, hide formatting issues, and still fail on common LLM JSON defects such as trailing prose or fenced code blocks.
Suggestion: Use a stricter structured-output contract and parse the raw response first. If recovery is needed, implement targeted cleanup for known wrappers such as markdown fences rather than globally rewriting whitespace.
| for pred_str in predictions: | ||
| logger.debug(f"\n original_pred_input: {pred_str}") | ||
| cleaned = re.sub(r"[\n\t\r]+", " ", pred_str) | ||
| cleaned = re.sub(r"\s+", " ", cleaned) |
There was a problem hiding this comment.
[review] Issue: Invalid judge outputs are logged and then skipped silently. Later metric computation still uses the original expected sample count, which can understate or distort results while hiding the true parse failure rate.
Suggestion: Surface parse failure statistics explicitly in the returned metrics, and consider treating excessive parse failures as an evaluation error instead of silently continuing.
| return np.abs(avg_conf - avg_correct) | ||
|
|
||
| bins = [[i * beta, (i + 1) * beta] for i in range(num_bins)] | ||
| bins[-1] = [bins[-1][0], len(confidence)] |
There was a problem hiding this comment.
[review] Issue: The binning logic in calib_err is flawed for datasets with at least beta samples. num_bins = len(confidence) // beta combined with for i in range(len(bins) - 1) means the final bin is never processed, so part of the dataset is excluded from calibration error calculation.
Suggestion: Iterate over all bins, not all bins minus one. This is a correctness bug in metric computation and should be fixed before relying on reported calibration values.
| avg_correct = np.nanmean(correct) | ||
| return np.abs(avg_conf - avg_correct) | ||
|
|
||
| bins = [[i * beta, (i + 1) * beta] for i in range(num_bins)] |
There was a problem hiding this comment.
[review] Issue: The bin construction uses fixed-size chunks and then overwrites only the final bin end index. When len(confidence) is not an exact multiple of beta, the remainder samples are not assigned to any bin at all.
Suggestion: Build bins so the final range always includes the tail remainder, or use a simpler slice-based loop such as for start in range(0, len(confidence), beta) to cover the full dataset deterministically.
| "confidence_half_width": "+/- 0%", | ||
| "calibration_error": 0, | ||
| "sample_num": n, | ||
| } |
There was a problem hiding this comment.
[review] Issue: dump_metrics divides by n without guarding against n == 0. An empty reference set will trigger division-by-zero errors in both accuracy and confidence interval calculations.
Suggestion: Add an explicit empty-input guard and return a well-defined zero-sample result before performing any metric math.
| if not judge_results: | ||
| logger.error(UTILS_CODES.UNKNOWN_ERROR, "No available judge_results") | ||
| return { | ||
| "accuracy": "0%", |
There was a problem hiding this comment.
[issue] Issue: The code logs a mismatch when parsed prediction count differs from expected count, but it still computes accuracy using the full expected count and the reduced parsed set. This couples data loss and metric bias in a way that is hard to interpret operationally.
Suggestion: Either fail the evaluation on count mismatch or return both expected_sample_num and valid_judge_sample_num so consumers can distinguish model performance from parser failure.
| the evaluated predictions. | ||
|
|
||
| Args: | ||
| judge_results: List of judge response dicts with 'correct' and 'confidence'. |
There was a problem hiding this comment.
[review] Issue: Correctness is derived using "yes" in judge.get("correct", ""), which is too loose. Any unexpected string containing yes as a substring would be counted as correct, even if the field is malformed.
Suggestion: Use an exact normalized comparison such as str(value).strip().lower() == "yes" to keep the metric contract strict and predictable.
|
|
||
| Returns: | ||
| Dictionary containing: | ||
| - accuracy: Percentage string with confidence interval |
There was a problem hiding this comment.
[review] Issue: Confidence values are appended without type validation and then converted using np.array(confidence) / 100. Non-numeric values from judge output can cause dtype degradation or runtime failures during metric computation.
Suggestion: Validate and coerce confidence values explicitly before aggregation, reject invalid entries with clear accounting, and clamp the accepted range to 0-100.
|
|
||
|
|
||
| @ICL_EVALUATORS.register_module() | ||
| class HLEJudgeEvaluator(LLMJudgeCorrectEvaluator): |
There was a problem hiding this comment.
[review] Issue: HLEJudgeEvaluator.score validates only list-length equality before parsing. It does not check response schema completeness, parse failure ratio, or confidence validity before returning metrics. This makes the evaluation path too permissive for a structured judge pipeline.
Suggestion: Add validation gates for parse success rate, required keys, and confidence range, and return explicit error states when judge outputs are not trustworthy enough for metric reporting.
PR Type / PR类型
Related Issue | 关联 Issue
Relates to #(issue ID / issue 编号)
🔍 Motivation / 变更动机
将 HLE(Humanity's Last Exam)多模态视觉问答数据集适配到 AISBench 评测框架中,使用裁判模型(Judge Model)进行自动评估,支持准确率、置信区间、校准误差等完整指标的计算,并保持与 HLE 官方评估逻辑的一致性。
主要目标:
calib_err()和dump_metrics()计算逻辑📝 Modification / 修改内容
新增文件 (4个)
ais_bench/benchmark/datasets/hle.pyHLEDataset、HLEJGDataset、HLEJudgeEvaluator以及calib_err()、dump_metrics()等函数ais_bench/benchmark/configs/datasets/hle/hle_llmjudge.pyhle_reader_cfg、hle_infer_cfg、hle_judge_infer_cfg、hle_eval_cfg等配置ais_bench/benchmark/configs/datasets/hle/README.mdais_bench/benchmark/configs/datasets/hle/README_en.md修改文件 (1个)
ais_bench/benchmark/summarizers/default.py_format_table()方法增加类型判断,支持字符串类型的 metric 值直接显示(如 "8.0% +/- 2.38%")ais_bench/benchmark/datasets/\__init__.py📐 Associated Test Results / 关联测试结果
【测试报告】HLE多模态闭卷学术评测接入AISBench
NA
NA
🌟 Use cases (Optional) / 使用案例(可选)
NA
✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build