bugfix: Support for extracting answers expressed in thousand separato…#360
bugfix: Support for extracting answers expressed in thousand separato…#360yejj710 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the GSM8K dataset post-processing logic to remove commas from the text before extracting numbers. The review feedback highlights that blindly replacing all commas can cause issues by merging comma-separated numbers (e.g., "1,2" into "12") and suggests using a regular expression to target only thousand separators.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| @TEXT_POSTPROCESSORS.register_module('gsm8k') | ||
| def gsm8k_postprocess(text: str) -> str: | ||
| text = text.split('Question:')[0] | ||
| text = text.split('Question:')[0].replace(',', '') |
There was a problem hiding this comment.
Blindly replacing all commas with an empty string can lead to incorrect parsing of numbers that are separated by commas without spaces (e.g., a list of numbers like 1,2 or coordinates). For example, 1,2 would be merged into 12, causing the postprocessor to extract 12 instead of 2.
To safely remove only thousand separators, you can use a regular expression that targets commas preceded by a digit and followed by exactly three digits (and a word boundary or decimal point).
| text = text.split('Question:')[0].replace(',', '') | |
| text = re.sub(r'(?<=\d),(?=\d{3}\b)', '', text.split('Question:')[0]) |
Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
PR Type / PR类型
Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)
🔍 Motivation / 变更动机
#359
提取答案的时候,存在提取逻辑不够完备导致的分数偏低问题。
📝 Modification / 修改内容
参考golden答案提取的逻辑,将','替换为空字符