Add open-r1/OpenR1-Math-220k dataset and nvidia/OpenMathReasoning to RL and fix reward function#3629
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
ffcdb16 to
5b5bcfd
Compare
def1309 to
cf7471d
Compare
50b9d63 to
0f10655
Compare
|
🤖 Hi @SurbhiJainUSC, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request introduces significant improvements to the Reinforcement Learning (RL) reward and evaluation logic, specifically for math-based datasets. Key highlights include the addition of multiprocessing for math_verify to prevent trainer hangs, support for new datasets (OpenR1-Math-220k, OpenMathReasoning), and more robust evaluation modes like majority voting and pass@1 estimation. The inclusion of comprehensive unit tests for the new grading logic and multiprocessing pool is a strong positive.
🔍 General Feedback
- Multiprocessing for Math Verification: The move to a process-isolated, timeout-bounded
math_verifypool is a critical improvement for stability, especially when dealing with complex symbolic computations that might hang. - Improved Answer Extraction: The new answer extraction and normalization logic is more robust and handles Multiple Choice Questions (MCQ) better by allowing both option letters and values as correct.
- Evaluation Modes: The addition of
maj@Kandpass@1metrics provides a more complete picture of model performance beyond simplepass@K. - LaTeX Normalization: The
fix_latex_escapingutility addresses common issues with improperly escaped LaTeX strings, although it requires some tuning to avoid over-correcting common English words.
ef788e9 to
7815e66
Compare
6476df0 to
7668316
Compare
NicoGrande
left a comment
There was a problem hiding this comment.
Looks good overall! In general, can we add type hints to arguments / return types in the new functions we are adding. It makes it much easier to understand the intent of the function and reason about correctness!
a8611e6 to
f1002cf
Compare
That's a great suggestion. |
…RL and fix reward function Co-authored-by: A9isha <mazumdera@google.com>
Description
Co-authored with @A9isha
This PR introduces significant improvements to the RL reward and evaluation logic for math-based datasets.
addition of multiprocessing for math_verify to prevent trainer and evaluation hangs
support for new datasets (OpenR1-Math-220k, OpenMathReasoning)
answer extraction and normalization logic handles Multiple Choice Questions by allowing both option letters and values as correct
Added more evaluation modes like majority voting and pass@1 estimation
Sync weights to vllm prior to pre RL evaluation
Added
are_equal_under_sympy()for symbolic equality checking as a fallback before math_verifyTests
Tested on OpenMathInstruct and OpenR1 dataset
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.