Skip to content

Add open-r1/OpenR1-Math-220k dataset and nvidia/OpenMathReasoning to RL and fix reward function#3629

Merged
copybara-service[bot] merged 1 commit intomainfrom
rl-debug
Apr 23, 2026
Merged

Add open-r1/OpenR1-Math-220k dataset and nvidia/OpenMathReasoning to RL and fix reward function#3629
copybara-service[bot] merged 1 commit intomainfrom
rl-debug

Conversation

@SurbhiJainUSC
Copy link
Copy Markdown
Collaborator

@SurbhiJainUSC SurbhiJainUSC commented Apr 10, 2026

Description

Co-authored with @A9isha

This PR introduces significant improvements to the RL reward and evaluation logic for math-based datasets.

  • addition of multiprocessing for math_verify to prevent trainer and evaluation hangs

  • support for new datasets (OpenR1-Math-220k, OpenMathReasoning)

  • answer extraction and normalization logic handles Multiple Choice Questions by allowing both option letters and values as correct

  • Added more evaluation modes like majority voting and pass@1 estimation

  • Sync weights to vllm prior to pre RL evaluation

  • Added are_equal_under_sympy() for symbolic equality checking as a fallback before math_verify

Tests

Tested on OpenMathInstruct and OpenR1 dataset

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

@SurbhiJainUSC SurbhiJainUSC force-pushed the rl-debug branch 8 times, most recently from ffcdb16 to 5b5bcfd Compare April 15, 2026 20:50
@SurbhiJainUSC SurbhiJainUSC force-pushed the rl-debug branch 2 times, most recently from def1309 to cf7471d Compare April 20, 2026 22:48
@SurbhiJainUSC SurbhiJainUSC changed the title Rl debug Add open-r1/OpenR1-Math-220k dataset and nvidia/OpenMathReasoning to RL and fix reward function Apr 20, 2026
@SurbhiJainUSC SurbhiJainUSC force-pushed the rl-debug branch 6 times, most recently from 50b9d63 to 0f10655 Compare April 20, 2026 23:33
@github-actions
Copy link
Copy Markdown

🤖 Hi @SurbhiJainUSC, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request introduces significant improvements to the Reinforcement Learning (RL) reward and evaluation logic, specifically for math-based datasets. Key highlights include the addition of multiprocessing for math_verify to prevent trainer hangs, support for new datasets (OpenR1-Math-220k, OpenMathReasoning), and more robust evaluation modes like majority voting and pass@1 estimation. The inclusion of comprehensive unit tests for the new grading logic and multiprocessing pool is a strong positive.

🔍 General Feedback

  • Multiprocessing for Math Verification: The move to a process-isolated, timeout-bounded math_verify pool is a critical improvement for stability, especially when dealing with complex symbolic computations that might hang.
  • Improved Answer Extraction: The new answer extraction and normalization logic is more robust and handles Multiple Choice Questions (MCQ) better by allowing both option letters and values as correct.
  • Evaluation Modes: The addition of maj@K and pass@1 metrics provides a more complete picture of model performance beyond simple pass@K.
  • LaTeX Normalization: The fix_latex_escaping utility addresses common issues with improperly escaped LaTeX strings, although it requires some tuning to avoid over-correcting common English words.

Comment thread src/maxtext/trainers/post_train/rl/utils_rl.py
Comment thread src/maxtext/trainers/post_train/rl/math_verify_pool.py
Comment thread src/maxtext/trainers/post_train/rl/utils_rl.py
Comment thread src/maxtext/trainers/post_train/rl/utils_rl.py Outdated
Comment thread src/maxtext/trainers/post_train/rl/utils_rl.py
Comment thread src/maxtext/trainers/post_train/rl/utils_rl.py
@SurbhiJainUSC SurbhiJainUSC force-pushed the rl-debug branch 4 times, most recently from ef788e9 to 7815e66 Compare April 21, 2026 21:31
Comment thread src/maxtext/trainers/post_train/rl/train_rl.py Outdated
Comment thread src/maxtext/trainers/post_train/rl/train_rl.py
@SurbhiJainUSC SurbhiJainUSC force-pushed the rl-debug branch 8 times, most recently from 6476df0 to 7668316 Compare April 22, 2026 23:29
Copy link
Copy Markdown
Collaborator

@NicoGrande NicoGrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! In general, can we add type hints to arguments / return types in the new functions we are adding. It makes it much easier to understand the intent of the function and reason about correctness!

Comment thread src/maxtext/configs/post_train/rl.yml Outdated
Comment thread src/maxtext/trainers/post_train/rl/train_rl.py Outdated
@SurbhiJainUSC SurbhiJainUSC force-pushed the rl-debug branch 3 times, most recently from a8611e6 to f1002cf Compare April 22, 2026 23:52
@SurbhiJainUSC
Copy link
Copy Markdown
Collaborator Author

Looks good overall! In general, can we add type hints to arguments / return types in the new functions we are adding. It makes it much easier to understand the intent of the function and reason about correctness!

That's a great suggestion.

Copy link
Copy Markdown
Collaborator

@NicoGrande NicoGrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/maxtext/trainers/post_train/rl/math_verify_pool.py Outdated
Comment thread src/maxtext/trainers/post_train/rl/math_verify_pool.py
…RL and fix reward function

Co-authored-by: A9isha <mazumdera@google.com>
@copybara-service copybara-service Bot merged commit c54d88f into main Apr 23, 2026
41 of 43 checks passed
@copybara-service copybara-service Bot deleted the rl-debug branch April 23, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants