DeepSeek-R1 tech report explores how reinforcement learning (RL) can spark advanced chain-of-thought reasoning in language models. Notably, DeepSeek-R1 introduced the concept of a sudden “aha moment” during RL training with verifiable rewards, in which the model rapidly develops more sophisticated reasoning and self-reflection. This discovery, among with many others, has inspired a wave of open-source reproduction and extension projects.
Below, I summarize some key notes and insights from these follow-up efforts. For a deeper dive into DeepSeek-R1 itself and the below blog/papers, please refer to the original papers for a complete read.
- HuggingFace's open reproduction of the DeepSeek-R1 project, aiming to replicate the entire pipeline described in the original DeepSeek-R1 tech report.
- Including model training (SFT and GRPO) and evaluation on tasks like AIME 2024, MATH-500, and code generation benchmarks.
- Reproduce DeepSeek R1-Zero in the countdown and multiplication tasks, using
Qwen2.5-3B-Instructthat developed self-verification through RL. - With PPO, GRPO and PRIME, the team discovered that long CoT all emerge and seem to all work well.
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
- Github repo: https://github.com/agentica-project/deepscaler
- Authors present DeepScaleR-1.5B-Preview, a 1.5B-parameter language model finetuned via RL on math reasoning tasks. It surpasses O1-Preview on AIME2024 and performs strongly across multiple competition-level math benchmarks.
- The work builds on
DeepSeek-R1-Distill-Qwen-1.5B. - Iterative Context Lengthening (“Think Shorter, Then Longer”)
- A critical challenge is the long outputs for math reasoning, which slow training and can lead to repetitive, unhelpful content. The team considers a staged approach with GRPO:
- Begin with an 8K context window to keep outputs shorter and focus on correct reasoning.
- Increase to 16K once the model plateaus and starts hitting the 8K limit.
- Finally, increase to 24K to fully exploit longer chain-of-thought once the model has learned to reason more concisely.
- A critical challenge is the long outputs for math reasoning, which slow training and can lead to repetitive, unhelpful content. The team considers a staged approach with GRPO:
- Github repo: https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero, PDF
- The authors collect and clean a dataset of 57k reasoning and math problems, combining open-source math and logic tasks and some synthetic samples.
- They show that scaling the dataset (e.g., going from ~7k to ~57k samples) significantly improves both response correctness and response length (i.e., deeper reasoning steps).
- As training proceeds, models’ response lengths grow steadily, mirroring prior observations from systems like DeepSeek-R1-Zero.
- Emergent “Aha Moments"
- "Step moment": the report highlights “step-function” jumps, where the model suddenly transitions to substantially better performance and longer, more thorough solutions.
- The authors also observe increased average correct reflection length once the model passes certain training thresholds (Figure 7 of the PDF).
Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead
- Github repo: https://github.com/RLHFlow/Online-DPO-R1
- Rule-based approaches like iterative DPO and RAFT can substantially enhance a capable base model’s performance on challenging math benchmarks, while remaining simpler to implement than full PPO training.
- While iterative DPO is simpler to implement and yields strong gains, PPO-based models still reach the highest overall performance. Nonetheless, the gap between DPO and PPO narrows substantially if the model is warm-started with SFT.
- Importance of a strong base model: attempts to apply iterative DPO on
Llama-3-8B-Instructdid not yield significant improvements. The authors suggest that Qwen2.5’s stronger math pre-training allows rule-based RL to succeed. - “Aha Moments” and self-reflection
- The authors did not observe a clear “aha moment” with small-scale open-source models. Instead, they note that Qwen2.5-Math-7B-Base already uses reflective reasoning patterns, and neither iterative DPO nor PPO specifically increase their frequency.
There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study
- Github repo: https://github.com/sail-sg/oat-zero
- The study challenges the common observation that this "Aha Moment" occurs during training, discovering that
- Aha moment at epoch 0: Self-reflection patterns were observed in base models from the start, without requiring any post-training. This suggests the "Aha moment" is present right from the beginning.
- Superficial Self-Reflection (SSR): While some models show self-reflection behavior, many of these are not effective and do not lead to correct answers.
- Response length and RL dynamics: The increasing response length often associated with the "aha moment" is not linked to emergent self-reflection but results from RL optimizing rule-based reward functions. The paper demonstrates that response length changes are a consequence of RL dynamics and not necessarily related to improved reasoning.
Demystifying Long Chain-of-Thought Reasoning in LLMs
- A detailed analysis of long CoT reasoning in LLMs. These long CoTs often include branching, backtracking, error-checking, and other advanced features similar to deeper analytical thinking.
- SFT with long CoTs
- Fine-tuning a base model on long CoT data consistently outperforms short CoT training and continues improving with more data.
- Reward shaping for stable length growth
- Simply relying on a verifiable reward can lead to unstable or explosive CoT length growth, causing outputs to exceed the context window and degrade performance.
- The authors introduce a cosine reward function (varies by answer correctness and sequence length) to help stabilize length scaling.
- In addition, applying a repetition penalty mitigates reward hacking where models add repeated content just to accumulate more reward.
- Importance of verifiable signals
- RL becomes more stable and effective when the model’s final answers can be verified using ground-truth signals.
- RL from base models vs. long-CoT–SFT models
- Training RL directly from smaller base models sometimes fails to spark the “aha” behaviors.
- Initializing RL with a model already tuned for long CoT consistently yields more robust exploration of longer reasoning paths.
- Latent abilities and data origins
- The paper also suggests that capabilities like error-correction may already exist as latent skills in base models (due to large-scale pretraining on forum discussions and step-by-step solutions).
- RL and SFT can bring these “buried” skills to the surface with proper rewards, penalties, and prompts.
LIMO: Less is More for Reasoning
- The authors challenge the assumption that complex mathematical reasoning requires massive amounts of fine-tuning data. Instead, they show that using fewer than 1k carefully curated training examples can dramatically enhance a model’s ability to solve advanced math problems.
- The LIMO Hypothesis: In foundation models that already have extensive domain knowledge “baked in” from pre-training, sophisticated reasoning capabilities can emerge with only a small number of carefully designed demonstrations.
- Related: LIMR: Less is More for RL Scaling
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
- Authors use Knights and Knaves (K&K) puzzles as controlled data that allows precise measurement of logical reasoning performance.
- They generate Knights and Knaves puzzles of varying complexity (2–8 inhabitants), ensuring each puzzle has a unique, verifiable solution.
- This procedural approach eliminates ambiguity and makes reward-checking straightforward.
- Modified REINFORCE++ algorithm
- Instead of PPO or GRPO, the authors use a modified REINFORCE approach.
- Key tweaks include a KL-divergence penalty term to avoid drifting too far from the base model’s distribution and a specialized KL estimator.
- Emergent reasoning behaviors
- As training progresses, the model begins to show verification, multi-path exploration, and self-reflection in the section.
- Response length tends to grow over training, but length alone is not a reliable indicator of improved reasoning accuracy.
- Generalization beyond training data
- Despite training on only 5K synthetic logic puzzles, the model demonstrates a transfer effect to tougher math benchmarks like AIME.
- The authors note it outperforms a simple SFT approach, which tends to memorize specific formats rather than develop robust reasoning.
Technical reports
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Test-time scaling
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems
Scaling Test-Time Compute Without Verification or RL is Suboptimal
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Concise/Efficient reasoning
Self-Training Elicits Concise Reasoning in Large Language Models
LightThinker: Thinking Step-by-Step Compression
Training Language Models to Reason Efficiently
Math reasoning
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Self-Rewarding Correction for Mathematical Reasoning
More
Thinking Preference Optimization
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
From System 1 to System 2: A Survey of Reasoning Large Language Models
Thanks to Dang Nguyen and Isha Puri for pointing to some of the relevant papers!