Consider using the RL algorithm -GRPO training?with SFT data.
Consider using the RL algorithm -GRPO training?with SFT data.