[model] feat: support ByteDance Seed-OSS 36B model (verl-project#3347)

chenhaiq · web-flow · commit e90f18c40aa6 · 2025-09-04T22:41:58.000+08:00
### What does this PR do? support ByteDance Seed-OSS 36B model: 1. add RL and SFT example 2. support mfu metrics Requirement: pip install transformers>=4.56.0 Notes: vllm v0.10.0 does not support Seed-OSS, but can fail back to transformers to get it working. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test (TaskRunner pid=373084) step:2 - global_seqlen/min:6260 - global_seqlen/max:11318 - global_seqlen/minmax_diff:5058 - global_seqlen/balanced_min:8466 - global_seqlen/balanced_max:8468 - global_seqlen/mean:8467.375 - actor/entropy:0.47251570224761963 - actor/kl_loss:0.03297248564194888 - actor/kl_coef:0.001 - actor/pg_loss:-0.0494408356025815 - actor/pg_clipfrac:0.019900403218343854 - actor/ppo_kl:0.020935473148711026 - actor/pg_clipfrac_lower:9.349289757665247e-05 - actor/grad_norm:0.47875913605093956 - perf/mfu/actor:0.2823303751694612 - perf/max_memory_allocated_gb:134.74115753173828 - perf/max_memory_reserved_gb:141.615234375 - perf/cpu_memory_used_gb:150.75712203979492 - actor/lr:1e-06 - training/global_step:2 - training/epoch:0 - critic/score/mean:0.3515625 - critic/score/max:1.0 - critic/score/min:0.0 - critic/rewards/mean:0.3515625 - critic/rewards/max:1.0 - critic/rewards/min:0.0 - critic/advantages/mean:-0.023741308599710464 - critic/advantages/max:0.7071057558059692 - critic/advantages/min:-0.7071057558059692 - critic/returns/mean:-0.023741308599710464 - critic/returns/max:0.7071057558059692 - critic/returns/min:-0.7071057558059692 - response_length/mean:444.4296875 - response_length/max:1024.0 - response_length/min:50.0 - response_length/clip_ratio:0.140625 - response_length_non_aborted/mean:444.4296875 - response_length_non_aborted/max:1024.0 - response_length_non_aborted/min:50.0 - response_length_non_aborted/clip_ratio:0.140625 - response/aborted_ratio:0.0 - prompt_length/mean:84.78125 - prompt_length/max:141.0 - prompt_length/min:54.0 - prompt_length/clip_ratio:0.0 - timing_s/start_profile:6.250300793908536e-05 - timing_s/generate_sequences:21.979598999023438 - timing_s/generation_timing/max:22.295286178588867 - timing_s/generation_timing/min:21.753456115722656 - timing_s/generation_timing/topk_ratio:0.125 - timing_s/gen:39.58543623800506 - timing_s/reward:0.031087818002561107 - timing_s/old_log_prob:17.46088112698635 - timing_s/ref:5.804751824995037 - timing_s/adv:0.003937039989978075 - timing_s/update_actor:57.383965655986685 - timing_s/step:120.27422251200187 - timing_s/stop_profile:6.923600449226797e-05 - timing_per_token_ms/gen:0.6958608511260053 - timing_per_token_ms/ref:0.08569290696637147 - timing_per_token_ms/adv:5.8120727940744256e-05 - timing_per_token_ms/update_actor:0.8471333449857052 - perf/total_num_tokens:67739 - perf/time_per_step:120.27422251200187 - perf/throughput:70.40057980133741 ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
diff --git a/examples/grpo_trainer/run_seed_oss_36b.sh b/examples/grpo_trainer/run_seed_oss_36b.sh
@@ -0,0 +1,48 @@
+set -x
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=64 \
+    data.max_prompt_length=512 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=ByteDance-Seed/Seed-OSS-36B-Base \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.model.use_fused_kernels=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.actor.use_dynamic_bsz=True \
+    actor_rollout_ref.actor.strategy=fsdp2 \
+    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=True \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=2 \
+    actor_rollout_ref.rollout.free_cache_engine=True \
+    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    actor_rollout_ref.ref.strategy=fsdp2 \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger='["console"]' \
+    trainer.project_name='verl_grpo_seed_oss_36b' \
+    trainer.experiment_name='seed_oss_36b' \
+    trainer.val_before_train=False \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=20 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
diff --git a/examples/sft/gsm8k/run_seed_oss_36b_sft.sh b/examples/sft/gsm8k/run_seed_oss_36b_sft.sh
@@ -0,0 +1,31 @@
+set -x
+
+if [ "$#" -lt 2 ]; then
+    echo "Usage: run_seed_oss_36b_sft.sh <nproc_per_node> <save_path> [other_configs...]"
+    exit 1
+fi
+
+nproc_per_node=$1
+save_path=$2
+
+# Shift the arguments so $@ refers to the rest
+shift 2
+
+torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
+     -m verl.trainer.fsdp_sft_trainer \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.prompt_key=extra_info \
+    data.response_key=extra_info \
+    optim.lr=1e-4 \
+    data.prompt_dict_keys=['question'] \
+    +data.response_dict_keys=['answer'] \
+    data.micro_batch_size=4 \
+    model.partial_pretrain=ByteDance-Seed/Seed-OSS-36B-Base \
+    trainer.default_local_dir=$save_path \
+    trainer.project_name=gsm8k-sft \
+    trainer.experiment_name=gsm8k-sft-seed-oss-36b \
+    trainer.logger=console \
+    trainer.total_training_steps=1 \
+    ulysses_sequence_parallel_size=2 \
+    use_remove_padding=true $@
diff --git a/verl/utils/flops_counter.py b/verl/utils/flops_counter.py
@@ -30,6 +30,7 @@
     "minicpmo",
     "mistral",
     "gemma3_text",
+    "seed_oss",
 }
 
 
@@ -130,6 +131,7 @@ def __init__(self, config: PretrainedConfig):
             "minicpmo": self._estimate_qwen2_flops,
             "mistral": self._estimate_qwen2_flops,
             "gemma3_text": self._estimate_gemma3_flops,
+            "seed_oss": self._estimate_qwen2_flops,
         }
         self.config = config
 

Original file line number	Diff line number	Diff line change
`@@ -30,6 +30,7 @@`
`30`	`30`	`"minicpmo",`
`31`	`31`	`"mistral",`
`32`	`32`	`"gemma3_text",`
	`33`	`+ "seed_oss",`
`33`	`34`	`}`
`34`	`35`
`35`	`36`
`@@ -130,6 +131,7 @@ def __init__(self, config: PretrainedConfig):`
`130`	`131`	`"minicpmo": self._estimate_qwen2_flops,`
`131`	`132`	`"mistral": self._estimate_qwen2_flops,`
`132`	`133`	`"gemma3_text": self._estimate_gemma3_flops,`
	`134`	`+ "seed_oss": self._estimate_qwen2_flops,`
`133`	`135`	`}`
`134`	`136`	`self.config = config`
`135`	`137`