runing

YangZhou08 · YangZhou08 · commit 700901d1da44 · 2026-02-09T20:59:58.000-05:00
diff --git a/README.md b/README.md
@@ -82,10 +82,9 @@ For installation, you can simply run the following
 ``` 
 pip install -e .[vllm] 
 ``` 
-Example script you can use to start Jackpot is here 
-```
-<<placeholder>> 
-``` 
+We prepare detailed example running scripts with Jackpot support under the following path [examples/jackpot_examples_gsm8k](examples/jackpot_examples_gsm8k) , including instructions 
+[examples/jackpot_examples_gsm8k/README.md](examples/jackpot_examples_gsm8k/README.md) 
+
 Here is the detailed explanation of all the parameters we added to verl arguments for Jackpot 
 ```yaml 
 # inside actor.yaml 
diff --git a/examples/jackpot_examples_gsm8k/README.md b/examples/jackpot_examples_gsm8k/README.md
@@ -0,0 +1,97 @@
+# Jackpot GSM8K Release Examples
+
+This folder provides four release-style scripts on GSM8K:
+
+1. `run_qwen2-0.5b-dual-kl-gsm8k.sh`
+2. `run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh`
+3. `run_qwen2-0.5b-jackpot-gsm8k.sh`
+4. `run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh`
+
+The first two are dual-model joint training.  
+The last two are single-model Jackpot baselines.
+
+## 1) Prepare GSM8K data
+
+Reference: `docs/examples/gsm8k_example.rst`
+
+From repository root:
+
+```bash
+cd examples/data_preprocess
+python3 gsm8k.py --local_save_dir ~/data/gsm8k
+```
+
+Expected files:
+
+- `~/data/gsm8k/train.parquet`
+- `~/data/gsm8k/test.parquet`
+
+Each script checks these files and exits with a hint if missing.
+
+## 2) Run examples
+
+From repository root:
+
+```bash
+bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-gsm8k.sh
+bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh
+bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh
+bash examples/jackpot_gsm8k_release/run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh
+```
+
+All scripts pass extra CLI overrides through `"$@"`, so you can do:
+
+```bash
+bash examples/jackpot_gsm8k_release/run_qwen2-0.5b-jackpot-gsm8k.sh \
+  trainer.logger=[console] \
+  trainer.total_epochs=2 \
+  actor_rollout_ref.actor.optim.lr=5e-7
+```
+
+## 3) What Jackpot arguments mean
+
+These are the key Jackpot-related fields used in all scripts:
+
+- `actor_rollout_ref.actor.use_jackpot`
+  Turns Jackpot correction on or off.
+
+- `actor_rollout_ref.actor.jackpot_use_latest_logits`
+  Uses current policy logits when computing Jackpot overlap/weights.
+  `True` usually gives tighter alignment to the model actually being updated.
+
+- `actor_rollout_ref.actor.jackpot_log_probs_to_keep`
+  Top-k width used by Jackpot overlap approximation.  
+  Larger k gives better overlap approximation but higher memory/compute.
+
+- `actor_rollout_ref.actor.jackpot_lambda`
+  Acceptance-ratio scaling factor in Jackpot correction.  
+  Increasing it makes correction stricter (fewer accepted/carrying tokens).
+
+- `actor_rollout_ref.actor.jackpot_clip_ratio`
+  Upper cap on Jackpot importance weights for stability.  
+  Lower cap is more conservative; higher cap is less biased but can be noisier.
+
+- `actor_rollout_ref.actor.jackpot_use_topk_renorm`
+  Renormalizes overlap mass in top-k space.  
+  Keep this `True` in most runs unless you intentionally study this ablation.
+
+Jackpot also depends on rollout-side log-prob collection:
+
+- `actor_rollout_ref.rollout.calculate_log_probs=True`
+- `actor_rollout_ref.rollout.log_probs_to_keep=<same top-k as actor>`
+
+If these rollout settings are disabled or mismatched, Jackpot correction is not properly supported.
+
+## 4) Script roles
+
+- `run_qwen2-0.5b-dual-kl-gsm8k.sh`
+  GRPO trainer, dual namespace (`large` + `small`), pairwise KL coupling.
+
+- `run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh`
+  DAPO trainer variant of the same dual-namespace joint setup.
+
+- `run_qwen2-0.5b-jackpot-gsm8k.sh`
+  Single-model GRPO baseline with Jackpot only.
+
+- `run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh`
+  Single-model DAPO baseline with Jackpot only.
diff --git a/examples/jackpot_examples_gsm8k/run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh b/examples/jackpot_examples_gsm8k/run_qwen2-0.5b-dual-kl-dapo-gsm8k.sh
@@ -0,0 +1,129 @@
+#!/usr/bin/env bash
+set -euo pipefail
+set -x
+
+# Joint training recipe (DAPO + Jackpot) on GSM8K.
+# - Two namespaces (`large`, `small`) are trained jointly.
+# - Rollout/logprob provider is `small`.
+# - Pairwise KL in `trainer.topologies` couples updates between both models.
+
+# Dataset preparation (from docs/examples/gsm8k_example.rst):
+#   cd examples/data_preprocess
+#   python3 gsm8k.py --local_save_dir ~/data/gsm8k
+DATA_ROOT=${DATA_ROOT:-$HOME/data/gsm8k}
+TRAIN_FILE=${TRAIN_FILE:-$DATA_ROOT/train.parquet}
+VAL_FILE=${VAL_FILE:-$DATA_ROOT/test.parquet}
+
+if [[ ! -f "$TRAIN_FILE" || ! -f "$VAL_FILE" ]]; then
+  echo "Missing GSM8K parquet files under $DATA_ROOT."
+  echo "Run: cd examples/data_preprocess && python3 gsm8k.py --local_save_dir $DATA_ROOT"
+  exit 1
+fi
+
+LARGE_MODEL=${LARGE_MODEL:-Qwen/Qwen3-0.6B-Base}
+SMALL_MODEL=${SMALL_MODEL:-Qwen/Qwen2.5-0.5B}
+
+TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-64}
+PPO_MINI_BATCH_SIZE=${PPO_MINI_BATCH_SIZE:-64}
+ROLLOUT_N=${ROLLOUT_N:-8}
+
+# DAPO clipping / filtering knobs.
+CLIP_RATIO_LOW=${CLIP_RATIO_LOW:-0.2}
+CLIP_RATIO_HIGH=${CLIP_RATIO_HIGH:-0.28}
+ENABLE_FILTER_GROUPS=${ENABLE_FILTER_GROUPS:-True}
+FILTER_GROUPS_METRIC=${FILTER_GROUPS_METRIC:-acc}
+MAX_NUM_GEN_BATCHES=${MAX_NUM_GEN_BATCHES:-10}
+
+# DAPO overlong buffer knobs.
+ENABLE_OVERLONG_BUFFER=${ENABLE_OVERLONG_BUFFER:-True}
+OVERLONG_BUFFER_LEN=${OVERLONG_BUFFER_LEN:-2048}
+OVERLONG_PENALTY_FACTOR=${OVERLONG_PENALTY_FACTOR:-1.0}
+
+# Jackpot knobs (OBRS correction):
+# - actor.use_jackpot=True enables Jackpot token reweighting.
+# - actor.jackpot_use_latest_logits=True recomputes overlap using current policy logits.
+# - actor.jackpot_log_probs_to_keep controls top-k width used for overlap mass estimation.
+# - actor.jackpot_lambda controls acceptance strictness (higher => stricter correction).
+# - actor.jackpot_clip_ratio caps Jackpot importance weights for stability.
+# - actor.jackpot_use_topk_renorm=True renormalizes overlap mass inside kept top-k.
+# - rollout.calculate_log_probs=True and rollout.log_probs_to_keep must stay enabled for Jackpot.
+JACKPOT_LOGPROBS_TO_KEEP=${JACKPOT_LOGPROBS_TO_KEEP:-20}
+JACKPOT_LAMBDA=${JACKPOT_LAMBDA:-1.0}
+JACKPOT_CLIP_RATIO=${JACKPOT_CLIP_RATIO:-16.0}
+
+FWD_KL_SMALL=${FWD_KL_SMALL:-0.1}
+REV_KL_LARGE=${REV_KL_LARGE:-0.00}
+
+python3 -m recipe.dapo.main_dapo \
+  data.train_files="$TRAIN_FILE" \
+  data.val_files="$VAL_FILE" \
+  data.train_batch_size="${TRAIN_BATCH_SIZE}" \
+  data.max_prompt_length=1024 \
+  data.max_response_length=4096 \
+  data.filter_overlong_prompts=True \
+  data.truncation=error \
+  +ray_kwargs.ray_init.object_store_memory=144000000000 \
+  trainer.namespace=large \
+  trainer.train_namespaces=[large,small] \
+  trainer.rollout_from=small \
+  trainer.critic_warmup=0 \
+  trainer.logger=[console,wandb] \
+  trainer.project_name=jackpot_gsm8k_release_dapo \
+  trainer.experiment_name=qwen_dual_kl_dapo_gsm8k \
+  trainer.n_gpus_per_node=1 \
+  trainer.nnodes=1 \
+  trainer.save_freq=16 \
+  trainer.test_freq=16 \
+  trainer.total_epochs=8 \
+  trainer.max_actor_ckpt_to_keep=1 \
+  trainer.val_before_train=False \
+  trainer.validation_use_train_namespace=True \
+  trainer.resource_pool_name=global_pool \
+  actor_rollout_ref.model.path="${LARGE_MODEL}" \
+  actor_rollout_ref.actor.optim.lr=1e-6 \
+  actor_rollout_ref.model.use_remove_padding=True \
+  actor_rollout_ref.model.enable_gradient_checkpointing=True \
+  actor_rollout_ref.actor.ppo_mini_batch_size="${PPO_MINI_BATCH_SIZE}" \
+  actor_rollout_ref.actor.use_dynamic_bsz=True \
+  actor_rollout_ref.actor.ppo_max_token_len_per_gpu=9000 \
+  actor_rollout_ref.actor.use_kl_loss=False \
+  actor_rollout_ref.actor.clip_ratio_low="${CLIP_RATIO_LOW}" \
+  actor_rollout_ref.actor.clip_ratio_high="${CLIP_RATIO_HIGH}" \
+  actor_rollout_ref.actor.clip_ratio_c=10.0 \
+  actor_rollout_ref.actor.use_jackpot=True \
+  actor_rollout_ref.actor.jackpot_use_latest_logits=True \
+  actor_rollout_ref.actor.jackpot_log_probs_to_keep="${JACKPOT_LOGPROBS_TO_KEEP}" \
+  actor_rollout_ref.actor.jackpot_lambda="${JACKPOT_LAMBDA}" \
+  actor_rollout_ref.actor.jackpot_clip_ratio="${JACKPOT_CLIP_RATIO}" \
+  actor_rollout_ref.actor.jackpot_use_topk_renorm=True \
+  actor_rollout_ref.actor.entropy_coeff=0 \
+  actor_rollout_ref.actor.fsdp_config.param_offload=True \
+  actor_rollout_ref.actor.fsdp_config.strategy=fsdp2 \
+  actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+  actor_rollout_ref.rollout.name=vllm \
+  actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+  actor_rollout_ref.rollout.n="${ROLLOUT_N}" \
+  actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=30000 \
+  actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
+  actor_rollout_ref.rollout.free_cache_engine=True \
+  actor_rollout_ref.rollout.mode=sync \
+  actor_rollout_ref.rollout.calculate_log_probs=True \
+  actor_rollout_ref.rollout.log_probs_to_keep="${JACKPOT_LOGPROBS_TO_KEEP}" \
+  actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=30000 \
+  actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
+  actor_rollout_ref.ref.fsdp_config.param_offload=True \
+  actor_rollout_ref.ref.fsdp_config.strategy=fsdp2 \
+  algorithm.adv_estimator=grpo \
+  algorithm.use_kl_in_reward=False \
+  algorithm.filter_groups.enable="${ENABLE_FILTER_GROUPS}" \
+  algorithm.filter_groups.max_num_gen_batches="${MAX_NUM_GEN_BATCHES}" \
+  algorithm.filter_groups.metric="${FILTER_GROUPS_METRIC}" \
+  reward_model.reward_manager=dapo \
+  reward_model.overlong_buffer.enable="${ENABLE_OVERLONG_BUFFER}" \
+  reward_model.overlong_buffer.len="${OVERLONG_BUFFER_LEN}" \
+  reward_model.overlong_buffer.penalty_factor="${OVERLONG_PENALTY_FACTOR}" \
+  "+trainer.topologies=[{name:dual_kl,rollout:small,logprob:small,train:[small,large],logprob_map:{large:large},kl_pairs:[{name:fwd_small_vs_large,train:small,p:large,q:small,mode:forward,coef:${FWD_KL_SMALL}},{name:rev_large_vs_small,train:large,p:large,q:small,mode:reverse,coef:${REV_KL_LARGE},use_is:true}]}]" \
+  '+trainer.topology_loop=[{name:dual_kl,repeat:1}]' \
+  "+trainer.worker_namespaces=[{name:small,train:true,config:{trainer:{rollout_from:small},actor_rollout_ref:{model:{path:'${SMALL_MODEL}'},actor:{optim:{lr:1e-6},ppo_mini_batch_size:${PPO_MINI_BATCH_SIZE},use_dynamic_bsz:true,ppo_max_token_len_per_gpu:12000,use_jackpot:true,jackpot_use_latest_logits:true,jackpot_log_probs_to_keep:${JACKPOT_LOGPROBS_TO_KEEP},jackpot_lambda:${JACKPOT_LAMBDA},jackpot_clip_ratio:${JACKPOT_CLIP_RATIO},jackpot_use_topk_renorm:true,fsdp_config:{param_offload:true,optimizer_offload:true}},ref:{log_prob_max_token_len_per_gpu:30000},rollout:{tensor_model_parallel_size:1,gpu_memory_utilization:0.6,log_prob_max_token_len_per_gpu:30000,log_prob_use_dynamic_bsz:true,free_cache_engine:true,mode:sync,calculate_log_probs:true,log_probs_to_keep:${JACKPOT_LOGPROBS_TO_KEEP}}}}}]" \
+  "$@"
diff --git a/examples/jackpot_examples_gsm8k/run_qwen2-0.5b-dual-kl-gsm8k.sh b/examples/jackpot_examples_gsm8k/run_qwen2-0.5b-dual-kl-gsm8k.sh
@@ -0,0 +1,113 @@
+#!/usr/bin/env bash
+set -euo pipefail
+set -x
+
+# Joint training recipe (GRPO + Jackpot) on GSM8K.
+# - Two namespaces are updated together: `large` and `small`.
+# - Rollout/logprob generation comes from `small` to reduce serving cost.
+# - Pairwise KL terms in `trainer.topologies` couple both models during updates.
+
+# Dataset preparation (from docs/examples/gsm8k_example.rst):
+#   cd examples/data_preprocess
+#   python3 gsm8k.py --local_save_dir ~/data/gsm8k
+DATA_ROOT=${DATA_ROOT:-$HOME/data/gsm8k}
+TRAIN_FILE=${TRAIN_FILE:-$DATA_ROOT/train.parquet}
+VAL_FILE=${VAL_FILE:-$DATA_ROOT/test.parquet}
+
+if [[ ! -f "$TRAIN_FILE" || ! -f "$VAL_FILE" ]]; then
+  echo "Missing GSM8K parquet files under $DATA_ROOT."
+  echo "Run: cd examples/data_preprocess && python3 gsm8k.py --local_save_dir $DATA_ROOT"
+  exit 1
+fi
+
+# Large model is the main namespace; small model provides rollout/logprob service.
+LARGE_MODEL=${LARGE_MODEL:-Qwen/Qwen3-0.6B-Base}
+SMALL_MODEL=${SMALL_MODEL:-Qwen/Qwen2.5-0.5B}
+
+# Throughput-related knobs.
+TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-128}
+PPO_MINI_BATCH_SIZE=${PPO_MINI_BATCH_SIZE:-32}
+ROLLOUT_N=${ROLLOUT_N:-8}
+
+# Jackpot knobs (OBRS correction):
+# - actor.use_jackpot=True enables Jackpot token reweighting.
+# - actor.jackpot_use_latest_logits=True recomputes overlap using current policy logits.
+# - actor.jackpot_log_probs_to_keep controls top-k width used for overlap mass estimation.
+# - actor.jackpot_lambda controls acceptance strictness (higher => stricter correction).
+# - actor.jackpot_clip_ratio caps Jackpot importance weights for stability.
+# - actor.jackpot_use_topk_renorm=True renormalizes overlap mass inside kept top-k.
+# - rollout.calculate_log_probs=True and rollout.log_probs_to_keep must stay enabled for Jackpot.
+JACKPOT_LOGPROBS_TO_KEEP=${JACKPOT_LOGPROBS_TO_KEEP:-20}
+JACKPOT_LAMBDA=${JACKPOT_LAMBDA:-1.0}
+JACKPOT_CLIP_RATIO=${JACKPOT_CLIP_RATIO:-16.0}
+
+# KL coupling knobs for joint training.
+# - FWD_KL_SMALL: forward KL used when training `small`.
+# - REV_KL_LARGE: reverse KL used when training `large`.
+FWD_KL_SMALL=${FWD_KL_SMALL:-0.1}
+REV_KL_LARGE=${REV_KL_LARGE:-0.02}
+
+python3 -m verl.trainer.main_ppo \
+  data.train_files="$TRAIN_FILE" \
+  data.val_files="$VAL_FILE" \
+  data.train_batch_size="${TRAIN_BATCH_SIZE}" \
+  data.max_prompt_length=1024 \
+  data.max_response_length=4096 \
+  data.val_max_samples=1000 \
+  data.filter_overlong_prompts=True \
+  data.truncation=error \
+  +ray_kwargs.ray_init.object_store_memory=144000000000 \
+  trainer.namespace=large \
+  trainer.train_namespaces=[large,small] \
+  trainer.rollout_from=small \
+  trainer.critic_warmup=0 \
+  trainer.logger=[console,wandb] \
+  trainer.project_name=jackpot_gsm8k_release \
+  trainer.experiment_name=qwen_dual_kl_gsm8k \
+  trainer.n_gpus_per_node=1 \
+  trainer.nnodes=1 \
+  trainer.save_freq=8 \
+  trainer.test_freq=8 \
+  trainer.total_epochs=8 \
+  trainer.max_actor_ckpt_to_keep=1 \
+  trainer.val_before_train=True \
+  trainer.validation_use_train_namespace=True \
+  trainer.resource_pool_name=global_pool \
+  actor_rollout_ref.model.path="${LARGE_MODEL}" \
+  actor_rollout_ref.actor.optim.lr=1e-6 \
+  actor_rollout_ref.model.use_remove_padding=True \
+  actor_rollout_ref.model.enable_gradient_checkpointing=True \
+  actor_rollout_ref.actor.ppo_mini_batch_size="${PPO_MINI_BATCH_SIZE}" \
+  actor_rollout_ref.actor.use_dynamic_bsz=True \
+  actor_rollout_ref.actor.ppo_max_token_len_per_gpu=7000 \
+  actor_rollout_ref.actor.use_kl_loss=False \
+  actor_rollout_ref.actor.entropy_coeff=0 \
+  actor_rollout_ref.actor.fsdp_config.param_offload=True \
+  actor_rollout_ref.actor.fsdp_config.strategy=fsdp2 \
+  actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+  actor_rollout_ref.actor.use_jackpot=True \
+  actor_rollout_ref.actor.jackpot_use_latest_logits=True \
+  actor_rollout_ref.actor.jackpot_log_probs_to_keep="${JACKPOT_LOGPROBS_TO_KEEP}" \
+  actor_rollout_ref.actor.jackpot_lambda="${JACKPOT_LAMBDA}" \
+  actor_rollout_ref.actor.jackpot_clip_ratio="${JACKPOT_CLIP_RATIO}" \
+  actor_rollout_ref.actor.jackpot_use_topk_renorm=True \
+  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+  actor_rollout_ref.rollout.name=vllm \
+  actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+  actor_rollout_ref.rollout.n="${ROLLOUT_N}" \
+  actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=30000 \
+  actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
+  actor_rollout_ref.rollout.free_cache_engine=True \
+  actor_rollout_ref.rollout.mode=sync \
+  actor_rollout_ref.rollout.calculate_log_probs=True \
+  actor_rollout_ref.rollout.log_probs_to_keep="${JACKPOT_LOGPROBS_TO_KEEP}" \
+  actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=30000 \
+  actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
+  actor_rollout_ref.ref.fsdp_config.param_offload=True \
+  actor_rollout_ref.ref.fsdp_config.strategy=fsdp2 \
+  algorithm.adv_estimator=grpo \
+  algorithm.use_kl_in_reward=False \
+  "+trainer.topologies=[{name:dual_kl,rollout:small,logprob:small,train:[small,large],logprob_map:{large:large},kl_pairs:[{name:fwd_small_vs_large,train:small,p:large,q:small,mode:forward,coef:${FWD_KL_SMALL}},{name:rev_large_vs_small,train:large,p:large,q:small,mode:reverse,coef:${REV_KL_LARGE},use_is:true}]}]" \
+  '+trainer.topology_loop=[{name:dual_kl,repeat:1}]' \
+  "+trainer.worker_namespaces=[{name:small,train:true,config:{trainer:{rollout_from:small},actor_rollout_ref:{model:{path:'${SMALL_MODEL}'},actor:{optim:{lr:1e-6},ppo_mini_batch_size:${PPO_MINI_BATCH_SIZE},use_dynamic_bsz:true,ppo_max_token_len_per_gpu:12000,use_jackpot:true,jackpot_use_latest_logits:true,jackpot_log_probs_to_keep:${JACKPOT_LOGPROBS_TO_KEEP},jackpot_lambda:${JACKPOT_LAMBDA},jackpot_clip_ratio:${JACKPOT_CLIP_RATIO},jackpot_use_topk_renorm:true,fsdp_config:{param_offload:true,optimizer_offload:true}},ref:{log_prob_max_token_len_per_gpu:30000},rollout:{tensor_model_parallel_size:1,gpu_memory_utilization:0.6,log_prob_max_token_len_per_gpu:30000,log_prob_use_dynamic_bsz:true,free_cache_engine:true,mode:sync,calculate_log_probs:true,log_probs_to_keep:${JACKPOT_LOGPROBS_TO_KEEP}}}}}]" \
+  "$@"
diff --git a/examples/jackpot_examples_gsm8k/run_qwen2-0.5b-jackpot-gsm8k.sh b/examples/jackpot_examples_gsm8k/run_qwen2-0.5b-jackpot-gsm8k.sh
diff --git a/examples/jackpot_examples_gsm8k/run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh b/examples/jackpot_examples_gsm8k/run_qwen3-0.6b-base-jackpot-dapo-gsm8k.sh