Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ Our long-term goal is to **advance personalized, practically useful agents with
✅ **Release Track 1:** Fully async OpenClaw-RL framework with Binary RL + OPD
✅ Best recipe discovery via demonstration experiments
✅ Support LoRA Training
✅ Single-GPU INT4 QLoRA with offload-rollout and external PRM API (24 GB VRAM)
✅ Deploy training on [Tinker](https://thinkingmachines.ai/tinker/)
✅ Deploy training on [Fireworks AI](https://fireworks.ai)

Expand Down Expand Up @@ -241,6 +242,26 @@ If you're interested in any of these, feel free to open an issue to discuss your

For detailed environment setup, see [Slime](https://github.com/THUDM/slime) or [`./instructions/README.md`](./instructions/README.md).

#### Only have a single GPU (≥ 24 GB)?

We support **single-GPU INT4 QLoRA** training and inference via:

- **INT4 QLoRA** — bitsandbytes NF4 quantization + LoRA (rank=4), base weights ~2 GB VRAM
- **Colocate + CPU Offload** — training and rollout share one GPU, offloading when idle
- **External PRM API** — PRM scoring via any OpenAI-compatible API, no local PRM GPU needed

```bash
cd slime
export HF_CKPT="/path/to/Qwen3-4B" # original HF weights (not pre-quantized)
export OPENAI_API_KEY="your-api-key" # external LLM API key (for PRM & OPD)
export OPENAI_BASE_URL="https://your-api-base"
export EXTERNAL_MODEL="your-model-name"

bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_single_gpu_int4_qlora.sh
```

See [`./openclaw-combine/README.md`](./openclaw-combine/README.md) for full parameter reference and [`./openclaw-test/README.md`](./openclaw-test/README.md) for single-GPU evaluation scripts.



#### Don't have a GPU?
Expand Down
50 changes: 50 additions & 0 deletions openclaw-combine/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,53 @@ openclaw-combine/
├── combine_loss.py # Weighted advantage: w_rl * GRPO + w_opd * teacher
└── results/ # Runtime records (auto-created)
```

---

## Single-GPU INT4 QLoRA

Train and serve on a **single 24 GB GPU** using `run_qwen3_4b_openclaw_combine_single_gpu_int4_qlora.sh`.

### Required Environment Variables

| Variable | Description |
|---|---|
| `HF_CKPT` | Path to original Qwen3-4B HF weights (**not** pre-quantized) |
| `OPENAI_API_KEY` | API key for external LLM (used by PRM and OPD teacher) |
| `OPENAI_BASE_URL` | Base URL for the external LLM API |
| `EXTERNAL_MODEL` | Model name served by the external API |

### Optional Environment Variables

| Variable | Default | Description |
|---|---|---|
| `MICRO_BATCH_SIZE` | `1` | Micro-batch size per GPU |
| `ROLLOUT_BATCH_SIZE` | `16` | Samples collected before each training step |

### Key Features Enabled

| Feature | Flag / Config | Description |
|---|---|---|
| INT4 quantization | `--fsdp-load-in-4bit` | bitsandbytes NF4 quantization. Compresses base model weights to ~4 bits, reducing VRAM from ~8 GB (bf16) to ~2 GB for Qwen3-4B. Only LoRA adapters are trained in full precision. |
| LoRA fine-tuning | `--use-lora --lora-rank 4` | Low-Rank Adaptation. Freezes base weights and inserts small trainable matrices into attention/MLP layers.
| Gradient checkpointing | `--gradient-checkpointing` | Trades compute for memory by recomputing activations during backward pass instead of storing them. |
| Colocate (train + rollout on same GPU) | `--colocate` | Runs both the FSDP training actor and the SGLang rollout engine on the same GPU(s). Essential for single-GPU setups. |
| Rollout offload | `--offload-rollout` | When training starts, the SGLang rollout engine is offloaded from GPU to free VRAM for the training forward/backward pass. When training finishes, the engine is reloaded for the next rollout. This alternating pattern allows a single GPU to handle both roles. |
| External PRM API | `--prm-use-external-api` | Sends PRM scoring requests to an external OpenAI-compatible API instead of hosting a local PRM model on GPU. Eliminates the need for a dedicated PRM GPU — critical for single-GPU setups. Configured via `PRM_EXTERNAL_API_BASE`, `PRM_EXTERNAL_MODEL`, and `PRM_EXTERNAL_API_KEY`. |
| Train sequence truncation | `SLIME_TRAIN_MAX_SEQ_LEN=4096` | Caps the token length of each training sample. Sequences longer than this are truncated from the **left** (dropping early prompt tokens, preserving the response). Prevents OOM on long rollouts and keeps per-sample memory bounded. |
| Logit chunking | `SLIME_LOGIT_CHUNK_SIZE=512` | Computes log-softmax and entropy in chunks of this size instead of materializing the full `[seq_len, vocab_size]` tensor at once. For Qwen3-4B (vocab ~152 K), this reduces peak allocation from ~7.5 GB to ~0.9 GB per sample. Set to `0` to disable chunking. |

### Quick Start

```bash
cd slime

export HF_CKPT="/path/to/Qwen3-4B"
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-api-base"
export EXTERNAL_MODEL="your-model-name"

bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_single_gpu_int4_qlora.sh
```

The model is served at `http://0.0.0.0:30000/v1` by default. For single-GPU evaluation scripts, see [`../openclaw-test/README.md`](../openclaw-test/README.md).
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
#!/bin/bash

set -euo pipefail
set -x
export PYTORCH_ALLOC_CONF=expandable_segments:True
export SLIME_TRAINING_SAMPLES_FILE="../results/samples.jsonl"
export PYTHONUNBUFFERED=1
export SLIME_LOGIT_CHUNK_SIZE=512
export PYTHONFAULTHANDLER=1
export OPENCLAW_GATEWAY_TOKEN=""
export OPENAI_API_KEY=""
export OPENCLAW_GATEWAY_URL=""
export OPENCLAW_WORKSPACE="$HOME/.openclaw/workspace"
export OPENAI_BASE_URL="" # point to your external LLM
export EXTERNAL_MODEL="" # model name for the external LLM
export SLIME_TRAIN_MAX_SEQ_LEN=4096 # truncation will happen if context + response exceed this length
export SGLANG_ENABLE_LOGITS_PROCESSER_CHUNK=True
export SGLANG_LOGITS_PROCESSER_CHUNK_SIZE=128 # to avoid OOM with long context + response in INT4
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
SLIME_ROOT="$(cd -- "${SCRIPT_DIR}/../slime" &>/dev/null && pwd)"

NUM_GPUS=${NUM_GPUS:-1}
ACTOR_GPUS=${ACTOR_GPUS:-1}
ROLLOUT_GPUS=${ROLLOUT_GPUS:-1}

# For bitsandbytes QLoRA, use the original HF checkpoint (do NOT use INT4 dir).
HF_CKPT=${HF_CKPT:-to/models/Qwen3-4B}
REF_LOAD=${REF_LOAD:-${HF_CKPT}}
SAVE_CKPT=${SAVE_CKPT:-/root/shared-nvme/OpenClaw-RL/models/qwen3_4b_openclaw_combine_single_gpu_int4_qlora_ckpt}
PRM_MODEL_PATH=${PRM_MODEL_PATH:-${HF_CKPT}}
# External PRM API (OpenAI-compatible)
PRM_EXTERNAL_API_BASE=${PRM_EXTERNAL_API_BASE:-${OPENAI_BASE_URL:-}}
PRM_EXTERNAL_MODEL=${PRM_EXTERNAL_MODEL:-${EXTERNAL_MODEL:-}}
PRM_EXTERNAL_API_KEY=${PRM_EXTERNAL_API_KEY:-${OPENAI_API_KEY:-}}

if [[ -z "${PRM_EXTERNAL_API_BASE}" || -z "${PRM_EXTERNAL_MODEL}" ]]; then
echo "PRM_EXTERNAL_API_BASE and PRM_EXTERNAL_MODEL are required (or set OPENAI_BASE_URL / EXTERNAL_MODEL)."
exit 1
fi

export SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-qwen3-4b-int4}"
export HOST="0.0.0.0"
export PORT="${PORT:-30000}"

export OPENCLAW_RECORD_ENABLED="${OPENCLAW_RECORD_ENABLED:-1}"
export OPENCLAW_RECORD_FILE="${SCRIPT_DIR}/results/qwen3_4b_single_gpu_int4_qlora_record.jsonl"
export OPENCLAW_EVAL_MODE="${OPENCLAW_EVAL_MODE:-1}"

export OPENCLAW_COMBINE_W_RL="${OPENCLAW_COMBINE_W_RL:-1.0}"
export OPENCLAW_COMBINE_W_OPD="${OPENCLAW_COMBINE_W_OPD:-1.0}"
export TRAIN_EPOCHS="${TRAIN_EPOCHS:-1}"
export PRM_M="${PRM_M:-1}"
export OPENCLAW_OPD_TEACHER_LP_MAX_CONCURRENCY="${OPENCLAW_OPD_TEACHER_LP_MAX_CONCURRENCY:-1}"

ray stop --force || true
pkill -9 sglang || true
pkill -9 ray || true

export MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
export no_proxy="127.0.0.1,${MASTER_ADDR}"
ray start --head --node-ip-address "${MASTER_ADDR}" --num-gpus "${NUM_GPUS}" --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265

CKPT_ARGS=(
--hf-checkpoint "${HF_CKPT}"
--ref-load "${REF_LOAD}"
--save "${SAVE_CKPT}"
--save-interval 20
)

ROLLOUT_ARGS=(
--disable-rollout-global-dataset
--rollout-function-path openclaw_combine_rollout.generate_rollout_openclaw_combine
--num-rollout ${NUM_ROLLOUT:-20000}
--rollout-batch-size ${ROLLOUT_BATCH_SIZE:-16}
--n-samples-per-prompt 1
--rollout-max-response-len ${ROLLOUT_MAX_RESPONSE_LEN:-4096}
--rollout-max-context-len ${ROLLOUT_MAX_CONTEXT_LEN:-22768}
--rollout-temperature ${ROLLOUT_TEMPERATURE:-0.6}
--reward-key score
--num-steps-per-rollout 1
)

COMBINE_ARGS=(
--advantage-estimator grpo
--disable-rewards-normalization
--loss-type custom_loss
--custom-loss-function-path combine_loss.combine_loss_function
--use-kl-loss
--kl-loss-coef 0.0
--kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2
--eps-clip-high 0.28
)

OPTIMIZER_ARGS=(
--optimizer adam
--lr ${LR:-1e-5}
--lr-decay-style constant
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.98
)

PERF_ARGS=(
--micro-batch-size ${MICRO_BATCH_SIZE:-1}
--max-tokens-per-gpu ${MAX_TOKENS_PER_GPU:-4096}
--gradient-checkpointing
)

# FSDP QLoRA (INT4 base + LoRA adapters)
QLORA_ARGS=(
--use-lora
--lora-rank ${LORA_RANK:-4}
--lora-alpha ${LORA_ALPHA:-4}
--lora-target-modules "${LORA_TARGET_MODULES:-q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj}"
--fsdp-load-in-4bit
--fsdp-bnb-4bit-quant-type ${FSDP_BNB_4BIT_QUANT_TYPE:-nf4}
--fsdp-bnb-4bit-compute-dtype ${FSDP_BNB_4BIT_COMPUTE_DTYPE:-bfloat16}
--fsdp-bnb-4bit-use-double-quant
--fsdp-prepare-model-for-kbit-training
)

SGLANG_ARGS=(
--rollout-num-gpus-per-engine 1
--sglang-context-length ${SGLANG_CONTEXT_LENGTH:-22768}
--sglang-mem-fraction-static ${SGLANG_MEM_FRACTION_STATIC:-0.6}
--sglang-reasoning-parser ${SGLANG_REASONING_PARSER:-qwen3}
--sglang-tool-call-parser ${SGLANG_TOOL_CALL_PARSER:-qwen25}
)

# External PRM: no local PRM engine / GPU allocation.
PRM_ARGS=(
--prm-enable
--prm-use-external-api
--prm-num-gpus 1
--prm-m ${PRM_M}
--prm-temperature ${PRM_TEMPERATURE:-0.6}
--prm-max-new-tokens ${PRM_MAX_NEW_TOKENS:-4096}
--prm-external-api-base "${PRM_EXTERNAL_API_BASE}"
--prm-external-model "${PRM_EXTERNAL_MODEL}"
)
if [[ -n "${PRM_EXTERNAL_API_KEY}" ]]; then
PRM_ARGS+=(--prm-external-api-key "${PRM_EXTERNAL_API_KEY}")
fi

CUSTOM_ARGS=(
--custom-generate-function-path openclaw_combine_api_server.generate
--custom-rm-path openclaw_combine_api_server.reward_func
)

WANDB_ARGS=()
if [[ "${USE_WANDB:-0}" == "1" && -n "${WANDB_API_KEY:-}" ]]; then
WANDB_ARGS=(
--use-wandb
--wandb-project ${WANDB_PROJECT:-openclaw_rl_int4}
--wandb-group qwen3-4b-openclaw-combine-int4-qlora
--wandb-key ${WANDB_API_KEY}
)
fi

RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"${SCRIPT_DIR}:${SCRIPT_DIR}/../openclaw-opd:${SLIME_ROOT}\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"OPENCLAW_EVAL_MODE\": \"${OPENCLAW_EVAL_MODE}\",
\"OPENCLAW_COMBINE_W_RL\": \"${OPENCLAW_COMBINE_W_RL}\",
\"OPENCLAW_COMBINE_W_OPD\": \"${OPENCLAW_COMBINE_W_OPD}\",
\"TRAIN_EPOCHS\": \"${TRAIN_EPOCHS}\"
}
}"

ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 train.py \
--train-backend fsdp \
--offload-rollout \
--actor-num-nodes 1 \
--actor-num-gpus-per-node "${ACTOR_GPUS}" \
--rollout-num-gpus "${ROLLOUT_GPUS}" \
--num-gpus-per-node "${NUM_GPUS}" \
--colocate \
"${CKPT_ARGS[@]}" \
"${ROLLOUT_ARGS[@]}" \
"${OPTIMIZER_ARGS[@]}" \
"${COMBINE_ARGS[@]}" \
"${PERF_ARGS[@]}" \
"${SGLANG_ARGS[@]}" \
"${PRM_ARGS[@]}" \
"${QLORA_ARGS[@]}" \
"${CUSTOM_ARGS[@]}" \
"${WANDB_ARGS[@]}"
Loading