Gen-Verse · wuwanchun · Apr 20, 2026 · Apr 20, 2026
diff --git a/README.md b/README.md
@@ -117,6 +117,7 @@ Our long-term goal is to **advance personalized, practically useful agents with
 ✅ **Release Track 1:** Fully async OpenClaw-RL framework with Binary RL + OPD  
 ✅ Best recipe discovery via demonstration experiments  
 ✅ Support LoRA Training  
+✅ Single-GPU INT4 QLoRA with offload-rollout and external PRM API (24 GB VRAM)  
 ✅ Deploy training on [Tinker](https://thinkingmachines.ai/tinker/)  
 ✅ Deploy training on [Fireworks AI](https://fireworks.ai)
 
@@ -241,6 +242,26 @@ If you're interested in any of these, feel free to open an issue to discuss your
 
 For detailed environment setup, see [Slime](https://github.com/THUDM/slime) or [`./instructions/README.md`](./instructions/README.md).
 
+#### Only have a single GPU (≥ 24 GB)?
+
+We support **single-GPU INT4 QLoRA** training and inference via:
+
+- **INT4 QLoRA** — bitsandbytes NF4 quantization + LoRA (rank=4), base weights ~2 GB VRAM
+- **Colocate + CPU Offload** — training and rollout share one GPU, offloading when idle
+- **External PRM API** — PRM scoring via any OpenAI-compatible API, no local PRM GPU needed
+
+```bash
+cd slime
+export HF_CKPT="/path/to/Qwen3-4B"           # original HF weights (not pre-quantized)
+export OPENAI_API_KEY="your-api-key"           # external LLM API key (for PRM & OPD)
+export OPENAI_BASE_URL="https://your-api-base"
+export EXTERNAL_MODEL="your-model-name"
+
+bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_single_gpu_int4_qlora.sh
+```
+
+See [`./openclaw-combine/README.md`](./openclaw-combine/README.md) for full parameter reference and [`./openclaw-test/README.md`](./openclaw-test/README.md) for single-GPU evaluation scripts.
+
 
 
 #### Don't have a GPU?

diff --git a/openclaw-combine/README.md b/openclaw-combine/README.md
@@ -66,3 +66,53 @@ openclaw-combine/
 ├── combine_loss.py                           # Weighted advantage: w_rl * GRPO + w_opd * teacher
 └── results/                                  # Runtime records (auto-created)
 ```
+
+---
+
+## Single-GPU INT4 QLoRA
+
+Train and serve on a **single 24 GB GPU** using `run_qwen3_4b_openclaw_combine_single_gpu_int4_qlora.sh`.
+
+### Required Environment Variables
+
+| Variable | Description |
+|---|---|
+| `HF_CKPT` | Path to original Qwen3-4B HF weights (**not** pre-quantized) |
+| `OPENAI_API_KEY` | API key for external LLM (used by PRM and OPD teacher) |
+| `OPENAI_BASE_URL` | Base URL for the external LLM API |
+| `EXTERNAL_MODEL` | Model name served by the external API |
+
+### Optional Environment Variables
+
+| Variable | Default | Description |
+|---|---|---|
+| `MICRO_BATCH_SIZE` | `1` | Micro-batch size per GPU |
+| `ROLLOUT_BATCH_SIZE` | `16` | Samples collected before each training step |
+
+### Key Features Enabled
+
+| Feature | Flag / Config | Description |
+|---|---|---|
+| INT4 quantization | `--fsdp-load-in-4bit` | bitsandbytes NF4 quantization. Compresses base model weights to ~4 bits, reducing VRAM from ~8 GB (bf16) to ~2 GB for Qwen3-4B. Only LoRA adapters are trained in full precision. |
+| LoRA fine-tuning | `--use-lora --lora-rank 4` | Low-Rank Adaptation. Freezes base weights and inserts small trainable matrices into attention/MLP layers.
+| Gradient checkpointing | `--gradient-checkpointing` | Trades compute for memory by recomputing activations during backward pass instead of storing them. |
+| Colocate (train + rollout on same GPU) | `--colocate` | Runs both the FSDP training actor and the SGLang rollout engine on the same GPU(s). Essential for single-GPU setups. |
+| Rollout offload | `--offload-rollout` | When training starts, the SGLang rollout engine is offloaded from GPU to free VRAM for the training forward/backward pass. When training finishes, the engine is reloaded for the next rollout. This alternating pattern allows a single GPU to handle both roles. |
+| External PRM API | `--prm-use-external-api` | Sends PRM scoring requests to an external OpenAI-compatible API instead of hosting a local PRM model on GPU. Eliminates the need for a dedicated PRM GPU — critical for single-GPU setups. Configured via `PRM_EXTERNAL_API_BASE`, `PRM_EXTERNAL_MODEL`, and `PRM_EXTERNAL_API_KEY`. |
+| Train sequence truncation | `SLIME_TRAIN_MAX_SEQ_LEN=4096` | Caps the token length of each training sample. Sequences longer than this are truncated from the **left** (dropping early prompt tokens, preserving the response). Prevents OOM on long rollouts and keeps per-sample memory bounded. |
+| Logit chunking | `SLIME_LOGIT_CHUNK_SIZE=512` | Computes log-softmax and entropy in chunks of this size instead of materializing the full `[seq_len, vocab_size]` tensor at once. For Qwen3-4B (vocab ~152 K), this reduces peak allocation from ~7.5 GB to ~0.9 GB per sample. Set to `0` to disable chunking. |
+
+### Quick Start
+
+```bash
+cd slime
+
+export HF_CKPT="/path/to/Qwen3-4B"
+export OPENAI_API_KEY="your-api-key"
+export OPENAI_BASE_URL="https://your-api-base"
+export EXTERNAL_MODEL="your-model-name"
+
+bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_single_gpu_int4_qlora.sh
+```
+
+The model is served at `http://0.0.0.0:30000/v1` by default. For single-GPU evaluation scripts, see [`../openclaw-test/README.md`](../openclaw-test/README.md).
diff --git a/openclaw-combine/run_qwen3_4b_openclaw_combine_qlora_single_gpu_int4.sh b/openclaw-combine/run_qwen3_4b_openclaw_combine_qlora_single_gpu_int4.sh
@@ -0,0 +1,192 @@
+#!/bin/bash
+
+set -euo pipefail
+set -x
+export PYTORCH_ALLOC_CONF=expandable_segments:True
+export SLIME_TRAINING_SAMPLES_FILE="../results/samples.jsonl"
+export PYTHONUNBUFFERED=1
+export SLIME_LOGIT_CHUNK_SIZE=512
+export PYTHONFAULTHANDLER=1
+export OPENCLAW_GATEWAY_TOKEN=""
+export OPENAI_API_KEY=""
+export OPENCLAW_GATEWAY_URL=""
+export OPENCLAW_WORKSPACE="$HOME/.openclaw/workspace"
+export OPENAI_BASE_URL=""   # point to your external LLM
+export EXTERNAL_MODEL=""    # model name for the external LLM
+export SLIME_TRAIN_MAX_SEQ_LEN=4096 # truncation will happen if context + response exceed this length
+export SGLANG_ENABLE_LOGITS_PROCESSER_CHUNK=True
+export SGLANG_LOGITS_PROCESSER_CHUNK_SIZE=128 # to avoid OOM with long context + response in INT4
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
+SLIME_ROOT="$(cd -- "${SCRIPT_DIR}/../slime" &>/dev/null && pwd)"
+
+NUM_GPUS=${NUM_GPUS:-1}
+ACTOR_GPUS=${ACTOR_GPUS:-1}
+ROLLOUT_GPUS=${ROLLOUT_GPUS:-1}
+
+# For bitsandbytes QLoRA, use the original HF checkpoint (do NOT use INT4 dir).
+HF_CKPT=${HF_CKPT:-to/models/Qwen3-4B}
+REF_LOAD=${REF_LOAD:-${HF_CKPT}}
+SAVE_CKPT=${SAVE_CKPT:-/root/shared-nvme/OpenClaw-RL/models/qwen3_4b_openclaw_combine_single_gpu_int4_qlora_ckpt}
+PRM_MODEL_PATH=${PRM_MODEL_PATH:-${HF_CKPT}}
+# External PRM API (OpenAI-compatible)
+PRM_EXTERNAL_API_BASE=${PRM_EXTERNAL_API_BASE:-${OPENAI_BASE_URL:-}}
+PRM_EXTERNAL_MODEL=${PRM_EXTERNAL_MODEL:-${EXTERNAL_MODEL:-}}
+PRM_EXTERNAL_API_KEY=${PRM_EXTERNAL_API_KEY:-${OPENAI_API_KEY:-}}
+
+if [[ -z "${PRM_EXTERNAL_API_BASE}" || -z "${PRM_EXTERNAL_MODEL}" ]]; then
+  echo "PRM_EXTERNAL_API_BASE and PRM_EXTERNAL_MODEL are required (or set OPENAI_BASE_URL / EXTERNAL_MODEL)."
+  exit 1
+fi
+
+export SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-qwen3-4b-int4}"
+export HOST="0.0.0.0"
+export PORT="${PORT:-30000}"
+
+export OPENCLAW_RECORD_ENABLED="${OPENCLAW_RECORD_ENABLED:-1}"
+export OPENCLAW_RECORD_FILE="${SCRIPT_DIR}/results/qwen3_4b_single_gpu_int4_qlora_record.jsonl"
+export OPENCLAW_EVAL_MODE="${OPENCLAW_EVAL_MODE:-1}"
+
+export OPENCLAW_COMBINE_W_RL="${OPENCLAW_COMBINE_W_RL:-1.0}"
+export OPENCLAW_COMBINE_W_OPD="${OPENCLAW_COMBINE_W_OPD:-1.0}"
+export TRAIN_EPOCHS="${TRAIN_EPOCHS:-1}"
+export PRM_M="${PRM_M:-1}"
+export OPENCLAW_OPD_TEACHER_LP_MAX_CONCURRENCY="${OPENCLAW_OPD_TEACHER_LP_MAX_CONCURRENCY:-1}"
+
+ray stop --force || true
+pkill -9 sglang || true
+pkill -9 ray || true
+
+export MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
+export no_proxy="127.0.0.1,${MASTER_ADDR}"
+ray start --head --node-ip-address "${MASTER_ADDR}" --num-gpus "${NUM_GPUS}" --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265
+
+CKPT_ARGS=(
+  --hf-checkpoint "${HF_CKPT}"
+  --ref-load "${REF_LOAD}"
+  --save "${SAVE_CKPT}"
+  --save-interval 20
+)
+
+ROLLOUT_ARGS=(
+  --disable-rollout-global-dataset
+  --rollout-function-path openclaw_combine_rollout.generate_rollout_openclaw_combine
+  --num-rollout ${NUM_ROLLOUT:-20000}
+  --rollout-batch-size ${ROLLOUT_BATCH_SIZE:-16}
+  --n-samples-per-prompt 1
+  --rollout-max-response-len ${ROLLOUT_MAX_RESPONSE_LEN:-4096}
+  --rollout-max-context-len ${ROLLOUT_MAX_CONTEXT_LEN:-22768}
+  --rollout-temperature ${ROLLOUT_TEMPERATURE:-0.6}
+  --reward-key score
+  --num-steps-per-rollout 1
+)
+
+COMBINE_ARGS=(
+  --advantage-estimator grpo
+  --disable-rewards-normalization
+  --loss-type custom_loss
+  --custom-loss-function-path combine_loss.combine_loss_function   
+  --use-kl-loss
+  --kl-loss-coef 0.0
+  --kl-loss-type low_var_kl
+  --entropy-coef 0.00
+  --eps-clip 0.2
+  --eps-clip-high 0.28
+)
+
+OPTIMIZER_ARGS=(
+  --optimizer adam
+  --lr ${LR:-1e-5}
+  --lr-decay-style constant
+  --weight-decay 0.1
+  --adam-beta1 0.9
+  --adam-beta2 0.98
+)
+
+PERF_ARGS=(
+  --micro-batch-size ${MICRO_BATCH_SIZE:-1}
+  --max-tokens-per-gpu ${MAX_TOKENS_PER_GPU:-4096}
+  --gradient-checkpointing
+)
+
+# FSDP QLoRA (INT4 base + LoRA adapters)
+QLORA_ARGS=(
+  --use-lora
+  --lora-rank ${LORA_RANK:-4}
+  --lora-alpha ${LORA_ALPHA:-4}
+  --lora-target-modules "${LORA_TARGET_MODULES:-q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj}"
+  --fsdp-load-in-4bit
+  --fsdp-bnb-4bit-quant-type ${FSDP_BNB_4BIT_QUANT_TYPE:-nf4}
+  --fsdp-bnb-4bit-compute-dtype ${FSDP_BNB_4BIT_COMPUTE_DTYPE:-bfloat16}
+  --fsdp-bnb-4bit-use-double-quant
+  --fsdp-prepare-model-for-kbit-training
+)
+
+SGLANG_ARGS=(
+  --rollout-num-gpus-per-engine 1
+  --sglang-context-length ${SGLANG_CONTEXT_LENGTH:-22768}
+  --sglang-mem-fraction-static ${SGLANG_MEM_FRACTION_STATIC:-0.6}
+  --sglang-reasoning-parser ${SGLANG_REASONING_PARSER:-qwen3}
+  --sglang-tool-call-parser ${SGLANG_TOOL_CALL_PARSER:-qwen25}
+)
+
+# External PRM: no local PRM engine / GPU allocation.
+PRM_ARGS=(
+  --prm-enable
+  --prm-use-external-api
+  --prm-num-gpus 1
+  --prm-m ${PRM_M}
+  --prm-temperature ${PRM_TEMPERATURE:-0.6}
+  --prm-max-new-tokens ${PRM_MAX_NEW_TOKENS:-4096}
+  --prm-external-api-base "${PRM_EXTERNAL_API_BASE}"
+  --prm-external-model "${PRM_EXTERNAL_MODEL}"
+)
+if [[ -n "${PRM_EXTERNAL_API_KEY}" ]]; then
+  PRM_ARGS+=(--prm-external-api-key "${PRM_EXTERNAL_API_KEY}")
+fi
+
+CUSTOM_ARGS=(
+  --custom-generate-function-path openclaw_combine_api_server.generate
+  --custom-rm-path openclaw_combine_api_server.reward_func
+)
+
+WANDB_ARGS=()
+if [[ "${USE_WANDB:-0}" == "1" && -n "${WANDB_API_KEY:-}" ]]; then
+  WANDB_ARGS=(
+    --use-wandb
+    --wandb-project ${WANDB_PROJECT:-openclaw_rl_int4}
+    --wandb-group qwen3-4b-openclaw-combine-int4-qlora
+    --wandb-key ${WANDB_API_KEY}
+  )
+fi
+
+RUNTIME_ENV_JSON="{
+  \"env_vars\": {
+    \"PYTHONPATH\": \"${SCRIPT_DIR}:${SCRIPT_DIR}/../openclaw-opd:${SLIME_ROOT}\",
+    \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
+    \"OPENCLAW_EVAL_MODE\": \"${OPENCLAW_EVAL_MODE}\",
+    \"OPENCLAW_COMBINE_W_RL\": \"${OPENCLAW_COMBINE_W_RL}\",
+    \"OPENCLAW_COMBINE_W_OPD\": \"${OPENCLAW_COMBINE_W_OPD}\",
+    \"TRAIN_EPOCHS\": \"${TRAIN_EPOCHS}\"
+  }
+}"
+
+ray job submit --address="http://127.0.0.1:8265" \
+  --runtime-env-json="${RUNTIME_ENV_JSON}" \
+  -- python3 train.py \
+  --train-backend fsdp \
+  --offload-rollout \
+  --actor-num-nodes 1 \
+  --actor-num-gpus-per-node "${ACTOR_GPUS}" \
+  --rollout-num-gpus "${ROLLOUT_GPUS}" \
+  --num-gpus-per-node "${NUM_GPUS}" \
+  --colocate \
+  "${CKPT_ARGS[@]}" \
+  "${ROLLOUT_ARGS[@]}" \
+  "${OPTIMIZER_ARGS[@]}" \
+  "${COMBINE_ARGS[@]}" \
+  "${PERF_ARGS[@]}" \
+  "${SGLANG_ARGS[@]}" \
+  "${PRM_ARGS[@]}" \
+  "${QLORA_ARGS[@]}" \
+  "${CUSTOM_ARGS[@]}" \
+  "${WANDB_ARGS[@]}"