|
| 1 | +# OpenClaw Agent Training - Extraversion Personality |
| 2 | + |
| 3 | +Train an LLM agent to exhibit more extraverted personality traits using reinforcement learning. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This training program uses GRPO (Group Relative Policy Optimization) to train Qwen2.5-7B-Instruct to respond with more extraverted characteristics: |
| 8 | +- Outgoing, energetic, enthusiastic tone |
| 9 | +- Social engagement and excitement |
| 10 | +- Positive, upbeat language |
| 11 | +- Action-oriented expressions |
| 12 | + |
| 13 | +## Architecture |
| 14 | + |
| 15 | +``` |
| 16 | +User Query → fake_vllm_endpoint.py → Swarm Server (8 GPUs) |
| 17 | + ↓ |
| 18 | + Generate N=4 responses in parallel |
| 19 | + ↓ |
| 20 | + Evaluate with ExtraversionGrader (OpenJudge) |
| 21 | + ↓ |
| 22 | + Compute rewards & update model (GRPO) |
| 23 | + ↓ |
| 24 | + Return best response to user |
| 25 | +``` |
| 26 | + |
| 27 | +## Prerequisites |
| 28 | + |
| 29 | +```bash |
| 30 | +pip install py-openjudge datasets |
| 31 | +``` |
| 32 | + |
| 33 | +## Setup |
| 34 | + |
| 35 | +### 1. Download Dataset |
| 36 | + |
| 37 | +```bash |
| 38 | +cd tutorial/opencode_build_openclaw_agent |
| 39 | +python download_dataset.py |
| 40 | +``` |
| 41 | + |
| 42 | +This downloads the `holistic-ai/personality_manipulation` dataset and extracts extraversion examples. |
| 43 | + |
| 44 | +### 2. Configure API Key |
| 45 | + |
| 46 | +Edit `on_compute_relative_reward.py` and set your API key for the judge model: |
| 47 | + |
| 48 | +```python |
| 49 | +model = OpenAIChatModel( |
| 50 | + model="qwen-plus", |
| 51 | + api_key="YOUR_API_KEY_HERE", # Change this |
| 52 | + base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", |
| 53 | +) |
| 54 | +``` |
| 55 | + |
| 56 | +## Training |
| 57 | + |
| 58 | +### Step 1: Start Swarm Server |
| 59 | + |
| 60 | +On your GPU server (with 8 GPUs available): |
| 61 | + |
| 62 | +```bash |
| 63 | +ajet-swarm start |
| 64 | +``` |
| 65 | + |
| 66 | +Or with monitoring: |
| 67 | + |
| 68 | +```bash |
| 69 | +(ajet-swarm start &> ajet-swarm-server.log) & (ajet-swarm overwatch) |
| 70 | +``` |
| 71 | + |
| 72 | +### Step 2: Start Fake vLLM Endpoint |
| 73 | + |
| 74 | +In a new terminal: |
| 75 | + |
| 76 | +```bash |
| 77 | +cd tutorial/opencode_build_openclaw_agent |
| 78 | + |
| 79 | +# Option 1: Use OpenJudge pointwise grading (default) |
| 80 | +export AJET_SWARM_URL="http://localhost:10086" |
| 81 | +export NUM_REPEAT=4 |
| 82 | +export REWARD_MODE=pointwise |
| 83 | +export DASHSCOPE_API_KEY=your_api_key_here |
| 84 | +python fake_vllm_endpoint.py |
| 85 | + |
| 86 | +# Option 2: Use OpenJudge listwise ranking |
| 87 | +export AJET_SWARM_URL="http://localhost:10086" |
| 88 | +export NUM_REPEAT=4 |
| 89 | +export REWARD_MODE=listwise |
| 90 | +export DASHSCOPE_API_KEY=your_api_key_here |
| 91 | +python fake_vllm_endpoint.py |
| 92 | +``` |
| 93 | + |
| 94 | +This starts the training proxy on `http://localhost:8090`. |
| 95 | + |
| 96 | +### Step 3: Configure OpenClaw to Use Training Endpoint |
| 97 | + |
| 98 | +OpenClaw needs to connect to the fake vLLM endpoint. |
| 99 | + |
| 100 | +Configure it to use `http://localhost:8090` as the LLM backend. |
| 101 | + |
| 102 | +### Step 4: Send Training Requests |
| 103 | + |
| 104 | +Option A - Manual testing via OpenClaw Web / Cli: |
| 105 | + |
| 106 | +```bash |
| 107 | +openclaw agent --message "What are your thoughts on Paris?" --thinking high |
| 108 | +``` |
| 109 | + |
| 110 | +Option B - Automated dataset iteration: |
| 111 | + |
| 112 | +```bash |
| 113 | +python mock_user_request.py |
| 114 | +``` |
| 115 | + |
| 116 | +This will iterate through the personality_manipulation dataset and send each question via OpenClaw CLI. |
| 117 | + |
| 118 | +## Configuration |
| 119 | + |
| 120 | +Key parameters in `fake_vllm_endpoint.py`: |
| 121 | + |
| 122 | +- `n_gpu=8` - Number of GPUs for training |
| 123 | +- `batch_size=32` - Training batch size |
| 124 | +- `num_repeat=4` - GRPO N parameter (responses per query) |
| 125 | +- `model` - Base model path |
| 126 | + |
| 127 | +Environment variables for reward computation: |
| 128 | + |
| 129 | +- `REWARD_MODE` - Reward computation mode: `pointwise` (default) or `listwise` |
| 130 | +- `DASHSCOPE_API_KEY` - API key for OpenJudge LLM grader |
| 131 | +- `JUDGE_BASE_URL` - Base URL for judge model API (default: DashScope) |
| 132 | +- `JUDGE_MODEL` - Judge model name (default: `qwen-plus`) |
| 133 | + |
| 134 | +## Reward Function |
| 135 | + |
| 136 | +Two OpenJudge-based reward modes are available: |
| 137 | + |
| 138 | +### 1. Pointwise Mode (Default) |
| 139 | + |
| 140 | +Uses OpenJudge LLM grader to evaluate each response independently: |
| 141 | +- Evaluates extraversion traits on 1-10 scale |
| 142 | +- Provides detailed reasoning for each score |
| 143 | +- Scores normalized to [-1, 1] for GRPO training |
| 144 | + |
| 145 | +```bash |
| 146 | +export REWARD_MODE=pointwise |
| 147 | +export DASHSCOPE_API_KEY=your_api_key_here |
| 148 | +``` |
| 149 | + |
| 150 | +### 2. Listwise Mode |
| 151 | + |
| 152 | +Uses OpenJudge to rank all responses together: |
| 153 | +- Compares responses directly against each other |
| 154 | +- Produces relative rankings |
| 155 | +- Best for capturing subtle differences |
| 156 | + |
| 157 | +```bash |
| 158 | +export REWARD_MODE=listwise |
| 159 | +export DASHSCOPE_API_KEY=your_api_key_here |
| 160 | +``` |
| 161 | + |
| 162 | +## Monitoring |
| 163 | + |
| 164 | +Check training progress: |
| 165 | + |
| 166 | +```bash |
| 167 | +# View swarm status |
| 168 | +ajet-swarm overwatch |
| 169 | + |
| 170 | +# Check request history |
| 171 | +curl http://localhost:8090/requests |
| 172 | + |
| 173 | +# Health check |
| 174 | +curl http://localhost:8090/health |
| 175 | +``` |
| 176 | + |
| 177 | +## Files |
| 178 | + |
| 179 | +- `fake_vllm_endpoint.py` - Main training server |
| 180 | +- `on_compute_relative_reward.py` - Extraversion reward function |
| 181 | +- `on_user_submit_new_requests.py` - Request handler |
| 182 | +- `download_dataset.py` - Dataset downloader |
| 183 | +- `mock_user_request.py` - Automated testing client |
| 184 | + |
| 185 | +## Troubleshooting |
| 186 | + |
| 187 | +**Import errors**: LSP warnings about unresolved imports are normal - dependencies will be available at runtime. |
| 188 | + |
| 189 | +**Connection refused**: Ensure swarm server is running on port 10086. |
| 190 | + |
| 191 | +**All episodes failed**: Check GPU availability and swarm server logs. |
| 192 | + |
| 193 | +## Notes |
| 194 | + |
| 195 | +- Training is passive - the endpoint waits for requests rather than iterating a dataset |
| 196 | +- Each request generates N=4 responses, evaluates them, and trains on the best |
| 197 | +- The model gradually learns to produce more extraverted responses over time |
0 commit comments