Skip to content

Commit 99642c5

Browse files
binary-huskyclaude
andcommitted
Improve map-verl-config skill with schema requirements
Add detailed instructions for updating the config schema, explaining that every nested level needs a dataclass to avoid raw dict issues at runtime. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 2802a00 commit 99642c5

3 files changed

Lines changed: 52 additions & 12 deletions

File tree

ajet/copilot/job.py

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -110,15 +110,6 @@ def __init__(
110110
if not (all(p is None for p in length_params) or all(p is not None for p in length_params)):
111111
raise ValueError("(`max_prompt_length`, `max_response_length`, `max_model_len`, `max_response_length_in_one_turn`) must all be None or all be non-None")
112112

113-
# Validate: when lora_rank > 0, load_format must be safetensors
114-
if lora_rank is not None and lora_rank > 0:
115-
if lora_load_format != "safetensors":
116-
raise ValueError(f"When lora_rank > 0, lora_load_format must be 'safetensors', got '{lora_load_format}'")
117-
if lr is None:
118-
raise ValueError("lr should be provided for lora training")
119-
if lr <= 1e-5:
120-
raise ValueError(f"lr should usually be greater than 1e-5 for lora training, got {lr}")
121-
122113
self.config_as_dict: dict = self.build_job_from_yaml(base_yaml_config)
123114
self.config = Config.update_from_dict_recursive(Config(), self.config_as_dict)
124115

@@ -198,6 +189,16 @@ def __init__(
198189

199190
assert self.max_prompt_length + self.max_response_length <= self.max_model_len, "illegal token length"
200191
assert self.max_response_length_in_one_turn <= self.max_response_length
192+
193+
# Validate: when lora_rank > 0, load_format must be safetensors
194+
if self.lora_rank > 0:
195+
if self.lora_load_format != "safetensors":
196+
raise ValueError(f"When lora_rank > 0, lora_load_format must be 'safetensors', got '{self.lora_load_format}'")
197+
if self.lr is None:
198+
raise ValueError("lr should be provided for lora training")
199+
if self.lr <= 1e-5:
200+
raise ValueError(f"lr should usually be greater than 1e-5 for lora training, got {self.lr}")
201+
201202
if self.backbone == "trinity":
202203
raise NotImplementedError("Trinity backbone is not yet supported in AgentJetJob.")
203204

ajet/copilot/map-verl-config/SKILL.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ license: Complete terms in LICENSE.txt
55
---
66

77

8-
1. find user requested verl config in in codebase/agentjet/ajet/default_config/verl/verl_default.yaml
8+
1. find user requested verl config in codebase/agentjet/ajet/default_config/verl/verl_default.yaml
99

1010
2. check `codebase/agentjet/ajet/default_config/verl/config_auto_convertion_verl.jsonc`, whether a mapping to this config already exists.
1111

@@ -15,5 +15,14 @@ license: Complete terms in LICENSE.txt
1515

1616
5. ask user whether to add to AgentJetJob (ajet/copilot/job.py), if the user confirms:
1717
- learn how other config is added in ajet/copilot/job.py
18-
- add to __init__, update docstring
19-
- add to ajet/default_config/ajet_config_schema.py
18+
- add to __init__ signature (with type hint and default None)
19+
- update docstring with parameter description
20+
- add instance attribute assignment with cast()
21+
- add mapping to `overrides` dict
22+
23+
6. **CRITICAL**: update `ajet/default_config/ajet_config_schema.py`
24+
- the schema must have a dataclass for EVERY nested level in the config path
25+
- e.g., for `ajet.trainer_common.optim.lr`, need:
26+
- `AjetOptim` dataclass with `lr: float = 1e-6`
27+
- `AjetTrainerCommon` must have `optim: AjetOptim = field(default_factory=AjetOptim)`
28+
- if parent dataclass is missing the nested field, config loading will store it as a raw dict instead of a typed dataclass, causing `getattr()` to fail at runtime

tutorial/example_train_multi_model/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,3 +199,33 @@ REMOTE_14B_BATCH_SIZE = 8 # Batch size for 14B model
199199
REMOTE_7B_ALLOCATE_GPU_PER_NODE = 8 # GPUs for 7B model
200200
REMOTE_14B_ALLOCATE_GPU_PER_NODE = 8 # GPUs for 14B model
201201
```
202+
203+
204+
## cheat sheet
205+
206+
PROJECT_DIR="/mnt/data_cpfs/qingxu.fu/agentjet/hello-agentjet"
207+
208+
# --- Swarm Server 1 ---
209+
tmux new-session -d -s "SWARM_SERVER_M1" # warning: do not add command here, otherwise it will be executed immediately and the session will exit
210+
tmux send-keys -t "SWARM_SERVER_M1" "cd ${PROJECT_DIR}" Enter
211+
tmux send-keys -t "SWARM_SERVER_M1" "source .venv/bin/activate" Enter
212+
tmux send-keys -t "SWARM_SERVER_M1" "export SETUPTOOLS_USE_DISTUTILS=local" Enter
213+
tmux send-keys -t "SWARM_SERVER_M1" "ajet-swarm start --swarm-port=10086" Enter
214+
echo "Started SWARM_SERVER_M1 on port 10086"
215+
216+
# --- Swarm Server 2 ---
217+
tmux new-session -d -s "SWARM_SERVER_M2"
218+
tmux send-keys -t "SWARM_SERVER_M2" "cd ${PROJECT_DIR}" Enter
219+
tmux send-keys -t "SWARM_SERVER_M2" "source .venv/bin/activate" Enter
220+
tmux send-keys -t "SWARM_SERVER_M2" "export SETUPTOOLS_USE_DISTUTILS=local" Enter
221+
tmux send-keys -t "SWARM_SERVER_M2" "ajet-swarm start --swarm-port=10087" Enter
222+
echo "Started SWARM_SERVER_M2 on port 10087"
223+
224+
# --- Swarm Client ---
225+
tmux new-session -d -s "SWARM_CLIENT_EXP1"
226+
tmux send-keys -t "SWARM_CLIENT_EXP1" "cd ${PROJECT_DIR}" Enter
227+
tmux send-keys -t "SWARM_CLIENT_EXP1" "source .venv/bin/activate" Enter
228+
tmux send-keys -t "SWARM_CLIENT_EXP1" "export SETUPTOOLS_USE_DISTUTILS=local" Enter
229+
tmux send-keys -t "SWARM_CLIENT_EXP1" "sleep 30s" Enter
230+
tmux send-keys -t "SWARM_CLIENT_EXP1" "python -m tutorial.example_train_multi_model.trans_roll_lora" Enter
231+
echo "Started SWARM_CLIENT_EXP1"

0 commit comments

Comments
 (0)