Skip to content

Commit a0dd2e4

Browse files
committed
fix: update image URL in swarm_intro.md and enhance Dockerfile for editable installation
1 parent 3e57540 commit a0dd2e4

File tree

4 files changed

+20
-6
lines changed

4 files changed

+20
-6
lines changed

ajet/backbone/main_verl.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,16 @@ def run_ppo(config: DictConfig) -> None:
6464
num_cpus=config.ray_init.num_cpus,
6565
)
6666

67-
atexit.register(lambda: ray.shutdown()) # ray shutdown on exit
67+
def on_shutdown():
68+
if ray.is_initialized():
69+
ray.shutdown()
70+
if config.ajet.enable_experimental_interchange_server:
71+
if config.ajet.enable_swarm_mode:
72+
from ajet.tuner_lib.experimental.interchange_utils import http_change_engine_status
73+
print("Changing engine status to OFFLINE before shutdown...")
74+
http_change_engine_status(config, "ENGINE.OFFLINE", global_step=0)
75+
76+
atexit.register(on_shutdown) # ray shutdown on exit
6877

6978
# Create a remote instance of the TaskRunner class, and
7079
# Execute the `run` method of the TaskRunner instance remotely and wait for it to complete

ajet/default_config/ajet_ts_default.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ ajet:
4848
# maximum number of parallel environments / simulate workers
4949
max_env_worker: 128
5050

51+
trainer_common:
52+
logger: tensorboard
5153

5254
# ------------------ 不需要修改 ------------------
5355
hydra:

docs/en/swarm_intro.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
# 灵活训练、动态伸缩:通义发布分布式多智能体 LLM 蜂群训练框架 AgentJet
22

33
<div align="center">
4-
<img width="500" alt="AgentJet" src="agentjet.jpg"/>
4+
<img width="500" alt="AgentJet" src="https://doc.agentjet.top/agentjet.jpg"/>
55
</div>
66

7-
TLDR:通义开源 AgentJet (AJet) 新一代前沿且易用的Agentic强化学习框架,具备全分布式蜂群训练(Swarm Training)能力,
8-
实现了训练和推理的完全解耦,能更高效地承担复杂LLM多智能体系统的训练工作。
9-
一方面,在AgentJet中,研究者可以使用非常简单的代码将多个不同LLM模型同时接入一个多智能体系统的强化学习训练中,实现真正意义的非共享参数多智能体强化学习;
10-
另一方面,研究者在任意设备上(甚至在自己的笔记本电脑上)运行智能体直接参与训练,并且支持随时动态添加、移除、修改智能体Rollout节点,形成不受环境限制、能随时改Bug、能从外部环境崩溃中自愈的蜂群训练网路。
7+
TLDR:通义开源新一代前沿且易用的Agentic强化学习框架AgentJet (AJet) 。
8+
AgentJet具备全分布式蜂群训练(Swarm Training)能力,
9+
实现了训练和推理的完全解耦,大幅简化了单智能体和多智能体LLM系统的训练流程,能更高效地承担复杂多智能体系统的训练工作。
10+
一方面,在AgentJet中,研究者可以使用非常简单的代码,将多个不同LLM模型同时接入一个多智能体系统的RL训练中,实现真正意义的非共享参数多智能体强化学习(MARL);
11+
另一方面,研究者可在任意设备(如笔记本电脑)上运行智能体直接参与训练,
12+
也能随时动态添加、移除、修改智能体Rollout节点,构建不受环境限制、能随时改Bug、能从外部环境崩溃中自愈的蜂群训练网路。
1113
此外,AgentJet 完全开源,样例丰富,开箱即用,开放共建,并配套Token层级的追踪调试工具 & 逐版本训练性能追踪平台;
1214
还面向Vibe Coding开发者提供相关技能(SKILLs),允许Claude Code等工具一键辅助您的智能体编排和调试训练工作。
1315

scripts/docker/dockerfile_zh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ RUN . /opt/venv/bin/activate && uv pip install flash_attn-2.8.3+cu12torch2.7cxx1
4848

4949
# cache friendly layer for code changes
5050
COPY . .
51+
RUN . /opt/venv/bin/activate && uv pip install -e .[verl] -i https://mirrors.aliyun.com/pypi/simple/
5152
RUN wget https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/astuner_archive/dataset.tar.gz
5253
RUN mkdir -p /mnt/data_cpfs/model_cache/modelscope
5354
RUN tar -xzf dataset.tar.gz -C /mnt/data_cpfs/model_cache/modelscope/ && rm dataset.tar.gz

0 commit comments

Comments
 (0)