Skip to content

Commit 3566cd5

Browse files
authored
Add scheduling mechanism and new workload (#1025)
This pull request introduces several new configuration files and significant updates to the disaggregated (disagg) connection logic and workload orchestration for the LightX2V project. The main focus is on supporting distributed inference with improved network handling, chunked data transfer, and dynamic workload simulation. The changes enhance reliability, configurability, and usability for running and testing disaggregated video inference pipelines. **Key changes:** ### New configuration and workload simulation * Added four new configuration files for disaggregated controller, encoder, transformer, and decoder modes, each specifying model parameters, quantization settings, RDMA protocol details, and distributed ranks. [[1]](diffhunk://#diff-440d18d9304ad8fbc166e4de7ec8269aa7219d6a6e5a88462cf0a716634bb1a5R1-R58) [[2]](diffhunk://#diff-d572d987f2b9dbc6221429d35d2ca572a229e075ad8ae71157ff4775216b6a7cR1-R58) [[3]](diffhunk://#diff-6fc0f74620b78b925329ce7d5642aa6c36646cd38186d876151ae3d6f6f6658bR1-R58) [[4]](diffhunk://#diff-a7a24a5c453ee5f04be754b15e016adb89aba716ab856b91a1489ebc303c1b0fR1-R58) * Introduced a workload staging configuration (`wan22_i2v_workload_stages.json`) to define warmup and change phases for dynamic load testing. * Added `run_user.py` example script to simulate dynamic user workloads, sending requests to the controller based on stage specifications and supporting configurable request rates. ### Disagg connection reliability and protocol improvements * Implemented `_normalize_loopback_host` to ensure consistent use of IPv4 loopback addresses, controlled by the `DISAGG_FORCE_IPV4_LOOPBACK` environment variable, improving local and mixed-protocol deployments. [[1]](diffhunk://#diff-a5e5778ac7adc9b8f2c175153e932db47158abebba325b47298001bc80e89ba7R85-R92) [[2]](diffhunk://#diff-a5e5778ac7adc9b8f2c175153e932db47158abebba325b47298001bc80e89ba7L544-R583) * Enhanced error handling in the transfer loop and status synchronization, logging exceptions and preventing crashes during data transfer and status updates. * Added support for chunked data transfer in `send_data`, controlled by the `MOONCAKE_TRANSFER_CHUNK_BYTES` environment variable, to handle large tensors more efficiently and robustly. ### Protocol and metadata updates * Updated ZMQ communication to include `receiver_engine_rank` in multipart messages for both encoder and transformer threads, ensuring correct routing and status updates in distributed settings. [[1]](diffhunk://#diff-a5e5778ac7adc9b8f2c175153e932db47158abebba325b47298001bc80e89ba7R364-R383) [[2]](diffhunk://#diff-a5e5778ac7adc9b8f2c175153e932db47158abebba325b47298001bc80e89ba7R392-R393) [[3]](diffhunk://#diff-a5e5778ac7adc9b8f2c175153e932db47158abebba325b47298001bc80e89ba7R439-R458) [[4]](diffhunk://#diff-a5e5778ac7adc9b8f2c175153e932db47158abebba325b47298001bc80e89ba7R467-R468) [[5]](diffhunk://#diff-a5e5778ac7adc9b8f2c175153e932db47158abebba325b47298001bc80e89ba7R600) * Improved local IP detection logic in `mooncake.py` to better select a non-loopback IPv4 address for outbound connections, enhancing compatibility in multi-host environments. ### New libs * locust These changes collectively improve the flexibility, reliability, and scalability of the disaggregated inference pipeline, making it easier to configure, test, and deploy in distributed environments.
1 parent f35e4d1 commit 3566cd5

34 files changed

Lines changed: 7010 additions & 703 deletions
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
{
2+
"infer_steps": 4,
3+
"in_dim": 36,
4+
"dim": 5120,
5+
"ffn_dim": 13824,
6+
"freq_dim": 256,
7+
"num_heads": 40,
8+
"num_layers": 40,
9+
"out_dim": 16,
10+
"eps": 1e-06,
11+
"model_type": "i2v",
12+
"target_video_length": 81,
13+
"text_len": 512,
14+
"target_height": 480,
15+
"target_width": 832,
16+
"self_attn_1_type": "sage_attn2",
17+
"cross_attn_1_type": "sage_attn2",
18+
"cross_attn_2_type": "sage_attn2",
19+
"sample_guide_scale": [
20+
3.5,
21+
3.5
22+
],
23+
"sample_shift": 5.0,
24+
"enable_cfg": false,
25+
"cpu_offload": true,
26+
"offload_granularity": "block",
27+
"rdma_buffer_slot_size": 8192,
28+
"t5_cpu_offload": false,
29+
"vae_cpu_offload": false,
30+
"fps": 16,
31+
"use_image_encoder": false,
32+
"boundary_step_index": 2,
33+
"denoising_step_list": [
34+
1000,
35+
750,
36+
500,
37+
250
38+
],
39+
"dit_quantized": true,
40+
"dit_quant_scheme": "int8-q8f",
41+
"high_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
42+
"low_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
43+
"high_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
44+
"low_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
45+
"image_path": "/root/zht/LightX2V/assets/inputs/imgs/img_0.jpg",
46+
"disagg_mode": "controller",
47+
"disagg_config": {
48+
"bootstrap_addr": "192.168.0.166",
49+
"bootstrap_room": 0,
50+
"ranks": 8,
51+
"encoder_engine_rank": 0,
52+
"transformer_engine_rank": 1,
53+
"decoder_engine_rank": 2,
54+
"protocol": "rdma",
55+
"local_hostname": "192.168.0.166",
56+
"metadata_server": "P2PHANDSHAKE",
57+
"service_env": {
58+
"RDMA_IFACE": "erdma_0"
59+
},
60+
"remote_workdir": "/root/zht/LightX2V",
61+
"remote_python_executable": "python",
62+
"remote_activate_cmd": "source /root/miniconda3/etc/profile.d/conda.sh && conda activate lightx2v && export LD_LIBRARY_PATH=/root/miniconda3/envs/lightx2v/lib:${LD_LIBRARY_PATH:-}",
63+
"remote_log_dir": "/root/zht/LightX2V/save_results",
64+
"use_remote_proxy": true,
65+
"remote_proxy_req_base_port": 28000,
66+
"ssh_user": "root",
67+
"ssh_options": [
68+
"-i",
69+
"/root/.ssh/id_ed25519_zht",
70+
"-o",
71+
"BatchMode=yes",
72+
"-o",
73+
"StrictHostKeyChecking=no"
74+
],
75+
"static_instance_slots": [
76+
{
77+
"instance_type": "encoder",
78+
"host": "192.168.0.139",
79+
"engine_rank": 0,
80+
"cuda_device": 0,
81+
"env": {
82+
"MOONCAKE_DEVICE_NAME": "eth0",
83+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.139"
84+
}
85+
},
86+
{
87+
"instance_type": "transformer",
88+
"host": "192.168.0.166",
89+
"engine_rank": 1,
90+
"cuda_device": 0,
91+
"env": {
92+
"MOONCAKE_DEVICE_NAME": "eth0",
93+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.166"
94+
}
95+
},
96+
{
97+
"instance_type": "transformer",
98+
"host": "192.168.0.166",
99+
"engine_rank": 2,
100+
"cuda_device": 1,
101+
"env": {
102+
"MOONCAKE_DEVICE_NAME": "eth0",
103+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.166"
104+
}
105+
},
106+
{
107+
"instance_type": "transformer",
108+
"host": "192.168.0.166",
109+
"engine_rank": 3,
110+
"cuda_device": 2,
111+
"env": {
112+
"MOONCAKE_DEVICE_NAME": "eth0",
113+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.166"
114+
}
115+
},
116+
{
117+
"instance_type": "transformer",
118+
"host": "192.168.0.166",
119+
"engine_rank": 4,
120+
"cuda_device": 3,
121+
"env": {
122+
"MOONCAKE_DEVICE_NAME": "eth0",
123+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.166"
124+
}
125+
},
126+
{
127+
"instance_type": "transformer",
128+
"host": "192.168.0.166",
129+
"engine_rank": 5,
130+
"cuda_device": 4,
131+
"env": {
132+
"MOONCAKE_DEVICE_NAME": "eth0",
133+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.166"
134+
}
135+
},
136+
{
137+
"instance_type": "transformer",
138+
"host": "192.168.0.166",
139+
"engine_rank": 6,
140+
"cuda_device": 5,
141+
"env": {
142+
"MOONCAKE_DEVICE_NAME": "eth0",
143+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.166"
144+
}
145+
},
146+
{
147+
"instance_type": "decoder",
148+
"host": "192.168.0.139",
149+
"engine_rank": 7,
150+
"cuda_device": 1,
151+
"env": {
152+
"MOONCAKE_DEVICE_NAME": "eth0",
153+
"MOONCAKE_LOCAL_HOSTNAME": "192.168.0.139"
154+
}
155+
}
156+
]
157+
}
158+
}
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
{
2+
"infer_steps": 4,
3+
"in_dim": 36,
4+
"dim": 5120,
5+
"ffn_dim": 13824,
6+
"freq_dim": 256,
7+
"num_heads": 40,
8+
"num_layers": 40,
9+
"out_dim": 16,
10+
"eps": 1e-06,
11+
"model_type": "i2v",
12+
"target_video_length": 81,
13+
"text_len": 512,
14+
"target_height": 480,
15+
"target_width": 832,
16+
"self_attn_1_type": "sage_attn2",
17+
"cross_attn_1_type": "sage_attn2",
18+
"cross_attn_2_type": "sage_attn2",
19+
"sample_guide_scale": [
20+
3.5,
21+
3.5
22+
],
23+
"sample_shift": 5.0,
24+
"enable_cfg": false,
25+
"cpu_offload": true,
26+
"offload_granularity": "block",
27+
"rdma_buffer_slot_size": 8192,
28+
"t5_cpu_offload": false,
29+
"vae_cpu_offload": false,
30+
"fps": 16,
31+
"use_image_encoder": false,
32+
"boundary_step_index": 2,
33+
"denoising_step_list": [
34+
1000,
35+
750,
36+
500,
37+
250
38+
],
39+
"dit_quantized": true,
40+
"dit_quant_scheme": "int8-q8f",
41+
"high_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
42+
"low_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
43+
"high_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
44+
"low_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
45+
"image_path": "/root/zht/LightX2V/assets/inputs/imgs/img_0.jpg",
46+
"disagg_mode": "decoder",
47+
"disagg_config": {
48+
"bootstrap_addr": "192.168.0.166",
49+
"bootstrap_room": 0,
50+
"ranks": 8,
51+
"encoder_engine_rank": 0,
52+
"transformer_engine_rank": 1,
53+
"decoder_engine_rank": 2,
54+
"protocol": "rdma",
55+
"local_hostname": "192.168.0.139",
56+
"metadata_server": "P2PHANDSHAKE"
57+
}
58+
}
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
{
2+
"infer_steps": 4,
3+
"in_dim": 36,
4+
"dim": 5120,
5+
"ffn_dim": 13824,
6+
"freq_dim": 256,
7+
"num_heads": 40,
8+
"num_layers": 40,
9+
"out_dim": 16,
10+
"eps": 1e-06,
11+
"model_type": "i2v",
12+
"target_video_length": 81,
13+
"text_len": 512,
14+
"target_height": 480,
15+
"target_width": 832,
16+
"self_attn_1_type": "sage_attn2",
17+
"cross_attn_1_type": "sage_attn2",
18+
"cross_attn_2_type": "sage_attn2",
19+
"sample_guide_scale": [
20+
3.5,
21+
3.5
22+
],
23+
"sample_shift": 5.0,
24+
"enable_cfg": false,
25+
"cpu_offload": true,
26+
"offload_granularity": "block",
27+
"rdma_buffer_slot_size": 8192,
28+
"t5_cpu_offload": false,
29+
"vae_cpu_offload": false,
30+
"fps": 16,
31+
"use_image_encoder": false,
32+
"boundary_step_index": 2,
33+
"denoising_step_list": [
34+
1000,
35+
750,
36+
500,
37+
250
38+
],
39+
"dit_quantized": true,
40+
"dit_quant_scheme": "int8-q8f",
41+
"high_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
42+
"low_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
43+
"high_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
44+
"low_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
45+
"image_path": "/root/zht/LightX2V/assets/inputs/imgs/img_0.jpg",
46+
"disagg_mode": "encoder",
47+
"disagg_config": {
48+
"bootstrap_addr": "192.168.0.166",
49+
"bootstrap_room": 0,
50+
"ranks": 8,
51+
"encoder_engine_rank": 0,
52+
"transformer_engine_rank": 1,
53+
"decoder_engine_rank": 2,
54+
"protocol": "rdma",
55+
"local_hostname": "192.168.0.139",
56+
"metadata_server": "P2PHANDSHAKE"
57+
}
58+
}
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
{
2+
"infer_steps": 4,
3+
"in_dim": 36,
4+
"dim": 5120,
5+
"ffn_dim": 13824,
6+
"freq_dim": 256,
7+
"num_heads": 40,
8+
"num_layers": 40,
9+
"out_dim": 16,
10+
"eps": 1e-06,
11+
"model_type": "i2v",
12+
"target_video_length": 81,
13+
"text_len": 512,
14+
"target_height": 480,
15+
"target_width": 832,
16+
"self_attn_1_type": "sage_attn2",
17+
"cross_attn_1_type": "sage_attn2",
18+
"cross_attn_2_type": "sage_attn2",
19+
"sample_guide_scale": [
20+
3.5,
21+
3.5
22+
],
23+
"sample_shift": 5.0,
24+
"enable_cfg": false,
25+
"cpu_offload": true,
26+
"offload_granularity": "block",
27+
"rdma_buffer_slot_size": 8192,
28+
"t5_cpu_offload": false,
29+
"vae_cpu_offload": false,
30+
"fps": 16,
31+
"use_image_encoder": false,
32+
"boundary_step_index": 2,
33+
"denoising_step_list": [
34+
1000,
35+
750,
36+
500,
37+
250
38+
],
39+
"dit_quantized": true,
40+
"dit_quant_scheme": "int8-q8f",
41+
"high_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
42+
"low_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
43+
"high_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
44+
"low_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
45+
"image_path": "/root/zht/LightX2V/assets/inputs/imgs/img_0.jpg",
46+
"disagg_mode": "transformer",
47+
"disagg_config": {
48+
"bootstrap_addr": "192.168.0.166",
49+
"bootstrap_room": 0,
50+
"ranks": 8,
51+
"encoder_engine_rank": 0,
52+
"transformer_engine_rank": 1,
53+
"decoder_engine_rank": 2,
54+
"protocol": "rdma",
55+
"local_hostname": "192.168.0.166",
56+
"metadata_server": "P2PHANDSHAKE"
57+
}
58+
}
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)