add base test#1044
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a benchmarking and monitoring framework for the disaggregated inference system, adding baseline configurations, inference scripts, a workload controller, and utilities for metrics extraction. The review feedback focuses on improving portability by replacing hardcoded absolute paths with relative paths or dynamic resolution across JSON, Python, and shell files. Additionally, the reviewer suggested merging redundant metrics scripts to enhance maintainability and capturing the current time once during metrics dumping to ensure timestamp consistency.
| "high_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors", | ||
| "low_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors", | ||
| "high_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors", | ||
| "low_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors", |
There was a problem hiding this comment.
The configuration file contains hardcoded absolute paths with a specific user's home directory (/root/zht). This makes the configuration not portable and difficult for other developers to use. These paths should be made relative to the project root or be replaced by placeholders that can be substituted at runtime (e.g., via environment variables).
| @@ -0,0 +1,36 @@ | |||
| { | |||
| "model_path": "/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B", | |||
| parser.add_argument("--model_path", type=str, default="/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B") | ||
| parser.add_argument("--base_config_json", type=str, default="/root/zht/LightX2V/configs/disagg/baseline/wan22_moe_i2v_baseline.json") | ||
| parser.add_argument("--save_dir", type=str, default="/root/zht/LightX2V/save_results") |
There was a problem hiding this comment.
The default values for model_path, base_config_json, and save_dir arguments are hardcoded with an absolute path containing a specific user's home directory (/root/zht). This makes the script not portable. It's better to use relative paths or make these arguments required.
| parser.add_argument("--model_path", type=str, default="/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B") | |
| parser.add_argument("--base_config_json", type=str, default="/root/zht/LightX2V/configs/disagg/baseline/wan22_moe_i2v_baseline.json") | |
| parser.add_argument("--save_dir", type=str, default="/root/zht/LightX2V/save_results") | |
| parser.add_argument("--model_path", type=str, required=True, help="Path to the model directory.") | |
| parser.add_argument("--base_config_json", type=str, required=True, help="Path to the base configuration JSON file.") | |
| parser.add_argument("--save_dir", type=str, default="save_results", help="Directory to save results.") |
| parser.add_argument("--generate_requests", type=int, default=10) | ||
| parser.add_argument("--generate_interval_s", type=float, default=0.0) | ||
|
|
||
| parser.add_argument("--metrics_output_json", type=str, default="/root/zht/LightX2V/save_results/baseline_controller_metrics.json") |
There was a problem hiding this comment.
The default value for metrics_output_json is hardcoded with an absolute path containing a specific user's home directory (/root/zht). This makes the script not portable. Please use a relative path.
| parser.add_argument("--metrics_output_json", type=str, default="/root/zht/LightX2V/save_results/baseline_controller_metrics.json") | |
| parser.add_argument("--metrics_output_json", type=str, default="save_results/baseline_controller_metrics.json", help="Path to save controller metrics.") |
| summary = { | ||
| "requests": self._to_plain(received_results), | ||
| "monitor_samples": self._to_plain(self._monitor_samples), | ||
| "generated_at": time.time(), | ||
| } | ||
| if batch_request_start_ts is not None: | ||
| summary["batch_total_time_s"] = time.time() - batch_request_start_ts | ||
| if self._controller_start_ts is not None: | ||
| summary["controller_uptime_s"] = time.time() - self._controller_start_ts |
There was a problem hiding this comment.
time.time() is called multiple times within this function to calculate different time-related metrics (generated_at, batch_total_time_s, controller_uptime_s). This can lead to slight inconsistencies in timestamps. It's better to capture the current time once at the beginning of the function and reuse that value for all calculations within the function.
| summary = { | |
| "requests": self._to_plain(received_results), | |
| "monitor_samples": self._to_plain(self._monitor_samples), | |
| "generated_at": time.time(), | |
| } | |
| if batch_request_start_ts is not None: | |
| summary["batch_total_time_s"] = time.time() - batch_request_start_ts | |
| if self._controller_start_ts is not None: | |
| summary["controller_uptime_s"] = time.time() - self._controller_start_ts | |
| now = time.time() | |
| summary = { | |
| "requests": self._to_plain(received_results), | |
| "monitor_samples": self._to_plain(self._monitor_samples), | |
| "generated_at": now, | |
| } | |
| if batch_request_start_ts is not None: | |
| summary["batch_total_time_s"] = now - batch_request_start_ts | |
| if self._controller_start_ts is not None: | |
| summary["controller_uptime_s"] = now - self._controller_start_ts |
| "--metrics", | ||
| default="/root/zht/LightX2V/save_results/baseline_controller_metrics.json", | ||
| help="Input baseline metrics json path", | ||
| ) | ||
| parser.add_argument( | ||
| "--output", | ||
| default="/root/zht/LightX2V/save_results/base_wan22_i2v.csv", | ||
| help="Output csv path", | ||
| ) |
There was a problem hiding this comment.
The default paths for --metrics and --output are hardcoded with a user-specific absolute path (/root/zht). This makes the script not portable. Please consider using relative paths or making the arguments required.
parser.add_argument(
"--metrics",
required=True,
help="Input baseline metrics json path",
)
parser.add_argument(
"--output",
required=True,
help="Output csv path",
)| "--metrics", | ||
| default="/root/zht/LightX2V/save_results/baseline_controller_metrics_4steps_p8.json", | ||
| help="Input GPU metrics json path", | ||
| ) | ||
| parser.add_argument( | ||
| "--output", | ||
| default="/root/zht/LightX2V/save_results/base_wan22_i2v_util.csv", | ||
| help="Output csv path", | ||
| ) |
There was a problem hiding this comment.
The default paths for --metrics and --output are hardcoded with a user-specific absolute path (/root/zht). This makes the script not portable. Please consider using relative paths or making the arguments required.
parser.add_argument(
"--metrics",
required=True,
help="Input GPU metrics json path",
)
parser.add_argument(
"--output",
required=True,
help="Output csv path",
)| #!/usr/bin/env python3 | ||
| from __future__ import annotations | ||
|
|
||
| import argparse | ||
| import csv | ||
| import json | ||
| from collections import defaultdict | ||
| from pathlib import Path | ||
|
|
||
|
|
||
| def _fmt_float3(value): | ||
| try: | ||
| return f"{float(value):.3f}" | ||
| except (TypeError, ValueError): | ||
| return "" | ||
|
|
||
|
|
||
| def parse_args() -> argparse.Namespace: | ||
| parser = argparse.ArgumentParser(description="Extract average GPU utilization rows from disagg_controller_metrics.json") | ||
| parser.add_argument( | ||
| "--metrics", | ||
| default="/root/zht/LightX2V/save_results/disagg_controller_metrics.json", | ||
| help="Input disagg metrics json path", | ||
| ) | ||
| parser.add_argument( | ||
| "--output", | ||
| default="/root/zht/LightX2V/save_results/disagg_wan22_i2v_util.csv", | ||
| help="Output csv path", | ||
| ) | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| def main() -> int: | ||
| args = parse_args() | ||
| metrics_path = Path(args.metrics) | ||
| out_path = Path(args.output) | ||
|
|
||
| if not metrics_path.is_file(): | ||
| raise FileNotFoundError(f"metrics file not found: {metrics_path}") | ||
|
|
||
| with metrics_path.open("r", encoding="utf-8") as f: | ||
| payload = json.load(f) | ||
|
|
||
| samples = payload.get("monitor_samples", []) | ||
| if not isinstance(samples, list): | ||
| raise ValueError(f"invalid metrics format: monitor_samples must be a list, got {type(samples)}") | ||
|
|
||
| grouped_samples = defaultdict(list) | ||
| for item in samples: | ||
| if not isinstance(item, dict): | ||
| continue | ||
| if item.get("status") != "ok": | ||
| continue | ||
| sample_ts = item.get("sample_ts_from_global_start_s") | ||
| if sample_ts is None: | ||
| continue | ||
| try: | ||
| grouped_samples[float(sample_ts)].append(item) | ||
| except (TypeError, ValueError): | ||
| continue | ||
|
|
||
| rows = [] | ||
| for sample_ts in sorted(grouped_samples): | ||
| group = grouped_samples[sample_ts] | ||
| gpu_utils = [] | ||
| mem_utils = [] | ||
| for item in group: | ||
| gpu_util = item.get("gpu_utilization") | ||
| mem_used = item.get("gpu_memory_used_mb") | ||
| mem_total = item.get("gpu_memory_total_mb") | ||
| if gpu_util is not None: | ||
| gpu_utils.append(float(gpu_util)) | ||
| if mem_used is not None and mem_total: | ||
| mem_utils.append(float(mem_used) / float(mem_total) * 100.0) | ||
|
|
||
| if not gpu_utils or not mem_utils: | ||
| continue | ||
|
|
||
| rows.append( | ||
| { | ||
| "time_from_start_s": _fmt_float3(sample_ts), | ||
| "avg_gpu_utilization": _fmt_float3(sum(gpu_utils) / len(gpu_utils)), | ||
| "avg_gpu_memory_occupancy_rate": _fmt_float3(sum(mem_utils) / len(mem_utils)), | ||
| } | ||
| ) | ||
|
|
||
| out_path.parent.mkdir(parents=True, exist_ok=True) | ||
| with out_path.open("w", encoding="utf-8", newline="") as f: | ||
| writer = csv.DictWriter( | ||
| f, | ||
| fieldnames=[ | ||
| "time_from_start_s", | ||
| "avg_gpu_utilization", | ||
| "avg_gpu_memory_occupancy_rate", | ||
| ], | ||
| ) | ||
| writer.writeheader() | ||
| for row in rows: | ||
| writer.writerow(row) | ||
|
|
||
| print(f"wrote {len(rows)} rows to {out_path}") | ||
| return 0 | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| raise SystemExit(main()) |
There was a problem hiding this comment.
This script is nearly identical to scripts/disagg/extract_base_utilization.py. The only significant difference is the default file paths. To improve maintainability and reduce code duplication, consider merging these two scripts into a single one that accepts different file paths as arguments. You could keep two separate wrapper shell scripts if you want to preserve the simple invocation.
|
|
||
| SCRIPT_NAMES=("run_baseline.sh") | ||
|
|
||
| lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V} |
There was a problem hiding this comment.
The lightx2v_path variable is hardcoded to /root/zht/LightX2V as a fallback. This makes the script not portable. It should be determined dynamically relative to the script's location.
| lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V} | |
| lightx2v_path=${LIGHTX2V_PATH:-"$(dirname "$0")/../.."} |
| lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V} | ||
| model_path=${WAN22_MOE_MODEL_PATH:-/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B} |
There was a problem hiding this comment.
The lightx2v_path and model_path variables have hardcoded fallbacks with a user-specific path (/root/zht). This harms portability. lightx2v_path should be determined dynamically relative to the script's location, and model_path should probably be passed as an argument or configured in a more portable way.
| lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V} | |
| model_path=${WAN22_MOE_MODEL_PATH:-/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B} | |
| lightx2v_path=${LIGHTX2V_PATH:-"$(dirname "$0")/../.."} | |
| model_path=${WAN22_MOE_MODEL_PATH:-"${lightx2v_path}/models/Wan-AI/Wan2.2-I2V-A14B"} |
This pull request introduces new configuration files, adds a flexible inference script, and implements tools for extracting and analyzing controller metrics related to latency and GPU utilization. It also enhances the controller service to record and export detailed metrics for improved monitoring and analysis.
Key changes:
Configuration and Inference:
wan22_moe_i2v_baseline.json(4-step inference) andwan22_moe_i2v_baseline_50steps.json(50-step inference), supporting the Wan2.2 I2V model with parallelization options. [1] [2]infer.pyscript for running inference with various LightX2V models, supporting a wide range of tasks and flexible command-line arguments.Controller Metrics and Monitoring:
controller.py) to record detailed monitoring samples, track controller uptime, and save all metrics (including requests and periodic samples) to a configurable JSON file on shutdown. This enables more thorough post-run analysis. [1] [2] [3] [4] [5]Metrics Extraction Utilities:
extract_base_latency.pyto extract and export baseline request latency data from metrics JSON to CSV, including finish time and e2e latency.extract_base_utilization.pyto process and export average GPU utilization and memory occupancy rates from controller monitoring samples to CSV for further analysis.