Skip to content

add base test#1044

Open
zhtshr wants to merge 1 commit intoModelTC:mainfrom
zhtshr:zht_dev
Open

add base test#1044
zhtshr wants to merge 1 commit intoModelTC:mainfrom
zhtshr:zht_dev

Conversation

@zhtshr
Copy link
Copy Markdown
Contributor

@zhtshr zhtshr commented Apr 29, 2026

This pull request introduces new configuration files, adds a flexible inference script, and implements tools for extracting and analyzing controller metrics related to latency and GPU utilization. It also enhances the controller service to record and export detailed metrics for improved monitoring and analysis.

Key changes:

Configuration and Inference:

  • Added two new model configuration files for different inference settings: wan22_moe_i2v_baseline.json (4-step inference) and wan22_moe_i2v_baseline_50steps.json (50-step inference), supporting the Wan2.2 I2V model with parallelization options. [1] [2]
  • Introduced a new, comprehensive infer.py script for running inference with various LightX2V models, supporting a wide range of tasks and flexible command-line arguments.

Controller Metrics and Monitoring:

  • Enhanced the controller (controller.py) to record detailed monitoring samples, track controller uptime, and save all metrics (including requests and periodic samples) to a configurable JSON file on shutdown. This enables more thorough post-run analysis. [1] [2] [3] [4] [5]

Metrics Extraction Utilities:

  • Added extract_base_latency.py to extract and export baseline request latency data from metrics JSON to CSV, including finish time and e2e latency.
  • Added extract_base_utilization.py to process and export average GPU utilization and memory occupancy rates from controller monitoring samples to CSV for further analysis.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a benchmarking and monitoring framework for the disaggregated inference system, adding baseline configurations, inference scripts, a workload controller, and utilities for metrics extraction. The review feedback focuses on improving portability by replacing hardcoded absolute paths with relative paths or dynamic resolution across JSON, Python, and shell files. Additionally, the reviewer suggested merging redundant metrics scripts to enhance maintainability and capturing the current time once during metrics dumping to ensure timestamp consistency.

Comment on lines +40 to +43
"high_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
"low_noise_quantized_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
"high_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_int8_lightx2v_4step.safetensors",
"low_noise_original_ckpt": "/root/zht/LightX2V/models/lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_int8_lightx2v_4step.safetensors",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration file contains hardcoded absolute paths with a specific user's home directory (/root/zht). This makes the configuration not portable and difficult for other developers to use. These paths should be made relative to the project root or be replaced by placeholders that can be substituted at runtime (e.g., via environment variables).

@@ -0,0 +1,36 @@
{
"model_path": "/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This configuration file contains a hardcoded absolute path with a specific user's home directory (/root/zht). This harms portability. Please consider using a relative path or a placeholder that can be configured at runtime.

Comment on lines +71 to +73
parser.add_argument("--model_path", type=str, default="/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B")
parser.add_argument("--base_config_json", type=str, default="/root/zht/LightX2V/configs/disagg/baseline/wan22_moe_i2v_baseline.json")
parser.add_argument("--save_dir", type=str, default="/root/zht/LightX2V/save_results")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default values for model_path, base_config_json, and save_dir arguments are hardcoded with an absolute path containing a specific user's home directory (/root/zht). This makes the script not portable. It's better to use relative paths or make these arguments required.

Suggested change
parser.add_argument("--model_path", type=str, default="/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B")
parser.add_argument("--base_config_json", type=str, default="/root/zht/LightX2V/configs/disagg/baseline/wan22_moe_i2v_baseline.json")
parser.add_argument("--save_dir", type=str, default="/root/zht/LightX2V/save_results")
parser.add_argument("--model_path", type=str, required=True, help="Path to the model directory.")
parser.add_argument("--base_config_json", type=str, required=True, help="Path to the base configuration JSON file.")
parser.add_argument("--save_dir", type=str, default="save_results", help="Directory to save results.")

parser.add_argument("--generate_requests", type=int, default=10)
parser.add_argument("--generate_interval_s", type=float, default=0.0)

parser.add_argument("--metrics_output_json", type=str, default="/root/zht/LightX2V/save_results/baseline_controller_metrics.json")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for metrics_output_json is hardcoded with an absolute path containing a specific user's home directory (/root/zht). This makes the script not portable. Please use a relative path.

Suggested change
parser.add_argument("--metrics_output_json", type=str, default="/root/zht/LightX2V/save_results/baseline_controller_metrics.json")
parser.add_argument("--metrics_output_json", type=str, default="save_results/baseline_controller_metrics.json", help="Path to save controller metrics.")

Comment on lines +1082 to +1090
summary = {
"requests": self._to_plain(received_results),
"monitor_samples": self._to_plain(self._monitor_samples),
"generated_at": time.time(),
}
if batch_request_start_ts is not None:
summary["batch_total_time_s"] = time.time() - batch_request_start_ts
if self._controller_start_ts is not None:
summary["controller_uptime_s"] = time.time() - self._controller_start_ts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

time.time() is called multiple times within this function to calculate different time-related metrics (generated_at, batch_total_time_s, controller_uptime_s). This can lead to slight inconsistencies in timestamps. It's better to capture the current time once at the beginning of the function and reuse that value for all calculations within the function.

Suggested change
summary = {
"requests": self._to_plain(received_results),
"monitor_samples": self._to_plain(self._monitor_samples),
"generated_at": time.time(),
}
if batch_request_start_ts is not None:
summary["batch_total_time_s"] = time.time() - batch_request_start_ts
if self._controller_start_ts is not None:
summary["controller_uptime_s"] = time.time() - self._controller_start_ts
now = time.time()
summary = {
"requests": self._to_plain(received_results),
"monitor_samples": self._to_plain(self._monitor_samples),
"generated_at": now,
}
if batch_request_start_ts is not None:
summary["batch_total_time_s"] = now - batch_request_start_ts
if self._controller_start_ts is not None:
summary["controller_uptime_s"] = now - self._controller_start_ts

Comment on lines +20 to +28
"--metrics",
default="/root/zht/LightX2V/save_results/baseline_controller_metrics.json",
help="Input baseline metrics json path",
)
parser.add_argument(
"--output",
default="/root/zht/LightX2V/save_results/base_wan22_i2v.csv",
help="Output csv path",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default paths for --metrics and --output are hardcoded with a user-specific absolute path (/root/zht). This makes the script not portable. Please consider using relative paths or making the arguments required.

    parser.add_argument(
        "--metrics",
        required=True,
        help="Input baseline metrics json path",
    )
    parser.add_argument(
        "--output",
        required=True,
        help="Output csv path",
    )

Comment on lines +21 to +29
"--metrics",
default="/root/zht/LightX2V/save_results/baseline_controller_metrics_4steps_p8.json",
help="Input GPU metrics json path",
)
parser.add_argument(
"--output",
default="/root/zht/LightX2V/save_results/base_wan22_i2v_util.csv",
help="Output csv path",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default paths for --metrics and --output are hardcoded with a user-specific absolute path (/root/zht). This makes the script not portable. Please consider using relative paths or making the arguments required.

    parser.add_argument(
        "--metrics",
        required=True,
        help="Input GPU metrics json path",
    )
    parser.add_argument(
        "--output",
        required=True,
        help="Output csv path",
    )

Comment on lines +1 to +106
#!/usr/bin/env python3
from __future__ import annotations

import argparse
import csv
import json
from collections import defaultdict
from pathlib import Path


def _fmt_float3(value):
try:
return f"{float(value):.3f}"
except (TypeError, ValueError):
return ""


def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Extract average GPU utilization rows from disagg_controller_metrics.json")
parser.add_argument(
"--metrics",
default="/root/zht/LightX2V/save_results/disagg_controller_metrics.json",
help="Input disagg metrics json path",
)
parser.add_argument(
"--output",
default="/root/zht/LightX2V/save_results/disagg_wan22_i2v_util.csv",
help="Output csv path",
)
return parser.parse_args()


def main() -> int:
args = parse_args()
metrics_path = Path(args.metrics)
out_path = Path(args.output)

if not metrics_path.is_file():
raise FileNotFoundError(f"metrics file not found: {metrics_path}")

with metrics_path.open("r", encoding="utf-8") as f:
payload = json.load(f)

samples = payload.get("monitor_samples", [])
if not isinstance(samples, list):
raise ValueError(f"invalid metrics format: monitor_samples must be a list, got {type(samples)}")

grouped_samples = defaultdict(list)
for item in samples:
if not isinstance(item, dict):
continue
if item.get("status") != "ok":
continue
sample_ts = item.get("sample_ts_from_global_start_s")
if sample_ts is None:
continue
try:
grouped_samples[float(sample_ts)].append(item)
except (TypeError, ValueError):
continue

rows = []
for sample_ts in sorted(grouped_samples):
group = grouped_samples[sample_ts]
gpu_utils = []
mem_utils = []
for item in group:
gpu_util = item.get("gpu_utilization")
mem_used = item.get("gpu_memory_used_mb")
mem_total = item.get("gpu_memory_total_mb")
if gpu_util is not None:
gpu_utils.append(float(gpu_util))
if mem_used is not None and mem_total:
mem_utils.append(float(mem_used) / float(mem_total) * 100.0)

if not gpu_utils or not mem_utils:
continue

rows.append(
{
"time_from_start_s": _fmt_float3(sample_ts),
"avg_gpu_utilization": _fmt_float3(sum(gpu_utils) / len(gpu_utils)),
"avg_gpu_memory_occupancy_rate": _fmt_float3(sum(mem_utils) / len(mem_utils)),
}
)

out_path.parent.mkdir(parents=True, exist_ok=True)
with out_path.open("w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(
f,
fieldnames=[
"time_from_start_s",
"avg_gpu_utilization",
"avg_gpu_memory_occupancy_rate",
],
)
writer.writeheader()
for row in rows:
writer.writerow(row)

print(f"wrote {len(rows)} rows to {out_path}")
return 0


if __name__ == "__main__":
raise SystemExit(main())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This script is nearly identical to scripts/disagg/extract_base_utilization.py. The only significant difference is the default file paths. To improve maintainability and reduce code duplication, consider merging these two scripts into a single one that accepts different file paths as arguments. You could keep two separate wrapper shell scripts if you want to preserve the simple invocation.


SCRIPT_NAMES=("run_baseline.sh")

lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The lightx2v_path variable is hardcoded to /root/zht/LightX2V as a fallback. This makes the script not portable. It should be determined dynamically relative to the script's location.

Suggested change
lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V}
lightx2v_path=${LIGHTX2V_PATH:-"$(dirname "$0")/../.."}

Comment on lines +5 to +6
lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V}
model_path=${WAN22_MOE_MODEL_PATH:-/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The lightx2v_path and model_path variables have hardcoded fallbacks with a user-specific path (/root/zht). This harms portability. lightx2v_path should be determined dynamically relative to the script's location, and model_path should probably be passed as an argument or configured in a more portable way.

Suggested change
lightx2v_path=${LIGHTX2V_PATH:-/root/zht/LightX2V}
model_path=${WAN22_MOE_MODEL_PATH:-/root/zht/LightX2V/models/Wan-AI/Wan2.2-I2V-A14B}
lightx2v_path=${LIGHTX2V_PATH:-"$(dirname "$0")/../.."}
model_path=${WAN22_MOE_MODEL_PATH:-"${lightx2v_path}/models/Wan-AI/Wan2.2-I2V-A14B"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant