Skip to content

Commit 4082ac9

Browse files
feat: add Moore Threads MUSA runner (S5000/S4000) — moorethreads_vllm… (#47)
* feat: add Moore Threads MUSA runner (S5000/S4000) — moorethreads_vllm_musa_57ff5443 Adds the AccelMark runner skeleton for Moore Threads MTT S5000 / S4000 GPUs via the official vllm-musa platform plugin. The plugin auto-patches vLLM at import time (torchada CUDA→MUSA aliasing + pymtml + Triton patches), so the standard vLLM Python API is preserved and the runner mirrors the structure of ascend_vllm_ascend. What is included: * runners/moorethreads_vllm_musa_57ff5443/ — runner.py, meta.json (with suite_support self-declaration), requirements.txt, README.md * configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example The README platforms matrix updates automatically from the runner's meta.json (no hand-editing required, thanks to the onboarding decoupling that landed in the preceding commit). The Moore Threads environment detector also already lives at runners/platforms/moorethreads.py in the same earlier commit. Notes: * Capability flags are conservative: SUPPORTED_QUANTIZATION_BACKENDS only declares compressed-tensors; FP8 / AWQ / GPTQ-Marlin will be enabled in a follow-up runner version once real-hardware smoke tests confirm kernel coverage on MUSA. * This code has not yet been validated on physical S5000 / S4000 silicon; all suites are marked "pending" in suite_support and smoke testing will land as a new runner folder with a fresh hash. Co-authored-by: Cursor <cursoragent@cursor.com> * update moore runner * add moore schema * upload moore results * update --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 6a038ca commit 4082ac9

22 files changed

Lines changed: 2218 additions & 61 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ Reference runners live under `runners/` (see each folder’s `meta.json`). The t
9393
| Huawei Ascend NPU | `ascend_vllm_ascend_d4aa9fda` | vllm-ascend ||||||||
9494
| Apple Silicon | `apple_mlx_lm_9546b8b5` | mlx-lm ||||||||
9595
| Google TPU | `google_vllm_tpu_68cc9ffa` | vllm-tpu ||||||||
96+
| Moore Threads GPU | `moorethreads_vllm_musa_f2f6f965` | vllm-musa ||||||||
9697

9798
_Legend: ✓ validated · ⋯ author-declared (not smoke-tested in this repo yet) · — unsupported._
9899
<!-- platforms-matrix:end -->
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# AccelMark runner config — moorethreads_vllm_musa_f2f6f965 (vllm-musa on Moore Threads)
2+
#
3+
# Copy this file to runner_moorethreads_vllm_musa_f2f6f965.yaml (remove
4+
# .example suffix) and edit as needed for your hardware. The actual .yaml
5+
# is gitignored.
6+
#
7+
# These settings adapt the runner to your hardware environment. They are
8+
# recorded in result.json task.extra_config for transparency but are NOT
9+
# part of the benchmark identity (not hashed into run_id).
10+
#
11+
# Merge priority: CLI flags > suite-specific > global defaults > runner defaults
12+
13+
# ── Global defaults (apply to all suites) ─────────────────────────────────────
14+
15+
# Tensor parallel size — number of Moore Threads GPUs to use (default: 1).
16+
# For multi-card runs make sure to export VLLM_WORKER_MULTIPROC_METHOD=spawn.
17+
tensor_parallel_size: 1
18+
19+
# Disable Triton CUDA-graph / compilation. Set true if you hit Triton kernel
20+
# errors on first request (most common on S3000 / S80 paths).
21+
enforce_eager: false
22+
23+
# Maximum number of sequences in a batch (default: 256).
24+
# Reduce on lower-memory cards: 128 on 24 GB cards, 64 on 16 GB cards.
25+
max_num_seqs: 256
26+
27+
# Fraction of MUSA HBM reserved for the KV cache (default: 0.85). Reduce if
28+
# you hit OOM; the vLLM flag is named gpu_memory_utilization but applies to
29+
# MUSA HBM via torchada.
30+
gpu_memory_utilization: 0.85
31+
32+
# Pass-through kwargs forwarded directly to vLLM LLM() / AsyncEngineArgs().
33+
# Unknown keys are dropped automatically with a warning, so this is safe to
34+
# use across vLLM 0.10.x / 0.13.x.
35+
# engine_kwargs:
36+
# swap_space: 8
37+
# max_seq_len_to_capture: 4096
38+
39+
# ── Suite-specific overrides ───────────────────────────────────────────────────
40+
41+
suites:
42+
suite_D:
43+
# Long-context — reduce batch size and reserve more memory.
44+
max_num_seqs: 32
45+
gpu_memory_utilization: 0.80
46+
47+
suite_F:
48+
max_num_seqs: 128
49+
50+
# ── Speculative decoding (suite_A / suite_D extra scenario) ─────────────────
51+
# Uncomment to enable. vllm-musa accepts the same speculative_config dict as
52+
# upstream vLLM; the runner translates flat keys (speculative_model,
53+
# num_speculative_tokens, ...) into speculative_config automatically.
54+
#
55+
# suites:
56+
# suite_A:
57+
# engine_kwargs:
58+
# speculative_model: "meta-llama/Llama-3.2-1B-Instruct"
59+
# num_speculative_tokens: 4
60+
# speculative_draft_tensor_parallel_size: 1
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"subset_score": 0.07,
3+
"baseline_delta": -0.53,
4+
"valid": false,
5+
"framework": "vllm-musa",
6+
"precision": "BF16",
7+
"notes": "Integrated accuracy check \u2014 used same vllm-musa instance as benchmark."
8+
}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
{
2+
"collected_at": "2026-05-18T09:21:31.092840+00:00",
3+
"accelerators": [
4+
{
5+
"index": 0,
6+
"name": "MTT S4000",
7+
"vendor": "Moore Threads",
8+
"memory_gb": 48.0,
9+
"driver_version": "2.7.0",
10+
"firmware_version": null,
11+
"supports_bf16": true
12+
}
13+
],
14+
"accelerator_platform": "moorethreads",
15+
"accelerator_topology": null,
16+
"intra_node_interconnect": null,
17+
"cpu": {
18+
"model": "Intel(R) Xeon(R) Gold 6430",
19+
"physical_cores": 64,
20+
"logical_cores": 128,
21+
"numa_nodes": 2
22+
},
23+
"system_memory_gb": 1007.5,
24+
"pcie_generation": "PCIe 16x/16x",
25+
"cpu_accelerator_bandwidth_gbs": null,
26+
"network_interfaces": [
27+
{
28+
"name": "mlx5_0",
29+
"type": "InfiniBand/RoCE",
30+
"bandwidth_gbps": null
31+
},
32+
{
33+
"name": "mlx5_1",
34+
"type": "InfiniBand/RoCE",
35+
"bandwidth_gbps": null
36+
},
37+
{
38+
"name": "mlx5_bond_0",
39+
"type": "InfiniBand/RoCE",
40+
"bandwidth_gbps": null
41+
}
42+
],
43+
"os": "Ubuntu Jammy Jellyfish (development branch)",
44+
"python_version": "3.10.8",
45+
"kernel_version": "5.15.0-105-generic",
46+
"runtime_version": "Moore Threads Driver 2.7.0",
47+
"pytorch_version": "2.2.0"
48+
}
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
{
2+
"schema_version": "1.0",
3+
"suite_id": "suite_A",
4+
"implementation_id": "moorethreads_vllm_musa_f2f6f965",
5+
"chip": {
6+
"name": "MTT S4000",
7+
"vendor": "Moore Threads",
8+
"count": 1,
9+
"memory_gb": 48.0,
10+
"interconnect_intra_node": null,
11+
"interconnect_inter_node": null
12+
},
13+
"environment": {
14+
"collected_at": "2026-05-18T09:21:31.092840+00:00",
15+
"accelerators": [
16+
{
17+
"index": 0,
18+
"name": "MTT S4000",
19+
"vendor": "Moore Threads",
20+
"memory_gb": 48.0,
21+
"driver_version": "2.7.0",
22+
"firmware_version": null,
23+
"supports_bf16": true
24+
}
25+
],
26+
"accelerator_platform": "moorethreads",
27+
"accelerator_topology": null,
28+
"intra_node_interconnect": null,
29+
"cpu": {
30+
"model": "Intel(R) Xeon(R) Gold 6430",
31+
"physical_cores": 64,
32+
"logical_cores": 128,
33+
"numa_nodes": 2
34+
},
35+
"system_memory_gb": 1007.5,
36+
"pcie_generation": "PCIe 16x/16x",
37+
"cpu_accelerator_bandwidth_gbs": null,
38+
"network_interfaces": [
39+
{
40+
"name": "mlx5_0",
41+
"type": "InfiniBand/RoCE",
42+
"bandwidth_gbps": null
43+
},
44+
{
45+
"name": "mlx5_1",
46+
"type": "InfiniBand/RoCE",
47+
"bandwidth_gbps": null
48+
},
49+
{
50+
"name": "mlx5_bond_0",
51+
"type": "InfiniBand/RoCE",
52+
"bandwidth_gbps": null
53+
}
54+
],
55+
"os": "Ubuntu Jammy Jellyfish (development branch)",
56+
"python_version": "3.10.8",
57+
"kernel_version": "5.15.0-105-generic",
58+
"runtime_version": "Moore Threads Driver 2.7.0",
59+
"pytorch_version": "2.2.0"
60+
},
61+
"software": {
62+
"framework": "vllm-musa",
63+
"framework_version": "0.4.2",
64+
"driver_version": "2.7.0",
65+
"runtime_version": "Moore Threads Driver 2.7.0",
66+
"os": "Ubuntu Jammy Jellyfish (development branch)",
67+
"python_version": "3.10.8"
68+
},
69+
"model": {
70+
"model_id": "meta-llama/Meta-Llama-3-8B-Instruct",
71+
"model_revision": "8afb486c1db24fe5011ec46dfbe5b5dccdb575c2",
72+
"model_name": null,
73+
"model_note": null,
74+
"model_source": "local",
75+
"architecture": "dense",
76+
"parameter_count_b": 8.0,
77+
"precision": "BF16",
78+
"effective_dtype": "float16",
79+
"quantization_method": null,
80+
"model_format": "HuggingFace original"
81+
},
82+
"task": {
83+
"scenario": "offline",
84+
"num_runs": 3,
85+
"warmup_runs": 1,
86+
"parallelism": {
87+
"tensor_parallel_size": 1,
88+
"pipeline_parallel_size": 1,
89+
"expert_parallel_size": 1,
90+
"data_parallel_size": 1
91+
},
92+
"extra_config": null,
93+
"runtime_metrics": null
94+
},
95+
"metrics": {
96+
"offline": {
97+
"results_by_concurrency": [
98+
{
99+
"client_concurrency": 8,
100+
"throughput_tokens_per_sec": 332.62,
101+
"throughput_tokens_per_sec_per_chip": 332.62,
102+
"throughput_tokens_per_sec_total": 922.83,
103+
"elapsed_seconds_median": 43.4,
104+
"peak_memory_gb": null,
105+
"power_watts_avg": null,
106+
"power_watts_peak": null,
107+
"oom": false,
108+
"_throughput_note": "output_only",
109+
"_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs."
110+
},
111+
{
112+
"client_concurrency": 32,
113+
"throughput_tokens_per_sec": 331.64,
114+
"throughput_tokens_per_sec_per_chip": 331.64,
115+
"throughput_tokens_per_sec_total": 920.1,
116+
"elapsed_seconds_median": 43.6,
117+
"peak_memory_gb": null,
118+
"power_watts_avg": null,
119+
"power_watts_peak": null,
120+
"oom": false,
121+
"_throughput_note": "output_only",
122+
"_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs."
123+
},
124+
{
125+
"client_concurrency": 128,
126+
"throughput_tokens_per_sec": 331.76,
127+
"throughput_tokens_per_sec_per_chip": 331.76,
128+
"throughput_tokens_per_sec_total": 920.46,
129+
"elapsed_seconds_median": 43.6,
130+
"peak_memory_gb": null,
131+
"power_watts_avg": null,
132+
"power_watts_peak": null,
133+
"oom": false,
134+
"_throughput_note": "output_only",
135+
"_concurrency_note": "client_concurrency is the number of requests sent simultaneously. The inference engine batches internally; this does not directly set engine parameters like max_num_seqs."
136+
}
137+
]
138+
}
139+
},
140+
"accuracy": {
141+
"subset_score": null,
142+
"baseline_delta": null,
143+
"valid": false,
144+
"notes": "Run --scenario accuracy to check model accuracy."
145+
},
146+
"meta": {
147+
"submitted_by": "JuhaoLiang1997",
148+
"submission_type": "individual",
149+
"date": "2026-05-18",
150+
"time": "17:34:52",
151+
"run_id": "cabb7bd0",
152+
"run_name": "mtt_s4000x1_suite_A_moorethreads_vllm_musa_f2f6f965_cabb7bd0",
153+
"flagged": null,
154+
"reproduce_script": "runners/moorethreads_vllm_musa_f2f6f965/runner.py",
155+
"env_info_file": "../env_info.json",
156+
"log_file": "run.log",
157+
"samples_file": "samples.jsonl",
158+
"notes": null,
159+
"benchmark_start_time": "2026-05-18T09:26:10.676960+00:00",
160+
"benchmark_end_time": "2026-05-18T09:34:52.667112+00:00",
161+
"benchmark_elapsed_minutes": 8.7,
162+
"model_load_seconds": 116.8
163+
}
164+
}

0 commit comments

Comments
 (0)