Skip to content

Commit 99e4a06

Browse files
feat: add Moore Threads MUSA runner (S5000/S4000) — moorethreads_vllm_musa_57ff5443
Adds the AccelMark runner skeleton for Moore Threads MTT S5000 / S4000 GPUs via the official vllm-musa platform plugin. The plugin auto-patches vLLM at import time (torchada CUDA→MUSA aliasing + pymtml + Triton patches), so the standard vLLM Python API is preserved and the runner mirrors the structure of ascend_vllm_ascend. What is included: * runners/moorethreads_vllm_musa_57ff5443/ — runner.py, meta.json (with suite_support self-declaration), requirements.txt, README.md * configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example The README platforms matrix updates automatically from the runner's meta.json (no hand-editing required, thanks to the onboarding decoupling that landed in the preceding commit). The Moore Threads environment detector also already lives at runners/platforms/moorethreads.py in the same earlier commit. Notes: * Capability flags are conservative: SUPPORTED_QUANTIZATION_BACKENDS only declares compressed-tensors; FP8 / AWQ / GPTQ-Marlin will be enabled in a follow-up runner version once real-hardware smoke tests confirm kernel coverage on MUSA. * This code has not yet been validated on physical S5000 / S4000 silicon; all suites are marked "pending" in suite_support and smoke testing will land as a new runner folder with a fresh hash. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 3529759 commit 99e4a06

6 files changed

Lines changed: 917 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ Reference runners live under `runners/` (see each folder’s `meta.json`). The t
9393
| Huawei Ascend NPU | `ascend_vllm_ascend_d4aa9fda` | vllm-ascend ||||||||
9494
| Apple Silicon | `apple_mlx_lm_9546b8b5` | mlx-lm ||||||||
9595
| Google TPU | `google_vllm_tpu_68cc9ffa` | vllm-tpu ||||||||
96+
| Moore Threads GPU | `moorethreads_vllm_musa_57ff5443` | vllm-musa ||||||||
9697

9798
_Legend: ✓ validated · ⋯ author-declared (not smoke-tested in this repo yet) · — unsupported._
9899
<!-- platforms-matrix:end -->
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# AccelMark runner config — moorethreads_vllm_musa_57ff5443 (vllm-musa on Moore Threads)
2+
#
3+
# Copy this file to runner_moorethreads_vllm_musa_57ff5443.yaml (remove
4+
# .example suffix) and edit as needed for your hardware. The actual .yaml
5+
# is gitignored.
6+
#
7+
# These settings adapt the runner to your hardware environment. They are
8+
# recorded in result.json task.extra_config for transparency but are NOT
9+
# part of the benchmark identity (not hashed into run_id).
10+
#
11+
# Merge priority: CLI flags > suite-specific > global defaults > runner defaults
12+
13+
# ── Global defaults (apply to all suites) ─────────────────────────────────────
14+
15+
# Tensor parallel size — number of Moore Threads GPUs to use (default: 1).
16+
# For multi-card runs make sure to export VLLM_WORKER_MULTIPROC_METHOD=spawn.
17+
tensor_parallel_size: 1
18+
19+
# Disable Triton CUDA-graph / compilation. Set true if you hit Triton kernel
20+
# errors on first request (most common on S3000 / S80 paths).
21+
enforce_eager: false
22+
23+
# Maximum number of sequences in a batch (default: 256).
24+
# Reduce on lower-memory cards: 128 on 24 GB cards, 64 on 16 GB cards.
25+
max_num_seqs: 256
26+
27+
# Fraction of MUSA HBM reserved for the KV cache (default: 0.85). Reduce if
28+
# you hit OOM; the vLLM flag is named gpu_memory_utilization but applies to
29+
# MUSA HBM via torchada.
30+
gpu_memory_utilization: 0.85
31+
32+
# Pass-through kwargs forwarded directly to vLLM LLM() / AsyncEngineArgs().
33+
# Unknown keys are dropped automatically with a warning, so this is safe to
34+
# use across vLLM 0.10.x / 0.13.x.
35+
# engine_kwargs:
36+
# swap_space: 8
37+
# max_seq_len_to_capture: 4096
38+
39+
# ── Suite-specific overrides ───────────────────────────────────────────────────
40+
41+
suites:
42+
suite_D:
43+
# Long-context — reduce batch size and reserve more memory.
44+
max_num_seqs: 32
45+
gpu_memory_utilization: 0.80
46+
47+
suite_F:
48+
# Consumer / edge GPU — enforce_eager often safer for first runs.
49+
# enforce_eager: true
50+
max_num_seqs: 128
51+
52+
# ── Speculative decoding (suite_A / suite_D extra scenario) ─────────────────
53+
# Uncomment to enable. vllm-musa accepts the same speculative_config dict as
54+
# upstream vLLM; the runner translates flat keys (speculative_model,
55+
# num_speculative_tokens, ...) into speculative_config automatically.
56+
#
57+
# suites:
58+
# suite_A:
59+
# engine_kwargs:
60+
# speculative_model: "meta-llama/Llama-3.2-1B-Instruct"
61+
# num_speculative_tokens: 4
62+
# speculative_draft_tensor_parallel_size: 1
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# moorethreads_vllm_musa_57ff5443 — Moore Threads MUSA Runner (vllm-musa)
2+
3+
AccelMark runner for Moore Threads MUSA GPUs using
4+
[vllm-musa](https://github.com/MooreThreads/vllm-musa), the official vLLM
5+
platform plugin for MUSA hardware.
6+
7+
> **Status:** This runner is **untested on real silicon at the time of
8+
> commit**. The code is written against the public `vllm-musa` plugin
9+
> documentation and follows the structural template of the
10+
> `ascend_vllm_ascend_*` runner. Plan to smoke-test on an S5000 / S4000
11+
> system; capability flags and dtype mappings may be adjusted in a follow-up
12+
> runner version (new hash, new folder) based on real-world findings.
13+
14+
## How vllm-musa works
15+
16+
`vllm-musa` is a vLLM **platform plugin** (auto-detected on `import vllm`)
17+
that makes the standard vLLM Python API run on Moore Threads MUSA GPUs. It
18+
relies on three components:
19+
20+
| Component | Role |
21+
|---|---|
22+
| `torchada` | CUDA→MUSA compatibility layer for PyTorch — aliases `torch.cuda.*` to MUSA so most code paths run unmodified |
23+
| `pymtml` (`mthreads-ml-py`) | Moore Threads Management Library bindings, equivalent to `nvidia-ml-py` |
24+
| Triton patches | Runtime monkey-patches in `vllm_musa_platform.patches.*` that fix `triton.attention` and `worker` modules for MUSA's Triton compiler |
25+
26+
The standard `vllm.LLM`, `vllm.AsyncLLMEngine`, and `vllm.SamplingParams`
27+
remain the entry points — this runner therefore reuses ~95% of the logic
28+
from the NVIDIA / Ascend vLLM runners.
29+
30+
## Supported suites
31+
32+
| Suite | Description | Notes |
33+
|-------|-------------|-------|
34+
| Suite A | Single-chip, Llama-3-8B | Pending smoke test on S4000 / S5000 |
35+
| Suite B | Multi-chip, Llama-3-70B | Requires multiple Moore Threads cards + MCCL TP |
36+
| Suite C | Quantization, Llama-3.1-8B | FP8 skipped (no native FP8 in current MUSA hardware); compressed-tensors W8A8/W8A16 candidate; AWQ / GPTQ pending validation |
37+
| Suite D | Long context ~28K input, Llama-3.1-8B | Reduce `max_num_seqs` and `gpu_memory_utilization` |
38+
| Suite E | Multi-chip scaling, Llama-3-8B | Validates MCCL tensor parallelism |
39+
| Suite F | Consumer/edge, Qwen2.5-0.5B | Recommended starting point for S4000 single-card systems |
40+
41+
## Hardware compatibility
42+
43+
| GPU | BF16 | TP via MCCL | FP8 | Notes |
44+
|-----|------|-------------|-----|-------|
45+
| MTT S5000 |||| Recommended public reference target (FA3 via MATE) |
46+
| MTT S4000 |||| Validated path with PyTorch SDPA-based FlashAttention |
47+
| MTT S3000 | ⚠️ | ⚠️ || May work via `--enforce-eager`; not the public reference |
48+
| MTT S80 | ⚠️ ||| Consumer card; treat as best-effort |
49+
50+
## Prerequisites
51+
52+
You must install the MUSA stack in this exact order — Python packages alone
53+
are not sufficient:
54+
55+
**1. MUSA toolkit + driver**
56+
57+
Match the toolkit version to your card firmware. Reference:
58+
<https://developer.mthreads.com/musa/>
59+
60+
**2. PyTorch with MUSA support (torch + torchada)**
61+
62+
The recommended path is the official Moore Threads container, which ships a
63+
pre-built `torch==2.7.1` together with `torchada` and `pymtml`. See:
64+
65+
```bash
66+
docker pull sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120
67+
```
68+
69+
**3. Runner dependencies**
70+
71+
Inside the MUSA container:
72+
73+
```bash
74+
pip install -r runners/moorethreads_vllm_musa_57ff5443/requirements.txt
75+
```
76+
77+
This installs `vllm-musa==0.1.1` which auto-pulls a validated vLLM core
78+
(`0.10.1.1` by default). To use vLLM `0.13.0` instead (V1-only engine):
79+
80+
```bash
81+
pip install vllm==0.13.0 --no-deps --upgrade
82+
pip install 'depyf==0.20.0' 'llguidance>=1.3.0,<1.4.0' \
83+
'lm-format-enforcer==0.11.3' 'outlines_core==0.2.11' \
84+
'xgrammar==0.1.27' 'compressed-tensors==0.12.2'
85+
```
86+
87+
## Required environment variables
88+
89+
```bash
90+
# Device visibility (works like CUDA_VISIBLE_DEVICES)
91+
export MUSA_VISIBLE_DEVICES=0,1,2,3
92+
93+
# Recommended for multi-process workers (TP > 1)
94+
export VLLM_WORKER_MULTIPROC_METHOD=spawn
95+
```
96+
97+
## Basic usage
98+
99+
```bash
100+
# Verify the plugin is loaded before running anything else
101+
python -c "from vllm_musa_platform import musa_platform_plugin; print('ok')"
102+
103+
# Suite F (single-card S4000 / S5000)
104+
python run.py --runner moorethreads_vllm_musa_57ff5443 --suite suite_F
105+
106+
# Suite A (single-card datacenter benchmark)
107+
python run.py --runner moorethreads_vllm_musa_57ff5443 --suite suite_A
108+
109+
# Multi-card tensor parallelism (e.g. 8 x S5000 on a single host)
110+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
111+
python run.py --runner moorethreads_vllm_musa_57ff5443 \
112+
--suite suite_B \
113+
--tensor-parallel-size 8
114+
115+
# Local model cache
116+
python run.py --runner moorethreads_vllm_musa_57ff5443 \
117+
--suite suite_A \
118+
--model-path /data/models/Meta-Llama-3-8B-Instruct
119+
```
120+
121+
## Runner config
122+
123+
Copy the example config and adjust for your hardware:
124+
125+
```bash
126+
cp configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example \
127+
configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml
128+
```
129+
130+
Key settings:
131+
132+
| Field | Default | Notes |
133+
|-------|---------|-------|
134+
| `tensor_parallel_size` | 1 | Number of MUSA GPUs for tensor parallelism |
135+
| `enforce_eager` | false | Disable CUDA-graph / compilation; useful for pre-S4000 cards or while debugging |
136+
| `max_num_seqs` | 256 | Max concurrent sequences; reduce on lower-memory cards |
137+
| `gpu_memory_utilization` | 0.85 | Fraction of HBM reserved for KV cache; reduce if OOM |
138+
139+
## Triton / kernel compilation errors
140+
141+
If you encounter errors during Triton graph capture on first request,
142+
disable graph capture with `--enforce-eager`:
143+
144+
```bash
145+
python run.py --runner moorethreads_vllm_musa_57ff5443 \
146+
--suite suite_F --enforce-eager
147+
```
148+
149+
Or set persistently in the runner config YAML:
150+
151+
```yaml
152+
enforce_eager: true
153+
```
154+
155+
## HBM OOM errors
156+
157+
Reduce `gpu_memory_utilization` and/or `max_num_seqs`, either globally or
158+
per-suite (Suite D is the most memory-hungry due to long-context inputs):
159+
160+
```yaml
161+
gpu_memory_utilization: 0.80
162+
max_num_seqs: 128
163+
164+
suites:
165+
suite_D:
166+
max_num_seqs: 32
167+
gpu_memory_utilization: 0.78
168+
```
169+
170+
## Known gaps (pre-smoke-test)
171+
172+
The following items are placeholders and **must be re-validated** on real
173+
S4000 / S5000 hardware:
174+
175+
- **Memory peak**: relies on `torch.cuda.max_memory_allocated()` which
176+
torchada aliases to MUSA. If this returns 0 or `None`, fall back to
177+
`pymtml.mtmlDeviceGetMemoryInfo()`.
178+
- **MCCL teardown**: assumes the same `cleanup_dist_env_and_memory` entry
179+
point as upstream vLLM. If MCCL leaves a hanging process group, the
180+
fallback path explicitly destroys the torch.distributed group.
181+
- **Quantization**: `SUPPORTED_QUANTIZATION_BACKENDS` currently lists only
182+
`compressed-tensors`. AWQ / GPTQ-Marlin / FP8 are intentionally excluded
183+
until kernel coverage on MUSA is confirmed.
184+
- **Precision detection**: `_get_chip_count()` prefers `pymtml` over
185+
`torch.cuda.device_count()`. On hosts where pymtml is missing this may
186+
miscount; in that case the torch fallback should still work because
187+
torchada provides `torch.cuda.device_count()`.
188+
189+
## Requirements
190+
191+
See `requirements.txt` for the pinned plugin / extras list. The heavy
192+
dependencies (torch + torchada + MUSA toolkit) must come from the Moore
193+
Threads container; do not install them from PyPI.
194+
195+
Minimum environment:
196+
- Moore Threads MTT S4000 or newer (S3000 / S80 best-effort)
197+
- MUSA toolkit + driver matching card firmware
198+
- torch 2.7.1 (Moore Threads MUSA build) + torchada ≥ 0.1.9
199+
- Python 3.10+
200+
- vllm-musa 0.1.1 (vLLM core 0.10.1.1 or 0.13.0)
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"id": "moorethreads_vllm_musa_57ff5443",
3+
"platform": "moorethreads",
4+
"name": "vllm-musa on Moore Threads MUSA GPU",
5+
"framework": "vllm-musa",
6+
"submitted_by": "JuhaoLiang1997",
7+
"description": "AccelMark runner for Moore Threads MTT S4000 / S5000 MUSA GPUs via the vllm-musa platform plugin (vLLM 0.10.x / 0.13.x + torchada CUDA→MUSA compatibility + pymtml). API-compatible with standard vLLM; MCCL-based tensor parallelism. FP8 excluded — not supported on current MUSA hardware. Quantization limited to compressed-tensors (W8A8/W8A16) pending real-hardware validation of AWQ / GPTQ / FP8 paths.",
8+
"supersedes_chain": [],
9+
"notes": "Initial Moore Threads runner. Written from the public vllm-musa documentation and the structural template of ascend_vllm_ascend_d4aa9fda; capability flags, dtype mapping and teardown sequence are placeholders awaiting smoke-testing on real S4000 / S5000 silicon.",
10+
"created": "2026-05-15",
11+
"hardware_label": null,
12+
"suite_support": {
13+
"A": "pending",
14+
"B": "pending",
15+
"C": "pending",
16+
"D": "pending",
17+
"E": "pending",
18+
"F": "pending",
19+
"G": "unsupported"
20+
}
21+
}
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# AccelMark -- Moore Threads MUSA vllm-musa runner dependencies
2+
#
3+
# This runner is designed to run inside the official Moore Threads MUSA
4+
# container (which already ships torch + torchada built for the MUSA
5+
# toolkit) and only installs the vLLM platform plugin + accelmark extras
6+
# on top of it.
7+
#
8+
# Tested image (subject to change at smoke-test time):
9+
# sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120
10+
# Reference docker command:
11+
# docker run -d --net host --privileged --pid=host --shm-size 500g \
12+
# -v $PWD:/ws -w /ws \
13+
# --name accelmark-musa \
14+
# sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120 \
15+
# sleep infinity
16+
# docker exec -it accelmark-musa bash
17+
#
18+
# Pre-installed in the container (do NOT reinstall via pip):
19+
# torch==2.7.1 (built for MUSA with torchada)
20+
# torchada>=0.1.9 (CUDA→MUSA compatibility layer)
21+
# mthreads-ml-py>=2.2.5 (pymtml — MTML bindings)
22+
#
23+
# vLLM core: the plugin pulls in a compatible version automatically, but for
24+
# reproducibility we pin to one of the validated combinations below.
25+
# Pick ONE of these two stacks (uncomment the matching line in the install
26+
# guide in README.md):
27+
#
28+
# stack A — vLLM 0.10.1.1 (V0 + V1 engines):
29+
# pip install -e . # plugin auto-installs vllm==0.10.1.1
30+
#
31+
# stack B — vLLM 0.13.0 (V1-only):
32+
# pip install -e . # plugin installs vllm==0.10.1.1
33+
# pip install vllm==0.13.0 --no-deps --upgrade
34+
# pip install 'depyf==0.20.0' 'llguidance>=1.3.0,<1.4.0' \
35+
# 'lm-format-enforcer==0.11.3' 'outlines_core==0.2.11' \
36+
# 'xgrammar==0.1.27' 'compressed-tensors==0.12.2'
37+
38+
# vLLM MUSA platform plugin (PyPI: vllm-musa, GitHub: MooreThreads/vllm-musa)
39+
vllm-musa==0.1.1
40+
41+
# Transformers stack — pin to versions compatible with vLLM 0.10.x / 0.13.x
42+
transformers==4.46.3
43+
tokenizers==0.20.3
44+
huggingface-hub==0.26.5
45+
accelerate==1.2.1
46+
safetensors==0.4.5
47+
48+
# AccelMark dependencies (not bundled in the image)
49+
numpy==1.26.4
50+
jsonschema==4.25.1
51+
psutil==7.1.0
52+
tqdm==4.67.1
53+
54+
# Async support
55+
aiohttp==3.12.15
56+
57+
# Config file parsing
58+
PyYAML==6.0.2

0 commit comments

Comments
 (0)