feat: add Moore Threads MUSA runner (S5000/S4000) — moorethreads_vllm_musa_57ff5443

JuhaoLiang1997 · cursoragent · JuhaoLiang1997 · commit 99e4a06be3d1 · 2026-05-15T13:13:11.000+08:00
Adds the AccelMark runner skeleton for Moore Threads MTT S5000 / S4000
GPUs via the official vllm-musa platform plugin. The plugin auto-patches
vLLM at import time (torchada CUDA→MUSA aliasing + pymtml + Triton
patches), so the standard vLLM Python API is preserved and the runner
mirrors the structure of ascend_vllm_ascend.

What is included:

* runners/moorethreads_vllm_musa_57ff5443/ — runner.py, meta.json
  (with suite_support self-declaration), requirements.txt, README.md
* configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example

The README platforms matrix updates automatically from the runner's
meta.json (no hand-editing required, thanks to the onboarding
decoupling that landed in the preceding commit). The Moore Threads
environment detector also already lives at runners/platforms/moorethreads.py
in the same earlier commit.

Notes:

* Capability flags are conservative: SUPPORTED_QUANTIZATION_BACKENDS only
  declares compressed-tensors; FP8 / AWQ / GPTQ-Marlin will be enabled in
  a follow-up runner version once real-hardware smoke tests confirm kernel
  coverage on MUSA.
* This code has not yet been validated on physical S5000 / S4000 silicon;
  all suites are marked "pending" in suite_support and smoke testing will
  land as a new runner folder with a fresh hash.

Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
diff --git a/README.md b/README.md
@@ -93,6 +93,7 @@ Reference runners live under `runners/` (see each folder’s `meta.json`). The t
 | Huawei Ascend NPU | `ascend_vllm_ascend_d4aa9fda` | vllm-ascend | ✓ | ✓ | ✓ | ✓ | ✓ | — | — |
 | Apple Silicon | `apple_mlx_lm_9546b8b5` | mlx-lm | ⋯ | — | — | ⋯ | — | ⋯ | — |
 | Google TPU | `google_vllm_tpu_68cc9ffa` | vllm-tpu | ✓ | — | — | ✓ | — | ✓ | — |
+| Moore Threads GPU | `moorethreads_vllm_musa_57ff5443` | vllm-musa | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | — |
 
 _Legend: ✓ validated · ⋯ author-declared (not smoke-tested in this repo yet) · — unsupported._
 <!-- platforms-matrix:end -->
diff --git a/configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example b/configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example
@@ -0,0 +1,62 @@
+# AccelMark runner config — moorethreads_vllm_musa_57ff5443 (vllm-musa on Moore Threads)
+#
+# Copy this file to runner_moorethreads_vllm_musa_57ff5443.yaml (remove
+# .example suffix) and edit as needed for your hardware. The actual .yaml
+# is gitignored.
+#
+# These settings adapt the runner to your hardware environment. They are
+# recorded in result.json task.extra_config for transparency but are NOT
+# part of the benchmark identity (not hashed into run_id).
+#
+# Merge priority: CLI flags > suite-specific > global defaults > runner defaults
+
+# ── Global defaults (apply to all suites) ─────────────────────────────────────
+
+# Tensor parallel size — number of Moore Threads GPUs to use (default: 1).
+# For multi-card runs make sure to export VLLM_WORKER_MULTIPROC_METHOD=spawn.
+tensor_parallel_size: 1
+
+# Disable Triton CUDA-graph / compilation. Set true if you hit Triton kernel
+# errors on first request (most common on S3000 / S80 paths).
+enforce_eager: false
+
+# Maximum number of sequences in a batch (default: 256).
+# Reduce on lower-memory cards: 128 on 24 GB cards, 64 on 16 GB cards.
+max_num_seqs: 256
+
+# Fraction of MUSA HBM reserved for the KV cache (default: 0.85). Reduce if
+# you hit OOM; the vLLM flag is named gpu_memory_utilization but applies to
+# MUSA HBM via torchada.
+gpu_memory_utilization: 0.85
+
+# Pass-through kwargs forwarded directly to vLLM LLM() / AsyncEngineArgs().
+# Unknown keys are dropped automatically with a warning, so this is safe to
+# use across vLLM 0.10.x / 0.13.x.
+# engine_kwargs:
+#   swap_space: 8
+#   max_seq_len_to_capture: 4096
+
+# ── Suite-specific overrides ───────────────────────────────────────────────────
+
+suites:
+  suite_D:
+    # Long-context — reduce batch size and reserve more memory.
+    max_num_seqs: 32
+    gpu_memory_utilization: 0.80
+
+  suite_F:
+    # Consumer / edge GPU — enforce_eager often safer for first runs.
+    # enforce_eager: true
+    max_num_seqs: 128
+
+# ── Speculative decoding (suite_A / suite_D extra scenario) ─────────────────
+# Uncomment to enable. vllm-musa accepts the same speculative_config dict as
+# upstream vLLM; the runner translates flat keys (speculative_model,
+# num_speculative_tokens, ...) into speculative_config automatically.
+#
+# suites:
+#   suite_A:
+#     engine_kwargs:
+#       speculative_model: "meta-llama/Llama-3.2-1B-Instruct"
+#       num_speculative_tokens: 4
+#       speculative_draft_tensor_parallel_size: 1
diff --git a/runners/moorethreads_vllm_musa_57ff5443/README.md b/runners/moorethreads_vllm_musa_57ff5443/README.md
@@ -0,0 +1,200 @@
+# moorethreads_vllm_musa_57ff5443 — Moore Threads MUSA Runner (vllm-musa)
+
+AccelMark runner for Moore Threads MUSA GPUs using
+[vllm-musa](https://github.com/MooreThreads/vllm-musa), the official vLLM
+platform plugin for MUSA hardware.
+
+> **Status:** This runner is **untested on real silicon at the time of
+> commit**. The code is written against the public `vllm-musa` plugin
+> documentation and follows the structural template of the
+> `ascend_vllm_ascend_*` runner. Plan to smoke-test on an S5000 / S4000
+> system; capability flags and dtype mappings may be adjusted in a follow-up
+> runner version (new hash, new folder) based on real-world findings.
+
+## How vllm-musa works
+
+`vllm-musa` is a vLLM **platform plugin** (auto-detected on `import vllm`)
+that makes the standard vLLM Python API run on Moore Threads MUSA GPUs. It
+relies on three components:
+
+| Component | Role |
+|---|---|
+| `torchada` | CUDA→MUSA compatibility layer for PyTorch — aliases `torch.cuda.*` to MUSA so most code paths run unmodified |
+| `pymtml` (`mthreads-ml-py`) | Moore Threads Management Library bindings, equivalent to `nvidia-ml-py` |
+| Triton patches | Runtime monkey-patches in `vllm_musa_platform.patches.*` that fix `triton.attention` and `worker` modules for MUSA's Triton compiler |
+
+The standard `vllm.LLM`, `vllm.AsyncLLMEngine`, and `vllm.SamplingParams`
+remain the entry points — this runner therefore reuses ~95% of the logic
+from the NVIDIA / Ascend vLLM runners.
+
+## Supported suites
+
+| Suite | Description | Notes |
+|-------|-------------|-------|
+| Suite A | Single-chip, Llama-3-8B | Pending smoke test on S4000 / S5000 |
+| Suite B | Multi-chip, Llama-3-70B | Requires multiple Moore Threads cards + MCCL TP |
+| Suite C | Quantization, Llama-3.1-8B | FP8 skipped (no native FP8 in current MUSA hardware); compressed-tensors W8A8/W8A16 candidate; AWQ / GPTQ pending validation |
+| Suite D | Long context ~28K input, Llama-3.1-8B | Reduce `max_num_seqs` and `gpu_memory_utilization` |
+| Suite E | Multi-chip scaling, Llama-3-8B | Validates MCCL tensor parallelism |
+| Suite F | Consumer/edge, Qwen2.5-0.5B | Recommended starting point for S4000 single-card systems |
+
+## Hardware compatibility
+
+| GPU | BF16 | TP via MCCL | FP8 | Notes |
+|-----|------|-------------|-----|-------|
+| MTT S5000 | ✅ | ✅ | ❌ | Recommended public reference target (FA3 via MATE) |
+| MTT S4000 | ✅ | ✅ | ❌ | Validated path with PyTorch SDPA-based FlashAttention |
+| MTT S3000 | ⚠️ | ⚠️ | ❌ | May work via `--enforce-eager`; not the public reference |
+| MTT S80 | ⚠️ | — | ❌ | Consumer card; treat as best-effort |
+
+## Prerequisites
+
+You must install the MUSA stack in this exact order — Python packages alone
+are not sufficient:
+
+**1. MUSA toolkit + driver**
+
+Match the toolkit version to your card firmware. Reference:
+<https://developer.mthreads.com/musa/>
+
+**2. PyTorch with MUSA support (torch + torchada)**
+
+The recommended path is the official Moore Threads container, which ships a
+pre-built `torch==2.7.1` together with `torchada` and `pymtml`. See:
+
+```bash
+docker pull sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120
+```
+
+**3. Runner dependencies**
+
+Inside the MUSA container:
+
+```bash
+pip install -r runners/moorethreads_vllm_musa_57ff5443/requirements.txt
+```
+
+This installs `vllm-musa==0.1.1` which auto-pulls a validated vLLM core
+(`0.10.1.1` by default). To use vLLM `0.13.0` instead (V1-only engine):
+
+```bash
+pip install vllm==0.13.0 --no-deps --upgrade
+pip install 'depyf==0.20.0' 'llguidance>=1.3.0,<1.4.0' \
+            'lm-format-enforcer==0.11.3' 'outlines_core==0.2.11' \
+            'xgrammar==0.1.27' 'compressed-tensors==0.12.2'
+```
+
+## Required environment variables
+
+```bash
+# Device visibility (works like CUDA_VISIBLE_DEVICES)
+export MUSA_VISIBLE_DEVICES=0,1,2,3
+
+# Recommended for multi-process workers (TP > 1)
+export VLLM_WORKER_MULTIPROC_METHOD=spawn
+```
+
+## Basic usage
+
+```bash
+# Verify the plugin is loaded before running anything else
+python -c "from vllm_musa_platform import musa_platform_plugin; print('ok')"
+
+# Suite F (single-card S4000 / S5000)
+python run.py --runner moorethreads_vllm_musa_57ff5443 --suite suite_F
+
+# Suite A (single-card datacenter benchmark)
+python run.py --runner moorethreads_vllm_musa_57ff5443 --suite suite_A
+
+# Multi-card tensor parallelism (e.g. 8 x S5000 on a single host)
+VLLM_WORKER_MULTIPROC_METHOD=spawn \
+python run.py --runner moorethreads_vllm_musa_57ff5443 \
+    --suite suite_B \
+    --tensor-parallel-size 8
+
+# Local model cache
+python run.py --runner moorethreads_vllm_musa_57ff5443 \
+    --suite suite_A \
+    --model-path /data/models/Meta-Llama-3-8B-Instruct
+```
+
+## Runner config
+
+Copy the example config and adjust for your hardware:
+
+```bash
+cp configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example \
+   configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml
+```
+
+Key settings:
+
+| Field | Default | Notes |
+|-------|---------|-------|
+| `tensor_parallel_size` | 1 | Number of MUSA GPUs for tensor parallelism |
+| `enforce_eager` | false | Disable CUDA-graph / compilation; useful for pre-S4000 cards or while debugging |
+| `max_num_seqs` | 256 | Max concurrent sequences; reduce on lower-memory cards |
+| `gpu_memory_utilization` | 0.85 | Fraction of HBM reserved for KV cache; reduce if OOM |
+
+## Triton / kernel compilation errors
+
+If you encounter errors during Triton graph capture on first request,
+disable graph capture with `--enforce-eager`:
+
+```bash
+python run.py --runner moorethreads_vllm_musa_57ff5443 \
+    --suite suite_F --enforce-eager
+```
+
+Or set persistently in the runner config YAML:
+
+```yaml
+enforce_eager: true
+```
+
+## HBM OOM errors
+
+Reduce `gpu_memory_utilization` and/or `max_num_seqs`, either globally or
+per-suite (Suite D is the most memory-hungry due to long-context inputs):
+
+```yaml
+gpu_memory_utilization: 0.80
+max_num_seqs: 128
+
+suites:
+  suite_D:
+    max_num_seqs: 32
+    gpu_memory_utilization: 0.78
+```
+
+## Known gaps (pre-smoke-test)
+
+The following items are placeholders and **must be re-validated** on real
+S4000 / S5000 hardware:
+
+- **Memory peak**: relies on `torch.cuda.max_memory_allocated()` which
+  torchada aliases to MUSA. If this returns 0 or `None`, fall back to
+  `pymtml.mtmlDeviceGetMemoryInfo()`.
+- **MCCL teardown**: assumes the same `cleanup_dist_env_and_memory` entry
+  point as upstream vLLM. If MCCL leaves a hanging process group, the
+  fallback path explicitly destroys the torch.distributed group.
+- **Quantization**: `SUPPORTED_QUANTIZATION_BACKENDS` currently lists only
+  `compressed-tensors`. AWQ / GPTQ-Marlin / FP8 are intentionally excluded
+  until kernel coverage on MUSA is confirmed.
+- **Precision detection**: `_get_chip_count()` prefers `pymtml` over
+  `torch.cuda.device_count()`. On hosts where pymtml is missing this may
+  miscount; in that case the torch fallback should still work because
+  torchada provides `torch.cuda.device_count()`.
+
+## Requirements
+
+See `requirements.txt` for the pinned plugin / extras list. The heavy
+dependencies (torch + torchada + MUSA toolkit) must come from the Moore
+Threads container; do not install them from PyPI.
+
+Minimum environment:
+- Moore Threads MTT S4000 or newer (S3000 / S80 best-effort)
+- MUSA toolkit + driver matching card firmware
+- torch 2.7.1 (Moore Threads MUSA build) + torchada ≥ 0.1.9
+- Python 3.10+
+- vllm-musa 0.1.1 (vLLM core 0.10.1.1 or 0.13.0)
diff --git a/runners/moorethreads_vllm_musa_57ff5443/meta.json b/runners/moorethreads_vllm_musa_57ff5443/meta.json
@@ -0,0 +1,21 @@
+{
+  "id": "moorethreads_vllm_musa_57ff5443",
+  "platform": "moorethreads",
+  "name": "vllm-musa on Moore Threads MUSA GPU",
+  "framework": "vllm-musa",
+  "submitted_by": "JuhaoLiang1997",
+  "description": "AccelMark runner for Moore Threads MTT S4000 / S5000 MUSA GPUs via the vllm-musa platform plugin (vLLM 0.10.x / 0.13.x + torchada CUDA→MUSA compatibility + pymtml). API-compatible with standard vLLM; MCCL-based tensor parallelism. FP8 excluded — not supported on current MUSA hardware. Quantization limited to compressed-tensors (W8A8/W8A16) pending real-hardware validation of AWQ / GPTQ / FP8 paths.",
+  "supersedes_chain": [],
+  "notes": "Initial Moore Threads runner. Written from the public vllm-musa documentation and the structural template of ascend_vllm_ascend_d4aa9fda; capability flags, dtype mapping and teardown sequence are placeholders awaiting smoke-testing on real S4000 / S5000 silicon.",
+  "created": "2026-05-15",
+  "hardware_label": null,
+  "suite_support": {
+    "A": "pending",
+    "B": "pending",
+    "C": "pending",
+    "D": "pending",
+    "E": "pending",
+    "F": "pending",
+    "G": "unsupported"
+  }
+}
diff --git a/runners/moorethreads_vllm_musa_57ff5443/requirements.txt b/runners/moorethreads_vllm_musa_57ff5443/requirements.txt
@@ -0,0 +1,58 @@
+# AccelMark -- Moore Threads MUSA vllm-musa runner dependencies
+#
+# This runner is designed to run inside the official Moore Threads MUSA
+# container (which already ships torch + torchada built for the MUSA
+# toolkit) and only installs the vLLM platform plugin + accelmark extras
+# on top of it.
+#
+# Tested image (subject to change at smoke-test time):
+#   sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120
+# Reference docker command:
+#   docker run -d --net host --privileged --pid=host --shm-size 500g \
+#     -v $PWD:/ws -w /ws \
+#     --name accelmark-musa \
+#     sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120 \
+#     sleep infinity
+#   docker exec -it accelmark-musa bash
+#
+# Pre-installed in the container (do NOT reinstall via pip):
+#   torch==2.7.1              (built for MUSA with torchada)
+#   torchada>=0.1.9           (CUDA→MUSA compatibility layer)
+#   mthreads-ml-py>=2.2.5     (pymtml — MTML bindings)
+#
+# vLLM core: the plugin pulls in a compatible version automatically, but for
+# reproducibility we pin to one of the validated combinations below.
+# Pick ONE of these two stacks (uncomment the matching line in the install
+# guide in README.md):
+#
+#   stack A — vLLM 0.10.1.1 (V0 + V1 engines):
+#     pip install -e .   # plugin auto-installs vllm==0.10.1.1
+#
+#   stack B — vLLM 0.13.0 (V1-only):
+#     pip install -e .                      # plugin installs vllm==0.10.1.1
+#     pip install vllm==0.13.0 --no-deps --upgrade
+#     pip install 'depyf==0.20.0' 'llguidance>=1.3.0,<1.4.0' \
+#                 'lm-format-enforcer==0.11.3' 'outlines_core==0.2.11' \
+#                 'xgrammar==0.1.27' 'compressed-tensors==0.12.2'
+
+# vLLM MUSA platform plugin (PyPI: vllm-musa, GitHub: MooreThreads/vllm-musa)
+vllm-musa==0.1.1
+
+# Transformers stack — pin to versions compatible with vLLM 0.10.x / 0.13.x
+transformers==4.46.3
+tokenizers==0.20.3
+huggingface-hub==0.26.5
+accelerate==1.2.1
+safetensors==0.4.5
+
+# AccelMark dependencies (not bundled in the image)
+numpy==1.26.4
+jsonschema==4.25.1
+psutil==7.1.0
+tqdm==4.67.1
+
+# Async support
+aiohttp==3.12.15
+
+# Config file parsing
+PyYAML==6.0.2
diff --git a/runners/moorethreads_vllm_musa_57ff5443/runner.py b/runners/moorethreads_vllm_musa_57ff5443/runner.py