|
| 1 | +# moorethreads_vllm_musa_57ff5443 — Moore Threads MUSA Runner (vllm-musa) |
| 2 | + |
| 3 | +AccelMark runner for Moore Threads MUSA GPUs using |
| 4 | +[vllm-musa](https://github.com/MooreThreads/vllm-musa), the official vLLM |
| 5 | +platform plugin for MUSA hardware. |
| 6 | + |
| 7 | +> **Status:** This runner is **untested on real silicon at the time of |
| 8 | +> commit**. The code is written against the public `vllm-musa` plugin |
| 9 | +> documentation and follows the structural template of the |
| 10 | +> `ascend_vllm_ascend_*` runner. Plan to smoke-test on an S5000 / S4000 |
| 11 | +> system; capability flags and dtype mappings may be adjusted in a follow-up |
| 12 | +> runner version (new hash, new folder) based on real-world findings. |
| 13 | +
|
| 14 | +## How vllm-musa works |
| 15 | + |
| 16 | +`vllm-musa` is a vLLM **platform plugin** (auto-detected on `import vllm`) |
| 17 | +that makes the standard vLLM Python API run on Moore Threads MUSA GPUs. It |
| 18 | +relies on three components: |
| 19 | + |
| 20 | +| Component | Role | |
| 21 | +|---|---| |
| 22 | +| `torchada` | CUDA→MUSA compatibility layer for PyTorch — aliases `torch.cuda.*` to MUSA so most code paths run unmodified | |
| 23 | +| `pymtml` (`mthreads-ml-py`) | Moore Threads Management Library bindings, equivalent to `nvidia-ml-py` | |
| 24 | +| Triton patches | Runtime monkey-patches in `vllm_musa_platform.patches.*` that fix `triton.attention` and `worker` modules for MUSA's Triton compiler | |
| 25 | + |
| 26 | +The standard `vllm.LLM`, `vllm.AsyncLLMEngine`, and `vllm.SamplingParams` |
| 27 | +remain the entry points — this runner therefore reuses ~95% of the logic |
| 28 | +from the NVIDIA / Ascend vLLM runners. |
| 29 | + |
| 30 | +## Supported suites |
| 31 | + |
| 32 | +| Suite | Description | Notes | |
| 33 | +|-------|-------------|-------| |
| 34 | +| Suite A | Single-chip, Llama-3-8B | Pending smoke test on S4000 / S5000 | |
| 35 | +| Suite B | Multi-chip, Llama-3-70B | Requires multiple Moore Threads cards + MCCL TP | |
| 36 | +| Suite C | Quantization, Llama-3.1-8B | FP8 skipped (no native FP8 in current MUSA hardware); compressed-tensors W8A8/W8A16 candidate; AWQ / GPTQ pending validation | |
| 37 | +| Suite D | Long context ~28K input, Llama-3.1-8B | Reduce `max_num_seqs` and `gpu_memory_utilization` | |
| 38 | +| Suite E | Multi-chip scaling, Llama-3-8B | Validates MCCL tensor parallelism | |
| 39 | +| Suite F | Consumer/edge, Qwen2.5-0.5B | Recommended starting point for S4000 single-card systems | |
| 40 | + |
| 41 | +## Hardware compatibility |
| 42 | + |
| 43 | +| GPU | BF16 | TP via MCCL | FP8 | Notes | |
| 44 | +|-----|------|-------------|-----|-------| |
| 45 | +| MTT S5000 | ✅ | ✅ | ❌ | Recommended public reference target (FA3 via MATE) | |
| 46 | +| MTT S4000 | ✅ | ✅ | ❌ | Validated path with PyTorch SDPA-based FlashAttention | |
| 47 | +| MTT S3000 | ⚠️ | ⚠️ | ❌ | May work via `--enforce-eager`; not the public reference | |
| 48 | +| MTT S80 | ⚠️ | — | ❌ | Consumer card; treat as best-effort | |
| 49 | + |
| 50 | +## Prerequisites |
| 51 | + |
| 52 | +You must install the MUSA stack in this exact order — Python packages alone |
| 53 | +are not sufficient: |
| 54 | + |
| 55 | +**1. MUSA toolkit + driver** |
| 56 | + |
| 57 | +Match the toolkit version to your card firmware. Reference: |
| 58 | +<https://developer.mthreads.com/musa/> |
| 59 | + |
| 60 | +**2. PyTorch with MUSA support (torch + torchada)** |
| 61 | + |
| 62 | +The recommended path is the official Moore Threads container, which ships a |
| 63 | +pre-built `torch==2.7.1` together with `torchada` and `pymtml`. See: |
| 64 | + |
| 65 | +```bash |
| 66 | +docker pull sh-harbor.mthreads.com/mcctest/musa-compile:rc4.3.3-torch2.7-20251120 |
| 67 | +``` |
| 68 | + |
| 69 | +**3. Runner dependencies** |
| 70 | + |
| 71 | +Inside the MUSA container: |
| 72 | + |
| 73 | +```bash |
| 74 | +pip install -r runners/moorethreads_vllm_musa_57ff5443/requirements.txt |
| 75 | +``` |
| 76 | + |
| 77 | +This installs `vllm-musa==0.1.1` which auto-pulls a validated vLLM core |
| 78 | +(`0.10.1.1` by default). To use vLLM `0.13.0` instead (V1-only engine): |
| 79 | + |
| 80 | +```bash |
| 81 | +pip install vllm==0.13.0 --no-deps --upgrade |
| 82 | +pip install 'depyf==0.20.0' 'llguidance>=1.3.0,<1.4.0' \ |
| 83 | + 'lm-format-enforcer==0.11.3' 'outlines_core==0.2.11' \ |
| 84 | + 'xgrammar==0.1.27' 'compressed-tensors==0.12.2' |
| 85 | +``` |
| 86 | + |
| 87 | +## Required environment variables |
| 88 | + |
| 89 | +```bash |
| 90 | +# Device visibility (works like CUDA_VISIBLE_DEVICES) |
| 91 | +export MUSA_VISIBLE_DEVICES=0,1,2,3 |
| 92 | + |
| 93 | +# Recommended for multi-process workers (TP > 1) |
| 94 | +export VLLM_WORKER_MULTIPROC_METHOD=spawn |
| 95 | +``` |
| 96 | + |
| 97 | +## Basic usage |
| 98 | + |
| 99 | +```bash |
| 100 | +# Verify the plugin is loaded before running anything else |
| 101 | +python -c "from vllm_musa_platform import musa_platform_plugin; print('ok')" |
| 102 | + |
| 103 | +# Suite F (single-card S4000 / S5000) |
| 104 | +python run.py --runner moorethreads_vllm_musa_57ff5443 --suite suite_F |
| 105 | + |
| 106 | +# Suite A (single-card datacenter benchmark) |
| 107 | +python run.py --runner moorethreads_vllm_musa_57ff5443 --suite suite_A |
| 108 | + |
| 109 | +# Multi-card tensor parallelism (e.g. 8 x S5000 on a single host) |
| 110 | +VLLM_WORKER_MULTIPROC_METHOD=spawn \ |
| 111 | +python run.py --runner moorethreads_vllm_musa_57ff5443 \ |
| 112 | + --suite suite_B \ |
| 113 | + --tensor-parallel-size 8 |
| 114 | + |
| 115 | +# Local model cache |
| 116 | +python run.py --runner moorethreads_vllm_musa_57ff5443 \ |
| 117 | + --suite suite_A \ |
| 118 | + --model-path /data/models/Meta-Llama-3-8B-Instruct |
| 119 | +``` |
| 120 | + |
| 121 | +## Runner config |
| 122 | + |
| 123 | +Copy the example config and adjust for your hardware: |
| 124 | + |
| 125 | +```bash |
| 126 | +cp configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml.example \ |
| 127 | + configs/runner_configs/runner_moorethreads_vllm_musa_57ff5443.yaml |
| 128 | +``` |
| 129 | + |
| 130 | +Key settings: |
| 131 | + |
| 132 | +| Field | Default | Notes | |
| 133 | +|-------|---------|-------| |
| 134 | +| `tensor_parallel_size` | 1 | Number of MUSA GPUs for tensor parallelism | |
| 135 | +| `enforce_eager` | false | Disable CUDA-graph / compilation; useful for pre-S4000 cards or while debugging | |
| 136 | +| `max_num_seqs` | 256 | Max concurrent sequences; reduce on lower-memory cards | |
| 137 | +| `gpu_memory_utilization` | 0.85 | Fraction of HBM reserved for KV cache; reduce if OOM | |
| 138 | + |
| 139 | +## Triton / kernel compilation errors |
| 140 | + |
| 141 | +If you encounter errors during Triton graph capture on first request, |
| 142 | +disable graph capture with `--enforce-eager`: |
| 143 | + |
| 144 | +```bash |
| 145 | +python run.py --runner moorethreads_vllm_musa_57ff5443 \ |
| 146 | + --suite suite_F --enforce-eager |
| 147 | +``` |
| 148 | + |
| 149 | +Or set persistently in the runner config YAML: |
| 150 | + |
| 151 | +```yaml |
| 152 | +enforce_eager: true |
| 153 | +``` |
| 154 | +
|
| 155 | +## HBM OOM errors |
| 156 | +
|
| 157 | +Reduce `gpu_memory_utilization` and/or `max_num_seqs`, either globally or |
| 158 | +per-suite (Suite D is the most memory-hungry due to long-context inputs): |
| 159 | + |
| 160 | +```yaml |
| 161 | +gpu_memory_utilization: 0.80 |
| 162 | +max_num_seqs: 128 |
| 163 | +
|
| 164 | +suites: |
| 165 | + suite_D: |
| 166 | + max_num_seqs: 32 |
| 167 | + gpu_memory_utilization: 0.78 |
| 168 | +``` |
| 169 | + |
| 170 | +## Known gaps (pre-smoke-test) |
| 171 | + |
| 172 | +The following items are placeholders and **must be re-validated** on real |
| 173 | +S4000 / S5000 hardware: |
| 174 | + |
| 175 | +- **Memory peak**: relies on `torch.cuda.max_memory_allocated()` which |
| 176 | + torchada aliases to MUSA. If this returns 0 or `None`, fall back to |
| 177 | + `pymtml.mtmlDeviceGetMemoryInfo()`. |
| 178 | +- **MCCL teardown**: assumes the same `cleanup_dist_env_and_memory` entry |
| 179 | + point as upstream vLLM. If MCCL leaves a hanging process group, the |
| 180 | + fallback path explicitly destroys the torch.distributed group. |
| 181 | +- **Quantization**: `SUPPORTED_QUANTIZATION_BACKENDS` currently lists only |
| 182 | + `compressed-tensors`. AWQ / GPTQ-Marlin / FP8 are intentionally excluded |
| 183 | + until kernel coverage on MUSA is confirmed. |
| 184 | +- **Precision detection**: `_get_chip_count()` prefers `pymtml` over |
| 185 | + `torch.cuda.device_count()`. On hosts where pymtml is missing this may |
| 186 | + miscount; in that case the torch fallback should still work because |
| 187 | + torchada provides `torch.cuda.device_count()`. |
| 188 | + |
| 189 | +## Requirements |
| 190 | + |
| 191 | +See `requirements.txt` for the pinned plugin / extras list. The heavy |
| 192 | +dependencies (torch + torchada + MUSA toolkit) must come from the Moore |
| 193 | +Threads container; do not install them from PyPI. |
| 194 | + |
| 195 | +Minimum environment: |
| 196 | +- Moore Threads MTT S4000 or newer (S3000 / S80 best-effort) |
| 197 | +- MUSA toolkit + driver matching card firmware |
| 198 | +- torch 2.7.1 (Moore Threads MUSA build) + torchada ≥ 0.1.9 |
| 199 | +- Python 3.10+ |
| 200 | +- vllm-musa 0.1.1 (vLLM core 0.10.1.1 or 0.13.0) |
0 commit comments