AccelMark is a community-driven benchmark leaderboard for AI accelerators. The most valuable contribution is running the benchmark on hardware not yet in the leaderboard and submitting your results.
Got a GPU? Here's the shortest path to getting on the leaderboard:
# 1. Fork the repo on GitHub, then clone your fork
git clone https://github.com/<you>/AccelMark.git
cd AccelMark
pip install -e .
pip install -r runners/nvidia_vllm_47f5d58e/requirements.txt
# 2. Set your name (one-time setup)
cp configs/submitter.yaml.example configs/submitter.yaml
# Edit configs/submitter.yaml — add your name or GitHub username
# 3. Run the benchmark (~11 min on A100 for default scenarios)
# Accuracy gate runs automatically before the benchmark starts.
# Output directory is auto-named using run_name, e.g.:
# results/community/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm_47f5d58e_ed4b0557
python run.py --runner nvidia_vllm_47f5d58e --suite suite_A
# 4. Open a pull request with your result
git checkout -b submit/<your-hardware>
git add results/ && git commit -m "results: <hardware> on suite_A"
git push origin submit/<your-hardware>
gh pr create # or open the PR via the GitHub web UIThat's it. CI validates the result automatically; merging the PR publishes it to the leaderboard.
Each runner ships its own requirements.txt. Install for the runner you want to use:
pip install -r runners/nvidia_vllm_47f5d58e/requirements.txtTo see all available runners and their install commands:
python run.py --listcp configs/submitter.yaml.example configs/submitter.yamlEdit configs/submitter.yaml:
# Shown publicly on the leaderboard — can be your name, GitHub username, or organization
submitted_by: your_name_or_github_username
submission_type: individual # individual / vendor / research
organization: null # optional, e.g. "MIT" or "Google DeepMind"This file is gitignored — it never gets committed.
If you have models downloaded locally, set their paths so you don't
need to specify --model-path every time:
cp configs/models_local.yaml.example configs/models_local.yamlEdit configs/models_local.yaml:
models:
meta-llama/Meta-Llama-3-8B-Instruct:
local_path: /your/path/to/Meta-Llama-3-8B-Instruct
meta-llama/Llama-3.1-8B-Instruct:
local_path: null # not downloaded yetconfigs/models_local.yaml is gitignored. Once configured, you don't
need --model-path on the command line.
If you want to permanently change a runner's defaults (e.g. raise
max_num_seqs, enable enforce_eager, set tensor_parallel_size) without
adding flags to every invocation, drop a yaml at
configs/runner_configs/runner_<runner_id>.yaml. The file is
gitignored — only *.yaml.example companions are checked into the
repo. That makes the override strictly local to your machine and keeps
the canonical defaults intact for everyone else.
cp configs/runner_configs/runner_nvidia_vllm_47f5d58e.yaml.example \
configs/runner_configs/runner_nvidia_vllm_47f5d58e.yaml
# edit freely — your benchmarks now pick up the overrides automaticallypython run.py --runner nvidia_vllm_47f5d58e --suite suite_AThis runs the suite's default scenarios (accuracy gate → offline → online)
in sequence and produces a single merged result.json. If the accuracy gate fails,
the benchmark is aborted (use --skip-accuracy-gate to override).
Use --scenario all to also include extra scenarios defined in the suite (e.g. interactive, sustained),
or --scenario offline (or any other scenario name) to run a single scenario.
# Override the output directory if needed
python run.py --runner nvidia_vllm_47f5d58e \
--suite suite_A \
--output-dir ./results/verified/nvidia_a100_sxm4_80gbx1_suite_A_nvidia_vllm_47f5d58e_ed4b0557python run.py --runner nvidia_vllm_47f5d58e --suite suite_A --scenario offline# Suite B: Llama-3-70B on 4 chips
python run.py --runner nvidia_vllm_47f5d58e \
--suite suite_B \
--tensor-parallel-size 4 # (or set tensor_parallel_size: 4 in the runner config yaml)Suite B does not require a specific chip count — use however many chips your hardware needs to fit the 70B model. The result records the actual chip count and the leaderboard groups results by chip count for fair comparison.
Suite A on multiple chips is not recommended. Llama-3-8B fits on a single chip, so multi-chip adds communication overhead without a meaningful use case. Use Suite B (70B) or Suite E (scaling benchmark) for multi-chip runs.
Suite E runs the same workload at 1×, 2×, 4×, and 8× chip counts automatically and measures how efficiently throughput scales. Specify the maximum chip count you want to test:
python run.py --runner nvidia_vllm_47f5d58e \
--suite suite_ESet the chip count range via tensor_parallel_size in the runner config yaml,
or pass --tensor-parallel-size directly to the runner.
Suite E handles the scaling automatically — it runs at 1×, 2×, and 4× chips
sequentially and computes scaling_efficiency = N_chip_throughput / (1_chip_throughput × N).
Suite F uses Qwen2.5-0.5B-Instruct and short prompts (sharegpt_edge_v1).
It is aimed at single-GPU consumer hardware (≥4 GB VRAM). On pre-Ampere GPUs,
use --enforce-eager with the NVIDIA vLLM runner — see
runners/nvidia_vllm_47f5d58e/README.md.
python run.py --runner nvidia_vllm_47f5d58e --suite suite_FFull specs: suites/README.md.
Note on concurrency vs batch size: The offline scenario sweeps
client-side concurrency (how many requests the load generator fires
simultaneously) — not the inference engine's internal batch size.
The engine's internal batching (e.g. vLLM's max_num_seqs) is
configured separately in the runner and is not varied by the suite.
Results report concurrency values, not batch sizes.
python run.py --runner nvidia_vllm_47f5d58e \
--suite suite_A \
--model-path /path/to/local/model| Scenario | Primary metric | What it tells you |
|---|---|---|
| offline | tokens/sec | Max throughput when the GPU is fully loaded |
| online | max valid QPS | How many users/sec this chip can serve within a latency SLA |
| interactive | TTFT p99 | Single-user latency when the system is idle |
| sustained | sustained tok/s + throttle ratio | Whether throughput degrades over a long fixed-concurrency run (typically 30 min; Suite F uses 15 min) |
| accuracy | subset score | Model output quality against an MMLU baseline |
| speculative (extra) | tokens/sec + acceptance rate | Offline throughput with speculative decoding (draft model); also captures task.runtime_metrics.acceptance_rate if the runner implements get_runtime_metrics(). Suites A and D. |
| burst (extra) | burst degradation ratio | TTFT p99 during burst windows vs steady windows; stress-tests KV cache eviction and scheduler resilience. Suites A and B. |
Which scenarios are included in the default run and which are extras depends on
the suite. Check the suite's suite.json or see suites/README.md.
Running speculative on Suite A or D: the draft model path is resolved automatically from
speculative_draft_model_idin the suite config (respectingconfigs/models_local.yaml). No manual engine config is required.
throughput_tokens_per_sec — input + output tokens per second. Offline
scenario primary metric.
max_valid_qps — highest request rate (queries/sec) where TTFT p99 stays
under the suite's SLA. Online scenario primary metric.
ttft_ms_p99 — 99th percentile time-to-first-token in milliseconds.
Interactive scenario primary metric.
throttle_ratio — min_throughput / max_throughput over the 30-minute
sustained run. 1.0 = perfectly stable. A100 reference: ~0.87 at concurrency 8.
quality_efficiency — Suite C only. speedup_vs_bf16 × accuracy_retention.
Rewards quantized formats that are both fast and accurate.
See suites/README.md for full details on all suites and metrics.
| Suite | Default run | Notes |
|---|---|---|
| A | ~11 min | offline + online |
| B | ~20 min | 4× chips (A100-80GB); ~37 min on A100-40GB |
| C | ~22 min | offline only, all 5 precision formats |
| D | ~22 min | Long-context (offline only) |
| E | ~9 min | Up to 4× scaling |
| F | ~10 min | offline + online + interactive |
| G | ~35 min | MoE multi-chip (varies with chip count) |
Sustained adds about 30 minutes on datacenter suites (A–E); Suite F uses a 15-minute sustained profile.
Add interactive (~35 min on Suite A) by running with --scenario interactive or --scenario all.
Add speculative (~3 min extra on Suite A, ~24 min extra on Suite D) or burst (~18 min extra on Suite A/B) with --scenario speculative / --scenario burst.
When you run the suite, accuracy runs automatically as the first step. If accuracy fails, the benchmark is aborted.
============================================================
Step 1: Accuracy Gate
Must pass before benchmark runs.
============================================================
Score: 62/100 = 0.6200
Baseline: 0.6000
Delta: +0.0200 (min allowed: -0.10 — matches `accuracy_threshold_delta` in many suites; Suite C uses per-format thresholds)
Valid: True
✓ Accuracy gate passed: 0.62 (delta=0.02, valid=True)
The accuracy check uses the same model instance as the benchmark — same framework, same precision, same inference stack.
If accuracy fails:
- Check model weights and revision against the suite spec
- Common cause: quantized weights with too much quality loss
- Use
--skip-accuracy-gateonly for debugging — results submitted with a failed accuracy gate are permanently flagged on the leaderboard
Resuming an interrupted run: Re-running the same command resumes from where it stopped. Completed steps are skipped automatically.
After a successful run, validate locally and open a PR:
# Validate the produced files against the schemas (the same check CI runs).
python runners/validate_submission.py \
results/community/<run_name>/result.json
# Stage just the new result and env file.
git checkout -b submit/<your-hardware>
git add results/community/<run_name>/
git commit -m "results: <hardware> on suite_A"
git push origin submit/<your-hardware>
# Open the PR — either via the GitHub web UI or:
gh pr create --fillWhat gets committed is only the new files under results/community/<run_name>/:
your result.json, env_info.json, and (optionally) samples.jsonl. Nothing
else in the repo should change.
CI then re-runs the schema validator and the runner-folder integrity check. When both pass and a contributor reviews the diff, the PR is merged and your result shows up on the leaderboard on the next site build.
The static site is generated from results/ by leaderboard/generate.py.
After dropping your result into results/community/<run_name>/, you can
preview the final UI before opening the PR:
python leaderboard/generate.py # writes leaderboard/site/leaderboard.js + api/
python -m http.server -d leaderboard/site 8000 # serve the static site
# open http://localhost:8000Both leaderboard.js and leaderboard/site/api/ are gitignored — the GitHub
Actions workflow regenerates them on every merge to main.
| Tier | How to get it | Leaderboard placement |
|---|---|---|
| community | Submit a PR and pass CI validation | Community tab |
| verified | Independently reproduced on the same hardware/runner within 5% | Main leaderboard |
To promote a community result to verified, anyone with the same hardware
and runner can run the same suite and open a follow-up PR that lands the
reproduction in results/verified/. Maintainers do not gate this — every
verified result is itself reproducible by definition.
The frontend treats every submission as data — there's nothing per-vendor or per-chip hand-coded on the UI side. A few conventions are worth knowing if you want your result to look its best:
The chip-detail page (#/chip/<slug>) is keyed on the chip model alone, not on chip_count. That means a single chip page aggregates every fan-out you've ever submitted (×1, ×4, ×8, ×16) into one overview, with the runs table sorted by chip_count ascending and the per-suite KPI card flagging the deployment behind the best score (e.g. ×8 badge next to the metric).
Implication: if your runner emits one result.json per chip-count, you don't need to invent fake chip names to keep them apart — submit them with the same chip field and they'll merge cleanly. Old …-x<N> URLs from before this change auto-redirect to the bare-model slug, so existing shared links keep working.
Vendor accents (chip dot, vendor pill, peer card border) are driven entirely by assets/js/data.js's VENDOR_COLORS map. New vendors get a deterministic fallback colour from a 9-entry palette the first time they appear in the dataset — no frontend change required to ship the result.
If you want a brand-accurate accent for a new vendor (e.g. your accelerator's official colour), open a one-line PR to VENDOR_COLORS:
export const VENDOR_COLORS = {
// …
Cerebras: "#ff6f3c", // ← add yours
};VENDOR_ORDER (used to lay out the rankings facet pill row) is derived from Object.keys(VENDOR_COLORS), so the same edit also pins your vendor's position in the brand list. Vendors not in the map are appended alphabetically after the brand-named ones.
The run-detail modal's Visualize tab is hidden when result.viz is absent. Populate it to surface scenario-specific charts:
Suite-specific shapes the frontend understands today:
viz.type |
Suites | Required keys |
|---|---|---|
decode |
A · F · G | offline.labels[], offline.throughput[] |
multichip |
B | offline.labels[], offline.throughput[], offline.throughput_per_chip[] |
quant |
C | precisions[], throughput[], optional accuracy[] |
longctx |
D | offline.labels[], offline.throughput[], interactive.{ttft_p50,…,tpot_p99} |
scaling |
E | chip_counts[], throughput[], efficiency_pct[] |
viz is fully optional — runners that don't emit it still get a clean Details / Implementation modal. When present, the same fields drive the per-suite head-to-head charts on the Compare page (so two basket runs render directly comparable visualisations instead of falling back to a metric-table-only view).
The leaderboard surfaces the value of meta.submitted_by as @<handle> next to your result on every list (home recent, suite cards, chip-detail submissions table). Anything that looks like a GitHub login, an email, or a Name <email> form is reduced to the local-part — see submitterHandle in assets/js/utils.js.
AccelMark separates the model identifier (used for leaderboard comparisons) from the model path (where weights load from at runtime).
model_id and model_revision in result.json always use canonical
HuggingFace identifiers — they don't change regardless of where you load from.
# Download for offline use
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir /your/path/Meta-Llama-3-8B-Instruct
# Use a local copy
python run.py --runner nvidia_vllm_47f5d58e --model-path /your/path/...If your local copy was downloaded at a different revision, add a note in
meta.notes of result.json.
A "runner" here is a Python class that wraps an inference framework (vLLM,
SGLang, mlx-lm, …) and exposes the AccelMark standard interface. Adding
one for an existing platform (NVIDIA, AMD, Ascend, Apple, Google TPU,
Moore Threads, …) does not require touching any shared file. The full
walk-through lives in runners/README.md; the short
version is:
- Copy
runners/template/runner.pyinto a temporary folder and fill in the three required methods (load_model,inference_fn_offline,release_resources) plusinference_fn_streamingif your framework has a streaming API. - Compute the hash and rename the folder:
python runners/hash_runner.py runners/tmp/produces e.g.nvidia_myframework_3f8a2c1d. - Write
meta.jsonnext to it, includingsuite_support— that field is how the top-levelREADME.mdtable picks up your runner. You never editREADME.mdyourself. - Add a
requirements.txt. - Validate:
python runners/validate_runners.py --dir runners/<your_folder>. - Regenerate the README matrix locally:
python tools/generate_platforms_matrix.py.
# runners/your_platform_{hash8}/runner.py
from runners.benchmark_runner import BenchmarkRunner, InferenceRequest
from loadgen.types import InferenceResult
class MyFrameworkRunner(BenchmarkRunner):
SUPPORTS_STREAMING = True # set False if no streaming API
SUPPORTS_BATCHING = True # set False if serial only (e.g. mlx-lm)
SUPPORTS_ONLINE = True
SUPPORTS_MULTI_CHIP = True # set False if no tensor parallelism
def load_model(self, model_path: str, parallelism: dict) -> None:
tp_size = parallelism["tensor_parallel_size"]
self.model = MyFramework.load(model_path, tp=tp_size)
def inference_fn_offline(self, requests: list[InferenceRequest]) -> list[InferenceResult]:
prompts = [self._format_prompt(r.prompt) for r in requests]
outputs = self.model.generate(prompts)
return [InferenceResult(
first_token_time_ms=None,
total_time_ms=o.elapsed_ms,
output_tokens=o.num_tokens,
input_tokens=o.num_input_tokens,
success=True,
) for o in outputs]
async def inference_fn_streaming(self, request: InferenceRequest) -> InferenceResult:
... # required if SUPPORTS_STREAMING = True
def release_resources(self) -> None:
del self.model
self.model = None
def _get_framework_name(self) -> str:
return "MyFramework"
def _get_framework_version(self) -> str:
import myframework
return myframework.__version__
if __name__ == "__main__":
MyFrameworkRunner().main()All orchestration (result building, accuracy reuse, Suite E, etc.) is
inherited from BenchmarkRunner automatically.
Checklist for a new-runner PR (existing platform):
- Runner folder named
{platform}_{name}_{hash8}with correct hash -
runner.pysubclassesBenchmarkRunnerand passesrunners/validate_runners.py -
meta.jsonpresent and valid (seerunners/meta.schema.json), withsuite_supportdeclared for every suite your runner can or cannot run -
requirements.txtincluded -
tools/generate_platforms_matrix.py --checkpasses locally (CI also enforces this) - At least one reference result in
results/community/(validated by CI)
If you are bringing up a new platform (e.g. a vendor not yet in
schema/platforms.json), the only additional file you need to ship is
runners/platforms/<your_platform>.py
which exports module-level collect(), detect_runtime_version() and a
few optional helpers. The collector at runners/collect_env.py
auto-discovers it; no change to that file is required. See
runners/README.md
for the full protocol and a worked example.
Optional polish steps when the new platform stabilises:
- Add an entry to
schema/platforms.jsonso the README matrix renders a pretty hardware label and stable sort order. Until then, the matrix renders the bare identifier andvalidate_runners.pyemits a non-fatal warning prompting this follow-up.
See DEVELOPMENT.md for the full implementation reference.
If a result looks wrong:
- Open a GitHub Issue using the "Challenge a Result" template
- Include: the submission name, what looks wrong, and ideally your own run on the same hardware as evidence
The community discusses the report in the issue. If the consensus is that
the result is suspicious, a PR sets meta.flagged on that result to a
reason string and the entry shows up with a
- Fix a bug — open a PR with a description and test if possible
- Improve runners — better error messages, edge case handling
- Update cloud pricing — edit
schema/cloud_pricing.json, open a PR titleddata: update cloud pricing YYYY-MMwith source URLs in the description - Propose a new suite — open an Issue with model, chip count, scenarios, and rationale; new suites require a reference result before going live
- Do not modify
schema/accuracy_subset.jsonl— it is immutable - Do not modify other people's results in
results/ - Vendor employees may submit results for their own chips (shown with a Vendor badge);
disclose affiliation by tagging
[vendor]in your submitter name - Results submitted with
--enforce-eagerare valid but noted — they may underrepresent true hardware capability.--enforce-eageris a runner-specific flag (not a baserun.pyflag); it can also be set permanently viaenforce_eager: truein the runner config yaml atconfigs/runner_configs/runner_<id>.yaml. - Results submitted with
--skip-accuracy-gateare permanently flagged
{ "viz": { "type": "decode", // bandwidth-bound suite (A/F/G default) "offline": { "labels": [1, 2, 4, 8, 16, 32], "throughput": [120, 230, 410, 760, 1100, 1380] }, "interactive": { // optional, suite_D-style "ttft_p50": 78, "ttft_p90": 110, "ttft_p99": 135, "tpot_p50": 9, "tpot_p90": 11, "tpot_p99": 14 } } }