Each suite defines a fully-specified benchmark configuration for comparing AI accelerators apples-to-apples.
| Suite | Model | Chips | Scenarios | Purpose |
|---|---|---|---|---|
| Suite A | Llama-3-8B-Instruct | 1 | offline, online (+ interactive, sustained, speculative, burst extra) | Standard single-chip inference. Required for leaderboard entry. |
| Suite B | Llama-3-70B-Instruct | flexible | offline, online (+ interactive, sustained, burst extra) | Large model multi-chip inference |
| Suite C | Llama-3.1-8B-Instruct | 1 | offline (+ online, sustained extra) | Quantization efficiency (BF16/FP8/W8A8/W8A16/W4A16) |
| Suite D | Llama-3.1-8B-Instruct | 1 | offline (+ interactive, online, sustained, speculative extra) | Long-context inference (~28K input tokens) |
| Suite E | Llama-3-8B-Instruct | 1×/2×/4×/8× | offline | Multi-chip scaling efficiency |
| Suite F | Qwen2.5-0.5B-Instruct | 1 (recommended) | offline, online, interactive | Consumer/edge single-GPU inference |
| Suite G | Mixtral-8x7B-Instruct-v0.1 | ≥2 (auto) | offline, online (+ interactive, sustained extra) | MoE multi-chip inference |
The table below shows measured wall-clock times on a single NVIDIA A100 SXM4 80GB running with vLLM,
recorded in meta.benchmark_elapsed_minutes of each submitted result.json.
Times on other hardware will differ.
Times below are measured wall-clock on NVIDIA A100-SXM4-80GB with vLLM 0.7.3.
benchmark_elapsed_minutes in each result.json is the sum of per-scenario benchmark times
(excludes model load and sleep gaps between scenarios).
Default scenarios only:
| Suite | Scenario | rc | Formula | Wall time |
|---|---|---|---|---|
| A | offline | 100 | 10s/run × 4 × 3 conc | ~2 min |
| A | online | 300 | Σ(elapsed × 3) / 3 QPS | ~7 min |
| A | (interactive — extra) | 150 | 659s/run × 3 runs | (~33 min) |
| A | (sustained — extra) | — | 32 min fixed | (~32 min) |
| A | (speculative — extra) | 100 | same as offline, draft model loaded | (~3 min) |
| A | (burst — extra) | 300 | num_runs × (burst_interval + burst_duration) | (~18 min) |
| Suite A default total | ~13 min | |||
| B | offline | 100 | 21s/run × 4 × 3 conc | ~4 min |
| B | online | 200 | Σ(elapsed × 3) / 4 QPS | ~23 min |
| B | (interactive — extra) | 50 | 780s/run × 3 runs | (~39 min) |
| B | (sustained — extra) | — | 32 min fixed | (~32 min) |
| B | (burst — extra) | 200 | num_runs × (burst_interval + burst_duration) | (~18 min) |
| Suite B default total | ~26 min | |||
| C | offline (×5 formats) | 100 | 4s/run × 4 × 3 conc × 5 fmt | ~22 min |
| C | (online — extra) | 300 | Σ(elapsed × 3) / 4 QPS × 5 fmt | (~48 min) |
| C | (sustained — extra) | — | 15 min fixed × 5 fmt | (~76 min) |
| Suite C default total | ~22 min | |||
| D | offline | 50 | 220s/run × 3 × 2 conc | ~22 min |
| D | (interactive — extra) | 100 | 1124s/run × 2 runs | (~37 min) |
| D | (online — extra) | 200 | Σ(elapsed × 2) / 3 QPS | (~38 min) |
| D | (sustained — extra) | — | 32 min fixed | (~32 min) |
| D | (speculative — extra) | 50 | same as offline, draft model loaded | (~24 min) |
| Suite D default total | ~22 min | |||
| E | offline (1×/2×/4×) | 150 | per-chip runs × 4 × 3 conc | ~9 min |
| Suite E default total | ~9 min | |||
| F | offline | 200 | 8s/run × 4 × 3 conc | ~2 min |
| F | online | 300 | Σ(elapsed × 3) / 2 QPS | ~3 min |
| F | interactive | 150 | 94s/run × 3 runs | ~5 min |
| F | (sustained — extra) | — | 15 min fixed | (~15 min) |
| Suite F default total | ~10 min |
Total default (A–F): ~85 min · Total all-scenarios (A–F): ~420 min · Suite G default: ~35 min (2× chip; varies with MoE routing overhead)
rc = request count per run. elapsed = elapsed_seconds_median from result.json (one run).
Formula for offline: elapsed × (num_runs + 1 warmup) × num_concurrency_levels.
Formula for online/interactive: elapsed × num_runs (no warmup run).
Times in italics are extra scenarios — run with --scenario all.
Sustained scenario (extra, opt-in): adds ~30 min on datacenter suites (A–E); Suite F uses a 15-minute profile. Run with
--scenario sustainedor--scenario all. Not included in default suite runs.
Counts are defined per suite in suites/<suite_id>/suite.json. Typical patterns:
offline: rc=100 (A, B, C); rc=50 (D, long-context); rc=150 (E, scaling); rc=200 (F, fast model)
online: orc=300 (A, C, F) — robust p99 at practical QPS levels
orc=200 (B, D) — 70B/long-context; p95 is primary tail metric
interactive: irc=150 (A, F); irc=100 (D, long-context p90 primary); irc=50 (B, 70B decode ~15s/req)
Serial execution — one request at a time. interactive_warmup_runs=0 for all suites.
Total wall time =
elapsed_seconds_median × num_runs(interactive/online per QPS) orelapsed_seconds_median × (num_runs + warmup_runs) × num_concurrency_levels(offline).elapsed_seconds_medianin result.json is one run, not the full suite.
Always use the suite’s request_count, online_request_count, and
interactive_request_count fields as the source of truth.
Single-chip inference — minimum required for leaderboard entry
Model: meta-llama/Meta-Llama-3-8B-Instruct
Chips: 1
Precision: BF16
Measures maximum throughput when all requests are sent at once. vLLM's internal scheduler handles batching.
concurrency_levels: [8, 32, 128] — client-side concurrency (requests sent simultaneously)
request_count: 100
num_runs: 3 + 1 warmup
Primary metric: throughput_tokens_per_sec (input + output tokens)
Measures maximum sustainable QPS while meeting a latency SLA. Requests arrive following a Poisson process (realistic service traffic).
online_qps_levels: [5, 25, 100]
online_sla_ttft_ms: 500 — p99 TTFT must be < 500ms to pass
online_request_count: 300
num_runs: 3 (no warmup)
Primary metric: max_valid_qps
max_valid_qps = the highest QPS level where p99 TTFT < 500ms.
Measures single-request latency in isolation (no concurrency).
interactive_request_count: 150
num_runs: 3 (no warmup)
Primary metrics: ttft_ms_p50, ttft_ms_p99
> Interactive is an **extra** scenario for Suite A. Run with `--scenario all` or `--scenario interactive`.
Runs 100 MMLU multiple-choice questions through the same model and framework as the benchmark. Runs automatically as the first step when running a suite.
accuracy_questions: 100
accuracy_threshold_delta: 0.10 — valid if score ≥ baseline − 0.10 (see suite.json)
Primary metric: subset_score (fraction correct)
Suite C uses per-format accuracy_thresholds in suite_C/suite.json instead of a single delta.
30-minute fixed-concurrency load test. Detects KV cache exhaustion, thermal throttling, and memory fragmentation that point-in-time benchmarks miss.
sustained_concurrency: 8 — requests kept in-flight simultaneously
duration_minutes: 30
sample_interval_seconds: 60 — throughput snapshot every minute
warmup_minutes: 2
A100 reference result: 527 tok/s sustained, throttle ratio 0.91, no throttle onset detected.
Key output metrics:
sustained_throughput_tokens_per_sec— average post-warmup throughputthrottle_ratio— min/max throughput ratio. 1.0 = no degradation. Lower = more throttling.throttle_onset_minute— when throughput first dropped below 90% of peak
Run explicitly with --scenario sustained. Not part of the default run.
python run.py --runner nvidia_vllm_47f5d58e --suite suite_A --scenario sustainedRuns the offline workload with a draft model loaded for speculative token generation. The loadgen path is identical to offline — only the engine configuration changes.
speculative_draft_model_id: meta-llama/Llama-3.2-1B-Instruct
speculative_draft_model_revision: 9213176726f574b556790deb65791e0c5aa438b6
speculative_num_tokens: 4 — draft tokens proposed per step
request_count: 100
num_runs: 3 + 1 warmup
Primary metric: throughput_tokens_per_sec (offline)
The draft model path is resolved automatically via _resolve_model_path() (respects
configs/models_local.yaml). Runners may override get_runtime_metrics() to expose
acceptance_rate and mean_accepted_tokens in task.runtime_metrics.
python run.py --runner nvidia_vllm_47f5d58e --suite suite_A --scenario speculativeAlternates between a steady arrival rate and a 5× burst. Tests KV cache eviction behavior and scheduler responsiveness under transient overload.
burst_steady_qps: 5 — QPS during steady windows
burst_peak_qps: 25 — QPS during burst windows
burst_duration_seconds: 30 — duration of each burst window
burst_interval_seconds: 120 — duration of each steady window between bursts
num_runs: 3 (cycles)
online_request_count: 300 — request pool size (same as online)
Primary metric: burst_degradation_ratio (burst_ttft_p99 / steady_ttft_p99)
python run.py --runner nvidia_vllm_47f5d58e --suite suite_A --scenario burstLarge model multi-chip inference
Model: meta-llama/Meta-Llama-3-70B-Instruct
Chips: flexible — use however many your hardware requires
Precision: BF16
Default scenarios match Suite A’s offline + online workload at 70B scale.
Optional interactive and sustained scenarios are defined in suite_B/suite.json
(scenarios.extra). The chip count is flexible — use however many chips your hardware needs.
Scaling efficiency vs Suite A (reference: N chips used for Suite B):
efficiency = (Suite B throughput / N) / (Suite A throughput / 1)
A value of 0.8 with N=4 means 4 chips deliver 3.2× the single-chip throughput.
sustained_concurrency: 4 — lower than Suite A due to higher memory pressure per request
duration_minutes: 30
Run with --scenario sustained. Concurrency set to 4 because the 70B model
occupies most GPU memory, leaving less room for KV cache than the 8B model.
Same two-state burst pattern as Suite A, using online_request_count (200) as the
request pool.
burst_steady_qps: 5
burst_peak_qps: 25
burst_duration_seconds: 30
burst_interval_seconds: 120
python run.py --runner nvidia_vllm_47f5d58e --suite suite_B --scenario burstQuantization efficiency — speed vs quality tradeoff
"How much faster does quantization make this chip, and what quality is lost?"
Suite C runs a similar offline workload to Suite A (same dataset, output_tokens_max 512)
at five precision formats using
fixed pre-quantized HuggingFace checkpoints. All formats use the same
Llama-3.1-8B base model — accuracy differences reflect quantization only,
not model version differences.
| Base model | meta-llama/Llama-3.1-8B-Instruct |
| Chips | 1 |
| Default scenarios | accuracy, offline |
| Extra scenarios | online, sustained |
| Primary metric | quality_efficiency (best across all formats) |
| Run time | ~31 min on A100 (default scenarios, all 5 formats) |
| Format | Checkpoint | Accuracy threshold | Notes |
|---|---|---|---|
| BF16 | meta-llama/Llama-3.1-8B-Instruct |
±0.03 | Baseline |
| FP8 | RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 |
±0.03 | Fast on H100/MI300X; emulated on A100 |
| W8A8 | RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 |
±0.04 | INT8 weights + activations |
| W8A16 | RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 |
±0.03 | INT8 weights, FP16 activations |
| W4A16 | RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 |
±0.05 | INT4 weights (AWQ), FP16 activations |
Each format runs against the same 100 prompts with concurrency levels
[1, 4, 16, 64] from suite_C/suite.json (not the same sweep as Suite A’s
[8, 32, 128]). Format availability depends on the runner's
SUPPORTED_QUANTIZATION_BACKENDS declaration — unsupported formats are
skipped automatically by matching each entry's engine_kwargs.quantization
against the runner's backend list.
speedup_vs_bf16: throughput ratio relative to BF16 baseline.
1.20 = 20% more throughput than BF16.
quality_efficiency: throughput × accuracy_score. Rewards both
speed and accuracy simultaneously. The leaderboard primary metric is the
best quality_efficiency across all evaluated formats.
accuracy per format: each format has its own accuracy baseline and threshold. Accuracy below threshold is flagged but does not block the run.
Format Throughput Accuracy Speedup Quality Eff Compute dtype
BF16 5,336 tok/s 0.57 1.000× 3,042 bfloat16
FP8 5,179 tok/s 0.57 0.971× 2,952 bfloat16 (emulated)
W8A8 6,399 tok/s 0.59 1.199× 3,776 bfloat16
W8A16 4,939 tok/s 0.57 0.925× 2,815 bfloat16
W4A16 5,095 tok/s 0.57 0.955× 2,904 float16
W8A8 wins on A100 because it uses INT8 tensor cores. FP8 shows no speedup because A100 lacks native FP8 hardware — compute falls back to BF16. On H100, FP8 would show ~1.5-1.8× speedup.
Declare which quantization backends your runner's framework supports. The strings are the engine's own backend identifiers (vLLM names shown), NOT suite precision tags such as W8A8/FP8/W4A16:
# In your runner class:
SUPPORTED_QUANTIZATION_BACKENDS = ["fp8", "compressed-tensors", "gptq_marlin"] # vLLM full
SUPPORTED_QUANTIZATION_BACKENDS = ["compressed-tensors", "gptq_marlin"] # No native FP8
SUPPORTED_QUANTIZATION_BACKENDS = [] # BF16 onlyEach format's checkpoint must be available locally. Add to
configs/models_local.yaml:
models:
"RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8":
local_path: /data/models/llama31-8b-fp8
"RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8":
local_path: /data/models/llama31-8b-w8a8
"RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a16":
local_path: /data/models/llama31-8b-w8a16
"RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16":
local_path: /data/models/llama31-8b-w4a16python run.py --runner nvidia_vllm_47f5d58e --suite suite_CRuns all supported formats in sequence. Each format is a separate subprocess for clean GPU state. BF16 always runs first as the baseline.
Long-context inference
Model: meta-llama/Llama-3.1-8B-Instruct (128K native context window)
Chips: 1
Precision: BF16
max_model_len: 30208 (KV budget for this benchmark)
Input tokens: p50 ~28,650 (dataset sharegpt_longctx_v1; p99 ~29,932)
Output tokens: up to 256
Tests the chip's ability to handle long-context workloads. Llama-3.1-8B
is used (not 3.0) because it natively supports 128K context. The suite caps
max_model_len at 30,208 and uses prompts near ~28K tokens (not full 32K)
so runs remain reproducible and within practical memory limits on common GPUs.
concurrency_levels: [1, 4]
request_count: 50
num_runs: 2 + 1 warmup
Primary metric: throughput_tokens_per_sec
Long-context inference is dominated by the prefill phase (~28K input tokens), which is compute-bound and tests raw FLOPS more than memory bandwidth.
OOM on some batch sizes is expected and recorded as valid data.
A chip that OOMs at batch_size=4 but succeeds at batch_size=1 will show
"oom": true for that row — useful information, not a failure.
interactive_request_count: 100 — each request ~11s sequential at 28K context
num_runs: 2 (no warmup)
Primary metrics: ttft_ms_p50, ttft_ms_p90 — p90 is primary; p95 marginal at 100 reqs
Interactive is extra for Suite D: 100 reqs × 2 runs ≈ 37 min — expensive at 28K context.
Run with --scenario all or --scenario interactive.
online_qps_levels: [0.5, 1, 2]
online_request_count: 200
Extra due to cost: QPS=0.5 alone takes ~13 min (rate-bound: 200 reqs / 0.5 QPS × 2 runs).
sustained_concurrency: 8
duration_minutes: 30
A100 reference result: 52 tok/s sustained, throttle ratio 0.85, throttle onset at minute 13. Low absolute throughput is expected due to the ~28K input token prefill overhead — what matters is the throttle ratio relative to peak.
Runs the offline workload at ~28K context with a 1B draft model. Speculative decoding at long context is prefill-bound — acceptance rate and speedup will differ significantly from Suite A.
speculative_draft_model_id: meta-llama/Llama-3.2-1B-Instruct
speculative_draft_model_revision: 9213176726f574b556790deb65791e0c5aa438b6
speculative_num_tokens: 4
request_count: 50
num_runs: 2 + 1 warmup
python run.py --runner nvidia_vllm_47f5d58e --suite suite_D --scenario speculativeMulti-chip scaling efficiency
Model: meta-llama/Meta-Llama-3-8B-Instruct
Chips: 1×, 2× required; 4×, 8× optional
Scenario: offline only
Holds the model constant (8B, fits on any single chip) and varies only chip count. This isolates the scaling dimension from chip speed.
Scaling efficiency is the primary metric:
scaling_efficiency = N_chip_throughput / (1_chip_throughput × N)
1.00 = perfect linear scaling
0.85 = 4 chips give 3.4× speedup (15% lost to communication)
0.50 = 4 chips give only 2× speedup (poor interconnect)
concurrency_levels: [8, 32, 128]
request_count: 150
num_runs: 3 + 1 warmup
chip_counts_required: [1, 2]
chip_counts_optional: [4, 8]
chip_counts_all: [1, 2, 4, 8]
# 4-chip machine
python run.py --runner nvidia_vllm_47f5d58e --suite suite_E --max-chips 4
# 8-chip machine
python run.py --runner nvidia_vllm_47f5d58e --suite suite_E --max-chips 8Minimum requirement: both 1× and 2× must succeed for the submission to pass validation.
Request datasets are stored in datasets/ and shared across suites.
Each dataset is versioned and immutable — changing prompts creates a new
version rather than modifying the existing one.
| Dataset | Used by | Prompts | Input p50 | Output p50 |
|---|---|---|---|---|
sharegpt_standard_v1 |
Suite A, B, C, E, G | 500 | ~280 tokens | ~310 tokens |
sharegpt_longctx_v1 |
Suite D | 200 | p50 ~28,650 tokens | up to 256 (suite cap) |
sharegpt_edge_v1 |
Suite F | 500 | ~95 tokens | ~150 tokens |
Suite JSON files reference datasets by name:
"dataset": "sharegpt_standard_v1"Each line in requests.jsonl:
{"request_id": 0, "prompt": "...", "input_tokens": 245, "conversation_id": "sg_00001", "turn_index": 0, "prompt_type": "conversational"}These files must not be edited manually. Changing prompts invalidates comparisons with existing results.
Prompt type distribution (sharegpt_standard_v1):
conversational: 40% — everyday dialogue, advice, Q&A
summarization: 30% — long input, short output
code_generation: 20% — write/fix/explain code
reasoning: 10% — step-by-step analysis, math
Consumer/edge single-GPU inference
Model: Qwen/Qwen2.5-0.5B-Instruct
Chips: 1 (recommended — no hard constraint)
Precision: BF16 (auto-fallback to FP16 on pre-Ampere)
Suite F is designed for consumer and edge GPUs: RTX 3090, RTX 4090, A10, L4, and pre-Ampere hardware including V100 and T4. The model (0.5B parameters, ~1 GB in FP16) fits comfortably on any GPU with 4+ GB VRAM.
Precision handling
precision_required: BF16 with allowed_precisions: [FP16, BF16] (order matches
suite_F/suite.json). Ampere+ GPUs
(RTX 3090/4090, A100, H100) use BF16 natively. Pre-Ampere GPUs (V100, T4, RTX 20xx)
automatically fall back to FP16 via allowed_precisions — no warning, no flag, since
FP16 is an explicitly accepted precision for this suite. Results are labeled with
the actual precision used.
Why Qwen2.5-0.5B?
- Smallest practical instruction-tuned model with full vLLM support since v0.4.0
- Fits in 4 GB VRAM in FP16 — accessible to the widest range of consumer hardware
- Stable
Qwen2ForCausalLMarchitecture avoids newer-vLLM-only features - Apache 2.0 licensed
Accuracy note: Absolute MMLU score for a 0.5B model is ~0.35–0.40, well below larger models. The accuracy gate exists to detect broken quantization or misconfigured precision — not to evaluate model quality. The threshold (±0.10) is intentionally wider than datacenter suites.
Same structure as Suite A — offline, online, and interactive. Concurrency levels are smaller (4/16/64 vs 8/32/128) because a 0.5B model saturates consumer GPUs at lower concurrency.
concurrency_levels: [4, 16, 64]
online_qps_levels: [10, 40] — QPS=2 excluded; rate-bound at 0.5B scale, below practical range
online_sla_ttft_ms: 500
online_request_count: 300
request_count: 200 (offline)
interactive_request_count: 150
num_runs: 3 + 1 warmup
Shorter wall time than datacenter suites so consumer GPUs stay within a practical budget.
sustained_concurrency: 32
duration_minutes: 15
sample_interval_seconds: 60
warmup_minutes: 1
Run with --scenario sustained. Not part of the default run.
# Standard run (Ampere+)
python runners/nvidia_vllm_47f5d58e/runner.py --suite suite_F
# Pre-Ampere GPU (V100, T4, RTX 20xx) — required flag
python runners/nvidia_vllm_47f5d58e/runner.py --suite suite_F --enforce-eager
# Or set persistently: enforce_eager: true in
# configs/runner_configs/runner_nvidia_vllm_47f5d58e.yaml under suites.suite_F
# Single scenario
python runners/nvidia_vllm_47f5d58e/runner.py --suite suite_F --scenario offlineFor runner-specific hardware compatibility details (including pre-Ampere guidance),
see runners/nvidia_vllm_47f5d58e/README.md.
Suite F does not enforce a single-chip constraint. Developers are free to run with TP > 1. However, Suite F is designed for single-chip consumer hardware — multi-chip results over PCIe will show poor scaling efficiency and reflect the interconnect bottleneck rather than GPU capability. For apples-to-apples consumer comparisons, submit single-chip results.
MoE multi-chip inference
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
Chips: auto — minimum 2×A100-80GB or 4×A100-40GB (~90GB BF16)
Precision: BF16
Suite G targets Mixture-of-Experts architectures. Mixtral-8x7B uses 8 experts per layer with top-2 routing (~47B total parameters, ~13B active per token). The model requires at least two datacenter GPUs due to its memory footprint.
required_chips is set to "auto" — runners use all available GPUs via
tensor parallelism. The leaderboard groups results by chip type and chip count
naturally, so no scaling sweep is needed (unlike Suite E).
Default scenarios match Suite A's offline + online workload at MoE scale.
Optional interactive and sustained are available via --scenario all.
concurrency_levels: [4, 16, 64] — lower than Suite A due to larger memory footprint
online_qps_levels: [2, 10, 40]
online_sla_ttft_ms: 500
request_count: 100 (offline)
online_request_count: 300
interactive_request_count: 150
num_runs: 3 + 1 warmup
sustained_concurrency: 8
duration_minutes: 30
sample_interval_seconds: 60
warmup_minutes: 2
Runners that expose MoE-specific statistics should override get_runtime_metrics()
to return expert routing data:
{
"expert_load_balance": 0.12, # std dev of expert activation frequency
"mean_experts_per_token": 2.0 # mean number of experts activated per token
}These are recorded in task.runtime_metrics and displayed on the leaderboard
but do not affect ranking.
# 2-GPU machine (A100-80GB)
python run.py --runner nvidia_vllm_47f5d58e --suite suite_G
# 4-GPU machine (A100-40GB)
python run.py --runner nvidia_vllm_47f5d58e --suite suite_GThe MMLU accuracy baseline for Mixtral-8x7B is pending — bf16_baseline_score
is set to null in schema/accuracy_baselines.json. Run the accuracy scenario
on 2×A100-80GB BF16 to establish the baseline before accepting community
submissions.
- Open a GitHub Issue using the Propose a new suite template
- Specify: model, chip count, scenarios, and rationale
- Discuss the proposal in the issue thread — interested contributors weigh in
- Create
suites/suite_X/suite.jsonreferencing a shared dataset (or add a new dataset todatasets/) - If custom orchestration is needed, add
suites/suite_X/suite.py(see DEVELOPMENT.md for the suite plugin interface) - Submit a reference result on at least one chip before the suite appears on the main leaderboard
See DEVELOPMENT.md for the full guide.