You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
│ ├─process_changelog.py # perf-changelog parsing and trim_conc
67
+
│ ├─summarize.py # markdown summary generation
68
+
│ └─test_process_result.py
69
+
└─experimental/ # non-core experiments
48
70
```
49
71
50
72
## Terminology
@@ -54,13 +76,13 @@ InferenceX is an open-source, automated benchmarking system that continuously tr
54
76
55
77
## Key Technologies
56
78
57
-
-**Python 3.13**: Core automation and config generation
58
-
-**Pydantic**: Configuration validation (V2 with strict mode)
59
-
-**Bash**: Benchmark execution and infrastructure orchestration
60
-
-**YAML**: Configuration files
61
-
-**GitHub Actions**: CI/CD workflows
62
-
-**Evals**: lm-eval validation of benchmark results
63
-
-**pytest**: Testing framework
79
+
- Python 3.13: Core automation and config generation
80
+
- Pydantic Configuration validation (V2 with strict mode)
81
+
- Bash**: Benchmark execution and infrastructure orchestration
82
+
- YAML: Configuration files
83
+
- GitHub Actions: CI/CD workflows
84
+
- Evals: lm-eval validation of benchmark results
85
+
- pytest: Testing framework
64
86
65
87
## Development Workflow
66
88
@@ -110,15 +132,6 @@ python utils/summarize.py
110
132
111
133
When working with benchmark configurations, use these valid values:
112
134
113
-
**Models (model-prefix)**:
114
-
-`dsr1` - DeepSeek-R1-0528
115
-
-`dsv4` - DeepSeek-V4-Pro
116
-
-`gptoss` - GPT-OSS-120B
117
-
118
-
**Precisions**:
119
-
-`fp4`
120
-
-`fp8`
121
-
122
135
**Frameworks**:
123
136
-`sglang` - SGLang inference engine
124
137
-`trt` - TensorRT-LLM
@@ -128,18 +141,6 @@ When working with benchmark configurations, use these valid values:
128
141
-`dynamo-sglang` - NVIDIA Dynamo with SGLang backend
129
142
-`sglang-disagg` - SGLang disaggregated inference
130
143
131
-
**Runners (NVIDIA)**:
132
-
-`b200` - NVIDIA B200 GPU
133
-
-`b200-trt` - NVIDIA B200 with TensorRT
134
-
-`h100` - NVIDIA H100 GPU
135
-
-`h200` - NVIDIA H200 GPU
136
-
-`gb200` - NVIDIA GB200 (multi-node)
137
-
138
-
**Runners (AMD)**:
139
-
-`mi300x` - AMD MI300X GPU
140
-
-`mi325x` - AMD MI325X GPU
141
-
-`mi355x` - AMD MI355X GPU
142
-
143
144
**Sequence Lengths (ISL/OSL)**:
144
145
-`1k1k` - 1024 input / 1024 output
145
146
-`8k1k` - 8192 input / 1024 output
@@ -177,18 +178,36 @@ When working with benchmark configurations, use these valid values:
177
178
178
179
PRs do **not** run the sweep automatically — `run-sweep.yml` is gated on a label. Pick exactly one of the two; setting both is rejected by the workflow.
179
180
180
-
| Label | Behavior | When to use |
181
-
|-------|----------|-------------|
182
-
|`sweep-enabled`| Runs the sweep with `--trim-conc`: each parallelism config is reduced to its single highest configured concurrency point. | Default for most PRs — validates the change runs end-to-end without consuming the full cluster. |
183
-
|`full-sweep-enabled`| Runs the full intermediate concurrency sweep, identical to a push-to-main run. | Use when intermediate concurrency points actually matter for the PR (e.g., a recipe change expected to shift the throughput/latency curve, not just its endpoints). |
181
+
`sweep-enabled` - Runs the sweep with `--trim-conc`: each parallelism config is reduced to its single highest configured concurrency point. Default for most PRs — validates the change runs end-to-end without consuming the full cluster.
182
+
`full-sweep-enabled` - Runs the full intermediate concurrency sweep, identical to a push-to-main run. Use when intermediate concurrency points actually matter for the PR (e.g., a recipe change expected to shift the throughput/latency curve, not just its endpoints).
184
183
185
184
Notes:
186
185
- The two labels are mutually exclusive — `run-sweep.yml`'s `setup` job fails fast with an explicit error if both are present.
187
-
- Push-to-main always runs the full (untrimmed) sweep unless `[skip-sweep]` is in the commit message; the trim only applies to PR runs that opt in via `sweep-enabled`.
186
+
- Push-to-main always runs the full untrimmed sweep unless `[skip-sweep]` is in the commit message; the trim only applies to PR runs that opt in via `sweep-enabled`.
188
187
- The trimming logic lives in `trim_conc()` in `utils/process_changelog.py` — single-node entries are grouped by every non-`conc` field and only the highest-`conc` entry per group is kept; multi-node entries have their `conc` list collapsed to `[max(conc)]`.
* ref: workflow ref to dispatch from; usually the branch containing the workflow.
205
+
* inputs[ref]: checkout ref used by jobs and matrix generation.
206
+
* inputs[test-name]: display name in GitHub Actions.
207
+
* inputs[generate-cli-command]: arguments passed to utils/matrix_logic/generate_sweep_configs.py. Can be tested locally.
208
+
209
+
To monitor: `gh run watch <RUN_ID> --repo SemiAnalysisAI/InferenceX --exit-status`
210
+
192
211
### Adding a New Benchmark Configuration
193
212
194
213
1. Add entry to `.github/configs/nvidia-master.yaml` or `amd-master.yaml`
@@ -314,38 +333,15 @@ When upgrading Docker images in benchmark scripts and master configs .yaml:
314
333
315
334
## Evals (Accuracy Validation)
316
335
317
-
Evals run optional accuracy checks to ensure model outputs aren't degraded by inference optimizations. They can run alongside benchmarks or independently in eval-only mode.
318
-
319
-
### When Evals Run
320
-
321
-
Evals run as **separate workflow jobs** from throughput benchmarks (eval-only mode). The `EVAL_ONLY` flag skips throughput benchmarking and only runs lm-eval.
322
-
323
-
**Single-node** eval selection:
324
-
- All TPs at **highest concurrency** and **median concurrency** per (model, runner, framework, precision, ISL, OSL, spec-decoding, dp-attn)
325
-
- Only on `8k1k` sequence length
336
+
Evals are optional accuracy checks that ensure inference optimizations do not degrade model outputs. Keep detailed eval reference material in `utils/evals/EVALS.md`; this top-level file should only carry the essentials needed during routine agent runs.
326
337
327
-
**Multi-node** eval selection:
328
-
- Entry with **highest max eligible concurrency** per (model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)
329
-
- Only `8k1k` sequence length
330
-
- Eval runs at `eval-conc`, the upper median concurrency from the selected config
338
+
Quick pointers:
339
+
- Eval selection is marked by `mark_eval_entries()` in `utils/matrix_logic/generate_sweep_configs.py`.
340
+
- Eval workflow jobs run separately from throughput jobs in eval-only mode (`EVAL_ONLY=true`).
341
+
- Generate normal configs with eval markings by default, skip evals with `--no-evals`, or generate only eval jobs with `--evals-only`.
342
+
- Benchmark/eval helpers live in `benchmarks/benchmark_lib.sh`; aggregated eval output is produced by `utils/collect_eval_results.py`.
331
343
332
-
This selection logic is in `mark_eval_entries()` in `utils/matrix_logic/generate_sweep_configs.py`.
333
-
334
-
**Workflow separation**: Eval jobs are independent from benchmark jobs:
335
-
- `run-sweep.yml`: `sweep-evals`(single-node) and `sweep-multi-node-evals` (multi-node)
- Both use their respective benchmark templates with `eval-only: true`
338
-
- `collect-evals`depends only on eval jobs, not benchmark jobs
339
-
340
-
**Multi-node eval infrastructure**:
341
-
- AMD (MI355X): `server.sh`skips `bench.sh` when `EVAL_ONLY=true`, runs lm-eval directly
342
-
- NVIDIA Slurm multi-node (GB200, GB300, B200, B300, H100, H200): srt-slurm invokes its `lm-eval` runner from `do_sweep.py` as a post/eval-only step using `INFMAX_WORKSPACE`
343
-
344
-
### Eval Framework: lm-eval
345
-
346
-
The default eval framework is [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (`lm-eval`).
347
-
348
-
### Running Evals via CLI
344
+
### CLI
349
345
350
346
```bash
351
347
# Generate configs (evals marked by default on 8k1k subset)
0 commit comments