Skip to content

Commit 2d3a3f3

Browse files
authored
init (#1312)
1 parent 5ee0aac commit 2d3a3f3

2 files changed

Lines changed: 195 additions & 211 deletions

File tree

AGENTS.md

Lines changed: 97 additions & 209 deletions
Original file line numberDiff line numberDiff line change
@@ -9,42 +9,64 @@ InferenceX is an open-source, automated benchmarking system that continuously tr
99
## Directory Structure
1010

1111
```
12-
├── benchmarks/ # Shell scripts for running benchmarks
13-
│ ├── benchmark_lib.sh # Shared benchmarking/eval utilities
14-
│ ├── dsr1_*.sh # Deepseek R1-specific benchmark scripts
15-
│ └── gptoss_*.sh # gptoss-specific benchmark scripts
16-
├── runners/ # Launch scripts for different hardware
17-
│ ├── launch_b200/h100/h200-*.sh # NVIDIA launcher scripts
18-
│ └── launch_mi*.sh # AMD launcher scripts
19-
├── utils/ # Python utilities
20-
│ ├── matrix_logic/ # Config generation and validation
21-
│ │ ├── generate_sweep_configs.py # CLI for generating benchmark matrix
22-
│ │ ├── validation.py # Pydantic validation models
23-
│ │ └── test_*.py # Unit tests
24-
│ ├── bench_serving/ # Benchmark serving client (upstreamed from vLLM)
25-
│ │ ├── benchmark_serving.py # Main benchmark client script
26-
│ │ ├── backend_request_func.py # Backend-specific request functions
27-
│ │ └── benchmark_utils.py # Utility functions
28-
│ ├── evals/ # Eval task definitions for lm-eval
29-
│ │ ├── EVALS.md # Evals documentation
30-
│ │ ├── gsm8k.yaml
31-
│ │ └── gpqa_diamond.yaml
32-
│ ├── collect_eval_results.py # Aggregates eval results
33-
│ ├── process_result.py # Post-processes benchmark results
34-
│ ├── process_changelog.py # Processes perf-changelog.yaml
35-
│ └── summarize.py # Generates markdown summaries
36-
├── .github/
37-
│ ├── workflows/ # GitHub Actions CI/CD
38-
│ │ ├── run-sweep.yml # Main performance sweep
39-
│ │ ├── e2e-tests.yml # End-to-end testing
40-
│ │ ├── benchmark-tmpl.yml # Single-node benchmark job template
41-
│ │ ├── benchmark-multinode-tmpl.yml # Multi-node benchmark job template
42-
│ │ └── collect-evals.yml # Eval results collection
43-
│ └── configs/ # Master configuration files
44-
│ ├── nvidia-master.yaml
45-
│ ├── amd-master.yaml
46-
│ └── runners.yaml
47-
└── perf-changelog.yaml # Triggers benchmarks on changes
12+
.
13+
├─AGENTS.md # agent instructions
14+
├─perf-changelog.yaml # benchmark trigger log; append-only; preserve whitespace
15+
├─benchmarks/
16+
│ ├─benchmark_lib.sh # shared benchmark/eval/server helpers
17+
│ ├─single_node/ # single-node benchmark entrypoints
18+
│ │ ├─agentic/ # agentic benchmark scripts
19+
│ │ ├─chat_templates/ # model chat templates, e.g. DeepSeek-V4 thinking
20+
│ │ ├─*_mtp.sh # MTP/spec-decoding scripts
21+
│ │ └─*.sh # per model/precision/hardware/framework scripts
22+
│ └─multi_node/ # multinode benchmark entrypoints
23+
│ ├─agentic_srt.sh
24+
│ ├─amd_utils/ # AMD multinode Slurm/server/bench helpers
25+
│ │ ├─bench.sh
26+
│ │ ├─env.sh
27+
│ │ ├─job.slurm
28+
│ │ ├─models.yaml
29+
│ │ ├─server.sh
30+
│ │ ├─submit.sh
31+
│ │ └─sync.py
32+
│ ├─*_sglang-disagg.sh # SGLang disaggregated multinode scripts
33+
│ ├─*_dynamo-trt.sh # Dynamo/TensorRT multinode scripts
34+
│ └─srt-slurm-recipes/ # checked-in external recipe YAMLs
35+
│ ├─sglang/deepseek-v4/8k1k/
36+
│ └─vllm/deepseek-v4/8k1k/
37+
├─runners/ # hardware launcher scripts
38+
├─utils/
39+
│ ├─matrix_logic/ # benchmark matrix generation/validation/tests
40+
│ │ ├─generate_sweep_configs.py # full-sweep/test-config CLI
41+
│ │ ├─validation.py # Pydantic schemas
42+
│ │ ├─test_generate_sweep_configs.py
43+
│ │ └─test_validation.py
44+
│ ├─bench_serving/ # serving benchmark client
45+
│ │ ├─benchmark_serving.py
46+
│ │ ├─backend_request_func.py
47+
│ │ ├─benchmark_utils.py
48+
│ │ ├─encoding_dsv4.py
49+
│ │ └─KNOWN_LIMITATION.md
50+
│ ├─evals/ # lm-eval task configs and score validation
51+
│ │ ├─EVALS.md
52+
│ │ ├─gsm8k.yaml
53+
│ │ ├─gpqa_diamond.yaml
54+
│ │ ├─thresholds.json
55+
│ │ ├─utils.py
56+
│ │ └─validate_scores.py
57+
│ ├─agentic-benchmark/ # agentic benchmark collection/analysis helpers
58+
│ ├─trace-replay/ # trace replay utilities
59+
│ ├─constants.py
60+
│ ├─collect_results.py
61+
│ ├─collect_eval_results.py
62+
│ ├─compare_results.py
63+
│ ├─calc_success_rate.py
64+
│ ├─process_result.py # benchmark aggregation/normalization
65+
│ ├─process_agentic_result.py
66+
│ ├─process_changelog.py # perf-changelog parsing and trim_conc
67+
│ ├─summarize.py # markdown summary generation
68+
│ └─test_process_result.py
69+
└─experimental/ # non-core experiments
4870
```
4971

5072
## Terminology
@@ -54,13 +76,13 @@ InferenceX is an open-source, automated benchmarking system that continuously tr
5476

5577
## Key Technologies
5678

57-
- **Python 3.13**: Core automation and config generation
58-
- **Pydantic**: Configuration validation (V2 with strict mode)
59-
- **Bash**: Benchmark execution and infrastructure orchestration
60-
- **YAML**: Configuration files
61-
- **GitHub Actions**: CI/CD workflows
62-
- **Evals**: lm-eval validation of benchmark results
63-
- **pytest**: Testing framework
79+
- Python 3.13: Core automation and config generation
80+
- Pydantic Configuration validation (V2 with strict mode)
81+
- Bash**: Benchmark execution and infrastructure orchestration
82+
- YAML: Configuration files
83+
- GitHub Actions: CI/CD workflows
84+
- Evals: lm-eval validation of benchmark results
85+
- pytest: Testing framework
6486

6587
## Development Workflow
6688

@@ -110,15 +132,6 @@ python utils/summarize.py
110132

111133
When working with benchmark configurations, use these valid values:
112134

113-
**Models (model-prefix)**:
114-
- `dsr1` - DeepSeek-R1-0528
115-
- `dsv4` - DeepSeek-V4-Pro
116-
- `gptoss` - GPT-OSS-120B
117-
118-
**Precisions**:
119-
- `fp4`
120-
- `fp8`
121-
122135
**Frameworks**:
123136
- `sglang` - SGLang inference engine
124137
- `trt` - TensorRT-LLM
@@ -128,18 +141,6 @@ When working with benchmark configurations, use these valid values:
128141
- `dynamo-sglang` - NVIDIA Dynamo with SGLang backend
129142
- `sglang-disagg` - SGLang disaggregated inference
130143

131-
**Runners (NVIDIA)**:
132-
- `b200` - NVIDIA B200 GPU
133-
- `b200-trt` - NVIDIA B200 with TensorRT
134-
- `h100` - NVIDIA H100 GPU
135-
- `h200` - NVIDIA H200 GPU
136-
- `gb200` - NVIDIA GB200 (multi-node)
137-
138-
**Runners (AMD)**:
139-
- `mi300x` - AMD MI300X GPU
140-
- `mi325x` - AMD MI325X GPU
141-
- `mi355x` - AMD MI355X GPU
142-
143144
**Sequence Lengths (ISL/OSL)**:
144145
- `1k1k` - 1024 input / 1024 output
145146
- `8k1k` - 8192 input / 1024 output
@@ -177,18 +178,36 @@ When working with benchmark configurations, use these valid values:
177178

178179
PRs do **not** run the sweep automatically — `run-sweep.yml` is gated on a label. Pick exactly one of the two; setting both is rejected by the workflow.
179180

180-
| Label | Behavior | When to use |
181-
|-------|----------|-------------|
182-
| `sweep-enabled` | Runs the sweep with `--trim-conc`: each parallelism config is reduced to its single highest configured concurrency point. | Default for most PRs — validates the change runs end-to-end without consuming the full cluster. |
183-
| `full-sweep-enabled` | Runs the full intermediate concurrency sweep, identical to a push-to-main run. | Use when intermediate concurrency points actually matter for the PR (e.g., a recipe change expected to shift the throughput/latency curve, not just its endpoints). |
181+
`sweep-enabled` - Runs the sweep with `--trim-conc`: each parallelism config is reduced to its single highest configured concurrency point. Default for most PRs — validates the change runs end-to-end without consuming the full cluster.
182+
`full-sweep-enabled` - Runs the full intermediate concurrency sweep, identical to a push-to-main run. Use when intermediate concurrency points actually matter for the PR (e.g., a recipe change expected to shift the throughput/latency curve, not just its endpoints).
184183

185184
Notes:
186185
- The two labels are mutually exclusive — `run-sweep.yml`'s `setup` job fails fast with an explicit error if both are present.
187-
- Push-to-main always runs the full (untrimmed) sweep unless `[skip-sweep]` is in the commit message; the trim only applies to PR runs that opt in via `sweep-enabled`.
186+
- Push-to-main always runs the full untrimmed sweep unless `[skip-sweep]` is in the commit message; the trim only applies to PR runs that opt in via `sweep-enabled`.
188187
- The trimming logic lives in `trim_conc()` in `utils/process_changelog.py` — single-node entries are grouped by every non-`conc` field and only the highest-`conc` entry per group is kept; multi-node entries have their `conc` list collapsed to `[max(conc)]`.
189188

190189
## Common Tasks
191190

191+
### Dispatching jobs
192+
193+
When asked to do a run or a sweep,
194+
```
195+
gh api -X POST \
196+
/repos/SemiAnalysisAI/InferenceX/actions/workflows/e2e-tests.yml/dispatches \
197+
-f ref='<ref>' \
198+
-f 'inputs[ref]=<input ref>' \
199+
-f 'inputs[test-name]=<name>' \
200+
-f 'inputs[generate-cli-command]=command'
201+
```
202+
Input meanings:
203+
204+
* ref: workflow ref to dispatch from; usually the branch containing the workflow.
205+
* inputs[ref]: checkout ref used by jobs and matrix generation.
206+
* inputs[test-name]: display name in GitHub Actions.
207+
* inputs[generate-cli-command]: arguments passed to utils/matrix_logic/generate_sweep_configs.py. Can be tested locally.
208+
209+
To monitor: `gh run watch <RUN_ID> --repo SemiAnalysisAI/InferenceX --exit-status`
210+
192211
### Adding a New Benchmark Configuration
193212

194213
1. Add entry to `.github/configs/nvidia-master.yaml` or `amd-master.yaml`
@@ -314,38 +333,15 @@ When upgrading Docker images in benchmark scripts and master configs .yaml:
314333

315334
## Evals (Accuracy Validation)
316335

317-
Evals run optional accuracy checks to ensure model outputs aren't degraded by inference optimizations. They can run alongside benchmarks or independently in eval-only mode.
318-
319-
### When Evals Run
320-
321-
Evals run as **separate workflow jobs** from throughput benchmarks (eval-only mode). The `EVAL_ONLY` flag skips throughput benchmarking and only runs lm-eval.
322-
323-
**Single-node** eval selection:
324-
- All TPs at **highest concurrency** and **median concurrency** per (model, runner, framework, precision, ISL, OSL, spec-decoding, dp-attn)
325-
- Only on `8k1k` sequence length
336+
Evals are optional accuracy checks that ensure inference optimizations do not degrade model outputs. Keep detailed eval reference material in `utils/evals/EVALS.md`; this top-level file should only carry the essentials needed during routine agent runs.
326337

327-
**Multi-node** eval selection:
328-
- Entry with **highest max eligible concurrency** per (model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)
329-
- Only `8k1k` sequence length
330-
- Eval runs at `eval-conc`, the upper median concurrency from the selected config
338+
Quick pointers:
339+
- Eval selection is marked by `mark_eval_entries()` in `utils/matrix_logic/generate_sweep_configs.py`.
340+
- Eval workflow jobs run separately from throughput jobs in eval-only mode (`EVAL_ONLY=true`).
341+
- Generate normal configs with eval markings by default, skip evals with `--no-evals`, or generate only eval jobs with `--evals-only`.
342+
- Benchmark/eval helpers live in `benchmarks/benchmark_lib.sh`; aggregated eval output is produced by `utils/collect_eval_results.py`.
331343

332-
This selection logic is in `mark_eval_entries()` in `utils/matrix_logic/generate_sweep_configs.py`.
333-
334-
**Workflow separation**: Eval jobs are independent from benchmark jobs:
335-
- `run-sweep.yml`: `sweep-evals` (single-node) and `sweep-multi-node-evals` (multi-node)
336-
- `e2e-tests.yml`: `test-sweep-evals` and `test-sweep-multi-node-evals`
337-
- Both use their respective benchmark templates with `eval-only: true`
338-
- `collect-evals` depends only on eval jobs, not benchmark jobs
339-
340-
**Multi-node eval infrastructure**:
341-
- AMD (MI355X): `server.sh` skips `bench.sh` when `EVAL_ONLY=true`, runs lm-eval directly
342-
- NVIDIA Slurm multi-node (GB200, GB300, B200, B300, H100, H200): srt-slurm invokes its `lm-eval` runner from `do_sweep.py` as a post/eval-only step using `INFMAX_WORKSPACE`
343-
344-
### Eval Framework: lm-eval
345-
346-
The default eval framework is [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (`lm-eval`).
347-
348-
### Running Evals via CLI
344+
### CLI
349345

350346
```bash
351347
# Generate configs (evals marked by default on 8k1k subset)
@@ -357,118 +353,12 @@ python utils/matrix_logic/generate_sweep_configs.py full-sweep \
357353
--config-files .github/configs/nvidia-master.yaml \
358354
--no-evals
359355
360-
# Generate ONLY the eval subset (excludes non-eval configs)
356+
# Generate only the eval subset (excludes non-eval configs)
361357
python utils/matrix_logic/generate_sweep_configs.py full-sweep \
362358
--config-files .github/configs/nvidia-master.yaml \
363359
--evals-only
364360
```
365361

366-
### Eval Integration in Benchmark Scripts
367-
368-
All benchmark scripts in `benchmarks/` follow one of two flows:
369-
370-
```bash
371-
# Combined mode (benchmark + eval):
372-
# 1. Start server (with --context-length expansion if EVAL_ONLY=true)
373-
# 2. wait_for_server_ready
374-
# 3. run_benchmark_serving (skipped automatically when EVAL_ONLY=true)
375-
# 4. Run evals:
376-
if [ "${RUN_EVAL}" = "true" ]; then
377-
run_eval --framework lm-eval --port "$PORT"
378-
append_lm_eval_summary # Writes meta_env.json and moves artifacts
379-
fi
380-
381-
# Eval-only mode (EVAL_ONLY=true):
382-
# 1. Compute eval context via compute_eval_context_length
383-
# 2. Start server with that context (--context-length or --max-model-len)
384-
# 3. wait_for_server_ready
385-
# 4. run_benchmark_serving returns immediately (skipped)
386-
# 5. run_eval + append_lm_eval_summary
387-
```
388-
389-
**Multi-node AMD** (`benchmarks/multi_node/amd_utils/server.sh`):
390-
- Skips `bench.sh` when `EVAL_ONLY=true`
391-
- Runs lm-eval via `run_eval` against the router on port 30000
392-
- Copies eval artifacts to `/run_logs/slurm_job-*/eval_results/`
393-
394-
**Multi-node NVIDIA Slurm** (GB200, GB300, B200, B300, H100, H200 via srt-slurm):
395-
- Uses the srt-slurm `lm-eval` runner as a post/eval-only step from `do_sweep.py`
396-
- Mounts the InferenceX checkout from `INFMAX_WORKSPACE` at `/infmax-workspace`
397-
- `lm-eval` runner sources `benchmark_lib.sh` from `/infmax-workspace`
398-
399-
### Key Eval Functions in `benchmarks/benchmark_lib.sh`
400-
401-
| Function | Description |
402-
|----------|-------------|
403-
| `run_eval` | Unified entrypoint - dispatches to framework-specific runner |
404-
| `run_lm_eval` | Runs lm-eval harness against the OpenAI-compatible endpoint |
405-
| `append_lm_eval_summary` | Writes `meta_env.json` and moves eval artifacts to workspace |
406-
| `_install_lm_eval_deps` | Installs lm-eval dependencies |
407-
| `_patch_lm_eval` | Patches lm-eval for reasoning tokens and TRT compatibility |
408-
| `compute_eval_context_length` | Computes eval context length (requested benchmark context, capped at model native max) |
409-
| `get_native_max_context_length` | Extracts model's native max context length from HF config |
410-
411-
### Eval Results Collection
412-
413-
Eval results are collected by `.github/workflows/collect-evals.yml`:
414-
415-
1. Downloads all `eval_*` artifacts
416-
2. Runs `utils/collect_eval_results.py` to aggregate results
417-
3. Outputs `agg_eval_<exp_name>.json` with all eval metrics
418-
4. Publishes summary table to GitHub Step Summary
419-
420-
### Fetching Eval Results
421-
422-
```bash
423-
# Download eval results artifact
424-
gh run download <RUN_ID> --repo SemiAnalysisAI/InferenceX -n eval_results_all -D ./evals
425-
426-
# View eval summary
427-
cat ./evals/agg_eval_all.json | jq -r '
428-
.[] | [.hw, .framework, .precision, .tp, .conc, .task, (.score * 100 | round | . / 100)]
429-
| @tsv' | column -t
430-
431-
# Filter to specific hardware
432-
cat ./evals/agg_eval_all.json | jq '[.[] | select(.hw == "B200")]'
433-
```
434-
435-
### Eval Metrics
436-
437-
| Field | Description |
438-
|-------|-------------|
439-
| `score` | Primary metric (exact match for GSM8K) |
440-
| `em_strict` | Strict exact match (requires `####` format) |
441-
| `em_flexible` | Flexible extraction (looser number matching) |
442-
| `n_eff` | Number of samples evaluated |
443-
| `task` | Eval task name (e.g., `gsm8k`) |
444-
445-
### Environment Variables for Evals
446-
447-
| Variable | Default | Description |
448-
|----------|---------|-------------|
449-
| `RUN_EVAL` | `false` | Enable eval after throughput benchmark |
450-
| `EVAL_ONLY` | `false` | Skip throughput, only run evals (set by workflow) |
451-
| `EVAL_FRAMEWORK` | `lm-eval` | Eval framework to use |
452-
| `EVAL_TASKS_DIR` | `utils/evals/gsm8k.yaml` | Path to lm-eval task YAML |
453-
| `EVAL_RESULT_DIR` | `/tmp/eval_out-*` | Output directory for eval results |
454-
| `EVAL_MAX_MODEL_LEN` | `16384` | Max context for eval (set by `compute_eval_context_length`) |
455-
| `EVAL_CONCURRENT_REQUESTS` | `64` | Concurrent requests during eval |
456-
457-
### Adding a New Eval Task
458-
459-
1. Create a task YAML in `utils/evals/` (follow lm-eval task format)
460-
2. Set `EVAL_TASKS_DIR=utils/evals/<your_task>.yaml` when running benchmarks
461-
3. Update `utils/collect_eval_results.py` if new metrics need extraction
462-
463-
### lm-eval Patches
464-
465-
The codebase includes patches for lm-eval compatibility (`_patch_lm_eval`):
466-
467-
1. **Reasoning token handling**: Extracts `reasoning_content` when `message.content` is empty
468-
2. **TRT compatibility**: Avoids injecting `{"type": "text"}` for non-HF tokenizers
469-
470-
These patches are applied via `sitecustomize.py` in `PYTHONPATH`.
471-
472362
## Key Files to Understand
473363

474364
- `utils/matrix_logic/validation.py` - Defines all configuration schemas
@@ -499,9 +389,7 @@ Markers available: `slow`, `integration`
499389

500390
## Fetching GitHub Actions Benchmark Results
501391

502-
When asked to analyze benchmark results from a GitHub Actions run URL, use the `gh` CLI.
503-
504-
### Commands
392+
When asked to analyze benchmark results from a GitHub Actions run:
505393
```bash
506394
# List artifacts for a run
507395
gh api /repos/SemiAnalysisAI/InferenceX/actions/runs/<RUN_ID>/artifacts --jq '.artifacts[].name'

0 commit comments

Comments
 (0)