You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consolidate --workers flag with auto mode, resilient barrier DNS, and docs (#53)
* Consolidate --num-workers and --sync into --workers with automatic load division
Replace two overlapping flags (--sync JSON and --num-workers) with a single
--workers N flag. When N > 1, barrier sync is enabled automatically and
concurrency/rate are divided across workers so config files always express
total desired load.
Breaking changes:
- --sync flag removed (barrier sync now implicit when --workers > 1)
- --num-workers renamed to --workers
- concurrency/rate in configs now mean total, not per-worker
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* Retry DNS NXDOMAIN in barrier instead of failing fast
In Kubernetes, headless service DNS records take a few seconds to
propagate after pod creation. NXDOMAIN is expected early on and
should be retried with backoff (like connection errors), not treated
as a permanent failure after 3 attempts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* Document container image registry and PR image tags in README
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* Document --workers flag with auto load division and barrier sync
Update README and CLAUDE.md to replace old --sync '{"workers":N}'
references with the new --workers flag, and document automatic load
division behavior.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* Add --workers auto to compute worker count from max concurrency
--workers auto sets workers = ceil(max_concurrency / 1024) so each
worker handles at most 1024 concurrent streams. Config is parsed
before the kube branch so auto-resolution works for both local and
Kubernetes deploys.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
* Address review: clarify auto workers sizing and fix builtin shadow
- Rename `max` variable in MaxConcurrency to avoid shadowing Go builtin
- Add rationale for the 1024 streams-per-worker threshold
- Document that --workers auto sizes for peak stage, which may leave
workers idle during low-concurrency staircase stages
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
---------
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CLAUDE.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,7 @@ go test ./... -v
36
36
-**Client-side recording** — JSONL per worker. One line per completed request with timestamps, TTFT, per-token ITL array, token counts, status. Merging across workers = cat + sort.
37
37
-**Timestamps** — JSON file per worker with start_time, rampup_end_time, end_time. Used to query Prometheus for server-side metrics.
38
38
-**Mock server** — configurable TTFT, ITL, and output token count. Serves streaming SSE responses on `/v1/chat/completions`.
39
-
-**Barrier sync** — multi-pod synchronization via HTTP barrier. Pod-0 (leader) runs a barrier server; all pods negotiate a common start time before measured stages. Configured via `--sync '{"workers":N}'` CLI flag and `barrier()` in Starlark DSL.
39
+
-**Barrier sync** — multi-pod synchronization via HTTP barrier. Pod-0 (leader) runs a barrier server; all pods negotiate a common start time before measured stages. Enabled automatically when `--workers N` (N > 1); concurrency/rate are divided across workers so configs express total load. Use `barrier()` in Starlark DSL for explicit sync points.
40
40
41
41
## Deployment
42
42
@@ -48,13 +48,13 @@ just deploy my-bench http://vllm:8000/v1 config.json N_WORKERS=4
48
48
49
49
### Multi-pod synchronization
50
50
51
-
Use `--sync` to synchronize benchmark start across pods:
51
+
Use `--workers N` to enable barrier sync and automatic load division across pods:
An implicit `barrier()` is inserted before the first measured stage. In Starlark, use explicit `barrier()` for additional sync points:
57
+
Concurrency and rate values in configs express **total** desired load — each worker gets its share automatically. An implicit `barrier()` is inserted before the first measured stage. In Starlark, use explicit `barrier()` for additional sync points:
Copy file name to clipboardExpand all lines: README.md
+31-5Lines changed: 31 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,22 +72,33 @@ scenario(
72
72
73
73
Each goroutine stream can run multi-turn conversations, carrying real model responses forward into subsequent turns. This exercises server-side KV cache reuse (prefix caching) and produces realistic conversation-shaped traffic.
74
74
75
-
### Synchronized multi-pod start
75
+
### Synchronized multi-pod start with automatic load division
76
76
77
-
When running across multiple pods, `--sync '{"workers":N}'` enables barrier synchronization. All pods negotiate a common start time via an HTTP barrier protocol — pod-0 (leader) runs the barrier server, workers discover it via `BARRIER_ADDR` (set automatically in the Job manifest to the leader pod's DNS name). Barriers are first-class in the Starlark DSL:
77
+
When running across multiple pods, `--workers N` (where N > 1) enables barrier synchronization and automatically divides load across workers. Concurrency and rate values in config files always express the **total** desired load — each worker gets its fair share via integer division, with remainder distributed to lower-indexed workers (e.g. `concurrency=10, workers=3` → 4, 3, 3).
78
+
79
+
```bash
80
+
# Run with 4 workers — each gets 1/4 of the configured concurrency and rate
All pods negotiate a common start time via an HTTP barrier protocol — pod-0 (leader) runs the barrier server, workers discover it via `BARRIER_ADDR` (set automatically in the Job manifest to the leader pod's DNS name). An implicit barrier is inserted before the first measured stage. Barriers are first-class in the Starlark DSL:
78
85
79
86
```python
80
87
scenario(
81
88
stages=[
82
89
stage("2m", concurrency=16, warmup=True),
83
-
barrier(), # implicit one added automatically
90
+
# implicit barrier fires here — all workers sync before measured stages
84
91
stage("5m", concurrency=64),
85
-
barrier(drain=True), # drain pool before workload switch
92
+
barrier(drain=True), # explicit: drain pool before workload switch
86
93
stage("5m", concurrency=64, workload=other),
87
94
],
88
95
)
89
96
```
90
97
98
+
With `--workers auto`, the worker count is `ceil(max_concurrency / 1024)`, sized for the **peak** stage. In staircase configs where concurrency ramps up across stages (e.g. `[4, 64, 512, 2048]`), early low-concurrency stages will have some workers with very few or zero streams. Use an explicit `--workers N` if you need tighter control.
99
+
100
+
With `--workers 1` (the default), no barrier sync or load division occurs.
101
+
91
102
### Ramp-up and warmup
92
103
93
104
A configurable warmup phase brings the server to steady state before measurement begins, and ramp-up staggers stream starts to avoid synchronized request patterns that would otherwise create artificial load spikes.
@@ -154,7 +165,7 @@ Merging across workers: `cat requests_*.jsonl`.
154
165
just deploy my-benchmark http://vllm-server:8000/v1 config.star 8
155
166
```
156
167
157
-
This creates a ConfigMap with your config and launches an Indexed Job with 8 pods. Each pod auto-detects its worker ID from `JOB_COMPLETION_INDEX` and the barrier server address from `BARRIER_ADDR`. Sync is enabled automatically via `--sync '{"workers":N}'` in the manifest.
168
+
This creates a ConfigMap with your config and launches an Indexed Job with 8 pods. Each pod auto-detects its worker ID from `JOB_COMPLETION_INDEX` and the barrier server address from `BARRIER_ADDR`. The manifest passes `--workers N` so barrier sync and load division are enabled automatically.
0 commit comments