Consolidate --workers flag with auto mode, resilient barrier DNS, and docs (#53)

tlrmchlsmth · claude · web-flow · commit ddae7b79fbf0 · 2026-05-19T03:01:30.000Z
* Consolidate --num-workers and --sync into --workers with automatic load division

Replace two overlapping flags (--sync JSON and --num-workers) with a single
--workers N flag. When N &gt; 1, barrier sync is enabled automatically and
concurrency/rate are divided across workers so config files always express
total desired load.

Breaking changes:
- --sync flag removed (barrier sync now implicit when --workers &gt; 1)
- --num-workers renamed to --workers
- concurrency/rate in configs now mean total, not per-worker

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Tyler Michael Smith &lt;tlrmchlsmth@gmail.com&gt;

* Retry DNS NXDOMAIN in barrier instead of failing fast

In Kubernetes, headless service DNS records take a few seconds to
propagate after pod creation. NXDOMAIN is expected early on and
should be retried with backoff (like connection errors), not treated
as a permanent failure after 3 attempts.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Tyler Michael Smith &lt;tlrmchlsmth@gmail.com&gt;

* Document container image registry and PR image tags in README

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Tyler Michael Smith &lt;tlrmchlsmth@gmail.com&gt;

* Document --workers flag with auto load division and barrier sync

Update README and CLAUDE.md to replace old --sync '{"workers":N}'
references with the new --workers flag, and document automatic load
division behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Tyler Michael Smith &lt;tlrmchlsmth@gmail.com&gt;

* Add --workers auto to compute worker count from max concurrency

--workers auto sets workers = ceil(max_concurrency / 1024) so each
worker handles at most 1024 concurrent streams. Config is parsed
before the kube branch so auto-resolution works for both local and
Kubernetes deploys.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Tyler Michael Smith &lt;tlrmchlsmth@gmail.com&gt;

* Address review: clarify auto workers sizing and fix builtin shadow

- Rename `max` variable in MaxConcurrency to avoid shadowing Go builtin
- Add rationale for the 1024 streams-per-worker threshold
- Document that --workers auto sizes for peak stage, which may leave
  workers idle during low-concurrency staircase stages

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Tyler Michael Smith &lt;tlrmchlsmth@gmail.com&gt;

---------

Signed-off-by: Tyler Michael Smith &lt;tlrmchlsmth@gmail.com&gt;
Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -36,7 +36,7 @@ go test ./... -v
 - **Client-side recording** — JSONL per worker. One line per completed request with timestamps, TTFT, per-token ITL array, token counts, status. Merging across workers = cat + sort.
 - **Timestamps** — JSON file per worker with start_time, rampup_end_time, end_time. Used to query Prometheus for server-side metrics.
 - **Mock server** — configurable TTFT, ITL, and output token count. Serves streaming SSE responses on `/v1/chat/completions`.
-- **Barrier sync** — multi-pod synchronization via HTTP barrier. Pod-0 (leader) runs a barrier server; all pods negotiate a common start time before measured stages. Configured via `--sync '{"workers":N}'` CLI flag and `barrier()` in Starlark DSL.
+- **Barrier sync** — multi-pod synchronization via HTTP barrier. Pod-0 (leader) runs a barrier server; all pods negotiate a common start time before measured stages. Enabled automatically when `--workers N` (N > 1); concurrency/rate are divided across workers so configs express total load. Use `barrier()` in Starlark DSL for explicit sync points.
 
 ## Deployment
 
@@ -48,13 +48,13 @@ just deploy my-bench http://vllm:8000/v1 config.json N_WORKERS=4
 
 ### Multi-pod synchronization
 
-Use `--sync` to synchronize benchmark start across pods:
+Use `--workers N` to enable barrier sync and automatic load division across pods:
 
 ```bash
-nyann-bench generate --config scenario.star --sync '{"workers":4,"timeout":"10m"}'
+nyann-bench generate --config scenario.star --workers 4 --worker-id 0
 ```
 
-An implicit `barrier()` is inserted before the first measured stage. In Starlark, use explicit `barrier()` for additional sync points:
+Concurrency and rate values in configs express **total** desired load — each worker gets its share automatically. An implicit `barrier()` is inserted before the first measured stage. In Starlark, use explicit `barrier()` for additional sync points:
 
 ```python
 scenario(
diff --git a/README.md b/README.md
@@ -72,22 +72,33 @@ scenario(
 
 Each goroutine stream can run multi-turn conversations, carrying real model responses forward into subsequent turns. This exercises server-side KV cache reuse (prefix caching) and produces realistic conversation-shaped traffic.
 
-### Synchronized multi-pod start
+### Synchronized multi-pod start with automatic load division
 
-When running across multiple pods, `--sync '{"workers":N}'` enables barrier synchronization. All pods negotiate a common start time via an HTTP barrier protocol — pod-0 (leader) runs the barrier server, workers discover it via `BARRIER_ADDR` (set automatically in the Job manifest to the leader pod's DNS name). Barriers are first-class in the Starlark DSL:
+When running across multiple pods, `--workers N` (where N > 1) enables barrier synchronization and automatically divides load across workers. Concurrency and rate values in config files always express the **total** desired load — each worker gets its fair share via integer division, with remainder distributed to lower-indexed workers (e.g. `concurrency=10, workers=3` → 4, 3, 3).
+
+```bash
+# Run with 4 workers — each gets 1/4 of the configured concurrency and rate
+nyann-bench generate --target http://vllm:8000/v1 --config scenario.star --workers 4 --worker-id 0
+```
+
+All pods negotiate a common start time via an HTTP barrier protocol — pod-0 (leader) runs the barrier server, workers discover it via `BARRIER_ADDR` (set automatically in the Job manifest to the leader pod's DNS name). An implicit barrier is inserted before the first measured stage. Barriers are first-class in the Starlark DSL:
 
 ```python
 scenario(
     stages=[
         stage("2m", concurrency=16, warmup=True),
-        barrier(),                                  # implicit one added automatically
+        # implicit barrier fires here — all workers sync before measured stages
         stage("5m", concurrency=64),
-        barrier(drain=True),                        # drain pool before workload switch
+        barrier(drain=True),  # explicit: drain pool before workload switch
         stage("5m", concurrency=64, workload=other),
     ],
 )
 ```
 
+With `--workers auto`, the worker count is `ceil(max_concurrency / 1024)`, sized for the **peak** stage. In staircase configs where concurrency ramps up across stages (e.g. `[4, 64, 512, 2048]`), early low-concurrency stages will have some workers with very few or zero streams. Use an explicit `--workers N` if you need tighter control.
+
+With `--workers 1` (the default), no barrier sync or load division occurs.
+
 ### Ramp-up and warmup
 
 A configurable warmup phase brings the server to steady state before measurement begins, and ramp-up staggers stream starts to avoid synchronized request patterns that would otherwise create artificial load spikes.
@@ -154,7 +165,7 @@ Merging across workers: `cat requests_*.jsonl`.
 just deploy my-benchmark http://vllm-server:8000/v1 config.star 8
 ```
 
-This creates a ConfigMap with your config and launches an Indexed Job with 8 pods. Each pod auto-detects its worker ID from `JOB_COMPLETION_INDEX` and the barrier server address from `BARRIER_ADDR`. Sync is enabled automatically via `--sync '{"workers":N}'` in the manifest.
+This creates a ConfigMap with your config and launches an Indexed Job with 8 pods. Each pod auto-detects its worker ID from `JOB_COMPLETION_INDEX` and the barrier server address from `BARRIER_ADDR`. The manifest passes `--workers N` so barrier sync and load division are enabled automatically.
 
 ## Installation
 
@@ -168,6 +179,21 @@ Or pull the container:
 docker pull ghcr.io/neuralmagic/nyann-bench:latest
 ```
 
+## Container images
+
+CI pushes multi-platform (`linux/amd64`, `linux/arm64`) images to GitHub Container Registry on every push to `main` and every pull request:
+
+| Event | Tag | Example |
+|-------|-----|---------|
+| Push to `main` | `latest`, `sha-<commit>` | `ghcr.io/neuralmagic/nyann-bench:latest` |
+| Pull request | `pr-<number>` | `ghcr.io/neuralmagic/nyann-bench:pr-47` |
+
+To use a PR image for testing:
+
+```bash
+docker pull ghcr.io/neuralmagic/nyann-bench:pr-47
+```
+
 ## Development
 
 ```bash
diff --git a/cmd/nyann-bench/generate.go b/cmd/nyann-bench/generate.go
@@ -25,7 +25,7 @@ func generateCmd() *cobra.Command {
 		cfgInput    string
 		outputDir   string
 		workerID    int
-		workers     int
+		workersFlag string
 		metricsAddr string
 		kubeFlags   kube.Flags
 	)
@@ -63,6 +63,20 @@ Workload types:
   corpus      Sliding window over real text files
   gsm8k       GSM8K math problems with streaming eval`,
 		RunE: func(cmd *cobra.Command, args []string) error {
+			// Parse config early — needed to resolve --workers auto
+			sc, err := config.Parse(cfgInput)
+			if err != nil {
+				return fmt.Errorf("config: %w", err)
+			}
+
+			workers, err := config.ResolveWorkers(workersFlag, sc.MaxConcurrency())
+			if err != nil {
+				return err
+			}
+			if workersFlag == "auto" {
+				slog.Info("Auto-resolved workers", "workers", workers, "max_concurrency", sc.MaxConcurrency())
+			}
+
 			if kubeFlags.IsEnabled(cmd) {
 				cfg, err := kubeFlags.ToConfig()
 				if err != nil {
@@ -91,12 +105,6 @@ Workload types:
 				}
 			}
 
-			// Parse config
-			sc, err := config.Parse(cfgInput)
-			if err != nil {
-				return fmt.Errorf("config: %w", err)
-			}
-
 			sc.Workers = workers
 			sc.WorkerID = workerID
 
@@ -163,7 +171,7 @@ Workload types:
 	cmd.Flags().StringVar(&cfgInput, "config", "{}", "Workload config (JSON file, inline JSON, or .star file)")
 	cmd.Flags().StringVar(&outputDir, "output-dir", "", "Directory for JSONL + timestamp files (omit for stdout-only)")
 	cmd.Flags().IntVar(&workerID, "worker-id", 0, "Worker identifier (for multi-container runs)")
-	cmd.Flags().IntVar(&workers, "workers", 1, "Total number of workers (enables barrier sync and divides load when > 1)")
+	cmd.Flags().StringVar(&workersFlag, "workers", "1", `Number of workers: integer or "auto" (auto = ceil(max_concurrency/1024))`)
 	cmd.Flags().StringVar(&metricsAddr, "metrics", "", "Prometheus metrics listen address (e.g. :9090)")
 
 	kube.RegisterFlags(cmd, &kubeFlags)
diff --git a/pkg/barrier/barrier_test.go b/pkg/barrier/barrier_test.go
@@ -204,26 +204,21 @@ func TestContextCancel(t *testing.T) {
 	}
 }
 
-func TestDNSFailureFails(t *testing.T) {
-	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+func TestDNSFailureRetriesUntilTimeout(t *testing.T) {
+	// DNS NXDOMAIN should be retried (Kubernetes headless service DNS
+	// takes time to propagate), not treated as a permanent failure.
+	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
 	defer cancel()
 
-	// Use a hostname that definitely won't resolve
 	addr := "this-host-does-not-exist-barrier-test.invalid:8080"
 
-	start := time.Now()
-	_, err := WaitForStart(ctx, addr, 0, 0, 2, 30*time.Second)
-	elapsed := time.Since(start)
+	_, err := WaitForStart(ctx, addr, 0, 0, 2, 3*time.Second)
 
 	if err == nil {
-		t.Fatal("expected error for unresolvable hostname, got nil")
+		t.Fatal("expected timeout error for unresolvable hostname, got nil")
 	}
-	if !strings.Contains(err.Error(), "DNS lookup failed") {
-		t.Errorf("expected DNS failure message, got: %v", err)
-	}
-	// Should fail within a few seconds, not wait for the full 30s timeout
-	if elapsed > 15*time.Second {
-		t.Errorf("DNS failure took %v — should fail fast, not wait for full timeout", elapsed)
+	if !strings.Contains(err.Error(), "timed out") {
+		t.Errorf("expected timeout error, got: %v", err)
 	}
 }
 
diff --git a/pkg/barrier/client.go b/pkg/barrier/client.go
@@ -56,20 +56,17 @@ func WaitForStart(ctx context.Context, addr string, workerID, barrierID, nWorker
 				return time.Time{}, fmt.Errorf("barrier: timed out waiting for barrier server at %s after %d attempts: %w", addr, attempt, ctx.Err())
 			}
 
-			// DNS NXDOMAIN — hostname doesn't exist, retrying won't help
+			// DNS NXDOMAIN — in Kubernetes, headless service DNS records
+			// take a few seconds to propagate after pod creation, so
+			// NXDOMAIN is expected early on. Keep retrying.
 			var dnsErr *net.DNSError
 			if errors.As(err, &dnsErr) && dnsErr.IsNotFound {
 				dnsFailures++
-				if dnsFailures >= 3 {
-					host := addr
-					if h, _, splitErr := net.SplitHostPort(addr); splitErr == nil {
-						host = h
-					}
-					return time.Time{}, fmt.Errorf("barrier: DNS lookup failed for %q — "+
-						"the hostname does not resolve. For Kubernetes Indexed Jobs, BARRIER_ADDR must "+
-						"include the headless service name (e.g. <job>-0.<service>): %w", host, err)
+				if dnsFailures <= 3 {
+					slog.Warn("Barrier DNS lookup failed, retrying", "addr", addr, "attempt", dnsFailures, "error", err)
+				} else {
+					slog.Debug("Barrier DNS still not resolved", "addr", addr, "attempt", dnsFailures)
 				}
-				slog.Warn("Barrier DNS lookup failed, retrying", "addr", addr, "attempt", dnsFailures, "max_attempts", 3, "error", err)
 			} else if attempt <= 3 {
 				slog.Debug("Barrier server not ready, retrying", "error", err, "backoff", backoff)
 			} else {
diff --git a/pkg/config/config_test.go b/pkg/config/config_test.go
@@ -221,6 +221,58 @@ func TestDivideRate(t *testing.T) {
 	}
 }
 
+func TestMaxConcurrency(t *testing.T) {
+	sc := &config.ScenarioConfig{
+		Stages: []config.ScenarioStage{
+			{Concurrency: 16, Warmup: true},
+			{Concurrency: 64},
+			{Concurrency: 128},
+			{Concurrency: 32},
+		},
+	}
+	if got := sc.MaxConcurrency(); got != 128 {
+		t.Errorf("MaxConcurrency() = %d, want 128", got)
+	}
+
+	empty := &config.ScenarioConfig{}
+	if got := empty.MaxConcurrency(); got != 0 {
+		t.Errorf("MaxConcurrency() on empty = %d, want 0", got)
+	}
+}
+
+func TestResolveWorkers(t *testing.T) {
+	tests := []struct {
+		flag           string
+		maxConcurrency int
+		want           int
+		wantErr        bool
+	}{
+		{"1", 0, 1, false},
+		{"4", 0, 4, false},
+		{"auto", 0, 1, false},
+		{"auto", 1, 1, false},
+		{"auto", 1024, 1, false},
+		{"auto", 1025, 2, false},
+		{"auto", 2048, 2, false},
+		{"auto", 2049, 3, false},
+		{"auto", 4096, 4, false},
+		{"auto", 10000, 10, false},
+		{"0", 0, 0, true},
+		{"-1", 0, 0, true},
+		{"abc", 0, 0, true},
+	}
+	for _, tt := range tests {
+		got, err := config.ResolveWorkers(tt.flag, tt.maxConcurrency)
+		if (err != nil) != tt.wantErr {
+			t.Errorf("ResolveWorkers(%q, %d) error = %v, wantErr %v", tt.flag, tt.maxConcurrency, err, tt.wantErr)
+			continue
+		}
+		if got != tt.want {
+			t.Errorf("ResolveWorkers(%q, %d) = %d, want %d", tt.flag, tt.maxConcurrency, got, tt.want)
+		}
+	}
+}
+
 func TestInsertImplicitBarrier(t *testing.T) {
 	sc := &config.ScenarioConfig{
 		Stages: []config.ScenarioStage{
diff --git a/pkg/config/scenario.go b/pkg/config/scenario.go
@@ -1,6 +1,8 @@
 package config
 
 import (
+	"fmt"
+	"strconv"
 	"time"
 )
 
@@ -105,6 +107,38 @@ func DivideRate(total float64, nWorkers int) float64 {
 	return total / float64(nWorkers)
 }
 
+// MaxConcurrency returns the highest concurrency value across all stages.
+func (sc *ScenarioConfig) MaxConcurrency() int {
+	highest := 0
+	for _, s := range sc.Stages {
+		if s.Concurrency > highest {
+			highest = s.Concurrency
+		}
+	}
+	return highest
+}
+
+// ResolveWorkers converts a --workers flag value to an integer.
+// "auto" computes ceil(maxConcurrency / 1024) so each worker handles at most
+// 1024 concurrent streams — beyond that, goroutine scheduling overhead and
+// per-connection memory become significant on a single pod.
+func ResolveWorkers(flag string, maxConcurrency int) (int, error) {
+	if flag == "auto" {
+		if maxConcurrency <= 0 {
+			return 1, nil
+		}
+		return (maxConcurrency + 1023) / 1024, nil
+	}
+	n, err := strconv.Atoi(flag)
+	if err != nil {
+		return 0, fmt.Errorf("--workers must be a positive integer or \"auto\", got %q", flag)
+	}
+	if n < 1 {
+		return 0, fmt.Errorf("--workers must be >= 1, got %d", n)
+	}
+	return n, nil
+}
+
 // InsertImplicitBarrier adds a barrier before all stages so workers sync
 // before warmup begins. This is called when --workers > 1 to ensure a sync
 // point even without explicit barrier() calls.