feat(worker): bounded scan-concurrency pool (scan N hosts at once) (#592)

remyluslosius · web-flow · commit e2673b801fa5 · 2026-06-17T12:59:53.000-04:00
The in-process worker drained the scan queue strictly one job at a time:
dequeue -&gt; run the full ~11-min scan -&gt; dequeue next. A fleet sweep was
therefore serial (96 hosts x 11 min ~= 17 h), even though the design already
supports cross-host concurrency (the per-host advisory lock + SKIP LOCKED).

Run up to [server].scan_concurrency claim/process loops at once (default 4,
clamped &gt;=1). Each loop claims a disjoint job via SKIP LOCKED, so N distinct
hosts scan in parallel; the per-host advisory lock still serializes same-host
work, so no new locking is needed. At concurrency 8 a 96-host sweep drops from
~17 h to ~2 h. Within each host the single-connection/many-sessions model is
unchanged.

Spec system-job-queue v1.1.0 (C-07/AC-12). Test: a DB-backed integration test
proves exactly N run at once and the (N+1)th waits (it fails if the loop is
forced serial), plus a clamp unit test.

Also corrects SCALING_GUIDE.md, which wrongly claimed 'serve does not drain the
scan-job queue' (it does, via WithScanProcessor), and documents the knob in the
shipped openwatch.toml.
diff --git a/docs/guides/SCALING_GUIDE.md b/docs/guides/SCALING_GUIDE.md
@@ -17,22 +17,44 @@ OpenWatch has two long-lived processes and one database:
 
 | Component | What it does | How you scale it today |
 |-----------|--------------|------------------------|
-| `openwatch serve` | HTTPS API + embedded UI + in-process schedulers (liveness, intelligence, discovery) | Vertical: more CPU/RAM on the host. The process is stateless apart from PostgreSQL, so horizontal API replicas are possible in principle but not yet packaged (see "Not yet implemented"). |
-| `openwatch worker` | Drains the PostgreSQL scan-job queue and runs Kensa scans over SSH | Run more `openwatch worker` processes against the same database. The queue uses `SELECT ... FOR UPDATE SKIP LOCKED`, so multiple workers cooperate without double-claiming a job. |
+| `openwatch serve` | HTTPS API + embedded UI + in-process schedulers (liveness, intelligence, discovery) **and an in-process worker that drains the scan-job queue** | Raise `[server].scan_concurrency` (how many scans run at once in this process); then vertical CPU/RAM. Stateless apart from PostgreSQL. |
+| `openwatch worker` | An **optional, additional** process that also drains the scan-job queue and runs Kensa scans over SSH | Run one or more for extra/off-box capacity. The queue uses `SELECT ... FOR UPDATE SKIP LOCKED`, so the serve worker and any `openwatch worker` processes cooperate without double-claiming a job. |
 | PostgreSQL | All state: hosts, scans, transactions, audit events, queue | Vertical first (CPU, RAM, faster disk), then tune `max_connections` and the OpenWatch pool size. |
 
-The scan worker is a separate process from the API server. `openwatch serve`
-runs the liveness, intelligence, and discovery schedulers in-process, but it
-does **not** drain the scan-job queue. You must run `openwatch worker`
-separately for scans to execute. Verify the split in `cmd/openwatch/main.go`
-(`cmdServe`) and `cmd/openwatch/worker.go` (`cmdWorker`).
+`openwatch serve` runs an in-process worker that **does** drain the scan-job
+queue — the single-binary deployment scans with no extra process. By default it
+runs **`scan_concurrency` (4) scans concurrently** (`internal/worker/worker.go`,
+wired in `internal/server/server.go`). A separate `openwatch worker` is
+optional, for additional or off-box capacity.
 
 ## Scaling the scan workers
 
 Scans are the most resource-intensive work OpenWatch does: each one opens an SSH
 session to a target host and runs Kensa's native YAML checks. Worker throughput
 is the usual first bottleneck.
 
+### Scan concurrency (the first knob to turn)
+
+The in-process worker runs `[server].scan_concurrency` scan loops at once
+(default `4`). Each loop independently claims a job with `SKIP LOCKED`, so up to
+that many **different hosts** scan in parallel; a per-host advisory lock still
+prevents two scans of the **same** host from overlapping. This is the simplest
+way to clear a large queue — one config value, no extra processes:
+
+```toml
+# /etc/openwatch/openwatch.toml
+[server]
+scan_concurrency = 8
+```
+
+Restart `openwatch` to apply. Sizing: scans are SSH/IO-bound (they spend most of
+their time waiting on the remote host), so concurrency can comfortably exceed
+CPU core count. Mind two ceilings — the PostgreSQL pool (`[database].max_connections`
+/ pool size: each in-flight scan uses a connection plus the advisory-lock
+transaction) and how many simultaneous SSH sessions your targets and network
+tolerate. `8`–`16` is a reasonable range for a few dozen to a few hundred hosts;
+set it to `1` to restore strictly one-at-a-time draining.
+
 ### Run more worker processes
 
 The scan queue is PostgreSQL-native and claims one job at a time per worker with
diff --git a/internal/config/config.go b/internal/config/config.go
@@ -46,6 +46,11 @@ type ServerConfig struct {
 	Listen  string `toml:"listen"`
 	TLSCert string `toml:"tls_cert"`
 	TLSKey  string `toml:"tls_key"`
+	// ScanConcurrency is how many scan/job loops the in-process worker runs
+	// at once. The PostgreSQL queue (SKIP LOCKED) and the per-host advisory
+	// lock make concurrent draining safe; this only bounds the fan-out. Set
+	// to 1 to restore strictly-serial draining. Default 4.
+	ScanConcurrency int `toml:"scan_concurrency"`
 }
 
 // DatabaseConfig governs the PostgreSQL connection.
@@ -65,9 +70,10 @@ type LoggingConfig struct {
 func Defaults() *Config {
 	return &Config{
 		Server: ServerConfig{
-			Listen:  "0.0.0.0:8443",
-			TLSCert: "/etc/openwatch/tls/cert.pem",
-			TLSKey:  "/etc/openwatch/tls/key.pem",
+			Listen:          "0.0.0.0:8443",
+			TLSCert:         "/etc/openwatch/tls/cert.pem",
+			TLSKey:          "/etc/openwatch/tls/key.pem",
+			ScanConcurrency: 4,
 		},
 		Database: DatabaseConfig{
 			DSN:            "postgres://openwatch@localhost/openwatch?sslmode=disable",
diff --git a/internal/server/server.go b/internal/server/server.go
@@ -303,12 +303,14 @@ func New(cfg *config.Config, pool *pgxpool.Pool) *Server {
 		},
 	}
 
-	// Stage-0 in-process worker that drains diagnostics.test_job from the
-	// queue. Started by Run, stopped on shutdown. Spec
+	// In-process worker that drains the job queue (diagnostics, discovery,
+	// and — via WithScanProcessor — scans). Started by Run, stopped on
+	// shutdown. ScanConcurrency fans it out so a fleet of queued scans does
+	// not drain one host at a time (system-job-queue C-07). Spec
 	// release-stage-0-signoff AC-10.
 	var wkr *worker.Worker
 	if pool != nil {
-		wkr = worker.New(pool)
+		wkr = worker.New(pool).WithConcurrency(cfg.Server.ScanConcurrency)
 	}
 	return &Server{cfg: cfg, router: r, srv: srv, cm: cm, wkr: wkr, handlers: apiHandlers}
 }
diff --git a/internal/worker/concurrency_test.go b/internal/worker/concurrency_test.go
@@ -0,0 +1,101 @@
+// @spec system-job-queue
+//
+// Bounded-concurrency drain: the in-process worker runs up to N claim/process
+// loops at once so a fleet of queued jobs does not drain one at a time. SKIP
+// LOCKED gives each loop a disjoint job; the per-host advisory lock (covered by
+// scan_worker_test.go) still serializes same-host scans.
+
+package worker
+
+import (
+	"context"
+	"testing"
+	"time"
+
+	"github.com/Hanalyx/openwatch/internal/correlation"
+	"github.com/Hanalyx/openwatch/internal/db/dbtest"
+	"github.com/Hanalyx/openwatch/internal/queue"
+	"github.com/google/uuid"
+)
+
+// blockingDiscovery signals each invocation on started, then blocks until
+// release is closed. host.discovery is a convenient vehicle: it has no per-host
+// lock, so it isolates the worker's fan-out from the scan path's serialization.
+type blockingDiscovery struct {
+	started chan struct{}
+	release chan struct{}
+}
+
+func (b *blockingDiscovery) RunDiscovery(_ context.Context, _ uuid.UUID) error {
+	b.started <- struct{}{}
+	<-b.release
+	return nil
+}
+
+// @ac AC-12
+// AC-12: with WithConcurrency(N) and N+1 blocking jobs, exactly N run at once
+// and the (N+1)th waits for a free loop.
+func TestWorker_BoundedConcurrency(t *testing.T) {
+	t.Run("system-job-queue/AC-12", func(t *testing.T) {
+		pool := dbtest.Pool(t)
+		const n = 3
+		d := &blockingDiscovery{started: make(chan struct{}, 16), release: make(chan struct{})}
+		w := New(pool).WithDiscovery(d).WithConcurrency(n)
+
+		// Enqueue N+1 host.discovery jobs for distinct hosts.
+		for i := 0; i < n+1; i++ {
+			ctx := correlation.Set(context.Background(), correlation.Generate("test"))
+			// Pass the map directly — Enqueue marshals it (passing pre-marshaled
+			// bytes would double-encode into a JSON string).
+			body := map[string]string{"host_id": uuid.NewString()}
+			if _, err := queue.Enqueue(ctx, pool, "host.discovery", body); err != nil {
+				t.Fatalf("enqueue %d: %v", i, err)
+			}
+		}
+
+		w.Start(context.Background())
+		defer w.Stop()
+
+		// Exactly N jobs enter RunDiscovery concurrently.
+		for i := 0; i < n; i++ {
+			select {
+			case <-d.started:
+			case <-time.After(5 * time.Second):
+				t.Fatalf("only %d/%d jobs started concurrently", i, n)
+			}
+		}
+		// The (N+1)th must NOT have started — fan-out is bounded at N.
+		select {
+		case <-d.started:
+			t.Fatal("more than N jobs ran at once — concurrency not bounded")
+		case <-time.After(400 * time.Millisecond):
+			// good: bounded
+		}
+
+		// Free the in-flight loops; the (N+1)th now claims a freed loop.
+		close(d.release)
+		select {
+		case <-d.started:
+		case <-time.After(5 * time.Second):
+			t.Fatal("the (N+1)th job never ran after release — a freed loop did not pick it up")
+		}
+	})
+}
+
+// @ac AC-12
+// AC-12 (clamp): a concurrency < 1 clamps to 1 so the default worker stays
+// strictly serial; a positive value is kept.
+func TestWorker_WithConcurrencyClamps(t *testing.T) {
+	t.Run("system-job-queue/AC-12", func(t *testing.T) {
+		cases := map[int]int{0: 1, -5: 1, 1: 1, 8: 8}
+		for in, want := range cases {
+			if got := New(nil).WithConcurrency(in).concurrency; got != want {
+				t.Errorf("WithConcurrency(%d) = %d, want %d", in, got, want)
+			}
+		}
+		// New defaults to serial.
+		if got := New(nil).concurrency; got != 1 {
+			t.Errorf("New default concurrency = %d, want 1", got)
+		}
+	})
+}
diff --git a/internal/worker/worker.go b/internal/worker/worker.go
@@ -42,23 +42,44 @@ type HostDiscoveryRunner interface {
 
 // Worker drains pending jobs from job_queue. One Worker per process is
 // enough for Stage 0; multi-worker setups are Stage 2.
+//
+// The worker can run several claim/process loops concurrently
+// (WithConcurrency) so a fleet of queued scans does not drain one host at a
+// time; the queue's SKIP LOCKED claim and the scan path's per-host advisory
+// lock keep concurrent draining safe (system-job-queue C-07).
 type Worker struct {
-	pool      *pgxpool.Pool
-	stop      chan struct{}
-	wg        sync.WaitGroup
-	discovery HostDiscoveryRunner
-	scanProc  *ScanWorker
+	pool        *pgxpool.Pool
+	stop        chan struct{}
+	wg          sync.WaitGroup
+	discovery   HostDiscoveryRunner
+	scanProc    *ScanWorker
+	concurrency int
 }
 
 // New constructs a Worker bound to the given pool. Call Start to begin
-// the drain loop and Stop to exit cleanly.
+// the drain loop and Stop to exit cleanly. Defaults to one (serial) loop;
+// call WithConcurrency to fan out.
 func New(pool *pgxpool.Pool) *Worker {
 	return &Worker{
-		pool: pool,
-		stop: make(chan struct{}),
+		pool:        pool,
+		stop:        make(chan struct{}),
+		concurrency: 1,
 	}
 }
 
+// WithConcurrency sets how many claim/process loops run at once. A value < 1
+// clamps to 1 (strictly serial). Each loop independently claims jobs via
+// SKIP LOCKED, so N loops process up to N distinct hosts in parallel while the
+// per-host advisory lock still serializes same-host work. Spec
+// system-job-queue C-07.
+func (w *Worker) WithConcurrency(n int) *Worker {
+	if n < 1 {
+		n = 1
+	}
+	w.concurrency = n
+	return w
+}
+
 // WithDiscovery registers the OS Discovery runner. When set, the
 // worker processes host.discovery jobs by calling Discover; nil keeps
 // the legacy behavior (host.discovery fails as unsupported).
@@ -81,11 +102,18 @@ func (w *Worker) WithScanProcessor(sw *ScanWorker) *Worker {
 	return w
 }
 
-// Start kicks off the drain loop on a background goroutine. Returns
-// immediately. Safe to call once per Worker.
+// Start kicks off w.concurrency drain loops on background goroutines. Returns
+// immediately. Safe to call once per Worker. Each loop claims jobs
+// independently (SKIP LOCKED), so up to w.concurrency jobs run at once.
 func (w *Worker) Start(ctx context.Context) {
-	w.wg.Add(1)
-	go w.loop(ctx)
+	n := w.concurrency
+	if n < 1 {
+		n = 1
+	}
+	w.wg.Add(n)
+	for i := 0; i < n; i++ {
+		go w.loop(ctx)
+	}
 }
 
 // Stop signals the loop to exit and waits for the in-flight drain (if
diff --git a/packaging/common/openwatch.toml b/packaging/common/openwatch.toml
@@ -11,6 +11,11 @@
 listen   = "0.0.0.0:8443"
 tls_cert = "/etc/openwatch/tls/cert.pem"
 tls_key  = "/etc/openwatch/tls/key.pem"
+# How many compliance scans run at once in this process (different hosts in
+# parallel; same-host scans never overlap). Default 4. Raise for large fleets,
+# minding the DB pool and how many SSH sessions your targets tolerate; set 1
+# for strictly one-at-a-time. See docs/guides/SCALING_GUIDE.md.
+scan_concurrency = 4
 
 [database]
 # Replace with a real DSN. The package does not provision Postgres.
diff --git a/specs/system/job-queue.spec.yaml b/specs/system/job-queue.spec.yaml
@@ -1,7 +1,7 @@
 spec:
   id: system-job-queue
   title: PostgreSQL job queue with correlation propagation
-  version: "1.0.0"
+  version: "1.1.0"
   status: approved
   tier: 1
 
@@ -62,6 +62,20 @@ spec:
       description: Forbidigo lint MUST reject raw "INSERT INTO job_queue" outside internal/queue/ and "http.DefaultClient" outside internal/httpclient/
       type: technical
       enforcement: error
+    - id: C-07
+      description: >-
+        The in-process worker MUST be able to run up to a configured number of
+        claim/process loops CONCURRENTLY (ServerConfig.ScanConcurrency, default
+        4, clamped to >= 1), so a fleet of queued scans does not drain strictly
+        one host at a time. Concurrency is safe by construction and adds NO new
+        locking: SKIP LOCKED (C-04) guarantees the N loops claim disjoint jobs,
+        and the scan path's per-host advisory lock (system-worker-subcommand
+        C-09) still serializes any two jobs for the SAME host. Each loop keeps
+        the existing per-job contract (a fresh correlation context per job,
+        C-02). The default is conservative; operators tune it up for large
+        fleets or down to 1 to restore strictly-serial draining.
+      type: technical
+      enforcement: error
 
   acceptance_criteria:
     - id: AC-01
@@ -105,3 +119,12 @@ spec:
       description: Lint forbids raw "INSERT INTO job_queue" outside internal/queue/ and "http.DefaultClient" outside internal/httpclient/.
       priority: high
       references_constraints: [C-06]
+    - id: AC-12
+      description: >-
+        With Worker.WithConcurrency(N) and N pending jobs whose processing
+        blocks, the worker runs exactly N jobs concurrently and holds the
+        (N+1)th pending until one finishes (bounded fan-out). Stop waits for all
+        in-flight loops to drain. WithConcurrency clamps a value < 1 up to 1, so
+        the default-1 worker stays strictly serial.
+      priority: critical
+      references_constraints: [C-07]

Original file line number	Diff line number	Diff line change
`@@ -303,12 +303,14 @@ func New(cfg config.Config, pool pgxpool.Pool) *Server {`
`303`	`303`	`},`
`304`	`304`	`}`
`305`	`305`
`306`		`- // Stage-0 in-process worker that drains diagnostics.test_job from the`
`307`		`- // queue. Started by Run, stopped on shutdown. Spec`
	`306`	`+ // In-process worker that drains the job queue (diagnostics, discovery,`
	`307`	`+ // and — via WithScanProcessor — scans). Started by Run, stopped on`
	`308`	`+ // shutdown. ScanConcurrency fans it out so a fleet of queued scans does`
	`309`	`+ // not drain one host at a time (system-job-queue C-07). Spec`
`308`	`310`	`// release-stage-0-signoff AC-10.`
`309`	`311`	`var wkr *worker.Worker`
`310`	`312`	`if pool != nil {`
`311`		`- wkr = worker.New(pool)`
	`313`	`+ wkr = worker.New(pool).WithConcurrency(cfg.Server.ScanConcurrency)`
`312`	`314`	`}`
`313`	`315`	`return &Server{cfg: cfg, router: r, srv: srv, cm: cm, wkr: wkr, handlers: apiHandlers}`
`314`	`316`	`}`