PRD: go-worker Production Hardening & Performance

Background

go-worker runs in sensitive production environments (international banking and medical institutions). Reliability, correctness, predictable performance, and safe shutdown behavior are required. The current implementation exhibits lifecycle, concurrency, and rate-limiting inconsistencies that can lead to task loss, unexpected retries, goroutine leaks, and panics.

Goals

Provide correct, deterministic task lifecycle semantics (queued → running → terminal) with no duplicate executions.
Enforce rate limits and concurrency caps precisely and predictably.
Ensure cancellation, timeouts, retries, and shutdown are safe, idempotent, and leak-free.
Make behavior observable (metrics, logs, tracing) with minimal overhead.
Keep API ergonomics while enabling strong operational controls.
Provide an optional durable backend (Redis) with at-least-once delivery, leasing, and DLQ support.

Non-Goals

Additional durable backends beyond Redis (e.g., Postgres) are out of scope for now.
Building a full scheduler with cron semantics.
Providing a multi-tenant authorization system inside the library.

Assumptions

The library should remain usable as an embedded component (not a standalone service).
gRPC is an optional interface; the core worker should not depend on it.

Users / Use Cases

Internal services enqueue critical tasks with strict SLAs.
Operators need to safely drain tasks for deploys or incidents.
Compliance teams need traceability and reliable audit data for tasks.

Functional Requirements

Task lifecycle correctness - Exactly-once execution per task registration (no duplicate dispatch). - Terminal state is immutable once reached. - Task status transitions are thread-safe and race-free.
Rate limiting and concurrency control - Single, consistent limiter for execution (not double-applied). - Support low values (including 0 retries, 1 worker, and low TPS). - Burst handling should be deterministic and configurable.
Timeouts & cancellation - Task execution must receive the task’s context (with timeout/deadline). - Cancellation should prevent execution or stop in-flight tasks. - Cancellation should not reschedule tasks unless explicitly requested.
Retries - Allow zero retries (disabled) and configurable max retries. - Implement exponential backoff with jitter without blocking shared locks. - Retries must not deadlock or block the scheduler.
Shutdown / drain semantics - StopGraceful(ctx) and StopNow() must be idempotent and non-blocking. - Support graceful drain (finish running tasks) and hard stop (cancel). - No channel closes while writers are active.
Result handling - Provide safe, multi-subscriber result streams (fan-out). - Allow bounded buffer size with backpressure strategy.
Registry management - Provide optional retention policy (TTL or max entries) to avoid unbounded memory.
gRPC behavior - Per-client streaming should not steal results from other clients. - Idempotency keys should be enforced (optional) with conflict response. - Surface NotFound when canceling/querying unknown tasks.

Non-Functional Requirements

Safety: No panics on normal operation; no data races under -race.
Reliability: No goroutine leaks; Stop() always completes.
Performance: Minimum overhead per task; no global locks in steady state.
Observability: Provide counters and histograms for queue depth, latency, success/failure, retry counts, and cancellations.
Compatibility: Preserve public API where possible; breaking changes are versioned.

API / Behavior Changes (Proposed)

Replace dual scheduling (scheduler + direct channel send) with a single scheduling path.
Clarify defaults using typed time.Duration constants (e.g., 5 * time.Minute).
Replace GetResults() with SubscribeResults(buffer) and keep GetResults() as a compatibility shim for one release cycle.
Provide StopGraceful(ctx) and StopNow() to separate drain vs cancel semantics.
When a durable backend is enabled, use RegisterDurableTask(s) instead of RegisterTask(s).

Observability Requirements

Metrics (Prometheus or OpenTelemetry): - tasks_scheduled_total - tasks_running - tasks_completed_total - tasks_failed_total - tasks_cancelled_total - task_latency_seconds (histogram) - queue_depth - retry_count_total
Structured logging hooks (interface-based, no stdlib log dependency).
Optional tracing spans per task execution.

Security & Compliance

gRPC examples should demonstrate TLS and interceptor usage.
Optional hooks for authentication/authorization in gRPC server.
Avoid leaking sensitive payloads in logs by default.

Migration Plan

Introduce new API variants while keeping existing functions deprecated for one release cycle.
Provide a compatibility shim for old GetResults() behavior.
Document breaking changes and safe upgrade path.

Status vs Repo (January 30, 2026)

Status updated: February 5, 2026

Functional Requirements (January 30, 2026)

Task lifecycle correctness: Done for in-memory tasks; durable tasks are at-least-once by design.
Rate limiting & concurrency: Done; single scheduling path with deterministic burst (min(maxWorkers, maxTasks)).
Timeouts & cancellation: Done; task contexts inherit deadlines and cancellation is propagated.
Retries: Done; exponential backoff with jitter and non-blocking scheduling.
Shutdown / drain semantics: Done; StopGraceful + StopNow are idempotent and safe.
Result handling: Done; fan-out SubscribeResults + drop policy + GetResults shim.
Registry management: Done; retention policy and cleanup implemented.
gRPC behavior: Done; fan-out streaming, idempotency enforcement, NotFound for missing tasks.

Non-Functional Requirements (January 30, 2026)

Safety: Partial; panic recovery in hooks/tracer/broadcaster exists, but go test -race ./... still needs to be validated on your machine.
Safety: Partial; panic recovery in hooks/tracer/broadcaster exists, but go test -race ./... still needs to be validated on your machine (race fix landed in resultBroadcaster).
Reliability: Done; stop paths are idempotent and workers/scheduler loops are tracked via WaitGroups.
Performance: Done; queue scheduling avoids global locks in steady state beyond the queue/registry.
Observability: Done; metrics snapshot + OpenTelemetry metrics/tracing and examples.
Compatibility: Done; breaking changes documented, shim for GetResults.

API / Behavior Changes (January 30, 2026)

Single scheduling path: Done.
Defaults clarified: Done (typed time.Duration constants).
GetResults() replacement: Done as SubscribeResults + compatibility shim.
StopGraceful/StopNow: Done.
Durable mode registration: Done; RegisterTask(s) disabled when durable backend is enabled.

Security & Compliance (January 30, 2026)

TLS/interceptor examples: Done (__examples/grpc).
Auth hook: Done (WithGRPCAuth).
Avoid logging payloads by default: Partial; core avoids logging payloads, but examples/tests should be reviewed for sensitive logging.

Durable Backend

Redis durable backend: Done (leases, DLQ, replay/inspect tools, gRPC durable API).
Multi-backend support: Not started.
Global coordination: Done; global rate limiting (WithRedisDurableGlobalRateLimit) and leader lock (WithRedisDurableLeaderLock) added with tests and docs.

Admin Service & Admin UI

Admin Service (gRPC + HTTP gateway): Implemented with gaps (resumable SSE, central audit log, per-endpoint metrics, gateway artifact API parity).
Admin UI (Next.js + Tailwind): Implemented with gaps (cross-resource timeline, trend analytics, richer operator workflows, stronger diagnostics UX).
Docker/Compose: Done (Dockerfile + compose.admin.yaml + cert generation script).
Schedule management: Done (create/delete/pause via gateway + UI).
Admin action counters: Done (pause/resume/replay counts exposed in overview).
UI polish: In progress (actions, events, run detail, docs; deeper productization still in progress).
Canonical admin status: See PRD-admin-service.md for detailed service/UI gap matrix and priority backlog.

Milestones

Stability Patch: Fix critical panics, timeout propagation, cancellation correctness, and shutdown deadlocks.
Scheduling & Retry Refactor: Unify scheduling path, correct rate limiting, non-blocking retry strategy.
Observability & Metrics: Add metrics and structured logging hooks.
gRPC Enhancements: Fan-out results, idempotency, better error codes.
Memory & Retention: Task registry retention policy and cleanup.

Acceptance Criteria

All unit tests pass; new tests cover cancellation, retries, shutdown, rate limiting, and fan-out streaming.
go test -race ./... passes with no data races.
StopGraceful(ctx) returns within a bounded timeout even with pending tasks.
No panics in GetTask, middleware logging, or streaming when tasks have no result/error yet.
Task execution respects deadlines; canceled tasks do not execute.

Competitive Landscape & Gaps

Comparable Go packages

Worker pools (concurrency focus): ants, pond, tunny. These provide fast pool management but lack task lifecycle, retries, hooks, and results fan‑out.
Distributed/durable queues: river (Postgres‑backed) and gocraft/work (Redis‑backed). These provide persistence, multi‑node coordination, and UIs, but are heavier and require external infrastructure.

Gaps vs comparable packages

Durability: Redis-backed at-least-once exists, but no transactional enqueue across services and no exactly-once guarantees.
Operational tooling: workerctl CLI exists for queue inspection and DLQ replay; admin UI and admin service are now implemented (ongoing polish and feature depth).
Distributed coordination: multi-node guidance + lease renewal exist; global rate limiting and leader lock now exist; no quorum controls.
Scheduling: delayed scheduling and cron are implemented; UX and admin controls still need refinement.

Nice‑to‑have improvements

DLQ & replay: current replay utility is basic; add safety guards, dry-run, and filtering.
Queue segmentation (multiple named queues, weighted priorities).
Typed task payloads (optional typed registry, stronger compile‑time checks).
More observability knobs (labels for task name/status; exemplar support).
Operational guidance (sizing/rate‑limit recommendations, best‑practice defaults).

Roadmap (future milestones)

Durable backend: add additional backends (Postgres) and stronger transactional enqueue semantics.
Operational tooling: admin service/UI are implemented; close remaining productization gaps (resumable events, auditability, analytics, advanced workflows).
Scheduled jobs: cron/delayed scheduling layer (cron now implemented; cron UX improvements TBD).
Multi‑node coordination: optional distributed workers via backend (global rate limit/leader lock implemented; quorum/leader election improvements TBD).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PRD: go-worker Production Hardening & Performance

Background

Goals

Non-Goals

Assumptions

Users / Use Cases

Functional Requirements

Non-Functional Requirements

API / Behavior Changes (Proposed)

Observability Requirements

Security & Compliance

Migration Plan

Status vs Repo (January 30, 2026)

Functional Requirements (January 30, 2026)

Non-Functional Requirements (January 30, 2026)

API / Behavior Changes (January 30, 2026)

Security & Compliance (January 30, 2026)

Durable Backend

Admin Service & Admin UI

Milestones

Acceptance Criteria

Competitive Landscape & Gaps

Comparable Go packages

Gaps vs comparable packages

Nice‑to‑have improvements

Roadmap (future milestones)

Uh oh!

FilesExpand file tree

PRD.md

Latest commit

History

PRD.md

File metadata and controls

PRD: go-worker Production Hardening & Performance

Background

Goals

Non-Goals

Assumptions

Users / Use Cases

Functional Requirements

Non-Functional Requirements

API / Behavior Changes (Proposed)

Observability Requirements

Security & Compliance

Migration Plan

Status vs Repo (January 30, 2026)

Functional Requirements (January 30, 2026)

Non-Functional Requirements (January 30, 2026)

API / Behavior Changes (January 30, 2026)

Security & Compliance (January 30, 2026)

Durable Backend

Admin Service & Admin UI

Milestones

Acceptance Criteria

Competitive Landscape & Gaps

Comparable Go packages

Gaps vs comparable packages

Nice‑to‑have improvements

Roadmap (future milestones)