go-worker runs in sensitive production environments (international banking and medical institutions). Reliability, correctness, predictable performance, and safe shutdown behavior are required. The current implementation exhibits lifecycle, concurrency, and rate-limiting inconsistencies that can lead to task loss, unexpected retries, goroutine leaks, and panics.
- Provide correct, deterministic task lifecycle semantics (queued → running → terminal) with no duplicate executions.
- Enforce rate limits and concurrency caps precisely and predictably.
- Ensure cancellation, timeouts, retries, and shutdown are safe, idempotent, and leak-free.
- Make behavior observable (metrics, logs, tracing) with minimal overhead.
- Keep API ergonomics while enabling strong operational controls.
- Provide an optional durable backend (Redis) with at-least-once delivery, leasing, and DLQ support.
- Additional durable backends beyond Redis (e.g., Postgres) are out of scope for now.
- Building a full scheduler with cron semantics.
- Providing a multi-tenant authorization system inside the library.
- The library should remain usable as an embedded component (not a standalone service).
- gRPC is an optional interface; the core worker should not depend on it.
- Internal services enqueue critical tasks with strict SLAs.
- Operators need to safely drain tasks for deploys or incidents.
- Compliance teams need traceability and reliable audit data for tasks.
- Task lifecycle correctness - Exactly-once execution per task registration (no duplicate dispatch). - Terminal state is immutable once reached. - Task status transitions are thread-safe and race-free.
- Rate limiting and concurrency control - Single, consistent limiter for execution (not double-applied). - Support low values (including 0 retries, 1 worker, and low TPS). - Burst handling should be deterministic and configurable.
- Timeouts & cancellation - Task execution must receive the task’s context (with timeout/deadline). - Cancellation should prevent execution or stop in-flight tasks. - Cancellation should not reschedule tasks unless explicitly requested.
- Retries - Allow zero retries (disabled) and configurable max retries. - Implement exponential backoff with jitter without blocking shared locks. - Retries must not deadlock or block the scheduler.
- Shutdown / drain semantics
-
StopGraceful(ctx)andStopNow()must be idempotent and non-blocking. - Support graceful drain (finish running tasks) and hard stop (cancel). - No channel closes while writers are active. - Result handling - Provide safe, multi-subscriber result streams (fan-out). - Allow bounded buffer size with backpressure strategy.
- Registry management - Provide optional retention policy (TTL or max entries) to avoid unbounded memory.
- gRPC behavior - Per-client streaming should not steal results from other clients. - Idempotency keys should be enforced (optional) with conflict response. - Surface NotFound when canceling/querying unknown tasks.
- Safety: No panics on normal operation; no data races under
-race. - Reliability: No goroutine leaks;
Stop()always completes. - Performance: Minimum overhead per task; no global locks in steady state.
- Observability: Provide counters and histograms for queue depth, latency, success/failure, retry counts, and cancellations.
- Compatibility: Preserve public API where possible; breaking changes are versioned.
- Replace dual scheduling (scheduler + direct channel send) with a single scheduling path.
- Clarify defaults using typed
time.Durationconstants (e.g.,5 * time.Minute). - Replace
GetResults()withSubscribeResults(buffer)and keepGetResults()as a compatibility shim for one release cycle. - Provide
StopGraceful(ctx)andStopNow()to separate drain vs cancel semantics. - When a durable backend is enabled, use
RegisterDurableTask(s)instead ofRegisterTask(s).
- Metrics (Prometheus or OpenTelemetry): - tasks_scheduled_total - tasks_running - tasks_completed_total - tasks_failed_total - tasks_cancelled_total - task_latency_seconds (histogram) - queue_depth - retry_count_total
- Structured logging hooks (interface-based, no stdlib log dependency).
- Optional tracing spans per task execution.
- gRPC examples should demonstrate TLS and interceptor usage.
- Optional hooks for authentication/authorization in gRPC server.
- Avoid leaking sensitive payloads in logs by default.
- Introduce new API variants while keeping existing functions deprecated for one release cycle.
- Provide a compatibility shim for old
GetResults()behavior. - Document breaking changes and safe upgrade path.
Status updated: February 5, 2026
- Task lifecycle correctness: Done for in-memory tasks; durable tasks are at-least-once by design.
- Rate limiting & concurrency: Done; single scheduling path with deterministic burst (
min(maxWorkers, maxTasks)). - Timeouts & cancellation: Done; task contexts inherit deadlines and cancellation is propagated.
- Retries: Done; exponential backoff with jitter and non-blocking scheduling.
- Shutdown / drain semantics: Done;
StopGraceful+StopNoware idempotent and safe. - Result handling: Done; fan-out
SubscribeResults+ drop policy +GetResultsshim. - Registry management: Done; retention policy and cleanup implemented.
- gRPC behavior: Done; fan-out streaming, idempotency enforcement, NotFound for missing tasks.
- Safety: Partial; panic recovery in hooks/tracer/broadcaster exists, but
go test -race ./...still needs to be validated on your machine. - Safety: Partial; panic recovery in hooks/tracer/broadcaster exists, but
go test -race ./...still needs to be validated on your machine (race fix landed inresultBroadcaster). - Reliability: Done; stop paths are idempotent and workers/scheduler loops are tracked via WaitGroups.
- Performance: Done; queue scheduling avoids global locks in steady state beyond the queue/registry.
- Observability: Done; metrics snapshot + OpenTelemetry metrics/tracing and examples.
- Compatibility: Done; breaking changes documented, shim for
GetResults.
- Single scheduling path: Done.
- Defaults clarified: Done (typed
time.Durationconstants). GetResults()replacement: Done asSubscribeResults+ compatibility shim.- StopGraceful/StopNow: Done.
- Durable mode registration: Done;
RegisterTask(s)disabled when durable backend is enabled.
- TLS/interceptor examples: Done (
__examples/grpc). - Auth hook: Done (
WithGRPCAuth). - Avoid logging payloads by default: Partial; core avoids logging payloads, but examples/tests should be reviewed for sensitive logging.
- Redis durable backend: Done (leases, DLQ, replay/inspect tools, gRPC durable API).
- Multi-backend support: Not started.
- Global coordination: Done; global rate limiting (
WithRedisDurableGlobalRateLimit) and leader lock (WithRedisDurableLeaderLock) added with tests and docs.
- Admin Service (gRPC + HTTP gateway): Implemented with gaps (resumable SSE, central audit log, per-endpoint metrics, gateway artifact API parity).
- Admin UI (Next.js + Tailwind): Implemented with gaps (cross-resource timeline, trend analytics, richer operator workflows, stronger diagnostics UX).
- Docker/Compose: Done (
Dockerfile+compose.admin.yaml+ cert generation script). - Schedule management: Done (create/delete/pause via gateway + UI).
- Admin action counters: Done (pause/resume/replay counts exposed in overview).
- UI polish: In progress (actions, events, run detail, docs; deeper productization still in progress).
- Canonical admin status: See
PRD-admin-service.mdfor detailed service/UI gap matrix and priority backlog.
- Stability Patch: Fix critical panics, timeout propagation, cancellation correctness, and shutdown deadlocks.
- Scheduling & Retry Refactor: Unify scheduling path, correct rate limiting, non-blocking retry strategy.
- Observability & Metrics: Add metrics and structured logging hooks.
- gRPC Enhancements: Fan-out results, idempotency, better error codes.
- Memory & Retention: Task registry retention policy and cleanup.
- All unit tests pass; new tests cover cancellation, retries, shutdown, rate limiting, and fan-out streaming.
go test -race ./...passes with no data races.StopGraceful(ctx)returns within a bounded timeout even with pending tasks.- No panics in
GetTask, middleware logging, or streaming when tasks have no result/error yet. - Task execution respects deadlines; canceled tasks do not execute.
- Worker pools (concurrency focus):
ants,pond,tunny. These provide fast pool management but lack task lifecycle, retries, hooks, and results fan‑out. - Distributed/durable queues:
river(Postgres‑backed) andgocraft/work(Redis‑backed). These provide persistence, multi‑node coordination, and UIs, but are heavier and require external infrastructure.
- Durability: Redis-backed at-least-once exists, but no transactional enqueue across services and no exactly-once guarantees.
- Operational tooling:
workerctlCLI exists for queue inspection and DLQ replay; admin UI and admin service are now implemented (ongoing polish and feature depth). - Distributed coordination: multi-node guidance + lease renewal exist; global rate limiting and leader lock now exist; no quorum controls.
- Scheduling: delayed scheduling and cron are implemented; UX and admin controls still need refinement.
- DLQ & replay: current replay utility is basic; add safety guards, dry-run, and filtering.
- Queue segmentation (multiple named queues, weighted priorities).
- Typed task payloads (optional typed registry, stronger compile‑time checks).
- More observability knobs (labels for task name/status; exemplar support).
- Operational guidance (sizing/rate‑limit recommendations, best‑practice defaults).
- Durable backend: add additional backends (Postgres) and stronger transactional enqueue semantics.
- Operational tooling: admin service/UI are implemented; close remaining productization gaps (resumable events, auditability, analytics, advanced workflows).
- Scheduled jobs: cron/delayed scheduling layer (cron now implemented; cron UX improvements TBD).
- Multi‑node coordination: optional distributed workers via backend (global rate limit/leader lock implemented; quorum/leader election improvements TBD).