Skip to content

Add Prefect Readiness gate to task worker and server startup #59

@PhillSimonds

Description

@PhillSimonds

Problem

During fresh deployments of the infrahub and infrahub-enterprise Helm charts, infrahubTaskWorker pods (and, more rarely, infrahubServer pods) start before the bundled prefect-server service is ready to accept connections. The Python application catches the connection error but exits non-zero after its bounded retry loop, landing the pod in CrashLoopBackOff until Prefect finishes booting.

Symptom

$ kubectl get pods -n infrahub
infrahub-infrahub-task-worker-xxxxx-yyyyy   0/1   CrashLoopBackOff   3 (52s ago)   2m
infrahub-infrahub-task-worker-xxxxx-zzzzz   0/1   Error              3 (41s ago)   2m

$ kubectl logs infrahub-infrahub-task-worker-xxxxx-yyyyy
...
httpx.ConnectError: All connection attempts failed

Pods eventually recover via the deployment's exponential backoff, but cold starts become noisy and slow. It is reproducible on essentially every fresh helm install on an empty cluster.

Proposed fix

Add an optional initContainer — enabled by default — to both the infrahub-task-worker and infrahub-server deployment templates that waits for http://prefect-server:4200/api/health before the main container starts. A minimal busybox-based implementation is enough (it only needs wget + a polling loop with a timeout).

New values shape (per deployment):

infrahubTaskWorker:
  waitForPrefect:
    enabled: true
    image:
      repository: busybox
      tag: "1.37"
    url: "http://prefect-server:4200/api/health"
    pollSeconds: 5
    timeoutSeconds: 300

Users who disable the bundled prefect-server or point to an external Prefect can override url or set enabled: false.

Why not fix this in Infrahub code

The Python application already retries with a bounded count. Extending that in code risks masking legitimate misconfiguration (wrong PREFECT_API_URL, unreachable external Prefect) and produces noisier logs. An initContainer:

  1. Fails fast with a clear Init:Error when Prefect is truly unreachable, instead of a partial-retry in the main container
  2. Keeps transient startup errors out of application logs
  3. Is visible in kubectl describe pod as a discrete init phase
  4. Applies to any Infrahub version without coupling to application code

Scope

  • charts/infrahub/templates/infrahub-task-worker.yaml
  • charts/infrahub/templates/infrahub-server.yaml
  • charts/infrahub/values.yaml
  • Shared helper in charts/infrahub/templates/_helpers.tpl
  • Inheritance/defaults carried through to the enterprise chart flavors

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions