Problem
During fresh deployments of the infrahub and infrahub-enterprise Helm charts, infrahubTaskWorker pods (and, more rarely, infrahubServer pods) start before the bundled prefect-server service is ready to accept connections. The Python application catches the connection error but exits non-zero after its bounded retry loop, landing the pod in CrashLoopBackOff until Prefect finishes booting.
Symptom
$ kubectl get pods -n infrahub
infrahub-infrahub-task-worker-xxxxx-yyyyy 0/1 CrashLoopBackOff 3 (52s ago) 2m
infrahub-infrahub-task-worker-xxxxx-zzzzz 0/1 Error 3 (41s ago) 2m
$ kubectl logs infrahub-infrahub-task-worker-xxxxx-yyyyy
...
httpx.ConnectError: All connection attempts failed
Pods eventually recover via the deployment's exponential backoff, but cold starts become noisy and slow. It is reproducible on essentially every fresh helm install on an empty cluster.
Proposed fix
Add an optional initContainer — enabled by default — to both the infrahub-task-worker and infrahub-server deployment templates that waits for http://prefect-server:4200/api/health before the main container starts. A minimal busybox-based implementation is enough (it only needs wget + a polling loop with a timeout).
New values shape (per deployment):
infrahubTaskWorker:
waitForPrefect:
enabled: true
image:
repository: busybox
tag: "1.37"
url: "http://prefect-server:4200/api/health"
pollSeconds: 5
timeoutSeconds: 300
Users who disable the bundled prefect-server or point to an external Prefect can override url or set enabled: false.
Why not fix this in Infrahub code
The Python application already retries with a bounded count. Extending that in code risks masking legitimate misconfiguration (wrong PREFECT_API_URL, unreachable external Prefect) and produces noisier logs. An initContainer:
- Fails fast with a clear
Init:Error when Prefect is truly unreachable, instead of a partial-retry in the main container
- Keeps transient startup errors out of application logs
- Is visible in
kubectl describe pod as a discrete init phase
- Applies to any Infrahub version without coupling to application code
Scope
charts/infrahub/templates/infrahub-task-worker.yaml
charts/infrahub/templates/infrahub-server.yaml
charts/infrahub/values.yaml
- Shared helper in
charts/infrahub/templates/_helpers.tpl
- Inheritance/defaults carried through to the enterprise chart flavors
Problem
During fresh deployments of the
infrahubandinfrahub-enterpriseHelm charts,infrahubTaskWorkerpods (and, more rarely,infrahubServerpods) start before the bundledprefect-serverservice is ready to accept connections. The Python application catches the connection error but exits non-zero after its bounded retry loop, landing the pod inCrashLoopBackOffuntil Prefect finishes booting.Symptom
Pods eventually recover via the deployment's exponential backoff, but cold starts become noisy and slow. It is reproducible on essentially every fresh
helm installon an empty cluster.Proposed fix
Add an optional
initContainer— enabled by default — to both theinfrahub-task-workerandinfrahub-serverdeployment templates that waits forhttp://prefect-server:4200/api/healthbefore the main container starts. A minimal busybox-based implementation is enough (it only needswget+ a polling loop with a timeout).New values shape (per deployment):
Users who disable the bundled
prefect-serveror point to an external Prefect can overrideurlor setenabled: false.Why not fix this in Infrahub code
The Python application already retries with a bounded count. Extending that in code risks masking legitimate misconfiguration (wrong
PREFECT_API_URL, unreachable external Prefect) and produces noisier logs. An initContainer:Init:Errorwhen Prefect is truly unreachable, instead of a partial-retry in the main containerkubectl describe podas a discrete init phaseScope
charts/infrahub/templates/infrahub-task-worker.yamlcharts/infrahub/templates/infrahub-server.yamlcharts/infrahub/values.yamlcharts/infrahub/templates/_helpers.tpl