You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run OAP init job in the main phase to fix helm --wait deadlock (#190)
The OAP init job was a `post-install,post-upgrade,post-rollback` hook. Under
`helm upgrade --install --wait`, Helm waits for all release resources to become
Ready before running post-* hooks, but the OAP Deployment runs in `-Dmode=no-init`
and never becomes Ready until the init job creates the storage schema. The hook
therefore never runs and the install deadlocks until it times out (hits new
users on a fresh install/storage).
Hooks cannot fix this with embedded storage subcharts: a pre-* hook init job
cannot reach main-phase storage, and a post-* hook deadlocks under `--wait`.
So the init job now runs as a normal main-phase resource alongside storage and
the OAP Deployment, which blocks in no-init mode until the schema appears.
To avoid `spec.template is immutable` failures on upgrade (a Job's pod template
cannot be patched), the Job name carries an 8-char hash of the chart values, so
a changed spec yields a new Job and Helm prunes the previous one. A new optional
`oapInit.ttlSecondsAfterFinished` can auto-clean finished Jobs (off by default;
left off for GitOps tools that would otherwise recreate the Job).
The OAP Deployment startupProbe default failureThreshold is raised 9 -> 30
(90s -> 300s) so the pod waits for the init job during a cold start instead of
being restarted.
Docs (values.yaml, chart README, root README) updated accordingly.
Copy file name to clipboardExpand all lines: chart/skywalking/README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -68,7 +68,7 @@ The following table lists the configurable parameters of the Skywalking chart an
68
68
|`oap.nodeSelector`| OAP labels for master pod assignment |`{}`|
69
69
|`oap.tolerations`| OAP tolerations |`[]`|
70
70
|`oap.resources`| OAP node resources requests & limits |`{} - cpu limit must be an integer`|
71
-
| `oap.startupProbe` | Configuration fields for the [startupProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)| `tcpSocket.port: 12800` <br> `failureThreshold: 9` <br> `periodSeconds: 10`
71
+
| `oap.startupProbe` | Configuration fields for the [startupProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). The default budget (`failureThreshold` * `periodSeconds` = 300s) is large enough for OAP to wait in no-init mode while the OAP init Job creates the storage schema. | `tcpSocket.port: 12800` <br> `failureThreshold: 30` <br> `periodSeconds: 10`
72
72
| `oap.livenessProbe` | Configuration fields for the [livenessProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) | `tcpSocket.port: 12800` <br> `initialDelaySeconds: 5` <br> `periodSeconds: 10`
73
73
| `oap.readinessProbe` | Configuration fields for the [readinessProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) | `tcpSocket.port: 12800` <br> `initialDelaySeconds: 5` <br> `periodSeconds: 10`
74
74
|`oap.env`| OAP environment variables |`[]`|
@@ -109,6 +109,7 @@ The following table lists the configurable parameters of the Skywalking chart an
109
109
|`oapInit.nodeSelector`| OAP init job labels for master pod assignment |`{}`|
|`oapInit.ttlSecondsAfterFinished`| Seconds after which the finished OAP init Job (and its Pod) is auto-deleted by the Kubernetes TTL-after-finished controller. Empty keeps the Job. Leave empty with GitOps tools (Argo CD/Flux), which would recreate it after deletion. |`""`|
112
113
|`satellite.name`| Satellite deployment name |`satellite`|
0 commit comments