Cap per-VM STLS retry attempts (AB#38327355)#8653
Conversation
The aks-secure-tls-bootstrap client has no native long-running retry
loop: each invocation makes a single attempt and exits. Re-invocation
is driven by kubelet's Restart=always pulling the failed STLS unit via
its Wants= relation.
On 2026-06-04, a single stuck VM in eastus2euap emitted 2,837 STLS
failure events over 53+ hours — 96% of all STLS fail events in the
region — dropping per-event QoS to ~93% even though only 68 VMs
actually failed.
Add a Linux-side wrapper script
(parts/linux/cloud-init/artifacts/secure-tls-bootstrap-retry-cap.sh)
that the systemd unit invokes instead of the binary directly. The
wrapper:
- Tracks attempt count + first-attempt timestamp in
/var/lib/aks-secure-tls-bootstrap/, wiped per provisioning session
by cse_config.sh and reset across reboots via boot-id check.
- Applies exponential backoff (1s -> 300s default) between attempts.
- Caps per-session retries: 50 attempts OR 7200s wall-clock,
whichever fires first. Both bounds are env-var configurable.
- On cap, emits a distinct guest-agent event with TaskName
'AKS.Bootstrap.SecureTLSBootstrapping.RetryCapReached' carrying
the last FinalErrorType, attempt count, elapsed seconds, and
client version. Once capped, further invocations exit 0
immediately without re-running the binary or re-emitting events.
Plumb 4 new knobs end-to-end through both deployment paths:
- Legacy CSE: SecureTLSBootstrappingConfig fields in types.go ->
template funcs in baker.go -> env vars in cse_cmd.sh -> read by
cse_config.sh and written to /etc/default/secure-tls-bootstrap.
- aks-node-controller: 4 new proto fields (slots 16-19) in
bootstrapping_config.proto -> regenerated .pb.go -> mapped in
parser.go to the same env vars.
Backward-compat safe across the 6-month VHD window: the systemd unit
file ships with the VHD, so old VHDs keep their old unit pointing
directly at the binary. Old VHD + new CSE is therefore a no-op for
this change. New VHD + old CSE falls back to wrapper defaults.
Linux only — Windows tracked separately (AB#38327356/57).
Dashboards filtering on TaskName ==
'AKS.Bootstrap.SecureTLSBootstrapping' must be updated to also union
the .RetryCapReached suffix; spam-count denominators should exclude
the parent when the suffix fires for the same VM.
Tests:
- ShellSpec: 11 new tests in secure_tls_bootstrap_retry_cap_spec.sh
plus 3 new It blocks in cse_config_spec.sh for state-dir wipe and
env-var forwarding.
- Go: parser_test.go + baker_test.go assertions for the 4 new
fields/env-var mappings.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
|
The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a Linux-side retry-cap wrapper for Secure TLS Bootstrapping (STLS) to prevent runaway systemd/kubelet crash loops from generating excessive STLS failure events per VM, while keeping transient-retry behavior via bounded exponential backoff. It also wires four new retry-cap configuration knobs through both the legacy CSE path and the aks-node-controller path.
Changes:
- Add
secure-tls-bootstrap-retry-cap.shwrapper and update the STLS systemd unit to invoke it instead of calling the STLS binary directly. - Plumb four new retry-cap settings (max attempts, max total seconds, initial backoff, max backoff) through AgentBaker templates and aks-node-controller proto/parser mapping to env vars.
- Add ShellSpec and Go tests covering the wrapper behavior and the new configuration mappings.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/vhd-image-builder-mariner.json | Stage the new retry-cap wrapper script into the Mariner VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-mariner-cvm.json | Stage the new retry-cap wrapper script into the Mariner CVM VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-mariner-arm64.json | Stage the new retry-cap wrapper script into the Mariner ARM64 VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-flatcar.json | Stage the new retry-cap wrapper script into the Flatcar VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json | Stage the new retry-cap wrapper script into the Flatcar ARM64 VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-cvm.json | Stage the new retry-cap wrapper script into the CVM VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-base.json | Stage the new retry-cap wrapper script into the base Linux VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-arm64-gen2.json | Stage the new retry-cap wrapper script into the ARM64 Gen2 VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-arm64-gb.json | Stage the new retry-cap wrapper script into the ARM64 GB VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-acl.json | Stage the new retry-cap wrapper script into the ACL VHD build inputs. |
| vhdbuilder/packer/vhd-image-builder-acl-arm64.json | Stage the new retry-cap wrapper script into the ACL ARM64 VHD build inputs. |
| vhdbuilder/packer/packer_source.sh | Install the wrapper into /opt/azure/containers/ on the VHD with executable permissions. |
| spec/parts/linux/cloud-init/artifacts/secure_tls_bootstrap_retry_cap_spec.sh | Add ShellSpec coverage for wrapper state handling, caps, and env overrides. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Add ShellSpec coverage for state-dir wipe and env-var forwarding behavior. |
| pkg/agent/datamodel/types.go | Add new SecureTLSBootstrappingConfig fields + getters for retry-cap knobs. |
| pkg/agent/baker.go | Expose new template funcs for retry-cap fields so they can be rendered into env vars. |
| pkg/agent/baker_test.go | Assert legacy CSE env-var rendering includes the four new retry-cap variables. |
| parts/linux/cloud-init/artifacts/secure-tls-bootstrap.service | Switch systemd unit ExecStart to the wrapper script. |
| parts/linux/cloud-init/artifacts/secure-tls-bootstrap-retry-cap.sh | Introduce wrapper implementing capped retries + capped terminal event emission. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Reset retry-cap state per provisioning session and forward retry-cap env vars to the default file. |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Add env-var exports for the four new retry-cap knobs in legacy CSE command template. |
| aks-node-controller/proto/aksnodeconfig/v1/bootstrapping_config.proto | Add proto fields for retry-cap knobs (aks-node-controller path). |
| aks-node-controller/pkg/gen/aksnodeconfig/v1/bootstrapping_config.pb.go | Regenerate Go bindings for the updated proto. |
| aks-node-controller/parser/parser.go | Map the new proto fields into CSE env vars. |
| aks-node-controller/parser/parser_test.go | Extend parser tests to assert env vars exist for the new knobs. |
Files not reviewed (1)
- aks-node-controller/pkg/gen/aksnodeconfig/v1/bootstrapping_config.pb.go: Language not supported
| [Unit] | ||
| Description=AKS Secure TLS Bootstrap Client | ||
| ConditionPathExists=/opt/bin/aks-secure-tls-bootstrap-client | ||
| ConditionPathExists=/opt/azure/containers/secure-tls-bootstrap-retry-cap.sh | ||
| Wants=network-online.target |
| compute_backoff() { | ||
| # Loop-double from INITIAL_BACKOFF_SECONDS, n times, capped at MAX_BACKOFF_SECONDS. | ||
| # Loop-based to avoid integer overflow on large attempts (2^63 etc.). | ||
| local n="$1" | ||
| local b="$INITIAL_BACKOFF_SECONDS" | ||
| local i=0 | ||
| while [ "$i" -lt "$n" ]; do | ||
| b=$((b * 2)) | ||
| if [ "$b" -ge "$MAX_BACKOFF_SECONDS" ]; then | ||
| echo "$MAX_BACKOFF_SECONDS" | ||
| return | ||
| fi | ||
| i=$((i + 1)) | ||
| done | ||
| echo "$b" | ||
| } |
| else | ||
| # Defensive fallback (jq is on the VHD, but if missing we still emit | ||
| # a valid event so the cap signal isn't lost). | ||
| message="{\"Status\":\"Failure\",\"Reason\":\"RetryCapReached\",\"Attempts\":\"$attempts\",\"ElapsedSeconds\":\"$elapsed\",\"FinalErrorType\":\"$final_error\",\"MaxAttempts\":\"$MAX_ATTEMPTS\",\"MaxTotalSeconds\":\"$MAX_TOTAL_SECONDS\",\"ClientVersion\":\"$client_version\"}" | ||
| fi |
|
AgentBaker Linux PR gate — E2E failure (mixed: 2 leaves likely shared infra/test-fixture, 2 leaves possibly PR-related)
Failing leaves, grouped: Group A — shared with sibling PR #8654, very likely NOT caused by this PR:
Group B — possibly PR-correlated, needs the VM logs:
This PR's changes (relevant slice):
The two Ubuntu2404 leaves (Group B) are scenarios that exercise CSE on freshly-baked 24.04 VHDs. The PR injects a new artifact into every VHD via Confidence:
Strongest alternative for Group B (less likely): generic E2E flake on Ubuntu2404 image bring-up — refuted by the deterministic shape across two leaves on the same image and the fact that other Ubuntu2404 scenarios in the same run did NOT fail. Recommended next action (owner: PR author):
Posted by Clawpilot AgentBaker gate detective. |
What & why
ADO: AB#38327355
The
aks-secure-tls-bootstrapclient has no native long-running retry loop — each invocation makes a single attempt and exits. Re-invocation is driven by kubelet'sRestart=alwayspulling the failed STLS unit via itsWants=relation.On 2026-06-04, a single stuck VM (
b409a057-44a0-4b06-a6de-2e24e59a90a5in sub37deca37-...) hitGetNonceFailure: produced zero addresses(DNS) and emitted 2,837 STLS failure events in 53+ hours. That single VM was 96% of all STLS fail events in eastus2euap on Jun 5–6 and dropped per-event QoS to ~93% even though only 68 VMs actually failed.Approach
New Linux wrapper
parts/linux/cloud-init/artifacts/secure-tls-bootstrap-retry-cap.shthat the systemd unit invokes instead of the binary directly. The wrapper:/var/lib/aks-secure-tls-bootstrap/, wiped per provisioning session bycse_config.shand reset across reboots via boot-id check.AKS.Bootstrap.SecureTLSBootstrapping.RetryCapReachedcarrying the lastFinalErrorType, attempt count, elapsed seconds, and client version. Once capped, further invocations exit0immediately without re-running the binary or re-emitting events.Defaults are conservative — a typical transient blip recovers well within the budget. A stuck VM under these defaults emits ~30 binary-failure events + 1
RetryCapReachedevent over ~2 h, then goes quiet (vs. the 2,837 events / 53 h observed today).Full plumbing — 4 new knobs in both deployment paths
Legacy CSE:
SecureTLSBootstrappingConfigfields intypes.go→ template funcs inbaker.go→ env vars incse_cmd.sh→ read bycse_config.shand written to/etc/default/secure-tls-bootstrap.aks-node-controller: 4 new proto fields (slots 16–19) in
bootstrapping_config.proto→ regenerated.pb.go→ mapped inparser.goto the same env vars.Env vars:
SECURE_TLS_BOOTSTRAPPING_MAX_ATTEMPTS,..._MAX_TOTAL_SECONDS,..._INITIAL_BACKOFF_SECONDS,..._MAX_BACKOFF_SECONDS.Backward compatibility (6-month VHD window)
Existing dashboards filtering on
TaskName == "AKS.Bootstrap.SecureTLSBootstrapping"will NOT see the new.RetryCapReachedevents. To avoid regressing the per-event QoS metric this PR is fixing, dashboards must:AKS.Bootstrap.SecureTLSBootstrappingandAKS.Bootstrap.SecureTLSBootstrapping.RetryCapReached)..RetryCapReachedevent in the same provisioning session.I'll follow up with the QoS dashboard owners separately.
Scope notes
Azure/aks-secure-tls-bootstrapso this isn't an AgentBaker wrapper — out of scope for this PR.Tests
secure_tls_bootstrap_retry_cap_spec.sh(first invocation, Nth under cap, exact cap, sentinel persistence, env-var overrides, missing state dir, max-backoff respected). 3 newItblocks incse_config_spec.shfor state-dir wipe and env-var forwarding. All passing, shellcheck clean.parser_test.go+baker_test.gocover all 4 new fields → env-var mappings.go test ./pkg/agent/...andcd aks-node-controller && go test ./parser/...both green.Acceptance criteria (from ADO)
max_attempts).max_total_seconds)..RetryCapReachedTaskName suffix) for Kusto / dashboards.BootstrappingConfig.Pre-existing failures (NOT introduced by this PR)
TestDownloadHotfix_MatchingBaseUpgradesinaks-node-controller/hotfix_test.go— date-sensitive, fails on baselinemain.make validate-shellSC3014 failures in unrelated files — present on baselinemain.🤖 Generated by GitHub Copilot