Skip to content

Cap per-VM STLS retry attempts (AB#38327355)#8653

Open
djsly wants to merge 1 commit into
mainfrom
djsly/djsly-stls-retry-cap
Open

Cap per-VM STLS retry attempts (AB#38327355)#8653
djsly wants to merge 1 commit into
mainfrom
djsly/djsly-stls-retry-cap

Conversation

@djsly
Copy link
Copy Markdown
Collaborator

@djsly djsly commented Jun 7, 2026

What & why

ADO: AB#38327355

The aks-secure-tls-bootstrap client has no native long-running retry loop — each invocation makes a single attempt and exits. Re-invocation is driven by kubelet's Restart=always pulling the failed STLS unit via its Wants= relation.

On 2026-06-04, a single stuck VM (b409a057-44a0-4b06-a6de-2e24e59a90a5 in sub 37deca37-...) hit GetNonceFailure: produced zero addresses (DNS) and emitted 2,837 STLS failure events in 53+ hours. That single VM was 96% of all STLS fail events in eastus2euap on Jun 5–6 and dropped per-event QoS to ~93% even though only 68 VMs actually failed.

Approach

New Linux wrapper parts/linux/cloud-init/artifacts/secure-tls-bootstrap-retry-cap.sh that the systemd unit invokes instead of the binary directly. The wrapper:

  • Tracks attempt count + first-attempt timestamp in /var/lib/aks-secure-tls-bootstrap/, wiped per provisioning session by cse_config.sh and reset across reboots via boot-id check.
  • Applies exponential backoff (1 s → 300 s default cap) between attempts.
  • Caps per-session retries: 50 attempts OR 7200 s wall-clock, whichever fires first. Both bounds are env-var configurable from the apiserver.
  • On cap, emits a distinct guest-agent event with TaskName AKS.Bootstrap.SecureTLSBootstrapping.RetryCapReached carrying the last FinalErrorType, attempt count, elapsed seconds, and client version. Once capped, further invocations exit 0 immediately without re-running the binary or re-emitting events.

Defaults are conservative — a typical transient blip recovers well within the budget. A stuck VM under these defaults emits ~30 binary-failure events + 1 RetryCapReached event over ~2 h, then goes quiet (vs. the 2,837 events / 53 h observed today).

Full plumbing — 4 new knobs in both deployment paths

Legacy CSE: SecureTLSBootstrappingConfig fields in types.go → template funcs in baker.go → env vars in cse_cmd.sh → read by cse_config.sh and written to /etc/default/secure-tls-bootstrap.

aks-node-controller: 4 new proto fields (slots 16–19) in bootstrapping_config.proto → regenerated .pb.go → mapped in parser.go to the same env vars.

Env vars: SECURE_TLS_BOOTSTRAPPING_MAX_ATTEMPTS, ..._MAX_TOTAL_SECONDS, ..._INITIAL_BACKOFF_SECONDS, ..._MAX_BACKOFF_SECONDS.

Backward compatibility (6-month VHD window)

  • New VHD + new CSE → wrapper present, works.
  • New VHD + old CSE → wrapper present, old CSE doesn't pass the new env vars → wrapper uses safe defaults. ✅
  • Old VHD + new CSE → systemd unit file ships with the VHD (CSE only writes the drop-in), so old VHDs keep their old unit pointing directly at the binary. No regression. ✅

⚠️ Dashboard / Kusto coordination required

Existing dashboards filtering on TaskName == "AKS.Bootstrap.SecureTLSBootstrapping" will NOT see the new .RetryCapReached events. To avoid regressing the per-event QoS metric this PR is fixing, dashboards must:

  1. Union both TaskNames (AKS.Bootstrap.SecureTLSBootstrapping and AKS.Bootstrap.SecureTLSBootstrapping.RetryCapReached).
  2. When spam-counting per-VM events, exclude the parent TaskName for any VM that also has a .RetryCapReached event in the same provisioning session.

I'll follow up with the QoS dashboard owners separately.

Scope notes

  • Linux only. Windows STLS has the same crash-loop scenario but PowerShell parity needs its own design review — tracked in AB#38327356 / AB#38327357.
  • Long-term hardening: adding native state-tracking to Azure/aks-secure-tls-bootstrap so this isn't an AgentBaker wrapper — out of scope for this PR.

Tests

  • ShellSpec — 11 new tests in secure_tls_bootstrap_retry_cap_spec.sh (first invocation, Nth under cap, exact cap, sentinel persistence, env-var overrides, missing state dir, max-backoff respected). 3 new It blocks in cse_config_spec.sh for state-dir wipe and env-var forwarding. All passing, shellcheck clean.
  • Goparser_test.go + baker_test.go cover all 4 new fields → env-var mappings. go test ./pkg/agent/... and cd aks-node-controller && go test ./parser/... both green.

Acceptance criteria (from ADO)

  • ✅ Hard ceiling on attempts (max_attempts).
  • ✅ Hard ceiling on wall-clock time (max_total_seconds).
  • ✅ Exponential backoff with max delay between attempts.
  • ✅ Distinct terminal event (.RetryCapReached TaskName suffix) for Kusto / dashboards.
  • ✅ Transient retry behavior preserved (defaults allow ~30 retries over ~2 h).
  • ✅ All configurable from the apiserver-side BootstrappingConfig.

Pre-existing failures (NOT introduced by this PR)

  • TestDownloadHotfix_MatchingBaseUpgrades in aks-node-controller/hotfix_test.go — date-sensitive, fails on baseline main.
  • make validate-shell SC3014 failures in unrelated files — present on baseline main.

🤖 Generated by GitHub Copilot

The aks-secure-tls-bootstrap client has no native long-running retry
loop: each invocation makes a single attempt and exits. Re-invocation
is driven by kubelet's Restart=always pulling the failed STLS unit via
its Wants= relation.

On 2026-06-04, a single stuck VM in eastus2euap emitted 2,837 STLS
failure events over 53+ hours — 96% of all STLS fail events in the
region — dropping per-event QoS to ~93% even though only 68 VMs
actually failed.

Add a Linux-side wrapper script
(parts/linux/cloud-init/artifacts/secure-tls-bootstrap-retry-cap.sh)
that the systemd unit invokes instead of the binary directly. The
wrapper:

  - Tracks attempt count + first-attempt timestamp in
    /var/lib/aks-secure-tls-bootstrap/, wiped per provisioning session
    by cse_config.sh and reset across reboots via boot-id check.
  - Applies exponential backoff (1s -> 300s default) between attempts.
  - Caps per-session retries: 50 attempts OR 7200s wall-clock,
    whichever fires first. Both bounds are env-var configurable.
  - On cap, emits a distinct guest-agent event with TaskName
    'AKS.Bootstrap.SecureTLSBootstrapping.RetryCapReached' carrying
    the last FinalErrorType, attempt count, elapsed seconds, and
    client version. Once capped, further invocations exit 0
    immediately without re-running the binary or re-emitting events.

Plumb 4 new knobs end-to-end through both deployment paths:

  - Legacy CSE: SecureTLSBootstrappingConfig fields in types.go ->
    template funcs in baker.go -> env vars in cse_cmd.sh -> read by
    cse_config.sh and written to /etc/default/secure-tls-bootstrap.
  - aks-node-controller: 4 new proto fields (slots 16-19) in
    bootstrapping_config.proto -> regenerated .pb.go -> mapped in
    parser.go to the same env vars.

Backward-compat safe across the 6-month VHD window: the systemd unit
file ships with the VHD, so old VHDs keep their old unit pointing
directly at the binary. Old VHD + new CSE is therefore a no-op for
this change. New VHD + old CSE falls back to wrapper defaults.

Linux only — Windows tracked separately (AB#38327356/57).

Dashboards filtering on TaskName ==
'AKS.Bootstrap.SecureTLSBootstrapping' must be updated to also union
the .RetryCapReached suffix; spam-count denominators should exclude
the parent when the suffix fires for the same VM.

Tests:
  - ShellSpec: 11 new tests in secure_tls_bootstrap_retry_cap_spec.sh
    plus 3 new It blocks in cse_config_spec.sh for state-dir wipe and
    env-var forwarding.
  - Go: parser_test.go + baker_test.go assertions for the 4 new
    fields/env-var mappings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

PR Title Lint Failed ❌

Current Title: Cap per-VM STLS retry attempts (AB#38327355)

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 7, 2026, 4:02 PM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Linux-side retry-cap wrapper for Secure TLS Bootstrapping (STLS) to prevent runaway systemd/kubelet crash loops from generating excessive STLS failure events per VM, while keeping transient-retry behavior via bounded exponential backoff. It also wires four new retry-cap configuration knobs through both the legacy CSE path and the aks-node-controller path.

Changes:

  • Add secure-tls-bootstrap-retry-cap.sh wrapper and update the STLS systemd unit to invoke it instead of calling the STLS binary directly.
  • Plumb four new retry-cap settings (max attempts, max total seconds, initial backoff, max backoff) through AgentBaker templates and aks-node-controller proto/parser mapping to env vars.
  • Add ShellSpec and Go tests covering the wrapper behavior and the new configuration mappings.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vhdbuilder/packer/vhd-image-builder-mariner.json Stage the new retry-cap wrapper script into the Mariner VHD build inputs.
vhdbuilder/packer/vhd-image-builder-mariner-cvm.json Stage the new retry-cap wrapper script into the Mariner CVM VHD build inputs.
vhdbuilder/packer/vhd-image-builder-mariner-arm64.json Stage the new retry-cap wrapper script into the Mariner ARM64 VHD build inputs.
vhdbuilder/packer/vhd-image-builder-flatcar.json Stage the new retry-cap wrapper script into the Flatcar VHD build inputs.
vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json Stage the new retry-cap wrapper script into the Flatcar ARM64 VHD build inputs.
vhdbuilder/packer/vhd-image-builder-cvm.json Stage the new retry-cap wrapper script into the CVM VHD build inputs.
vhdbuilder/packer/vhd-image-builder-base.json Stage the new retry-cap wrapper script into the base Linux VHD build inputs.
vhdbuilder/packer/vhd-image-builder-arm64-gen2.json Stage the new retry-cap wrapper script into the ARM64 Gen2 VHD build inputs.
vhdbuilder/packer/vhd-image-builder-arm64-gb.json Stage the new retry-cap wrapper script into the ARM64 GB VHD build inputs.
vhdbuilder/packer/vhd-image-builder-acl.json Stage the new retry-cap wrapper script into the ACL VHD build inputs.
vhdbuilder/packer/vhd-image-builder-acl-arm64.json Stage the new retry-cap wrapper script into the ACL ARM64 VHD build inputs.
vhdbuilder/packer/packer_source.sh Install the wrapper into /opt/azure/containers/ on the VHD with executable permissions.
spec/parts/linux/cloud-init/artifacts/secure_tls_bootstrap_retry_cap_spec.sh Add ShellSpec coverage for wrapper state handling, caps, and env overrides.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Add ShellSpec coverage for state-dir wipe and env-var forwarding behavior.
pkg/agent/datamodel/types.go Add new SecureTLSBootstrappingConfig fields + getters for retry-cap knobs.
pkg/agent/baker.go Expose new template funcs for retry-cap fields so they can be rendered into env vars.
pkg/agent/baker_test.go Assert legacy CSE env-var rendering includes the four new retry-cap variables.
parts/linux/cloud-init/artifacts/secure-tls-bootstrap.service Switch systemd unit ExecStart to the wrapper script.
parts/linux/cloud-init/artifacts/secure-tls-bootstrap-retry-cap.sh Introduce wrapper implementing capped retries + capped terminal event emission.
parts/linux/cloud-init/artifacts/cse_config.sh Reset retry-cap state per provisioning session and forward retry-cap env vars to the default file.
parts/linux/cloud-init/artifacts/cse_cmd.sh Add env-var exports for the four new retry-cap knobs in legacy CSE command template.
aks-node-controller/proto/aksnodeconfig/v1/bootstrapping_config.proto Add proto fields for retry-cap knobs (aks-node-controller path).
aks-node-controller/pkg/gen/aksnodeconfig/v1/bootstrapping_config.pb.go Regenerate Go bindings for the updated proto.
aks-node-controller/parser/parser.go Map the new proto fields into CSE env vars.
aks-node-controller/parser/parser_test.go Extend parser tests to assert env vars exist for the new knobs.
Files not reviewed (1)
  • aks-node-controller/pkg/gen/aksnodeconfig/v1/bootstrapping_config.pb.go: Language not supported

Comment on lines 1 to 4
[Unit]
Description=AKS Secure TLS Bootstrap Client
ConditionPathExists=/opt/bin/aks-secure-tls-bootstrap-client
ConditionPathExists=/opt/azure/containers/secure-tls-bootstrap-retry-cap.sh
Wants=network-online.target
Comment on lines +95 to +110
compute_backoff() {
# Loop-double from INITIAL_BACKOFF_SECONDS, n times, capped at MAX_BACKOFF_SECONDS.
# Loop-based to avoid integer overflow on large attempts (2^63 etc.).
local n="$1"
local b="$INITIAL_BACKOFF_SECONDS"
local i=0
while [ "$i" -lt "$n" ]; do
b=$((b * 2))
if [ "$b" -ge "$MAX_BACKOFF_SECONDS" ]; then
echo "$MAX_BACKOFF_SECONDS"
return
fi
i=$((i + 1))
done
echo "$b"
}
Comment on lines +143 to +147
else
# Defensive fallback (jq is on the VHD, but if missing we still emit
# a valid event so the cap signal isn't lost).
message="{\"Status\":\"Failure\",\"Reason\":\"RetryCapReached\",\"Attempts\":\"$attempts\",\"ElapsedSeconds\":\"$elapsed\",\"FinalErrorType\":\"$final_error\",\"MaxAttempts\":\"$MAX_ATTEMPTS\",\"MaxTotalSeconds\":\"$MAX_TOTAL_SECONDS\",\"ClientVersion\":\"$client_version\"}"
fi
@djsly
Copy link
Copy Markdown
Collaborator Author

djsly commented Jun 8, 2026

AgentBaker Linux PR gate — E2E failure (mixed: 2 leaves likely shared infra/test-fixture, 2 leaves possibly PR-related)

  • Run: 167080476 (failed)
  • Failed task: Run AgentBaker E2E → AzureCLI → exit 1 (DONE 457 tests, 95 skipped, 7 failures in 1584.008s)

Failing leaves, grouped:

Group A — shared with sibling PR #8654, very likely NOT caused by this PR:

  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/default (2.94s) — test_helpers.go:227 🔴, empty error body
  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/scriptless_nbc (0.00s) — same
  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated (root container)
    Sibling PR stls: resolve apiserver IP locally and inject as APISERVER_IP env var #8654 build 167080536 shows the identical sub-3s shape on the same scenario despite touching unrelated code (APISERVER_IP injection vs STLS retry cap). Two PRs failing the same NetworkIsolated leaf at sub-3s with empty errors strongly indicates a shared test-fixture / private-cluster precondition issue, not a PR regression.

Group B — possibly PR-correlated, needs the VM logs:

  • Test_Ubuntu2404Gen2_McrChinaCloud/scriptless_nbc (310.64s) — scenario_test.go:2382 🔴, empty error body
  • Test_Ubuntu2404Gen2/default (355.07s) — scenario_test.go:2327 🔴, empty error body
  • Test_Ubuntu2404Gen2_McrChinaCloud + Test_Ubuntu2404Gen2 (root containers)

This PR's changes (relevant slice):

  • New parts/linux/cloud-init/artifacts/secure-tls-bootstrap-retry-cap.sh (+318), modified secure-tls-bootstrap.service (+5/-2), cse_cmd.sh (+4), cse_config.sh (+21).
  • New pkg/agent/datamodel/types.go (+51) and protobuf additions in aks-node-controller/proto/aksnodeconfig/v1/bootstrapping_config.proto (+20) plus generated .pb.go (+145/-65).
  • Touches every vhdbuilder/packer/vhd-image-builder-*.json (+5 each) — packer_source.sh (+3) sources the new retry-cap script into the VHD.

The two Ubuntu2404 leaves (Group B) are scenarios that exercise CSE on freshly-baked 24.04 VHDs. The PR injects a new artifact into every VHD via packer_source.sh and wraps secure-tls-bootstrap with a retry cap; if the new retry-cap script returns non-zero on the AKSLTS / MCR-China cloud paths, CSE would fail and the scenario validators would log an empty 🔴 FAIL: at scenario_test.go:2327 / :2382 (those are the high-level scenario-runner error printers — the real reason is in the per-scenario VM logs).

Confidence:

  • Group A: Medium-high "not this PR" — cross-PR same-shape evidence is strong; the actual cause is in the NetworkIsolated test fixture / private-cluster setup.
  • Group B: Medium-low without the per-scenario VM logs — symptoms are scenario-level fail markers with empty bodies; the new STLS retry-cap script + secure-tls-bootstrap.service change are plausible mechanisms (especially for the China-cloud / AKSLTS paths) but not proven from the build-log signal alone.

Strongest alternative for Group B (less likely): generic E2E flake on Ubuntu2404 image bring-up — refuted by the deterministic shape across two leaves on the same image and the fact that other Ubuntu2404 scenarios in the same run did NOT fail.

Recommended next action (owner: PR author):

  1. Open scenario-logs for Test_Ubuntu2404Gen2/default and Test_Ubuntu2404Gen2_McrChinaCloud/scriptless_nbc and look for secure-tls-bootstrap.service status, the new retry-cap helper's stdout/stderr, and CSE exit codes. The :2327 / :2382 lines in scenario_test.go are the per-scenario error printer — the real assertion is one frame above.
  2. For Group A (ImagePullIdentityBinding_NetworkIsolated): coordinate with stls: resolve apiserver IP locally and inject as APISERVER_IP env var #8654 — likely a shared infra/test-fixture issue independent of either PR.
  3. Rerun the failing job; if Group B reproduces deterministically, suspect the retry-cap helper integration; if it goes green, treat as flake.

Posted by Clawpilot AgentBaker gate detective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants