Skip to content

stls: resolve apiserver IP locally and inject as APISERVER_IP env var#8654

Open
djsly wants to merge 2 commits into
mainfrom
djsly/stls-skip-dns-lookup
Open

stls: resolve apiserver IP locally and inject as APISERVER_IP env var#8654
djsly wants to merge 2 commits into
mainfrom
djsly/stls-skip-dns-lookup

Conversation

@djsly

@djsly djsly commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Eliminates infinite STLS bootstrap retries caused by node-local DNS failures by injecting a pre-resolved API server IP into the STLS client via an APISERVER_IP environment variable.

Tracked by AB#38327357 (parent feature AB#34681743, Cameron Meissner).

Background

When DNS is broken on a node at STLS bootstrap time (CoreDNS not ready, systemd-resolved race, custom DNS misconfig), the STLS client emits

rpc error: code = DeadlineExceeded ... name resolver error: produced zero addresses

and retries forever (retry.WithMax(math.MaxUint) in the gRPC interceptor). Root cause: client/internal/bootstrap/grpc.go::getServiceClient builds the dial target as fmt.Sprintf("%s:443", cfg.APIServerFQDN), so gRPC's built-in dns resolver re-resolves the FQDN on every retry attempt.

Reference incident: VMId b409a057-44a0-4b06-a6de-2e24e59a90a5, 53+ hours of retries on Jun 5–6.

What this PR does

configureAndStartSecureTLSBootstrapping in parts/linux/cloud-init/artifacts/cse_config.sh now resolves the API server IP at provisioning time and writes it as APISERVER_IP=<addr> into /etc/default/secure-tls-bootstrap (the systemd EnvironmentFile for secure-tls-bootstrap.service).

Resolution order:

  1. If API_SERVER_NAME is already an IPv4 literal → use it as-is.
  2. For *.privatelink.* hosts → IMDS instance tag aksAPIServerIPAddress (same source as reconcile-private-hosts.sh; survives cluster DNS failures).
  3. getent ahostsv4 ${API_SERVER_NAME} (IPv4 preferred).
  4. getent ahostsv6 ${API_SERVER_NAME} (IPv6 fallback).
  5. Any non-[0-9a-fA-F:.] result is discarded.

If every step fails (e.g., DNS already down on a public cluster at CSE time), APISERVER_IP stays empty, the line is not emitted, and STLS falls back to its existing FQDN-dial behaviour — no regression.

The companion Azure/aks-secure-tls-bootstrap#180 PR:

  • Adds APIServerIP to bootstrap.Config, populated from os.Getenv("APISERVER_IP").
  • Switches getServiceClient to a passthrough:/// dial target when the IP is set, with tls.Config.ServerName = cfg.APIServerFQDN and grpc.WithAuthority(net.JoinHostPort(cfg.APIServerFQDN, "443")) so TLS SAN validation and HTTP/2 :authority still resolve against the FQDN. The gRPC dns resolver is bypassed entirely.

Backward / forward compatibility (6-month VHD window)

AgentBaker (CSE) STLS client binary Behaviour
Old (no env var) Old (no env reader) Status quo — DNS dial on every retry.
Old (no env var) New (env reader) APISERVER_IP unset → STLS falls back to FQDN dial. Identical to status quo.
New (writes env var) Old (no env reader) Old binary silently ignores unknown env vars. FQDN dial. Identical to status quo.
New (writes env var) New (env reader) Fix activates: STLS dials the IP literal, no DNS at gRPC time.

All four cells are safe — the fix only activates when both sides are new.

Why this lives in cse_config.sh and not RP / parser

STLS is started at cse_main.sh line ~390 — before the cluster-validation nslookup at line ~515. We cannot rely on a side effect of that later DNS call. Resolving and writing the env var inside configureAndStartSecureTLSBootstrapping, right before the EnvironmentFile write, is the only place that gives us both:

  • A live cluster context (so the IMDS tag and getent can run).
  • An atomic write of APISERVER_IP and BOOTSTRAP_FLAGS into the same EnvironmentFile.

No RP-side / pkg/agent/datamodel/types.go / proto changes — IP is resolved locally by CSE.

aks-node-controller parity

The ANC provisioning path invokes the same configureAndStartSecureTLSBootstrapping function via cse_main.sh, so no additional ANC code change is required. Confirmed by grep: no separate STLS bootstrap path exists in aks-node-controller/.

Testing

Unit (ShellSpec)

spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh adds 6 new It blocks covering every resolver branch:

  1. API_SERVER_NAME is an IPv4 literal → echoed as-is.
  2. getent ahostsv4 returns an IPv4 → captured, written.
  3. ahostsv4 fails, ahostsv6 returns an IPv6 → captured.
  4. *.privatelink.* host + IMDS returns valid tag → IP from IMDS, no getent call.
  5. *.privatelink.* host + IMDS returns no tag → falls through to DNS.
  6. All resolvers fail → APISERVER_IP= line absent, CSE does not fail.
  7. IMDS returns garbage → discarded by sanity-strip, falls through to DNS.

Full suite: 97/97 green under podman + the standard aksdataplanedev.azurecr.io/shellspec base image.

E2E

New test Test_Ubuntu2204_SecureTLSBootstrapping_APIServerIPEnvVar in e2e/scenario_test.go provisions an STLS-enabled Ubuntu 22.04 node and validates that /etc/default/secure-tls-bootstrap contains APISERVER_IP=. This is a smoke test of the AgentBaker side.

The full DNS-blackhole e2e (assert STLS bootstrap succeeds with DNS broken) requires the new STLS client binary baked into the VHD and will be added in a follow-up PR once #180 lands.

Lint / generate

  • make validate-shell: no new shellcheck issues introduced (pre-existing SC3014 warnings unchanged).
  • make generate: no snapshot drift (verified via GOPROXY=https://proxy.golang.org,direct GENERATE_TEST_DATA="true" go test ./pkg/agent/...).

Risks / open items

  • CSE-time DNS already broken on a public cluster: no IMDS tag (private only), getent fails, APISERVER_IP="", STLS falls back to FQDN dial — same as today. Not worse, but the fix doesn't help this narrow case. Future RP-side enhancement is the only complete fix; out of scope.
  • API server IP changes mid-bootstrap: extremely rare within the short STLS window; would surface as TCP/TLS error rather than DNS, eventually retries succeed. Acceptable; orthogonal to AB#38327355 (retry cap).
  • Windows: out of scope.
  • kube-apiserver-proxy literal naming (from the ADO item): the string does not appear in either repo. The real dial target is ${APIServerFQDN}. Flagging for cross-check with @cameronmeissner before landing.

Related work items

  • AB#38327357 — this PR
  • AB#34681743 — parent STLS Phase 1 feature
  • AB#38327355 — STLS retry cap (sibling)
  • AB#38327356 — per-VM STLS QoS metric (sibling)

🤖 Generated by GitHub Copilot

djsly and others added 2 commits June 7, 2026 11:52
When DNS is broken at node provisioning time (CoreDNS not ready,
systemd-resolved race, custom DNS misconfig), the STLS client retries
the bootstrap gRPC dial forever with the error:

  name resolver error: produced zero addresses

This was the root cause of the Jun 5-6 stuck-VM incident (VMId
b409a057-44a0-4b06-a6de-2e24e59a90a5, 53+ hours of retries).

configureAndStartSecureTLSBootstrapping now resolves the apiserver
IP locally and writes it into /etc/default/secure-tls-bootstrap as
APISERVER_IP=<addr>. The companion STLS client change reads this env
var and dials the IP literal via grpc passthrough://, bypassing the
gRPC dns:/// resolver entirely.

Resolution order, all best-effort:
  1. If API_SERVER_NAME is already an IPv4 literal, use as-is.
  2. For *.privatelink.* hosts, query the IMDS aksAPIServerIPAddress
     tag (same source reconcile-private-hosts.sh uses; works even
     when cluster DNS is already broken at CSE time).
  3. getent ahostsv4, then ahostsv6.
  4. Final sanity strip: any value not matching [0-9a-fA-F:.] is
     discarded.

If every step fails APISERVER_IP stays empty, the env var line is
omitted, and the STLS client falls back to its existing FQDN dial
path. CSE never fails on this resolution.

Backward / forward compat (all four cells safe):
  - Old CSE + old client: unchanged (FQDN dial).
  - Old CSE + new client: APISERVER_IP unset, FQDN fallback.
  - New CSE + old client: ignores the env var, FQDN dial.
  - New CSE + new client: dials IP, no DNS at gRPC time.

Covers both legacy CSE and aks-node-controller paths (ANC execs
cse_main.sh -> cse_config.sh, so one shell fix covers both).

Refs: AB#38327357

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… config

Adds Test_Ubuntu2204_SecureTLSBootstrapping_APIServerIPEnvVar which
provisions an STLS-enabled Ubuntu 22.04 node and verifies that
/etc/default/secure-tls-bootstrap contains APISERVER_IP=. This guards
the new resolver block in configureAndStartSecureTLSBootstrapping
against regressions.

The full DNS-blackhole e2e test (assert STLS bootstrap succeeds with
DNS broken) requires the companion STLS client binary baked into the
VHD and is tracked as a follow-up.

Refs: AB#38327357
Companion STLS client PR: Azure/aks-secure-tls-bootstrap#180

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

PR Title Lint Failed ❌

Current Title: stls: resolve apiserver IP locally and inject as APISERVER_IP env var

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Linux CSE secure-tls-bootstrap setup so it can pre-resolve the API server IP at provisioning time and inject it into the STLS client via an APISERVER_IP environment variable, avoiding infinite bootstrap retries when node-local DNS is unhealthy.

Changes:

  • Resolve API server IP (IMDS tag for privatelink, else getent v4/v6) in configureAndStartSecureTLSBootstrapping and write it to /etc/default/secure-tls-bootstrap when non-empty.
  • Extend ShellSpec coverage for the new resolution branches and ensure existing tests don’t depend on live DNS/IMDS.
  • Add an e2e smoke test validating that APISERVER_IP= is written for an STLS-enabled Ubuntu 22.04 scenario.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
parts/linux/cloud-init/artifacts/cse_config.sh Adds best-effort API server IP resolution and writes APISERVER_IP into the STLS systemd EnvironmentFile.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Adds unit tests for each resolution branch; stubs resolver commands by default for determinism.
e2e/scenario_test.go Adds an e2e scenario to assert the APISERVER_IP line is emitted into /etc/default/secure-tls-bootstrap.

Comment on lines +564 to +567
APISERVER_IP=$(curl -sSL -m 5 -H "Metadata: true" \
"http://169.254.169.254/metadata/instance/compute/tags?api-version=2019-03-11&format=text" 2>/dev/null \
| tr ';' '\n' \
| awk -F: 'tolower($1) == "aksapiserveripaddress" { print $2; exit }')
@djsly

djsly commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

AgentBaker Linux PR gate — E2E failure (needs deeper triage)

  • Run: 167080536 (failed)
  • Failed task: Run AgentBaker E2E → AzureCLI → exit 1 (DONE 460 tests, 95 skipped, 3 failures in 1501.866s)

Failing leaves (all 3 on the same parent test):

  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/default (2.76s) — test_helpers.go:227 🔴 (early-fail, empty error string in build log)
  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/scriptless_nbc (0.24s) — same
  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated (root container, 0.10s)

Signal — strongly correlated with sibling PR #8653: Build 167080476 (PR #8653) for the same author and overlapping STLS workstream shows the same Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/{default,scriptless_nbc} failure shape with empty error strings at test_helpers.go:227. Two PRs failing the same network-isolated scenario at sub-3s with empty error strings points away from a PR-specific behavioural regression and toward either (a) a shared early-setup error in this scenario (likely test-helper precondition / VMSS setup raising before the scenario body runs) or (b) a transient infra issue in the NetworkIsolated subnet path during this window.

This PR's changes:

  • e2e/scenario_test.go (+35), parts/linux/cloud-init/artifacts/cse_config.sh (+46, adds resolveAPIServerIP + APISERVER_IP env var injection), spec/.../cse_config_spec.sh (+145 ShellSpec).
  • The PR does NOT modify any image-pull or NetworkIsolated test wiring; the APISERVER_IP env injection runs in cse_config.sh before kubelet bootstrap and shouldn't gate image-pull identity binding.

Confidence: Medium-low on root cause without the actual error message at test_helpers.go:227 (build log shows the marker 🔴 FAIL: with an empty body — the real reason is in the per-scenario VM/log artifact). High that this is not caused by the APISERVER_IP injection in cse_config.sh (no path from APISERVER_IP env to ImagePullIdentityBinding test fixture).

Strongest alternative (likely the truth): shared early-setup failure in Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated (test fixture / VMSS-with-private-cluster precondition) — refuted as "PR-caused" by the same shape on the sibling PR #8653 that touches completely different STLS code.

Recommended next action (owner: PR author):

  1. Open the scenario-logs artifact for Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/defaulttest_helpers.go:227 is the per-scenario error printer; the real assertion is one frame above. The early <3s timing strongly suggests a pre-VMSS precondition (e.g. private-cluster networking setup, ACR-with-private-endpoint, identity binding RBAC).
  2. Rerun the failing job once. If it reproduces with the same sub-3s shape on both this PR and Cap per-VM STLS retry attempts (AB#38327355) #8653, file a flake/infra tracker on the ImagePullIdentityBinding_NetworkIsolated scenario rather than blocking either PR.
  3. PR-quality note (not a fix request): the APISERVER_IP injection in cse_config.sh looks unrelated to this failure.

Posted by Clawpilot AgentBaker gate detective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants