stls: resolve apiserver IP locally and inject as APISERVER_IP env var#8654
stls: resolve apiserver IP locally and inject as APISERVER_IP env var#8654djsly wants to merge 2 commits into
Conversation
When DNS is broken at node provisioning time (CoreDNS not ready,
systemd-resolved race, custom DNS misconfig), the STLS client retries
the bootstrap gRPC dial forever with the error:
name resolver error: produced zero addresses
This was the root cause of the Jun 5-6 stuck-VM incident (VMId
b409a057-44a0-4b06-a6de-2e24e59a90a5, 53+ hours of retries).
configureAndStartSecureTLSBootstrapping now resolves the apiserver
IP locally and writes it into /etc/default/secure-tls-bootstrap as
APISERVER_IP=<addr>. The companion STLS client change reads this env
var and dials the IP literal via grpc passthrough://, bypassing the
gRPC dns:/// resolver entirely.
Resolution order, all best-effort:
1. If API_SERVER_NAME is already an IPv4 literal, use as-is.
2. For *.privatelink.* hosts, query the IMDS aksAPIServerIPAddress
tag (same source reconcile-private-hosts.sh uses; works even
when cluster DNS is already broken at CSE time).
3. getent ahostsv4, then ahostsv6.
4. Final sanity strip: any value not matching [0-9a-fA-F:.] is
discarded.
If every step fails APISERVER_IP stays empty, the env var line is
omitted, and the STLS client falls back to its existing FQDN dial
path. CSE never fails on this resolution.
Backward / forward compat (all four cells safe):
- Old CSE + old client: unchanged (FQDN dial).
- Old CSE + new client: APISERVER_IP unset, FQDN fallback.
- New CSE + old client: ignores the env var, FQDN dial.
- New CSE + new client: dials IP, no DNS at gRPC time.
Covers both legacy CSE and aks-node-controller paths (ANC execs
cse_main.sh -> cse_config.sh, so one shell fix covers both).
Refs: AB#38327357
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… config Adds Test_Ubuntu2204_SecureTLSBootstrapping_APIServerIPEnvVar which provisions an STLS-enabled Ubuntu 22.04 node and verifies that /etc/default/secure-tls-bootstrap contains APISERVER_IP=. This guards the new resolver block in configureAndStartSecureTLSBootstrapping against regressions. The full DNS-blackhole e2e test (assert STLS bootstrap succeeds with DNS broken) requires the companion STLS client binary baked into the VHD and is tracked as a follow-up. Refs: AB#38327357 Companion STLS client PR: Azure/aks-secure-tls-bootstrap#180 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
There was a problem hiding this comment.
Pull request overview
This PR updates the Linux CSE secure-tls-bootstrap setup so it can pre-resolve the API server IP at provisioning time and inject it into the STLS client via an APISERVER_IP environment variable, avoiding infinite bootstrap retries when node-local DNS is unhealthy.
Changes:
- Resolve API server IP (IMDS tag for privatelink, else
getentv4/v6) inconfigureAndStartSecureTLSBootstrappingand write it to/etc/default/secure-tls-bootstrapwhen non-empty. - Extend ShellSpec coverage for the new resolution branches and ensure existing tests don’t depend on live DNS/IMDS.
- Add an e2e smoke test validating that
APISERVER_IP=is written for an STLS-enabled Ubuntu 22.04 scenario.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_config.sh | Adds best-effort API server IP resolution and writes APISERVER_IP into the STLS systemd EnvironmentFile. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Adds unit tests for each resolution branch; stubs resolver commands by default for determinism. |
| e2e/scenario_test.go | Adds an e2e scenario to assert the APISERVER_IP line is emitted into /etc/default/secure-tls-bootstrap. |
| APISERVER_IP=$(curl -sSL -m 5 -H "Metadata: true" \ | ||
| "http://169.254.169.254/metadata/instance/compute/tags?api-version=2019-03-11&format=text" 2>/dev/null \ | ||
| | tr ';' '\n' \ | ||
| | awk -F: 'tolower($1) == "aksapiserveripaddress" { print $2; exit }') |
|
AgentBaker Linux PR gate — E2E failure (needs deeper triage)
Failing leaves (all 3 on the same parent test):
Signal — strongly correlated with sibling PR #8653: Build 167080476 (PR #8653) for the same author and overlapping STLS workstream shows the same This PR's changes:
Confidence: Medium-low on root cause without the actual error message at Strongest alternative (likely the truth): shared early-setup failure in Recommended next action (owner: PR author):
Posted by Clawpilot AgentBaker gate detective. |
Summary
Eliminates infinite STLS bootstrap retries caused by node-local DNS failures by injecting a pre-resolved API server IP into the STLS client via an
APISERVER_IPenvironment variable.Tracked by AB#38327357 (parent feature AB#34681743, Cameron Meissner).
Background
When DNS is broken on a node at STLS bootstrap time (CoreDNS not ready, systemd-resolved race, custom DNS misconfig), the STLS client emits
and retries forever (
retry.WithMax(math.MaxUint)in the gRPC interceptor). Root cause:client/internal/bootstrap/grpc.go::getServiceClientbuilds the dial target asfmt.Sprintf("%s:443", cfg.APIServerFQDN), so gRPC's built-indnsresolver re-resolves the FQDN on every retry attempt.Reference incident: VMId
b409a057-44a0-4b06-a6de-2e24e59a90a5, 53+ hours of retries on Jun 5–6.What this PR does
configureAndStartSecureTLSBootstrappinginparts/linux/cloud-init/artifacts/cse_config.shnow resolves the API server IP at provisioning time and writes it asAPISERVER_IP=<addr>into/etc/default/secure-tls-bootstrap(the systemdEnvironmentFileforsecure-tls-bootstrap.service).Resolution order:
API_SERVER_NAMEis already an IPv4 literal → use it as-is.*.privatelink.*hosts → IMDS instance tagaksAPIServerIPAddress(same source asreconcile-private-hosts.sh; survives cluster DNS failures).getent ahostsv4 ${API_SERVER_NAME}(IPv4 preferred).getent ahostsv6 ${API_SERVER_NAME}(IPv6 fallback).[0-9a-fA-F:.]result is discarded.If every step fails (e.g., DNS already down on a public cluster at CSE time),
APISERVER_IPstays empty, the line is not emitted, and STLS falls back to its existing FQDN-dial behaviour — no regression.The companion Azure/aks-secure-tls-bootstrap#180 PR:
APIServerIPtobootstrap.Config, populated fromos.Getenv("APISERVER_IP").getServiceClientto apassthrough:///dial target when the IP is set, withtls.Config.ServerName = cfg.APIServerFQDNandgrpc.WithAuthority(net.JoinHostPort(cfg.APIServerFQDN, "443"))so TLS SAN validation and HTTP/2:authoritystill resolve against the FQDN. The gRPCdnsresolver is bypassed entirely.Backward / forward compatibility (6-month VHD window)
APISERVER_IPunset → STLS falls back to FQDN dial. Identical to status quo.All four cells are safe — the fix only activates when both sides are new.
Why this lives in
cse_config.shand not RP / parserSTLS is started at
cse_main.shline ~390 — before the cluster-validationnslookupat line ~515. We cannot rely on a side effect of that later DNS call. Resolving and writing the env var insideconfigureAndStartSecureTLSBootstrapping, right before theEnvironmentFilewrite, is the only place that gives us both:getentcan run).APISERVER_IPandBOOTSTRAP_FLAGSinto the sameEnvironmentFile.No RP-side /
pkg/agent/datamodel/types.go/ proto changes — IP is resolved locally by CSE.aks-node-controller parity
The ANC provisioning path invokes the same
configureAndStartSecureTLSBootstrappingfunction viacse_main.sh, so no additional ANC code change is required. Confirmed by grep: no separate STLS bootstrap path exists inaks-node-controller/.Testing
Unit (ShellSpec)
spec/parts/linux/cloud-init/artifacts/cse_config_spec.shadds 6 newItblocks covering every resolver branch:API_SERVER_NAMEis an IPv4 literal → echoed as-is.getent ahostsv4returns an IPv4 → captured, written.ahostsv4fails,ahostsv6returns an IPv6 → captured.*.privatelink.*host + IMDS returns valid tag → IP from IMDS, nogetentcall.*.privatelink.*host + IMDS returns no tag → falls through to DNS.APISERVER_IP=line absent, CSE does not fail.Full suite: 97/97 green under podman + the standard
aksdataplanedev.azurecr.io/shellspecbase image.E2E
New test
Test_Ubuntu2204_SecureTLSBootstrapping_APIServerIPEnvVarine2e/scenario_test.goprovisions an STLS-enabled Ubuntu 22.04 node and validates that/etc/default/secure-tls-bootstrapcontainsAPISERVER_IP=. This is a smoke test of the AgentBaker side.The full DNS-blackhole e2e (assert STLS bootstrap succeeds with DNS broken) requires the new STLS client binary baked into the VHD and will be added in a follow-up PR once #180 lands.
Lint / generate
make validate-shell: no new shellcheck issues introduced (pre-existing SC3014 warnings unchanged).make generate: no snapshot drift (verified viaGOPROXY=https://proxy.golang.org,direct GENERATE_TEST_DATA="true" go test ./pkg/agent/...).Risks / open items
getentfails,APISERVER_IP="", STLS falls back to FQDN dial — same as today. Not worse, but the fix doesn't help this narrow case. Future RP-side enhancement is the only complete fix; out of scope.kube-apiserver-proxyliteral naming (from the ADO item): the string does not appear in either repo. The real dial target is${APIServerFQDN}. Flagging for cross-check with @cameronmeissner before landing.Related work items
🤖 Generated by GitHub Copilot