Skip to content

test: (scriptless) Enable scriptless phase 3 in AB e2es#8453

Open
lilypan26 wants to merge 57 commits into
mainfrom
lily/scriptless/phase-3-e2e
Open

test: (scriptless) Enable scriptless phase 3 in AB e2es#8453
lilypan26 wants to merge 57 commits into
mainfrom
lily/scriptless/phase-3-e2e

Conversation

@lilypan26
Copy link
Copy Markdown
Contributor

@lilypan26 lilypan26 commented May 5, 2026

What this PR does / why we need it:

  • Enables scriptless phase 3 in ab e2es under tag EnableScriptlessANC
    • This tag means both ANC and NBC cse cmd will be provided, tests should validate that there are no diffs between the generated provisioning env vars
    • For now, NBC cse cmd will be used for provisioning until there are no more diffs, after which we can switch to using ANC
  • Adds CustomDataPhase3 to provide both ANC and NBC cse cmd to AKS node controller

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings May 5, 2026 16:25
@lilypan26 lilypan26 changed the title Lily/scriptless/phase 3 e2e test(scriptless): Enable scriptless phase 3 in AB e2es May 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables “scriptless phase 3” coverage in the AgentBaker e2e suite by adding a new scriptless_anc subtest path that provisions nodes using AKSNodeConfig/aks-node-controller, plus wiring many existing scenarios to provide an AKSNodeConfigMutator.

Changes:

  • Added a new scriptless_anc subtest variant and runtime flag (EnableScriptlessANC) to drive scriptless phase-3 execution.
  • Refactored/expanded the e2e “aks-node-controller hack” customData generation to optionally include AKSNodeConfig and/or an nbc-cmd script.
  • Updated many scenarios to set equivalent AKSNodeConfigMutator fields alongside existing NBC mutators.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
e2e/vmss.go Refactors customData hack generation and wires scriptless ANC + NBC cmd hack paths into VMSS creation.
e2e/types.go Adds EnableScriptlessANC and adjusts kubelet-config-file detection logic for scriptless ANC scenarios.
e2e/test_helpers.go Adds scriptless_anc subtest generation and new gating helper.
e2e/scenario_test.go Adds AKSNodeConfigMutator coverage across many existing scenarios.

Comment thread e2e/vmss.go Outdated
Comment thread e2e/vmss.go Outdated
Comment thread e2e/vmss.go Outdated
Comment thread e2e/vmss.go
Comment thread e2e/test_helpers.go
Comment thread e2e/vmss.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 29 changed files in this pull request and generated 3 comments.

Comment thread aks-node-controller/pkg/nodeconfigutils/utils.go Outdated
Comment thread e2e/vmss.go Outdated
Comment thread e2e/test_helpers.go
Copilot AI review requested due to automatic review settings June 5, 2026 05:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 27 changed files in this pull request and generated 2 comments.

Comment thread aks-node-controller/pkg/nodeconfigutils/utils_test.go Outdated
Comment thread e2e/test_helpers.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 25 changed files in this pull request and generated 4 comments.

Comment thread pkg/agent/baker.go Outdated
Comment thread e2e/test_helpers.go
Comment thread aks-node-controller/parser/parser.go
Comment thread aks-node-controller/app.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 26 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread aks-node-controller/parser/helper.go
Comment on lines 325 to +339
var diffs []string
for _, key := range sortedKeys {
pcVal, inPC := pcEnv[key]
nbcVal, inNBC := nbcEnv[key]
switch {
case inPC && !inNBC:
diffs = append(diffs, fmt.Sprintf("only-in-pc: %s", key))
diffs = append(diffs, fmt.Sprintf("only-in-pc: %s=%s", key, pcVal))
case !inPC && inNBC:
diffs = append(diffs, fmt.Sprintf("only-in-nbc: %s", key))
case pcVal != nbcVal:
diffs = append(diffs, fmt.Sprintf("differs: %s", key))
if !isExpectedDiffCSEVar(key) {
diffs = append(diffs, fmt.Sprintf("only-in-nbc: %s=%s", key, nbcVal))
}
case !envValsEqual(pcVal, nbcVal):
if !isExpectedDiffCSEVar(key) {
diffs = append(diffs, fmt.Sprintf("differs: %s pc=%s nbc=%s", key, pcVal, nbcVal))
}
@djsly
Copy link
Copy Markdown
Collaborator

djsly commented Jun 8, 2026

AgentBaker Linux PR gate — E2E mass-failure (change-caused, HIGH confidence)

  • Run: 167125948 (failed)
  • Failed task: Run AgentBaker E2E → AzureCLI → exit 1
  • Net new failures: ~13 leaves, all under Test_LocalDNSHostsPlugin*.

Two failure shapes, same root cause:

  1. Loud / fast (1 leaf): Test_LocalDNSHostsPlugin_Scriptless/ACL/default (155.83s) → e2e/validation.go:52 🔴 expected no env var differences between provision-config and nbc-cmd, but found differences: — the new compareEnvs validator this PR added fires with a non-empty diff list.
  2. Slow / derivative (7 leaves): Test_LocalDNSHostsPlugin{,_Scriptless}/{Ubuntu2204,Ubuntu2404,AzureLinuxV3,ACL}/{default,scriptless_nbc}kube.go:189 after ~742s 🔴 "<pod>" haven't appeared in k8s API server: context deadline exceeded. Identical signature across every distro × variant → systemic node-bring-up degradation, not infra.

Likely root cause ((b) + (c) confirmed, (a) ruled out):

  • This PR flips nbc.EnableScriptlessNBCCSECmd = true for every non-Scriptless scenario in e2e/node_config.go, and e2e/scenario_test.go strips Tags: { Scriptless: true } from each _Scriptless test and adds a BootstrapConfigMutator — so every LocalDNS scenario now provisions via the new scriptless NBC-CSE-Cmd path, including the default (bash-CSE) variants.
  • parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh was changed from if/elif to two independent ifs, so when both files exist --provision-config and --nbc-cmd are both passed to aks-node-controller provision, which then runs compareEnvs.
  • e2e/node_config.go::nbcToAKSNodeConfigV1 has no LocalDNSProfile mapping — every LocalDNS occurrence in the PR diff is the unrelated GenerateLocalDNSCoreFile template. The localdns systemd unit therefore never gets configured/started on nodes provisioned via the converter, DNS on the node breaks, debug pods can't reach the API server → the kube.go:189 timeout.
  • The 2 cosmetic localdns.toml.gtpl diff lines (blank-line removal + trailing newline) are not the cause.

Build-vs-test: test-code-caused (e2e converter + scenario tagging) with a tightly-coupled product-code enabler (baker.go scriptless plumbing + aks-node-controller-wrapper.sh). Not a main regression.

Confidence: HIGH — three independent indicators converge: PR's own validator firing loudly on ACL/default, the EnableScriptlessNBCCSECmd flip mechanism, and the perfectly-symmetric distro×variant timeout pattern.

Strongest alternative (less likely): sysctlTemplateString rewrite in baker.go degrading DNS/conntrack reliability — refuted because every branch still sets tcp_retries2=8, LocalDNS scenarios don't customize sysctls, and a sysctl change wouldn't produce a 100% deterministic per-distro pattern lined up with files this PR touched.

Side-channel (not the cause, FYI): build2204arm64gen2containerd and build2404gen2containerd flagged succeededWithIssues with CIS regressions detected (1) — same 6.1.4.1 pattern recurring across renovate PRs the last 48h (#8652, #8294); non-gating here.

Recommended next action (owner: @lilypan26):

  1. Look at the ACL/default validation.go:52 env-var diff list first — it names the exact env keys the converter mis-maps.
  2. Audit e2e/node_config.go::nbcToAKSNodeConfigV1 for missing NBC field coverage — start with LocalDNSProfile, then any field that drives a systemd unit / cloud-init artifact. Use pkg/agent/baker.go rendering and aks-node-controller/parser as the reference.
  3. Decide if enabling EnableScriptlessNBCCSECmd = true for default variants of every scenario is intended in Phase 3 — if yes, converter coverage must be exhaustive before merge; if not, scope the flip to opted-in scenarios.
  4. Re-run the LocalDNS slice (Ubuntu2204/2404 + AzureLinuxV3 + ACL × {default, scriptless_nbc}) before re-requesting review.

Posted by Clawpilot AgentBaker gate detective.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 26 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Comment thread parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants