Skip to content

test(e2e): prewarm environment pool to hide per-spec deploy latency#425

Closed
AlinsRan wants to merge 1 commit into
api7:masterfrom
AlinsRan:feat/e2e-prewarm-pool
Closed

test(e2e): prewarm environment pool to hide per-spec deploy latency#425
AlinsRan wants to merge 1 commit into
api7:masterfrom
AlinsRan:feat/e2e-prewarm-pool

Conversation

@AlinsRan

@AlinsRan AlinsRan commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Each APISIX-standalone spec deploys a full environment (data plane, controller, httpbin, port-forwards) synchronously in BeforeEach, so the ~30-80s deploy/readiness latency sits on every spec's critical path.

What this does

Adds a per-process prewarm pool: a small double-buffer of ready environments is provisioned by background workers, so while one spec runs, the next environment is built concurrently. BeforeEach then picks up a ready environment instead of deploying inline.

  • Per-spec namespace isolation is preserved.
  • The Deployer interface is untouched.
  • Only the default profile is pooled; webhook/custom-keyed environments and any provisioning failure fall back to synchronous deploy.
  • Stream-route specs (Gateway API TCP/TLS/UDP and ApisixRoute stream routes) also fall back to synchronous deploy: a stream data plane served from a prewarmed environment is sensitive to the background provisioning contention.
  • Tunnels are (re)created at acquire time against the fully-provisioned data plane (reusing createDataplaneTunnels/createAdminTunnel), so a port-forward is never opened before its listener is serving and then left idle in the buffer.

Knobs

  • E2E_PREWARM — on by default; set false to disable (kill switch).
  • E2E_PREWARM_DEPTH — pool depth per profile per process, default 1 (double buffer).

The cluster footprint is governed by E2E_NODES × (1 + depth); at the default depth this adds only one in-flight environment per process.

Also included

  • Pod-readiness polling now uses a fixed 2s interval instead of an exponential backoff that polled sparsely (7.5/15.5/31.5s) exactly during the 10-30s window pods become ready — worth up to ~15s per wait on its own.

Files

  • scaffold/envpool.go (new) — generic, provider-agnostic pool.
  • scaffold/apisix_prewarm.go (new) — default-profile provisioning + load onto scaffold.
  • scaffold/apisix_deployer.goBeforeEach prewarm fast path + sync fallback.
  • scaffold/scaffold.go, apisix/e2e_test.go — const + AfterSuite pool teardown.
  • framework/k8s.go — readiness polling fix + EnsureServiceReadyE.

Validation

go build ./test/..., go vet, gofmt all clean. Live-cluster behavior and peak resource usage need this PR's CI run to validate; E2E_PREWARM=false is the kill switch.

Summary by CodeRabbit

  • Tests
    • Implemented background prewarming pools to accelerate e2e test startup and reduce wait times
    • Added automatic cleanup of test environments and resources upon suite completion
    • Enhanced service readiness monitoring with improved error handling and configurable retry strategies
    • Optimized test setup with environment pooling support and configurable pool depth
    • Strengthened test isolation and infrastructure resource management

Each APISIX-standalone spec currently deploys a full environment (data plane,
controller, httpbin, port-forwards) synchronously in BeforeEach, so the
~30-80s deploy/readiness latency sits on every spec's critical path.

Add a per-process prewarm pool: a small double-buffer of ready environments is
provisioned by background workers, so while one spec runs the next environment
is built concurrently. BeforeEach picks up a ready environment instead of
deploying inline. Per-spec namespace isolation is preserved and the Deployer
interface is untouched.

Details:
- envpool.go: generic, provider-agnostic pool (channel + workers, panic-safe
  background provisioning via a minimal terratest TestingT, AfterSuite cleanup
  registered by the suite root rather than the shared scaffold package).
- apisix_prewarm.go: error-style provisioning of the default profile and
  loading it onto the scaffold. Tunnels are (re)created at acquire time against
  the fully-provisioned data plane, reusing createDataplaneTunnels/
  createAdminTunnel, so a port-forward is never opened before its listener is
  serving and then left idle in the buffer.
- Only the default profile is pooled; webhook/custom-keyed environments and any
  provisioning failure fall back to synchronous deploy. Stream-route specs
  (TCP/TLS/UDP and ApisixRoute stream routes) also fall back, since a stream
  data plane served from a prewarmed environment is sensitive to the background
  provisioning contention.
- Knobs: E2E_PREWARM (default on), E2E_PREWARM_DEPTH (default 1).
- k8s.go: poll pod readiness at a fixed 2s interval instead of an exponential
  backoff that polled sparsely during the 10-30s window pods become ready.
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Foo Bar seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Introduces a background environment prewarming pool for e2e tests. A new envpool.go manages worker goroutines that provision complete APISIX Kubernetes environments ahead of time. apisix_prewarm.go adds eligibility checks, provisioning functions, and deployer wiring. BeforeEach gains a fast path to acquire pooled environments. A suite-level AfterSuite hook calls ShutdownAllPools for cleanup.

Changes

E2E Environment Prewarming Pool

Layer / File(s) Summary
Pool infrastructure: envPool, workers, and lifecycle
test/e2e/scaffold/envpool.go
Defines envPool with background provisioning workers, panic-to-error wrapping via safeProvision, namespace-deletion cleanup in destroyPooledEnv, a global profile-keyed pool registry via getOrStartPool, exported ShutdownAllPools, and E2E_PREWARM/E2E_PREWARM_DEPTH env-var configuration helpers.
Prewarm eligibility checks and environment provisioning
test/e2e/scaffold/apisix_prewarm.go
Adds specPrewarmable (excludes stream-route/Gateway TCP/TLS/UDP specs by label and container hierarchy), isPoolable/profileKey/formatRegistry helpers, and the full provisioning chain: provisionAPISIXEnvprovisionDataplane (with optional etcd apply) + provisionIngress (webhook disabled, controller-manager pod wait) + provisionHTTPBIN (deployment apply, endpoint readiness check).
Deployer BeforeEach fast path and loadPooledEnv
test/e2e/scaffold/apisix_deployer.go, test/e2e/scaffold/apisix_prewarm.go
BeforeEach attempts pool acquisition and calls loadPooledEnv on success, falling back to beforeEachSync on failure. loadPooledEnv copies controller/admin/dataplane/httpbin settings into the deployer, clears finalizers, and creates tunnels.
Framework support, defaultProfileName, and suite teardown
test/e2e/framework/k8s.go, test/e2e/scaffold/scaffold.go, test/e2e/apisix/e2e_test.go
Adds EnsureServiceReadyE (error-returning service readiness used by provisionHTTPBIN), changes waitExponentialBackoff to a fixed 2s/90-step schedule, updates NewScaffold to use the defaultProfileName constant, and registers the AfterSuite hook calling ShutdownAllPools.

Sequence Diagram(s)

sequenceDiagram
  participant BeforeEach as APISIXDeployer.BeforeEach
  participant getOrStartPool
  participant envPool
  participant provisionAPISIXEnv
  participant loadPooledEnv
  participant AfterSuite

  rect rgba(100, 149, 237, 0.5)
    Note over BeforeEach,loadPooledEnv: Fast path (prewarm enabled, spec eligible)
    BeforeEach->>getOrStartPool: isPoolable + specPrewarmable check
    getOrStartPool->>envPool: start workers if new profile
    envPool->>provisionAPISIXEnv: background goroutine provisions env
    provisionAPISIXEnv-->>envPool: pooledEnv buffered in channel
    BeforeEach->>envPool: acquire() → pooledEnv
    BeforeEach->>loadPooledEnv: wire env into deployer (tunnels, settings)
    loadPooledEnv-->>BeforeEach: return (test proceeds)
  end

  rect rgba(255, 160, 122, 0.5)
    Note over AfterSuite,envPool: Suite teardown
    AfterSuite->>ShutdownAllPools: registered via Ginkgo AfterSuite
    ShutdownAllPools->>envPool: shutdown() for each pool
    envPool->>destroyPooledEnv: delete namespace of buffered envs
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: introducing a prewarm environment pool for e2e tests to hide per-spec deployment latency, which matches the core objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
E2e Test Quality Review ✅ Passed PR introduces test infrastructure optimization (e2e pool prewarming), not new tests. Code has comprehensive error handling, proper concurrency safety (mutex/waitgroup/channels), correct fallback lo...
Security Check ✅ Passed Comprehensive security review of all 7 categories completed. No sensitive data exposure, secrets handling is appropriate for test infrastructure, proper resource isolation with unique namespaces, a...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/scaffold/apisix_deployer.go (1)

111-111: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Admin API keys logged in plaintext at multiple locations. Both locations log the APISIX admin API key without redaction. Per coding guidelines, API keys should not be logged without redaction, even in test infrastructure, as logs may be accessible in CI systems.

  • test/e2e/scaffold/apisix_deployer.go#L111-L111: Redact or remove the admin key log in beforeEachSync
  • test/e2e/scaffold/apisix_deployer.go#L395-L395: Redact or remove the admin key log in CreateAdditionalGatewayWithOptions
Suggested fix
-	s.Logf("apisix admin api key: %s", s.runtimeOpts.APISIXAdminAPIKey)
+	// Redact middle portion for debugging while protecting the full key
+	key := s.runtimeOpts.APISIXAdminAPIKey
+	if len(key) > 8 {
+		s.Logf("apisix admin api key: %s...%s", key[:4], key[len(key)-4:])
+	}

As per coding guidelines: "Do not log, serialize, or return API keys, tokens... without redaction".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scaffold/apisix_deployer.go` at line 111, API keys are being logged
in plaintext at two locations in test/e2e/scaffold/apisix_deployer.go, violating
coding guidelines. In the `beforeEachSync` function at line 111-111, redact or
remove the s.runtimeOpts.APISIXAdminAPIKey from the Logf call. Similarly, in the
`CreateAdditionalGatewayWithOptions` function at line 395-395, redact or remove
the admin API key from the corresponding log statement. Either remove these log
lines entirely or redact the sensitive key value (for example, by logging only a
partial hash or masked value) before logging to ensure API keys are never
exposed in plaintext in test logs.

Source: Coding guidelines

🧹 Nitpick comments (2)
test/e2e/framework/k8s.go (1)

299-310: 💤 Low value

Function name waitExponentialBackoff no longer matches behavior.

The function now uses a fixed 2s interval rather than exponential backoff. Consider renaming to waitWithBackoff or waitFixedBackoff to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/framework/k8s.go` around lines 299 - 310, The function
`waitExponentialBackoff` is misleadingly named since it no longer implements
exponential backoff behavior. The backoff configuration uses Factor: 1, which
means the duration stays fixed at 2 seconds throughout all 90 steps, rather than
growing exponentially. Rename the function `waitExponentialBackoff` to a name
that accurately reflects its fixed-interval behavior, such as `waitWithBackoff`
or `waitFixedBackoff`. Make sure to update the function name definition and any
call sites that reference this function.
test/e2e/scaffold/envpool.go (1)

152-159: 💤 Low value

Consider logging namespace deletion failures for debugging.

destroyPooledEnv discards the error from RunKubectlE. While this is cleanup code where failures aren't actionable, logging the error would help diagnose orphaned namespace issues in CI.

 func destroyPooledEnv(env *pooledEnv) {
 	if env == nil {
 		return
 	}
 	if env.namespace != "" && env.kubectlOptions != nil {
-		_ = k8s.RunKubectlE(&bgTestingT{}, env.kubectlOptions, "delete", "namespace", env.namespace, "--wait=false")
+		if err := k8s.RunKubectlE(&bgTestingT{}, env.kubectlOptions, "delete", "namespace", env.namespace, "--wait=false"); err != nil {
+			fmt.Printf("prewarm: failed to delete namespace %s: %v\n", env.namespace, err)
+		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scaffold/envpool.go` around lines 152 - 159, The `destroyPooledEnv`
function discards the error returned by `k8s.RunKubectlE` using the blank
identifier, which prevents visibility into namespace deletion failures. Instead
of ignoring this error, capture the error return value from the `RunKubectlE`
call and log it when deletion fails. This will help diagnose orphaned namespace
issues during CI debugging, even though the error may not be actionable during
cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/e2e/scaffold/apisix_deployer.go`:
- Line 111: API keys are being logged in plaintext at two locations in
test/e2e/scaffold/apisix_deployer.go, violating coding guidelines. In the
`beforeEachSync` function at line 111-111, redact or remove the
s.runtimeOpts.APISIXAdminAPIKey from the Logf call. Similarly, in the
`CreateAdditionalGatewayWithOptions` function at line 395-395, redact or remove
the admin API key from the corresponding log statement. Either remove these log
lines entirely or redact the sensitive key value (for example, by logging only a
partial hash or masked value) before logging to ensure API keys are never
exposed in plaintext in test logs.

---

Nitpick comments:
In `@test/e2e/framework/k8s.go`:
- Around line 299-310: The function `waitExponentialBackoff` is misleadingly
named since it no longer implements exponential backoff behavior. The backoff
configuration uses Factor: 1, which means the duration stays fixed at 2 seconds
throughout all 90 steps, rather than growing exponentially. Rename the function
`waitExponentialBackoff` to a name that accurately reflects its fixed-interval
behavior, such as `waitWithBackoff` or `waitFixedBackoff`. Make sure to update
the function name definition and any call sites that reference this function.

In `@test/e2e/scaffold/envpool.go`:
- Around line 152-159: The `destroyPooledEnv` function discards the error
returned by `k8s.RunKubectlE` using the blank identifier, which prevents
visibility into namespace deletion failures. Instead of ignoring this error,
capture the error return value from the `RunKubectlE` call and log it when
deletion fails. This will help diagnose orphaned namespace issues during CI
debugging, even though the error may not be actionable during cleanup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 81665ee1-05a8-4dab-9236-a9367709a5d3

📥 Commits

Reviewing files that changed from the base of the PR and between ecbb5d5 and dc63f19.

📒 Files selected for processing (6)
  • test/e2e/apisix/e2e_test.go
  • test/e2e/framework/k8s.go
  • test/e2e/scaffold/apisix_deployer.go
  • test/e2e/scaffold/apisix_prewarm.go
  • test/e2e/scaffold/envpool.go
  • test/e2e/scaffold/scaffold.go

@AlinsRan

Copy link
Copy Markdown
Contributor Author

Reopening as a same-repository PR: fork-based PRs cannot access the registry login secret, so every e2e job fails at the "Login to Registry" setup step before any test runs. Superseded by a branch pushed directly to this repo.

@AlinsRan AlinsRan closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants