|
| 1 | +# CI E2E Parallel Execution Design |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +The `API7EE E2E Test` CI workflow (`e2e-test.yml`) takes ~1 hour, far exceeding the 30-minute target. The bottleneck is test case execution: ~199 E2E tests run serially, split across 3 matrix jobs (apisix.apache.org ~89 tests, networking.k8s.io ~81 tests, webhook ~28 tests). |
| 6 | + |
| 7 | +## Approach |
| 8 | + |
| 9 | +Enable ginkgo parallel execution (`--nodes=2`) within each matrix job by converting `BeforeSuite`/`AfterSuite` to ginkgo's `SynchronizedBeforeSuite`/`SynchronizedAfterSuite`. This ensures the API7EE control plane is deployed only once per job while each ginkgo node independently manages its own dashboard connection. |
| 10 | + |
| 11 | +Scope: `e2e-test.yml` only. Other workflows are not changed. |
| 12 | + |
| 13 | +## Architecture |
| 14 | + |
| 15 | +### Current (serial) |
| 16 | + |
| 17 | +``` |
| 18 | +Kind Cluster |
| 19 | +├── api7-ee-e2e namespace (BeforeSuite, once per job) |
| 20 | +│ ├── api7ee3-dashboard (NodePort 7080/7443) |
| 21 | +│ ├── api7ee3-dp-manager (7943) |
| 22 | +│ └── api7-postgresql |
| 23 | +└── ingress-apisix-e2e-tests-default-{ns} (each It() BeforeEach) |
| 24 | + ├── api7ee3-apisix-gateway-mtls (data plane pod) |
| 25 | + ├── api7-ingress-controller |
| 26 | + └── httpbin |
| 27 | +
|
| 28 | +Test process → kubectl port-forward → fixed local:{NodePort} → dashboard:7080 |
| 29 | +``` |
| 30 | + |
| 31 | +### Target (parallel, --nodes=2) |
| 32 | + |
| 33 | +``` |
| 34 | +Kind Cluster |
| 35 | +├── api7-ee-e2e namespace (SynchronizedBeforeSuite node 1, once per job) |
| 36 | +│ ├── api7ee3-dashboard / dp-manager / postgresql |
| 37 | +├── ingress-apisix-e2e-tests-default-{nsA} (node 1 BeforeEach) |
| 38 | +│ ├── gateway pod, ingress controller, httpbin |
| 39 | +│ └── API7EE Gateway Group A (UUID) |
| 40 | +└── ingress-apisix-e2e-tests-default-{nsB} (node 2 BeforeEach, concurrent) |
| 41 | + ├── gateway pod, ingress controller, httpbin |
| 42 | + └── API7EE Gateway Group B (UUID) |
| 43 | +
|
| 44 | +ginkgo node 1 process → kubectl port-forward → local:{autoPort1} → dashboard:7080 |
| 45 | +ginkgo node 2 process → kubectl port-forward → local:{autoPort2} → dashboard:7080 |
| 46 | +``` |
| 47 | + |
| 48 | +## Concurrency Safety Analysis |
| 49 | + |
| 50 | +| Resource | Isolation mechanism | Safe? | |
| 51 | +|---|---|---| |
| 52 | +| k8s namespace | `ingress-apisix-e2e-tests-{name}-{nanosecond}` | ✅ | |
| 53 | +| GatewayClass | name = namespace name (unique) | ✅ | |
| 54 | +| API7EE Gateway Group | `uuid.NewString()` | ✅ | |
| 55 | +| Ingress controller | controllerName contains namespace | ✅ | |
| 56 | +| Dashboard API calls | idempotent (UploadLicense, GetAdminKey) | ✅ | |
| 57 | +| Dashboard tunnel | fixed NodePort used as local port → conflict | ❌ **needs fix** | |
| 58 | + |
| 59 | +## Code Changes |
| 60 | + |
| 61 | +### 1. `test/e2e/e2e_test.go` |
| 62 | + |
| 63 | +Replace `BeforeSuite`/`AfterSuite` with the synchronized variants: |
| 64 | + |
| 65 | +```go |
| 66 | +SynchronizedBeforeSuite(f.DeployAPI7EE, f.InitNodeConnections) |
| 67 | +SynchronizedAfterSuite(f.CloseNodeConnections, f.TeardownInfrastructure) |
| 68 | +``` |
| 69 | + |
| 70 | +`DeployAPI7EE` (node 1 only): deploys the API7EE control plane once. |
| 71 | +`InitNodeConnections` (all nodes): each node creates its own dashboard port-forward tunnel. |
| 72 | +`CloseNodeConnections` (all nodes): each node closes its own tunnel. |
| 73 | +`TeardownInfrastructure` (node 1 only): no-op for now (cluster is torn down by CI). |
| 74 | + |
| 75 | +### 2. `test/e2e/framework/api7_framework.go` |
| 76 | + |
| 77 | +Split `BeforeSuite` into: |
| 78 | + |
| 79 | +**`DeployAPI7EE() []byte`** (node 1 only): |
| 80 | +- Init `API7EELicense` and `dashboardVersion` from env |
| 81 | +- Delete and recreate `api7-ee-e2e` namespace |
| 82 | +- Helm install `api7ee3` chart |
| 83 | +- Wait for pods to be ready (`time.Sleep(1 * time.Minute)`) |
| 84 | +- Create a temporary tunnel (with `findFreePort()`) |
| 85 | +- Call `UploadLicense()` and `setDpManagerEndpoints()` |
| 86 | +- Close the temporary tunnel |
| 87 | +- Return `[]byte("ready")` |
| 88 | + |
| 89 | +**`InitNodeConnections(_ []byte)`** (all nodes): |
| 90 | +- Init `API7EELicense` from env (needed for per-test `UploadLicense` calls) |
| 91 | +- Call `f.newDashboardTunnel()` to create a per-node tunnel |
| 92 | + |
| 93 | +**`CloseNodeConnections()`** (all nodes): |
| 94 | +- Call `f.shutdownDashboardTunnel()` |
| 95 | + |
| 96 | +**`TeardownInfrastructure()`** (node 1 only): |
| 97 | +- No-op (Kind cluster is deleted by `make kind-down` or CI teardown) |
| 98 | + |
| 99 | +### 3. Fix dashboard tunnel port conflict |
| 100 | + |
| 101 | +`newDashboardTunnel()` currently uses the k8s NodePort value as the local bind port. With parallel processes on the same machine, this causes `address already in use` errors. |
| 102 | + |
| 103 | +Replace fixed-port logic with a `findFreePort()` helper: |
| 104 | + |
| 105 | +```go |
| 106 | +func findFreePort() int { |
| 107 | + ln, err := net.Listen("tcp", ":0") |
| 108 | + if err != nil { |
| 109 | + panic(fmt.Sprintf("finding free port: %v", err)) |
| 110 | + } |
| 111 | + port := ln.Addr().(*net.TCPAddr).Port |
| 112 | + _ = ln.Close() |
| 113 | + return port |
| 114 | +} |
| 115 | +``` |
| 116 | + |
| 117 | +Use `findFreePort()` for both HTTP and HTTPS tunnels: |
| 118 | + |
| 119 | +```go |
| 120 | +localHTTPPort := findFreePort() |
| 121 | +localHTTPSPort := findFreePort() |
| 122 | +_dashboardHTTPTunnel = k8s.NewTunnel(..., localHTTPPort, httpPort) |
| 123 | +_dashboardHTTPSTunnel = k8s.NewTunnel(..., localHTTPSPort, httpsPort) |
| 124 | +``` |
| 125 | + |
| 126 | +Note: there is a small TOCTOU window between `ln.Close()` and `kubectl port-forward` binding the port. In practice this is safe on a CI machine. If it becomes an issue, retry logic can be added. |
| 127 | + |
| 128 | +### 4. `Makefile` |
| 129 | + |
| 130 | +Add `ginkgo-api7ee-e2e-test` target: |
| 131 | + |
| 132 | +```makefile |
| 133 | +.PHONY: ginkgo-api7ee-e2e-test |
| 134 | +ginkgo-api7ee-e2e-test: adc |
| 135 | + @ginkgo -cover -coverprofile=coverage.txt -r --randomize-all --randomize-suites \ |
| 136 | + --trace --nodes=$(E2E_NODES) --label-filter="$(TEST_LABEL)" ./test/e2e/ |
| 137 | +``` |
| 138 | + |
| 139 | +### 5. `.github/workflows/e2e-test.yml` |
| 140 | + |
| 141 | +Add `install-ginkgo` step. Replace `make e2e-test` with ginkgo parallel invocation: |
| 142 | + |
| 143 | +```yaml |
| 144 | +- name: Install ginkgo |
| 145 | + run: make install-ginkgo |
| 146 | + |
| 147 | +- name: Run E2E test suite |
| 148 | + env: |
| 149 | + API7_EE_LICENSE: ${{ secrets.API7_EE_LICENSE }} |
| 150 | + PROVIDER_TYPE: api7ee |
| 151 | + TEST_LABEL: ${{ matrix.cases_subset }} |
| 152 | + TEST_ENV: CI |
| 153 | + run: | |
| 154 | + if [[ "${{ matrix.cases_subset }}" == "webhook" ]]; then |
| 155 | + E2E_NODES=1 make ginkgo-api7ee-e2e-test |
| 156 | + else |
| 157 | + E2E_NODES=2 make ginkgo-api7ee-e2e-test |
| 158 | + fi |
| 159 | +``` |
| 160 | + |
| 161 | +## Expected Outcome |
| 162 | + |
| 163 | +| Job | Tests | Before | After (N=2) | |
| 164 | +|---|---|---|---| |
| 165 | +| apisix.apache.org | ~89 | ~45 min | ~22 min | |
| 166 | +| networking.k8s.io | ~81 | ~40 min | ~20 min | |
| 167 | +| webhook | ~28 | ~10 min | ~10 min (serial) | |
| 168 | +| **Total (longest)** | | **~60 min** | **~22-25 min** | |
| 169 | + |
| 170 | +Target of 30 minutes is achieved. |
| 171 | + |
| 172 | +## Risk and Rollback |
| 173 | + |
| 174 | +**Risk**: Resource contention on GitHub-hosted runners (2 CPUs, 7GB RAM). With 2 parallel test stacks (2 gateway pods + 2 ingress controllers + 2 httpbins) plus the shared API7EE control plane, memory usage may approach limits. |
| 175 | + |
| 176 | +**Mitigation**: Start with `E2E_NODES=2`. Monitor CI run times and failure rates. Roll back to `E2E_NODES=1` (equivalent to current behavior) if instability is observed. |
| 177 | + |
| 178 | +**Rollback**: Single line change in the workflow — set `E2E_NODES=1` for all matrix jobs, which is functionally identical to the current `make e2e-test`. |
| 179 | + |
| 180 | +## Testing |
| 181 | + |
| 182 | +After implementation, verify: |
| 183 | +1. All 3 matrix jobs pass with the new ginkgo invocation |
| 184 | +2. Each parallel node creates an independent namespace and gateway group |
| 185 | +3. No port conflicts in dashboard tunnel creation |
| 186 | +4. Serial mode (`E2E_NODES=1`) still works for webhook tests |
0 commit comments