Skip to content

Update routes and sg rules within eks terraform script#5256

Open
LordAbhishek wants to merge 1 commit into
mainfrom
abhishek/debug-vault-test-on-eks
Open

Update routes and sg rules within eks terraform script#5256
LordAbhishek wants to merge 1 commit into
mainfrom
abhishek/debug-vault-test-on-eks

Conversation

@LordAbhishek
Copy link
Copy Markdown
Contributor

@LordAbhishek LordAbhishek commented Apr 24, 2026

1. Issue:
Terraform failed while creating Kubernetes StorageClasses with (workflow):
Get "http://localhost/apis/storage.k8s.io/v1/storageclasses/gp3": dial tcp 127.0.0.1:80: connect: connection refused.

data.aws_iam_policy.csi-driver-policy: Read complete after 4s [id=arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
aws_iam_role_policy_attachment.csi[0]: Refreshing state... [id=terraform-20260330114946235000000027-2026033011494697440000002b]
│ Error: Get "http://localhost/apis/storage.k8s.io/v1/storageclasses/gp3": dial tcp 127.0.0.1:80: connect: connection refused
│   with kubernetes_storage_class.ebs_gp3_cluster0,
│   on main.tf line 184, in resource "kubernetes_storage_class" "ebs_gp3_cluster0":
│  184: resource "kubernetes_storage_class" "ebs_gp3_cluster0" {

Root cause:
During refresh/apply, the Kubernetes provider can be evaluated before cluster access is fully established. When provider authentication or endpoint resolution is incomplete at that point, the provider may default to localhost, causing the connection refusal.

Fix:
Configured per-cluster Kubernetes providers with exec-based client authentication (aws eks get-token) so Terraform obtains fresh EKS credentials at runtime and talks to the intended EKS API endpoint instead of falling back to localhost.

2. Issue:
Currently vault_partitions_test is failing on EKS workflows.
Logs:

vault_partitions_test.go:411: 2026-03-25T18:08:29Z Waiting 20m0s for pods with label "release=test-bqt7gj" to be ready.
retry.go:219: retry.go:39: 3 pods are not ready: test-bqt7gj-consul-client-58n7c,test-bqt7gj-consul-client-674gj,test-bqt7gj-consul-client-8w9rq

Root Cause:
The failure is caused by network segmentation between two peered VPCs used by TestVault_Partitions.
This test follows Single Consul Datacenter Across Multiple Kubernetes Clusters architecture and runs Consul servers in one EKS cluster and Consul clients in another cluster. Client agent needs to establish Serf LAN gossip membership with external servers, which requires direct pod-to-pod reachability across clusters [flat network].

Although VPC peering was enabled, routing allowed only public subnet paths. Because EKS worker nodes and pods run in private subnets, there was no private subnet route between the clusters. As a result, cross-cluster pod-to-pod traffic failed, Client agent Serf LAN membership updates timed out and removes consul server from the pool, and eventually client agents could not become healthy.

Please note: No mesh gateway is used in this test scenario and it is also a reason why only this test requires flat network.

Logs:

2026-04-17T11:32:30.160Z [INFO]  agent: Started gRPC listeners: port_name=grpc_tls address=[::]:8502 network=tcp
2026-04-17T11:32:30.161Z [TRACE] agent: [core][Server #2 ListenSocket #3] ListenSocket created
2026-04-17T11:32:30.160Z [DEBUG] agent: starting file watcher
2026-04-17T11:32:30.160Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce hcp k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2026-04-17T11:32:30.161Z [INFO]  agent: Joining cluster...: cluster=LAN
2026-04-17T11:32:30.161Z [INFO]  agent: (LAN) joining: lan_addresses=["ad3d98004013e40c1a82b1cb09e57229-197484693.us-west-2.elb.amazonaws.com"]
2026-04-17T11:32:30.161Z [INFO]  agent: started state syncer
2026-04-17T11:32:30.161Z [INFO]  agent: Consul agent running!
2026-04-17T11:32:30.161Z [WARN]  agent.router.manager: No servers available
2026-04-17T11:32:30.162Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2026-04-17T11:32:30.160Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="No known Consul servers" index=13
2026-04-17T11:32:30.161Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="No known Consul servers" index=13
2026-04-17T11:32:30.168Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  44.237.26.223:8301
2026-04-17T11:32:30.172Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: test-vybsbs-consul-server-0 10.0.3.44
2026-04-17T11:32:30.172Z [INFO]  agent.client: adding server: server="test-vybsbs-consul-server-0 (Addr: tcp/10.0.3.44:8300) (DC: dc1)"
2026-04-17T11:32:30.172Z [TRACE] agent: [core][Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "dc1-10.0.3.44:8300",
      "ServerName": "test-vybsbs-consul-server-0",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "dc1-10.0.3.44:8300",
          "ServerName": "test-vybsbs-consul-server-0",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2026-04-17T11:32:30.172Z [TRACE] agent.grpc.balancer: adding server address: target=consul://dc1.a03c97e4-1eca-baae-77e5-f98414058351/server.dc1 address=dc1-10.0.3.44:8300
2026-04-17T11:32:30.172Z [DEBUG] agent.grpc.balancer: switching server: target=consul://dc1.a03c97e4-1eca-baae-77e5-f98414058351/server.dc1 from=<none> to=dc1-10.0.3.44:8300
2026-04-17T11:32:30.172Z [TRACE] agent: [core][Channel #1 SubChannel #4] Subchannel created
2026-04-17T11:32:30.172Z [TRACE] agent: [core][Channel #1] Channel Connectivity change to CONNECTING
2026-04-17T11:32:30.172Z [TRACE] agent: [core][Channel #1 SubChannel #4] Subchannel Connectivity change to CONNECTING
2026-04-17T11:32:30.172Z [TRACE] agent: [core][Channel #1 SubChannel #4] Subchannel picks a new address "dc1-10.0.3.44:8300" to connect
2026-04-17T11:32:30.173Z [TRACE] agent.grpc.balancer: sub-connection state changed: target=consul://dc1.a03c97e4-1eca-baae-77e5-f98414058351/server.dc1 server=dc1-10.0.3.44:8300 state=CONNECTING
2026-04-17T11:32:30.173Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  44.253.46.184:8301
2026-04-17T11:32:30.175Z [INFO]  agent: (LAN) joined: number_of_nodes=2
2026-04-17T11:32:30.175Z [DEBUG] agent: systemd notify failed: error="No socket"
2026-04-17T11:32:30.175Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=2
2026-04-17T11:32:31.572Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: test-vybsbs-consul-server-0 (timeout reached)
2026-04-17T11:32:32.072Z [INFO]  agent.client.memberlist.lan: memberlist: Suspect test-vybsbs-consul-server-0 has failed, no acks received
2026-04-17T11:32:33.572Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: test-vybsbs-consul-server-0 (timeout reached)
2026-04-17T11:32:35.071Z [INFO]  agent.client.memberlist.lan: memberlist: Suspect test-vybsbs-consul-server-0 has failed, no acks received
2026-04-17T11:32:35.572Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed UDP ping: test-vybsbs-consul-server-0 (timeout reached)
2026-04-17T11:32:36.072Z [INFO]  agent.client.memberlist.lan: memberlist: Marking test-vybsbs-consul-server-0 as failed, suspect timeout reached (0 peer confirmations)
2026-04-17T11:32:36.073Z [INFO]  agent.client.serf.lan: serf: EventMemberFailed: test-vybsbs-consul-server-0 10.0.3.44
2026-04-17T11:32:36.073Z [INFO]  agent.client: removing server: server="test-vybsbs-consul-server-0 (Addr: tcp/10.0.3.44:8300) (DC: dc1)"
2026-04-17T11:32:36.073Z [TRACE] agent: [core][Channel #1] Resolver state updated: {
  "Addresses": null,
  "Endpoints": [],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned an empty address list)
2026-04-17T11:32:36.073Z [TRACE] agent.grpc.balancer: removing server address: target=consul://dc1.a03c97e4-1eca-baae-77e5-f98414058351/server.dc1 address=dc1-10.0.3.44:8300
2026-04-17T11:32:36.073Z [DEBUG] agent.grpc.balancer: switching server: target=consul://dc1.a03c97e4-1eca-baae-77e5-f98414058351/server.dc1 from=dc1-10.0.3.44:8300 to=<none>
abhishek@window-ubuntu consul-k8s % k2 get all
NAME                                                           READY   STATUS        RESTARTS   AGE
pod/test-noidde-vault-agent-injector-5fb95b774-kblvs           1/1     Running       0          32m
pod/test-vybsbs-consul-client-wjskm                            1/2     Terminating   0          29m
pod/test-vybsbs-consul-connect-injector-685756fbf8-qxbnj       2/2     Terminating   0          29m
pod/test-vybsbs-consul-webhook-cert-manager-67cb487cf8-6686d   1/1     Terminating   0          29m

Fix:
I have added the routes for private subnet and security group accordingly.
Test:
I have tested it locally on EKS. Test is successfully passing after the change.

go test -run '^TestVault_Partitions$' -v -p 1 -timeout 2h -failfast -enable-enterprise=true -debug-directory='./' -enable-multi-cluster=true -kubeconfigs='/Users/abhishek/.kube/consul-k8s-1012464252,/Users/abhishek/.kube/consul-k8s-1426345587' -kube-contexts='vaulttestc1,vaulttestc2' -consul-image=hashicorppreview/consul-enterprise:1.23.0-dev -consul-k8s-image=hashicorppreview/consul-k8s-control-plane:1.10.0-dev-nightly-8541e06c6b3cbc7d2829ae7eb38e6a24a81fe1f0
[Truncated output....]
vaulttestc1 --kubeconfig /Users/abhishek/.kube/consul-k8s-1012464252 --namespace default test-fozadj]
    command.go:185: 2026-04-24T13:56:27+05:30 release "test-fozadj" uninstalled
--- PASS: TestVault_Partitions (360.79s)
PASS
ok      github.com/hashicorp/consul-k8s/acceptance/tests/vault  361.260s
abhishek@window-ubuntu consul-k8s % k2 get pods
NAME                                                       READY   STATUS    RESTARTS   AGE
test-fozadj-vault-agent-injector-b9674c7d6-7lzs9           1/1     Running   0          3m14s
test-gs1blc-consul-client-b9x69                            2/2     Running   0          39s
test-gs1blc-consul-connect-injector-5f8b9dc94d-nz28z       2/2     Running   0          39s
test-gs1blc-consul-webhook-cert-manager-56c85cfc7c-clxm7   1/1     Running   0          39s

@LordAbhishek LordAbhishek added pr/no-changelog PR does not need a corresponding .changelog entry pr/no-backport signals that a PR will not contain a backport label do-not-merge labels Apr 24, 2026
@LordAbhishek LordAbhishek changed the title [Do not merge] update routes and sg rules within eks terraform script Update routes and sg rules within eks terraform script Apr 28, 2026
@LordAbhishek LordAbhishek added backport/1.7.x Backport to release/1.7.x branch backport/1.8.x Backport to release/1.8.x branch backport/1.9.x Backport to release/1.9.x branch and removed do-not-merge pr/no-backport signals that a PR will not contain a backport label labels Apr 28, 2026
@LordAbhishek LordAbhishek marked this pull request as ready for review April 28, 2026 05:45
@LordAbhishek LordAbhishek requested review from a team as code owners April 28, 2026 05:45
@LordAbhishek LordAbhishek added backport/2.0.x Backport to release/2.0.x branch and removed backport/1.7.x Backport to release/1.7.x branch labels May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.8.x Backport to release/1.8.x branch backport/1.9.x Backport to release/1.9.x branch backport/2.0.x Backport to release/2.0.x branch pr/no-changelog PR does not need a corresponding .changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant