Skip to content

Pod Identity Association also subject to cache race condition? #264

@joshfrench

Description

@joshfrench

What happened:
Similar to #174 but specific to pod identity associations, we're observing the expected AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE env var is absent when a service account and pod are created within a short window. Typically we'll experience something like this:

  • Programmatically create a pod identity association, service account annotated with IAM role, and pod in short succession
  • The pod comes up, but AWS operations error with An error occurred (InvalidIdentityToken) when calling the AssumeRoleWithWebIdentity operation: No OpenIDConnect provider found in your account for https://oidc.eks...
  • Examining the pod env, note that AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE is missing but AWS_WEB_IDENTITY_TOKEN_FILE is set
  • Restart pod
  • Note that AWS_WEB_IDENTITY_TOKEN_FILE is now replaced with AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE and pod operates as expected.

What you expected to happen:
If all the prerequisites are satisfied, pods should get the correct pod identity association mutation regardless of timing.

How to reproduce it (as minimally and precisely as possible):

  1. Create an IAM role with the correct pod identity association trust policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:TagSession",
                "sts:AssumeRole"
            ],
        }
    ]
}
  1. Create an EKS cluster, enabling the EKS Pod Identity Agent add-on.
  2. Run aws eks update-kubeconfig --name my-cluster
  3. Run:
$ aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace default \
  --service-account test-sa \
  --role-arn arn:aws:iam::111111111111:role/test-role && \
sleep 0.75s && \
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: default
  name: test-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/test-role
---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: default
spec:
  serviceAccountName: test-sa
  containers:
    - name: test
      image: amazon/aws-cli:latest
      imagePullPolicy: IfNotPresent
      command:
        - aws
        - sts
        - get-caller-identity
EOF

When waiting ~750ms or less between creating the association and submitting the SA, I consistently get the incorrect AWS_WEB_IDENTITY_TOKEN_FILE. Above ~1s seems to be reliably sufficient to get the correct mutation.

Anything else we need to know?:
I'm wondering if something like #236 and/or #252 should be applied to the FileConfig, to allow the cache some time to catch up or to provide a fallback in case of cache miss. The scenario in which a serviceaccount and a pod are created in a short timeframe is common with CI/CD and infrastructure-as-code.

Environment:

  • AWS Region: us-east-2
  • EKS Platform version: eks.6
  • Kubernetes version: 1.32
  • Webhook Version: ¯\_(ツ)_/¯ whatever EKS is running under the hood

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions