From time to time the prefetcher pods themselves get stuck in an indefinite wait for an image pull to complete.
One way to avoid this would be to use a "preseed" job with one-per-node pods and an active deadline seconds that prevents an indefinite delay.
This job could launch very early after the cluster is created, such that it hopefully finishes by the time the actual prefetcher is run.
apiVersion: batch/v1
kind: Job
metadata:
name: preseed
spec:
completions: $NUM_NODES
parallelism: $NUM_NODES
backoffLimit:
template:
metadata:
labels:
job-name: preseed
spec:
# The pod will be forcefully killed if it runs past 30 seconds, even in image pull phase
activeDeadlineSeconds: 30
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job-name
operator: In
values:
- preseed
topologyKey: kubernetes.io/hostname
containers:
- name: task
image: busybox # this is for testing, the actual job would use the prefetcher image and exit 0 immediately
# Calculate a random time between 20 and 35, echo it, and sleep
command:
- "sh"
- "-c"
- |
SLEEP_TIME=$(( (RANDOM % 16) + 20 ))
echo "Sleeping for $SLEEP_TIME seconds..."
sleep $SLEEP_TIME
restartPolicy: Never # to play nice with activeDeadlineSeconds
From time to time the prefetcher pods themselves get stuck in an indefinite wait for an image pull to complete.
One way to avoid this would be to use a "preseed" job with one-per-node pods and an active deadline seconds that prevents an indefinite delay.
This job could launch very early after the cluster is created, such that it hopefully finishes by the time the actual prefetcher is run.