Describe the bug
The error we got
spec.initContainers{_service_}: Error: failed to create containerd container: error unpacking image: apply layer error for "_place_holder_for_our_container_image_":
failed to extract layer sha256:82c60ccaf916322916d16bcdb4223f93acc1f68e2087dba4ddf64990b1dc27fb:
failed to get reader from content store: content digest sha256:fa8ae93e2b3a7478248483e942ff665efa7219c6cd72d7a03c775372076e98dc: not found
We had a patched version of kata-deploy which adds discard_unpacked_layers = false to the drop-in config file(/opt/kata/containerd/config.d/kata-deploy.toml) deployed on GKE, the initial test looked promising with images like ubuntu, nginx, postgres, alpine, etc. However, we saw the above error when deploying our own images.
After some debugging, I narrowed the problem down to distroless based images. Why distroless? That seems arbitrary and mysterious. We spent more time, and the answer is peer-pods-webhook
|
FROM --platform=$TARGETPLATFORM gcr.io/distroless/static@sha256:e3f945647ffb95b5839c07038d64f9811adf17308b9121d8a2b87b6a22a80a39 |
It is based on distroless which shares some common layers with our image.
So basically kata-deploy and peer-pods-webhook are deployed to a node simultaneously, the latter is much smaller which means that its image can get pulled and unpacked first, before kata-deploy updates the containerd config and restarts it. That is a problem because image layers can get discarded since the discard_unpacked_layers is still true.
More broadly speaking, this is not about peer-pods-webhook. If there are other damonsets in the GKE cluster, or CoCo components are deployed to a long standing node, it is possible that there are images pulled already and some of their layers have been discarded before kata-deploy or we manually update(s) the containerd config. We'd like to get some guidance here, thank you.
How to reproduce
Deploy CoCo v0.19.0 to a GKE cluster.
# ssh into the worker node
gcloud compute ssh ${vm_name} --project={your_project} --zone={zone_name}
sudo su - root
cd /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256
# Grab the image digest and get the digest for amd64 linux
# ctr -n k8s.io image ls | grep peer-pods-webhook
# ctr -n k8s.io image ls | grep distroless
# 6f3f2123de90d2e7998b8161a2838433ec32560a827d07bcab339dacbf0cf16f is the digest for distroless image
cat 6f3f2123de90d2e7998b8161a2838433ec32560a827d07bcab339dacbf0cf16f |jq '.layers [].digest' | tr -d '"' | awk -F':' '{print $2}' | xargs ls -l
ls: cannot access 'fa8ae93e2b3a7478248483e942ff665efa7219c6cd72d7a03c775372076e98dc': No such file or directory
ls: cannot access 'c172f21841dff4c8cf45cde46589c1c2616cefe7e819965e92e6d3475c428aa0': No such file or directory
ls: cannot access 'b4e6f1bfce0a1fba2b5421041552f4a897aada9cd5680926580f9e2c6247a7ae': No such file or directory
ls: cannot access 'b4242723c53fe4e094eb78569a2c15b6aafb8eb42aa9c3c2666130654a316ae2': No such file or directory
ls: cannot access 'd6b1b89eccacc15c2420b2776d72c1dae334a00805ed9af54bf2f71e4d536f28': No such file or directory
ls: cannot access 'b839dfae01f66e15c6a8b63520557ed315bdfe036342fa7a0c537259f10d7a9a': No such file or directory
-r--r--r-- 1 root root 67 Apr 29 22:32 2780920e5dbfbe103d03a583ed75345306e572ec5a48cb10361f046767d9f29a
-r--r--r-- 1 root root 123 Apr 29 22:32 3214acf345c0cc6bbdb56b698a41ccdefc624a09d6beb0d38b5de0b2303ecaf4
-r--r--r-- 1 root root 162 Apr 29 22:32 52630fc75a18675c530ed9eba5f55eca09b03e91bd5bc15307918bbc1a7e7296
-r--r--r-- 1 root root 188 Apr 29 22:32 7c12895b777bcaa8ccae0605b4de635b68fc32d60fa08f421dc3818bf55ee212
-r--r--r-- 1 root root 136993 Apr 29 22:32 bdfd7f7e5bf6fc27e70b59101db21c3d8284d283884419dd5fe7020583bb79ca
-r--r--r-- 1 root root 80 Apr 29 22:32 dd64bf2dd177757451a98fcdc999a339c35dee5d9872d8f4dc69c8f3c4dd0112
-r--r--r-- 1 root root 311 Apr 29 22:32 ebddc55facdc6b1f7e0f30816a5fc7cc62f38abdf76c0a8b0a0ce52085754795
We can see some of the layers are already gone. This particular layer fa8ae93e2b3a7478248483e942ff665efa7219c6cd72d7a03c775372076e98dc is the one we saw in the error message mentioned in the beginning.
CoCo version information
CoCo v0.19.0
What TEE are you seeing the problem on
Tdx
Failing command and relevant log output
Describe the bug
The error we got
We had a patched version of
kata-deploywhich addsdiscard_unpacked_layers = falseto the drop-in config file(/opt/kata/containerd/config.d/kata-deploy.toml) deployed on GKE, the initial test looked promising with images like ubuntu, nginx, postgres, alpine, etc. However, we saw the above error when deploying our own images.After some debugging, I narrowed the problem down to
distrolessbased images. Whydistroless? That seems arbitrary and mysterious. We spent more time, and the answer is peer-pods-webhookcloud-api-adaptor/src/webhook/Dockerfile
Line 26 in ab0b4f9
So basically kata-deploy and peer-pods-webhook are deployed to a node simultaneously, the latter is much smaller which means that its image can get pulled and unpacked first, before kata-deploy updates the containerd config and restarts it. That is a problem because image layers can get discarded since the
discard_unpacked_layersis still true.More broadly speaking, this is not about peer-pods-webhook. If there are other damonsets in the GKE cluster, or CoCo components are deployed to a long standing node, it is possible that there are images pulled already and some of their layers have been discarded before kata-deploy or we manually update(s) the containerd config. We'd like to get some guidance here, thank you.
How to reproduce
Deploy CoCo
v0.19.0to a GKE cluster.We can see some of the layers are already gone. This particular layer
fa8ae93e2b3a7478248483e942ff665efa7219c6cd72d7a03c775372076e98dcis the one we saw in the error message mentioned in the beginning.CoCo version information
CoCo v0.19.0
What TEE are you seeing the problem on
Tdx
Failing command and relevant log output