Skip to content

GCP: Error unpacking image, failed to extract layer sha256:82c60ccaf916322916d16bcdb4223f93acc1f68e2087dba4ddf64990b1dc27fb #3033

@Qiang-Xu

Description

@Qiang-Xu

Describe the bug

The error we got

spec.initContainers{_service_}: Error: failed to create containerd container: error unpacking image: apply layer error for "_place_holder_for_our_container_image_": 
failed to extract layer sha256:82c60ccaf916322916d16bcdb4223f93acc1f68e2087dba4ddf64990b1dc27fb: 
failed to get reader from content store: content digest sha256:fa8ae93e2b3a7478248483e942ff665efa7219c6cd72d7a03c775372076e98dc: not found

We had a patched version of kata-deploy which adds discard_unpacked_layers = false to the drop-in config file(/opt/kata/containerd/config.d/kata-deploy.toml) deployed on GKE, the initial test looked promising with images like ubuntu, nginx, postgres, alpine, etc. However, we saw the above error when deploying our own images.

After some debugging, I narrowed the problem down to distroless based images. Why distroless? That seems arbitrary and mysterious. We spent more time, and the answer is peer-pods-webhook

FROM --platform=$TARGETPLATFORM gcr.io/distroless/static@sha256:e3f945647ffb95b5839c07038d64f9811adf17308b9121d8a2b87b6a22a80a39
It is based on distroless which shares some common layers with our image.

So basically kata-deploy and peer-pods-webhook are deployed to a node simultaneously, the latter is much smaller which means that its image can get pulled and unpacked first, before kata-deploy updates the containerd config and restarts it. That is a problem because image layers can get discarded since the discard_unpacked_layers is still true.

More broadly speaking, this is not about peer-pods-webhook. If there are other damonsets in the GKE cluster, or CoCo components are deployed to a long standing node, it is possible that there are images pulled already and some of their layers have been discarded before kata-deploy or we manually update(s) the containerd config. We'd like to get some guidance here, thank you.

How to reproduce

Deploy CoCo v0.19.0 to a GKE cluster.

# ssh into the worker node
gcloud compute ssh ${vm_name} --project={your_project} --zone={zone_name}

sudo su - root
cd /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256

# Grab the image digest and get the digest for amd64 linux
# ctr -n k8s.io image ls | grep peer-pods-webhook
# ctr -n k8s.io image ls | grep distroless

# 6f3f2123de90d2e7998b8161a2838433ec32560a827d07bcab339dacbf0cf16f is the digest for distroless image
cat 6f3f2123de90d2e7998b8161a2838433ec32560a827d07bcab339dacbf0cf16f  |jq '.layers [].digest' | tr -d '"' | awk -F':' '{print $2}' | xargs ls -l
ls: cannot access 'fa8ae93e2b3a7478248483e942ff665efa7219c6cd72d7a03c775372076e98dc': No such file or directory
ls: cannot access 'c172f21841dff4c8cf45cde46589c1c2616cefe7e819965e92e6d3475c428aa0': No such file or directory
ls: cannot access 'b4e6f1bfce0a1fba2b5421041552f4a897aada9cd5680926580f9e2c6247a7ae': No such file or directory
ls: cannot access 'b4242723c53fe4e094eb78569a2c15b6aafb8eb42aa9c3c2666130654a316ae2': No such file or directory
ls: cannot access 'd6b1b89eccacc15c2420b2776d72c1dae334a00805ed9af54bf2f71e4d536f28': No such file or directory
ls: cannot access 'b839dfae01f66e15c6a8b63520557ed315bdfe036342fa7a0c537259f10d7a9a': No such file or directory
-r--r--r-- 1 root root     67 Apr 29 22:32 2780920e5dbfbe103d03a583ed75345306e572ec5a48cb10361f046767d9f29a
-r--r--r-- 1 root root    123 Apr 29 22:32 3214acf345c0cc6bbdb56b698a41ccdefc624a09d6beb0d38b5de0b2303ecaf4
-r--r--r-- 1 root root    162 Apr 29 22:32 52630fc75a18675c530ed9eba5f55eca09b03e91bd5bc15307918bbc1a7e7296
-r--r--r-- 1 root root    188 Apr 29 22:32 7c12895b777bcaa8ccae0605b4de635b68fc32d60fa08f421dc3818bf55ee212
-r--r--r-- 1 root root 136993 Apr 29 22:32 bdfd7f7e5bf6fc27e70b59101db21c3d8284d283884419dd5fe7020583bb79ca
-r--r--r-- 1 root root     80 Apr 29 22:32 dd64bf2dd177757451a98fcdc999a339c35dee5d9872d8f4dc69c8f3c4dd0112
-r--r--r-- 1 root root    311 Apr 29 22:32 ebddc55facdc6b1f7e0f30816a5fc7cc62f38abdf76c0a8b0a0ce52085754795

We can see some of the layers are already gone. This particular layer fa8ae93e2b3a7478248483e942ff665efa7219c6cd72d7a03c775372076e98dc is the one we saw in the error message mentioned in the beginning.

CoCo version information

CoCo v0.19.0

What TEE are you seeing the problem on

Tdx

Failing command and relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions