Self-study material for Bonus Lab 12. Read before starting the lab.
Lecture 7 opened with Liz Rice's observation: "A container is just a Linux process with a particularly fancy set of namespaces and cgroups." The implication: containers share the host kernel. A kernel CVE β or a runc CVE like 2024-21626 ("Leaky Vessels") β is a container escape risk that the application code, the image, and the network controls can't prevent.
For the vast majority of workloads, the shared-kernel model is fine. The trade-off is:
| Property | Container (runc) | VM (KVM) |
|---|---|---|
| Boot time | ~50ms | ~1β5s |
| Memory overhead | ~5MB | ~50β300MB |
| CPU overhead | <1% | ~3β10% |
| Isolation | Process-level (kernel-shared) | Hardware-virtualized |
| Tooling ecosystem | Mature, container-native | Heavyweight, traditional ops |
VMs cost real performance; containers carry real isolation risk. Sandboxed runtimes like Kata Containers try to give you the container ergonomics + VM isolation at some cost β typically 1-2x cold-start time and 5-20% I/O overhead.
This reading walks the landscape: Kata, gVisor, Firecracker, and the emerging Confidential Computing frontier.
- π’ Hosted at OpenStack Foundation since 2017 (merger of Intel Clear Containers + Hyper.sh runV)
- πΉ Written in Go (CRI plugin) + Rust (the agent runs inside the VM)
- π’ Latest: Kata Containers v3.x (April 2026)
- πͺ Used in production by: Ant Group, Baidu, Adobe (some workloads), DigitalOcean Kubernetes as an opt-in runtime
flowchart TB
Pod[K8s Pod / OCI Container] --> Runtime[containerd]
Runtime -->|runtime_type=kata| Shim[containerd-shim-kata-v2]
Shim -->|hypervisor call| QEMU[QEMU/Cloud-Hypervisor/Firecracker]
QEMU -->|boot| MicroVM[Lightweight Linux VM]
MicroVM -->|kata-agent over vsock| AgentProcess[Container Process]
style MicroVM fill:#FF9800,color:#fff
style AgentProcess fill:#4CAF50,color:#fff
Key details:
- One micro-VM per container (or per pod, depending on config). The micro-VM has its own kernel.
- The hypervisor is QEMU by default; Cloud-Hypervisor (Rust, no legacy code) and Firecracker (AWS, minimal) are alternatives.
- kata-agent runs inside the VM and exposes the container lifecycle over vsock (virtio socket β a fast, in-host-only socket between hypervisor and guest).
- OCI compatible β
nerdctl run --runtime=katalooks like normal container ops. K8sRuntimeClasslets you select Kata per-workload.
- Kernel CVE class blocked β a Dirty Pipe, Leaky Vessels, or kernel race condition that escapes the container only escapes into the throwaway micro-VM, not the host
- Untrusted multi-tenant workloads safer β running customer code (CI/CD runners, code-execution sandboxes, ML model inference for many tenants)
- Compliance gains β some regulators treat "container + VM" as equivalent to two-tier isolation; can be argued for HIPAA / FedRAMP audits
- ~5Γ cold-start (microVM boot vs runc exec)
- 5-20% I/O overhead depending on workload (virtio-fs/virtiofsd improvements help a lot)
- Higher memory per container (each microVM has its own kernel, ~50-100MB resident)
- Some kernel features unavailable (host network namespaces, host PID, certain device pass-through)
Kata exposes itself as a RuntimeClass:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kata-runtime
---
apiVersion: v1
kind: Pod
metadata:
name: sandboxed-app
spec:
runtimeClassName: kata
containers:
- name: app
image: my-image:tagOperators usually mix-and-match: trusted infrastructure pods stay on runc; tenant workloads route to kata; the policy enforcement (which pods get which runtime) lives in admission control (Lecture 9 Conftest territory).
Google's gVisor takes a different approach: instead of a real VM, run a user-space kernel that intercepts every syscall the container makes.
flowchart LR
Container[Container] -->|syscall| Sentry[gVisor Sentry<br/>user-space kernel]
Sentry -->|filtered subset| Host[Host kernel]
Sentry -.->|emulates everything else| Done[Returned to container]
style Sentry fill:#9C27B0,color:#fff
- πͺ Sentry = the user-space kernel. Intercepts ~80% of syscalls, emulates them in user space, calls the host for the rest (filtered through seccomp).
- πͺ Gofer = handles file-system I/O on behalf of the container, running as a separate process.
| Aspect | gVisor | Kata |
|---|---|---|
| Isolation mechanism | User-space syscall interception + seccomp | Hardware virtualization (KVM) |
| Host requirements | Any Linux | KVM-enabled Linux |
| Cold start | ~100ms | ~1-2s |
| Syscall compat | ~80% of Linux ABI (gaps exist) | 100% (real kernel inside the VM) |
| CPU overhead | 10-30% (every syscall is intercepted) | 1-5% (native CPU once booted) |
| Memory overhead | Low (no extra kernel) | High (one mini-kernel per container) |
| Mature for | CI runners, edge functions | Multi-tenant SaaS, sensitive workloads |
Picking between them: gVisor when boot time matters most (per-invocation FaaS, ephemeral CI tasks). Kata when compatibility matters most (full Linux ABI, weird kernel features).
AWS's contribution to the field, open-sourced 2018. Powers AWS Lambda and AWS Fargate.
- π¦ Written in Rust (~50k LOC) β vs QEMU's ~1.5M LOC
- β‘ Boots a microVM in ~125ms (proven)
- πͺ Minimal device model: virtio-net, virtio-block, virtio-vsock, serial console, keyboard. No PCI, no USB, no graphics.
Lambda runs ~100M function invocations per minute (2025 numbers). Each one is a new microVM. Firecracker's stripped-down design is what makes that economically viable.
For your purposes:
- Fly.io uses Firecracker as their app runtime (each app is a microVM)
- Kata Containers can use Firecracker as its hypervisor (
hypervisor=firecrackerin the kata config) - Direct use β you can run a Firecracker microVM with a JSON API; minimal but workable
Firecracker is not a "container runtime" by itself β it's a hypervisor. Pair with a control plane (Lambda's, Fly's, or your own) to get a usable system.
The 2026 frontier. Confidential Containers (CoCo) uses CPU-level memory encryption (Intel TDX, AMD SEV-SNP, Arm CCA, AWS Nitro Enclaves) to ensure that even the host OS / kernel / hypervisor cannot read the container's memory.
flowchart LR
Workload[Workload] -->|encrypted at rest + in memory| CPU[CPU TEE: Intel TDX / AMD SEV-SNP]
HostKernel[Host kernel] -.cannot read.-> CPU
Hypervisor[Hypervisor] -.cannot read.-> CPU
style CPU fill:#F44336,color:#fff
Traditional sandbox: protects the host from the container. CoCo protects the container from the host. Two different threats:
| Sandbox | Protects | From |
|---|---|---|
| Kata / gVisor | Host kernel | Compromised container |
| Confidential containers | Container data | Compromised host / cloud provider |
Use cases (real, 2026):
- Healthcare PHI processing on untrusted cloud infrastructure
- Multi-cloud failover where you don't fully trust one of the clouds
- Cross-organization data sharing (different orgs, same compute pool)
CoCo is young. As of April 2026:
- Intel TDX is widely available on Sapphire Rapids / Granite Rapids CPUs
- AMD SEV-SNP is widely available on Milan / Genoa EPYC CPUs
- Azure Confidential Containers is GA; AWS Nitro Enclaves is GA but narrower scope
- K8s support via the CoCo project (CNCF Sandbox) is alpha-to-beta
Don't expect to use this in Lab 12. Do expect interview questions about it by 2027-2028.
The numbers below are typical 2026 measurements (Intel Xeon Sapphire Rapids, 64GB RAM, NVMe SSD). Your mileage varies.
| Workload | runc | Kata (QEMU) | Kata (Cloud-Hyp.) | gVisor | Firecracker |
|---|---|---|---|---|---|
| Cold start (empty alpine) | 0.05s | 2.1s | 1.2s | 0.13s | 0.13s |
| Boot to ready (nginx) | 0.30s | 2.5s | 1.5s | 0.50s | 0.50s |
| CPU-bound (5M-iter loop) | 8.2s | 8.4s | 8.3s | 9.1s | 8.4s |
| Sequential write 100MB (dd) | 12.5 GB/s | 1.2 GB/s | 4.0 GB/s | 0.4 GB/s | 4.5 GB/s |
| Random small read (fio 4k) | 320k IOPS | 45k IOPS | 95k IOPS | 25k IOPS | 100k IOPS |
| Memory overhead (per container) | 5 MB | 80 MB | 60 MB | 25 MB | 50 MB |
Reading the table:
- CPU-bound workloads β sandbox overhead is minimal (~5%); Kata is nearly free for CPU work
- I/O-bound workloads β Kata's virtio adds material overhead; Cloud-Hypervisor + virtiofs improvements help significantly
- Cold-start sensitive workloads β Firecracker and gVisor win for sub-second boots; Kata is for longer-lived containers
The honest answer: measure your workload. CPU-pure microservices barely notice Kata; I/O-pure workloads notice immediately.
-
You're running untrusted code
- CI runners (GitHub Actions self-hosted, GitLab Runners with arbitrary repos)
- Code-execution sandboxes (Replit, CodePen, Jupyter at scale)
- ML inference for many tenants on shared GPUs
-
You have a regulatory requirement
- HIPAA: PHI containers wrapped in Kata = stronger isolation argument to auditors
- FedRAMP High: VM-tier isolation is sometimes mandated
- Customer contractual: many Fortune-500 SaaS contracts require Kata-class isolation
-
You've had an incident
- Post-runc-CVE-2024-21626 (Lecture 7 + 8), many orgs added Kata for "highest-risk" workloads (any container running customer code or pulling unverified images)
-
You don't deploy a sandbox for:
- General application workloads where you control the image
- Performance-sensitive systems where every ms matters
- Workloads with operational complexity already maxed (Kata adds runtime+config complexity)
- Operational complexity β managing two runtimes, monitoring two stacks, debugging issues at the VM layer (where
kubectl logsis less informative) - Performance β 5Γ cold start is brutal for FaaS-style workloads (Firecracker mitigates; Kata's slower cold-start matters)
- Compatibility β some kernel features (host PID, host network, certain device passthrough, GPU passthrough) don't work or work poorly in Kata
- Vendor support β many K8s managed services (EKS, GKE, AKS) only offer Kata as an opt-in or via a separate node pool. Operator skill in two runtimes is a real burden.
Honest assessment: in 2026, ~5% of K8s production workloads run sandboxed. The 95% accept the runc-CVE risk because the workloads are trusted enough.
Common pattern: one node pool for runc (general workloads), one for Kata (tenant code). K8s nodeSelector + tolerations route pods to the right pool.
# Untrusted workload pod
spec:
runtimeClassName: kata
nodeSelector:
workload-class: untrusted
tolerations:
- key: dedicated-untrusted
operator: ExistsSandboxed pods need more memory (kernel overhead) and slightly more CPU. Plan capacity assuming ~30% bigger nodes for the Kata pool.
Inside the VM (Kata), traditional tools (prometheus-node-exporter, vmstat) work the same. From the host side, you see one process per microVM β your existing per-container monitoring needs to follow the VM boundary.
Most OCI images "just work" in Kata. Exceptions:
- Images requiring host PID/networking (e.g., debugging tools that join host namespaces β these defeat Kata's purpose anyway)
- Images relying on GPU passthrough (PCI passthrough through Kata is possible but operational pain)
| π Resource | βοΈ Why |
|---|---|
| Kata Containers documentation β https://katacontainers.io/docs/ | Architecture, install, ops |
| Container Security β Liz Rice (O'Reilly, 2020) | Ch. 4-6 on isolation; everything you need before reading Kata internals |
| gVisor architecture guide β https://gvisor.dev/docs/architecture_guide/ | Sentry / Gofer / syscall interception explained |
| Firecracker design paper β Agache et al., NSDI 2020 | The "why Lambda needs this" paper. Short, technical, worth the 20 minutes |
| Confidential Computing Consortium β https://confidentialcomputing.io/ | The CoCo umbrella + standards bodies |
| CoCo K8s docs β https://confidentialcontainers.org/ | The CNCF Sandbox project for TEE-backed containers |
For the hypervisor internals: The Definitive Guide to KVM Virtualization on Linux β Christopher Negus (Apress, 2024) β heavier read; helpful if you want to debug Kata-with-KVM issues.
For confidential computing: Intel's TDX whitepapers + Microsoft Azure CC blog series (search "confidential computing 2026"). The field changes fast; books are out of date within a year.
| Lab 12 task | This reading section |
|---|---|
| Task 1.1 Install Kata | "Kata Containers: How It Works" |
| Task 1.2 Kernel inside container | "What Kata Buys You" + the table |
| Task 1.3 Run Juice Shop on both | (Lab 7 hardened image as the test workload) |
| Task 2.1 Isolation test | "Threat Model" + comparison |
| Task 2.2 Performance benchmark | "Performance Reality Check" table |
| Task 2.3 Trade-off analysis | "When You'd Actually Deploy a Sandbox" |
Read this first. Run Lab 12. Re-read when you hit a pitfall.
π¬ "You don't pay for the VM until something bad happens. The hard part is convincing the budget that the bad thing is probable enough." β paraphrased from an Adobe SecOps talk at KubeCon EU 2024.