- Intro
- Installing Docker inside the Container
- Running Docker inside the Container
- Preloading Inner Container Images
- Persistence of Inner Docker Images
- Inner Docker Privileged Containers
- Limitations for the Inner Docker
Sysbox has support for running Docker inside containers (aka Docker-in-Docker).
Unlike all other alternatives, Sysbox enables users to do this easily and securely, without resorting to complex Docker run commands and container images, and without using privileged containers or bind-mounting the host's Docker socket into the container. The inner Docker is totally isolated from the Docker on the host.
This is useful for Docker sandboxing, testing, and CI/CD use cases.
The easiest way is to use a system container image that has Docker preinstalled in it.
You can find a few such images in the Nestybox DockerHub repo. The Dockerfiles for the images are here.
Alternatively, you can always deploy a baseline system container image (e.g., ubuntu or alpine) and install Docker in it just as you would on a physical host or VM.
In fact, the system container images that come with Docker preinstalled are created with a Dockerfile that does just that.
You do this just as you would on a physical host or VM (e.g., by executing
docker run -it alpine inside the container).
The Sysbox Quickstart Guide has several examples showing how to run Docker inside a system container.
Sysbox enables you to easily create system container images that come preloaded with inner Docker container images.
This way, when you deploy the system container, the inner Docker images are already present, and void the need to pull them from the network.
There are two ways to do this:
-
Using
docker build -
Using
docker commit
See the User Guide System Container Images Chapter for a full explanation on how to do this.
Inner container images that are preloaded inside the system container image are persistent: they are present every time a new system container is created.
However, inner container images that are pulled into the system container at
runtime are not by default persistent: they get destroyed when the system
container is removed (e.g., via docker rm).
But it's easy to make these runtime inner container images (and even inner containers) persistent too.
You do this by simply mounting a host directory into the system container's
/var/lib/docker directory (i.e., where the inner Docker stores its container
images).
The Sysbox Quick Start Guide has examples here and here.
A couple of caveats though:
-
A given host directory mounted into a system container's
/var/lib/dockermust only be mounted on a single system container at any given time. This is a restriction imposed by the inner Docker daemon, which does not allow its image cache to be shared concurrently among multiple daemon instances. Sysbox will check for violations of this rule and report an appropriate error during system container creation. -
Do not mount the host's
/var/lib/dockerto a system container's/var/lib/docker. Doing so breaks container isolation since the system container can now inspect all sibling containers on the host. Furthermore, as mentioned in bullet (1) above, you cannot share/var/lib/dockeracross multiple container instances, so you can't share the host's Docker cache with a Docker instance running inside a sysbox container.
Inside a system container you can deploy privileged Docker containers (e.g.,
by issuing the following command to the inner Docker: docker run --privileged ...).
NOTE: due to a bug in Docker, this requires the inner Docker to be version 19.03 or newer.
The ability to run privileged containers inside a system container is useful when deploying inner containers that require full privileges (typically containers for system services such as Kubernetes control-plane pods).
Note however that a privileged container inside a system container is only privileged within the context of the system container, but has no privileges on the underlying host.
For example, when running a privileged container inside a system container, the
procfs (i.e., /proc) mounted inside the privileged container only allows
access to resources associated with the system container. It does not allow
access to all host resources.
This is a unique and key security feature of Sysbox: it allows you to run privileged containers inside a system container without risking host security.
Starting with Sysbox v0.6.7, and in hosts with Linux kernel 6.7 or later, it's possible for a Docker instance running inside a Sysbox container to build multi-arch images.
This is because Linux kernel 6.7 introduces user-namespacing on the binfmt_misc filesystem,
thanks to this kernel commit by the excellent Christian Brauner.
This means that each Sysbox container has a dedicated binfmt_misc filesystem that is
independent and isolated from the binfmt_misc filesystems on the host or in other
Sysbox containers.
Sysbox v0.6.7 detects if the kernel supports binfmt_misc namespacing and if so
it auto-mounts it inside each Sysbox container at /proc/sys/fs/binfmt_misc.
This in turn enables applications such as Docker running inside the Sysbox container to leverage this filesystem for multi-arch use cases, for example building an arm64 image on an amd64 host as in the example below.
- Start a Sysbox container that comes with Docker inside:
$ docker run --runtime=sysbox-runc -it --rm nestybox/ubuntu-jammy-systemd-docker
Now, inside the Sysbox container:
- Tell the inner Docker to install QEMU for emulation:
$ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
- Verify the installation succeeded:
$ docker buildx inspect --bootstrap
Name: default
Driver: docker
Nodes:
Name: default
Endpoint: default
Status: running
Platforms: linux/amd64, linux/386, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/arm/v7, linux/arm/v6
- Tell the inner Docker to build the image (assumes you have a Dockerfile somewhere):
$ docker buildx build --platform linux/arm64 -t alpine-arm-test:arm64 .
- Run the newly built image:
$ docker run --rm --platform linux/arm64 alpine-arm-test:arm64 uname -m
aarch64
Note that steps (2) -> (5) above occur inside the Sysbox container, and they are exactly the same steps you would do on a normal Linux host. This makes sense since the environment inside the Sysbox container resembles a Linux VM.
Also, the installation of QEMU for emulation occurs entirely inside the Sysbox
container, and it does not affect the host or other containers (because
binfmt_misc is namespaced per Sysbox container).
Finally, as noted above, this requires a Linux kernel version 6.7 or later. On
earlier Linux kernels, Sysbox will not auto-mount binfmt_misc inside the
container and in fact trying to mount it manually inside the container
(e.g., mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc) will
result in a "permission denied" error.
Most Docker functionality works perfectly inside the system container.
However, there are some limitations at this time. This section describes these limitations.
The inner Docker must store it's images at the usual /var/lib/docker. This
directory is known as the Docker "data-root".
While it's possible to configure the inner Docker to store it's images at some
other location within the system container (via the Docker daemon's
--data-root option), Sysbox does not currently support this (i.e., the
inner Docker won't work).
The inner Docker must not be configured with userns-remap.
Enabling userns-remap on the inner Docker would cause the Linux user-namespace to be used on the inner containers, further isolating them from the rest of the software in the system container.
This is useful and we plan to support it in the future.
Note however that even without the inner Docker userns remap, inner containers are already well isolated from the host by the system container itself, since the system container uses the Linux user namespace.
