An Overview of Security, Pseudo-Capabilities, Boxes and Namespaces (WIP) #100
KaiNorberg
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
It's been a while since the last update. The foundations for PatchworkOS's security model have been finalized which has been quite complex. We are now at a point where the core idea is done but the details and implementation is still Work In Progress and is subject to change.
Included below is an overview of where we are currently, followed by a discussion on what comes next.
Security
In PatchworkOS, there are no Access Control Lists, user IDs or similar mechanisms. Instead, PatchworkOS uses a pseudo-capability security model based on per-process mountpoint namespaces and containerization. This means that there is no global filesystem view, each process has its own view of the filesystem defined by what directories and files have been mounted or bound into its namespace.
For a basic example, say we have a process A which creates a child process B. Process A has access to a secret directory
/secretthat it does not want process B to access. To prevent process B from accessing the/secretdirectory, process A can create a new empty namespace for process B and simply not mount or bind the/secretdirectory into process B's namespace:Alternatively, process A could mount a new empty tmpfs instance in its own namespace over the
/secretdirectory using the ":private" flag. This prevents a child namespace from inheriting the mountpoint and process A could store whatever it wanted there:The namespace system allows for a composable, transparent and pseudo-capability security model. Processes can be given access to any combination of files and directories without needing hidden permission bits or similar mechanisms. Since everything is a file, this applies to practically everything in the system, including devices, IPC mechanisms, etc. For example, if you wish to prevent a process from using sockets, you could simply not mount or bind the
/netdirectory into its namespace.It would even be possible to implement a multi-user-like system entirely in user space using namespaces by having the init process bind different directories depending on the user logging in.
Namespace Documentation
Userspace IO API Documentation
Hiding Dentries
For complex use cases, relying on just mountpoints becomes exponentially complex. As such, the Virtual File System allows a filesystem to dynamically hide directories and files using the
revalidate()dentry operation.For example, in "procfs", a process can see all the
/proc/[pid]/files of processes in its namespace and in child namespaces but for processes in parent namespaces certain files will appear to not exist in the filesystem hierarchy. The "netfs" filesystem works similarly making sure that only processes in the namespace that created a socket can see its directory.Process Filesystem Documentation
Networking Filesystem Documentation
Share and Claim
To securely send file descriptors from one process to another, we introduce two new system calls
share()andclaim(). These act as a replacement forSCM_RIGHTSin UNIX domain sockets.The
share()system call generates a one-time use key which remains valid for a limited time. Since the key generated by this system call is a string it can be sent to any other process using conventional IPC.After a process receives a shared key it can use the
claim()system call to retrieve a file descriptor to the same underlying file object that was originally shared.Included below is an example:
Key Documentation
Userspace IO API Documentation
Boxes
In userspace, PatchworkOS provides a simple containerization mechanism to isolate processes from the rest of the system. We call such an isolated process a "box".
Each box is stored in a
/box/[box_name]directory containing a/box/[box_name]/manifestini-style configuration file. This file defines what files and directories the box is allowed to access. These are parsed by the boxd daemon, which is responsible for spawning and managing boxes.Going over the entire box system is way beyond the scope of this discussion, as such we will limit the discussion to one example box and discuss how the box system is used by a user.
Documentation
The DOOM Box
As an example, PatchworkOS includes a box for running DOOM using the
doomgenericport stored at/box/doom. Its manifest file can be found here.First, the manifest file defines the boxes metadata such as its version, author, license, etc. and information about the executable such as its path (within the boxes namespace) and its desired scheduling priority.
After that it defines the boxes "sandbox", which specifies how the box should be configured. In this case, it specifies the "empty" profile meaning that boxd will create a completely empty namespace, to the root of which it will mount a tmpfs instance and that the box is a foreground box, more on that later.
Finally, it specifies a list of default environment variables and the most important section, the "namespace" section.
The namespace section specifies a list of files and directories to bind into the boxes namespace which is what ultimately controls what the box can access. In this case, doom is given extremely limited access, only binding four directories:
/box/doom/binto/app/bin, allowing it to access its own executable stored in/box/doom/bin/doom./box/doom/datato/app/data, allowing it to access any WAD files or save files stored in/box/doom/data./net/localto itself to allow it to create sockets to communicate with the Desktop Window Manager./dev/constto itself to allow it to use the/dev/const/zerofile to map/allocate memory.The doom box cannot see or access user files, system configuration files, devices or anything else outside its bound directories, it can't even create pipes or shared memory as the
/dev/pipe/newand/dev/shmem/newfiles do not exist in its namespace.Using Boxes
Containerization and capability models often introduce friction. In PatchworkOS, using boxes should be seamless to the point that a user should not even need to know that they are using a box.
In PatchworkOS there are only two directories for executables,
/sbinfor essential system binaries such asinitand/base/binfor everything else.Within the
/base/bindirectory is theboxspawnbinary which is used via symlinks. For example, there is a symlink at/base/bin/doompointing toboxspawn. When a user runs/base/bin/doom(or justdoomif/base/binis in the shell's PATH), theboxspawnbinary will be executed, but the first argument passed to it will be/base/bin/doomdue to the behavior of symlinks. The first argument is used to resolve the box name,doomin this case, and send a request to theboxddaemon to spawn the box.All this means that from a user's perspective, running a containerized box is as simple as running any other binary, running
doomfrom the shell will work as expected.Foreground and Background Boxes
Boxes can be either foreground or background boxes. When a foreground box is spawned, boxd will perform additional setup such that the box will appear to be a child of the process that spawned it, setting up its stdio, process group, allowing the spawning process to retrieve its exit status, etc. This allows for a system where using containerized boxes can be indistinguishable from using a regular binary from a user perspective.
A background box on the other hand is intended for daemons and services that do not need to interact with the user. When a background box is spawned, it will run detached from the spawning process, without any stdio or similar.
Documentation
Future Plans
The immediate next step is most likely the implementation of "File Servers" via a FUSE or 9P like system. Meaning that a user-space process could implement its own file systems either for actual file systems or to create servers by implementing virtual file systems, in the same way that the kernel implements "devfs", boxd could implement "boxfs" or similar. Which would fit far more cleanly into our security model and everything is a file philosophy. Once this is implemented, significant sections of user space will need to be reimplemented.
Currently,
share()andclaim()are not ideal, they suffer from potential vulnerabilities that would occur if the generated key, which resides in user-space, where to leak. However, it is a very convenient way to pass file descriptors, so the idea won't be abandoned entirely, Instead the current idea is to add another parameter to specify the PID of the intended target, ensuring that even if the key leaks only the target can claim it. To avoid refactoring systems twice, this will only be added once file servers have been implemented.There is currently a vulnerability in that file systems can be mounted by anyone, such that even if
/netis not mounted into a boxes namespace of a box, it could simply mount netfs on its own and bypass the restriction. Solving this wouldn't be too difficult, it could be as simple as saying that netfs can only be mounted once, its more a question of deciding what the best way of solving it is. Hence, why the issue still exists.It was slightly hinted at earlier, but we will be implementing multi-user support by having either the init process or boxd mount different directories depending on who is logging in. There may be some additional mechanisms in boxd itself, perhaps having a specific "user namespace" which boxes could be started within or similar. To some extent this has already been begun as the reference implementation of argon2, the PHC wining password hash, has already been ported to PatchworkOS to be used for password hashing.
Beta Was this translation helpful? Give feedback.
All reactions