A New Architecture, Async Direct I/O, True Capability Security, New Root Directory Concept and Userspace Components #117
KaiNorberg
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
It's been quite a while since the last update, and that's because a significant portion of PatchworkOS has been rewritten. So far the core architecture of the new design is more or less finalized, with a focus on async I/O, improving the capability security model, reducing ambient authority, stealing a few micro-kernel concept to move policy into userspace (such as userspace process loading and module management) and moving userspace into a flexible component based system.
The rewrite is far from done, but below is an overview of what has been done so far. This can also be found in the README.
Philosophy
There are a few concepts that form the core of PatchworkOS, "everything is a file", asynchronous I/O, capability based security and others.
Everything is a File
The "everything is a file" philosophy means that almost all kernel resources are exposed as files, where a file is defined as an object that can be interacted with "like a file", as in it can be opened, read, written, and closed.
This can often result in unorthodox APIs that seem overcomplicated at first, but the goal is to provide a simple, consistent and most importantly composable interface for all kernel subsystems. The core argument is not that each individual API is better than its POSIX counterpart, but that they combine to form a system that is greater than the sum of its parts, allowing for behavior that was never explicitly designed for.
Plus its fun.
I/O
The I/O system is designed with several modern I/O concepts in mind. For example, all open/walk operations use
openat()semantics, all I/O is vectored (uses scatter-gather lists), asynchronous and dispatched via an I/O Ring supporting timeouts and cancellation. All I/O is direct and (more or less) zero-copy.There are two components to asynchronous I/O, the I/O Ring and I/O Request Packets.
The I/O Ring acts as the user-kernel space boundary and is made up of two circular queues mapped into userspace. The first queue is used by the userspace to submit I/O requests to the kernel. The second queue is used by the kernel to return the result of the I/O request. This system also features a virtual register system, allowing I/O Requests to store the result of their operation to a virtual register, which another I/O Request can read from into their arguments, allowing for several operations that may rely on the result of previous operations to be executed asynchronously.
The I/O Request Packet (IRP) is a self-contained structure that contains all the information needed to perform an I/O operation. When the kernel receives a submission queue entry, it will parse it and create an I/O Request Packet. The I/O Request Packet will then be sent to the appropriate vnode (file system, device, etc.) for processing, once the I/O Request is completed, the kernel will write the result of the operation into the completion queue.
If the target vnode can't complete the IRP immediately, it simply returns a "PENDING" status and the kernel continues without blocking.
For reading or writing, the I/O Request Packet uses a Scatter Gather List, which is an array of entries, each containing a page frame number, offset and length. Since the kernel identity maps all of physical memory into its address space, it can directly read from or write to any buffers provided by userspace without needing to copy them into kernel space or map them.
Built on top of this system are several layers of abstractions. For example, the
iowrite()function is a simple synchronous wrapper around the I/O ring andfwrite()(provided by ANSI C) is a wrapper aroundiowrite()that works as expected. Many helper functions are also provided, for exampleiowritep()is a version ofiowrite()that will use the virtual register system to perform a walk, write and drop using a single system call.The "error" handling or status system allows for certain optimizations. For example, if a read is performed on a file such that no more data remains, the returned status will be an informational
EOFstatus. In certain cases, this means we can skip an additional read to check for EOF, potentially saving us a system call.The combination of this system and our "everything is a file" philosophy means that, since files are interacted with via async I/O and everything is a file, practically all operations can be asynchronous and dispatched via a I/O Ring.
Security
In PatchworkOS, there are no Access Control Lists, user IDs or similar mechanisms. Instead, PatchworkOS uses a capability security model based on file descriptors.
A process can only access files that have been passed to it via file descriptors, and since everything is a file, this applies to practically everything in the system, including devices, IPC mechanisms, etc.
The
..OperatorThe
..or dotdot operator is a major security concern in a capability based system. Consider that if we pass a directory to a process that process could use..to access the parent directory and then the parent of the parent and so on. This vulnerability would effectively make capability based security meaningless.A tempting solution, used by other capability based systems, is to ban the use of
..entirely and instead use string parsing to normalize paths (i.e. turninga/b/..intoa). There are two primary issues with this solution, relative paths and symlinks.As an example of a relative path, consider the path
./... The issue we encounter is that we may not know what.refers to, requiring us to track the current working directory as a string. This is not only complex but also inefficient as we would need to parse the entire path string for every traversal and also start any path traversal from the root directory instead of being able to start from any file descriptor.We are still left with the problem of symlinks. Consider the path
a/b/../c, ifbis a symlink, and we proceed to normalize this path toa/c, we will end up accessingcwithinanot within whatever directory the symlink points to, making symlinks pointless.There are alternative and superior solutions to this problem. However hopefully the point is clear, simply banning
..is not a good solution due to the complexity and inefficiency it introduces, and the fact that symlinks become difficult to impossible to implement. That's not even mentioning the potential for race conditions when normalizing paths.The solution proposed by PatchworkOS comes from the realization that
..is not inherently dangerous. Instead, it is only dangerous when it can be used to grant additional capabilities.As such we allow
..if the process can prove that it already has a capability to reach the parent directory. For example, say we have a directory structure as described below.Now let's say we have a process that wishes to open the
bdirectory and that has two file descriptors, one to thecfile (as in it has the capability to accessc) and one to theadirectory (as in it has the capability to accessa, the contents ofaand the contents of all subdirectories).In this case, if we disallow the process from using
..fromcto accessb, we are not meaningfully preventing the process from accessingb, since it can just use theafile descriptor to accessbdirectly. From this perspective, using..fromcis merely a more convenient way to accessb, not that doing so actually grants any new capabilities to the process.However, if the process didn't have a file descriptor to
athen allowing it to use..fromcwould grant additional capabilities and as such should not be allowed.All of this does however hinge on the ability for a process to prove that it has a capability to access the parent directory. The way this is done is closely tied to how PatchworkOS handles the "root directory."
The Root Directory
In PatchworkOS there is no global root or even local root. Instead, when a process walks a path it must always specify some file descriptor to be considered the root for that specific operation.
This root file descriptor has three purposes. First, it is used to implement paths starting with
/, letting paths start from the root file descriptor.Second, it is used by a process to provide the proof discussed in the
..section. If the process tries to use..to access the parent directory, the kernel will check if the specified root can reach that parent directory, if it can, then..acts as expected, otherwise..becomes a no-op to replicate expected POSIX-like behavior (e.g/../../is equivalent to/).Finally, the root file descriptor stores bindings. Within PatchworkOS, there is no namespace or per-process mountpoints. Instead, each file object stores a table of bindings. These bindings act as one would expect within POSIX, allowing a file to appear at a different path than its actual location within the filesystem hierarchy. When a bind is performed, that bind will only apply when walking paths from the file object whose binding table the bind was added to.
In this system one can consider binding a file to be nothing more than a convenient way to pass multiple capabilities (file descriptors) within a single file descriptor, by binding paths within its binding table. It does also allow all the expected benefits of bindings or mounts from POSIX-like systems but from a different perspective.
Standard Library
The standard library (libstd) is a superset of the ANSI C standard library, meaning that headers such as
<stdio.h>and<stdlib.h>are included while POSIX headers such as<unistd.h>are not. Instead, thesysdirectory provides a set of PatchworkOS-specific headers such as<sys/io.h>and<sys/proc.h>.Overall, an attempt is made to reuse and integrate our extensions cleanly without duplicating the ANSI sections of the standard library, for example the C11
<threads.h>header provides threading with<sys/proc.h>intentionally mirroring its API.Practical Examples
Included below are some practical examples of how to use the APIs provided by PatchworkOS, and how they differ from their POSIX counterparts. These examples are not meant to be comprehensive, but rather to provide an instinct and intuition for how PatchworkOS works.
Basic File I/O
For a basic example of file I/O, let's say we wanted to open a file, write "Hello, World!" to it and then close it.
In a POSIX system, we might write:
Using the synchronous I/O wrappers in PatchworkOS, we would write:
We first open the file using
iowalk(), specifying the default current working directory and root directory along with a path. Within the path we specify that we want "read and write" permissions (:rwsee [Path Flags and Payloads](#Path Flags and Payloads)).Note that the
FDCWDandFDROOTconstants are just standard file descriptors likeSTDIN,STDOUTandSTDERR(calledFDIN,FDOUTandFDERRrespectively). In PatchworkOS, the current working directory and root directory are just file descriptors like any other; an agreed upon convention that allows other processes to easily inherit them as needed.Then we write to the file using
iowrite(), passing the file descriptor, a buffer containing the data to write (theiowrite()function actually expects an array ofiovec_twhich theIOBUF()macro creates on the stack for convenience) and the offset to write at (in this caseIOCURto write at the current offset).Finally, we close the file using
iodrop().The
iowritet(),ioreadt()andiowalkt()functions are also provided that expect an additionalclock_t timeoutargument. There is also an event loop based abstraction around the I/O Ring itself provided via macros with theQsuffix.Path Flags and Payloads
A path can contain two additional optional segments, path flags and a payload. Path flags are appended after a
:character and can be written in two forms, either in full form or in short form with the short form being a single letter that can be specified in groups. For example,/my/path:read:write:executecould also be written as/my/path:rwx. Note that the order of the letters and duplicates are ignored.Beyond simple permission flags we have behavior flags such as
:append,:parents,:truncate, etc. or creation flags,:create,:directory,:symlinkand:hardlink. With the simple:createcreating a regular file.The payload of a path is specified after a
?character, this payload is treated as a raw string and will be passed to the underlying filesystem, allowing it to handle the payload in any way it chooses. However, typically filesystems will expect an options list in the form of key-value pairs separated by&characters.For example, we could create a symlink by walking the path
/my/path/to/source:symlink?/my/path/to/target.Another example is concatfs which is used to concatenate the contents of several directories into a single directory. It expects the targets of the concatenation to be specified in the payload to its clone file, for example
/sys/fs/concatfs/clone?targets=1,2,3,4where 1, 2, 3 and 4 are file descriptors.The primary intent behind the use of the flags and payload system is to allow for greater composability. With this system, any environment that can open a file, a Lua script, a shell, etc. can create any file, directory, symlink or hardlink with any permissions and flags without needing to rely on custom "PatchworkOS extensions".
One can as an exercise imagine the potential of a basic "touch" shell utility with this system.
Process Creation
Let's say we wanted to create a process, redirect its standard I/O to a set of file descriptors and then execute a program.
In a POSIX system, we might write:
Using the synchronous I/O wrappers in PatchworkOS, we would write:
We first create two pipes by opening the special file
/dev/pipe/clonetwice.Then we create a new process using the
proc_create()function. This function takes in several arguments, first it takes in the root and current working directory to use when resolving paths, then it takes in aproc_args_tstructure containing the command line arguments for the process which we use thePROC_ARGS()helper to construct. The second argument is an array ofproc_fd_tstructures allowing us to pass file descriptors to the child, where eachproc_fd_tstructure contains a parent file descriptor and a child file descriptor. The third argument is the size of this array which we use theARRAY_SIZE()helper to compute. The fourth and fifth arguments are the process's priority and flags, and the sixth argument is an output pointer for a file descriptor to the child's proc directory containing files for manipulating the child.We could optimize the pipe creation by walking to the second pipe relative to the first one. This optimization can be applied any time we wish to open the same file multiple times:
It's important to note that
proc_create()is not a system call; it's a wrapper around the/proc/clonefile which when opened returns the root of the new processes proc directory. The kernel does nothing more than provide an empty address space thatproc_create()fills using thememfile in the child's proc directory.A process will be freed when its reference count reaches zero, as such "killing" a process is merely freeing its threads to drop their references to the process.
Environment Variables
Environment variables are typically a set of key-value pairs that provide a simple way to configure programs. This concept of environment variables maps cleanly to a directory containing files, where the name of the file is the key and its contents are the value. As such, environment variables are provided via a binding in the
/envdirectory. This directory could either be a real directory, allowing the user to manage environment variables via the filesystem, or one could create a tmpfs instance and use that as the/envdirectory.Notes/Signals
Notes are PatchworkOS's equivalent to POSIX signals which asynchronously send strings to processes.
In POSIX, if a page fault were to occur in a process running in some form of shell, we would usually receive a
SIGSEGV, which is not very helpful. The core limitation is that signals are just integers, so we can't receive any additional information.In PatchworkOS, a note is a string where the first word of the string is the note type and the rest is arbitrary data. As such, a page fault note might look like:
All that happened is that the shell printed the exit status of the process, which is also a string and in this case is set to the note that killed the process.
Mounting a Filesystem
There is no
mount()system call in PatchworkOS; instead filesystems are exposed via files which are used in combination with thefdbind()function to mount filesystems.Filesystem files are exposed by "sysfs" as directories, for example,
/sys/fs/tmpfsis the filesystem directory for the tmpfs filesystem. Within these directories are "clone" files. Opening one of these clone files (for example/sys/fs/tmpfs/clone) gives us a file descriptor containing the root of a new instance of that filesystem (for more complex filesystems, for example a disk based one, additional parameters might be needed within the payload specified iniowalk()when opening the filesystem file).Then we can use
fdbind()to bind the root of the filesystem instance into our desired target:Components
In PatchworkOS, userspace is made up of "components". These components can be anything, executable programs, libraries, headers, or just data files.
Each component is stored in a
/comp/<name>directory. Within each components directory are version directories written in the form<x>.<y>.<z>(major.minor.patch).The actual component files are stored within the version directories, usually within subdirectories like
bin/,lib/,include/, etc. In addition, there is a manifest file which describes the component, its dependencies, and what capabilities it requires.These manifests are written in a simple markup language made for PatchworkOS called S-expression CONfig (SCON), a parser is provided in libstd with the purpose of standardizing any configuration files used throughout the OS.
Included below is an example manifest file:
(component (description "An example component.") (author "Kai Norberg") (license MIT) (launch bin/example) (dependencies (libstd 1.0.0) ) (capabilities /dev/fb /dev/kbd ) )Launching Components
Any process can launch a component using the
comp_launch()function from libstd. This function will construct a new root file descriptor for the component, with all the directories and files within the components directory and any dependencies directories, being concatenated via concatfs into a set of standard directories such as/bin,/lib, etc. and with any additional files specified via the capabilities being bound to the expected locations.Let's take the component described above as an example. The libstd component provides a
lib/libstd.sofile and let's also say that libother provides alib/libother.sofile. In this case, the launched process would then find bothlibstd.soandlibother.soin/lib. It would also be able to access see the/dev/fb/,/dev/kbd, as those were specified directly.The
comp_launch()function will automatically handle versioning via Minimum Version Selection inspired by GO, this means that the system will always choose the lowest possible version of components that satisfies all dependencies. Meaning that the version specified in a manifest might not be the version that's loaded, instead the version specified is the minimum version.This ensures that the system is reproducible, that any updates have to be explicit ensuring that an update never breaks the system and that rollbacks are effortless (with the potential for some auto update system in the future) as the same set of dependencies will always result in the same environment, given the same manifests.
This all has one rather large limitation, in that the parent process must have all the capabilities to be passed to the child. If the child needs a capability that the parent does not have, the
comp_launch()function will fail.The Init Process
The one exception to this rule is the init process, which is special in that it is the only process "loaded" by the kernel (it is actually loaded by the bootloader and the kernel simply copies the executable into memory) since executable loading is handled in userspace. The init process is granted a
FDROOTfile descriptor to the root ofsysfsfrom which it can acquire all capabilities. It uses these capabilities to load the RAM disk and setup userspace.This means that the security model forms a tree-like structure, with init having all capabilities and all child processes having some subset of those capabilities.
Modules
PatchworkOS uses a "modular" kernel design, meaning that instead of having one big kernel binary, the kernel is split into several smaller "modules" that can be loaded and unloaded at runtime.
This is highly convenient for development, but it also has practical advantages, for example, there is no need to load a driver for a device that is not attached to the system, saving memory.
The Module Manager (modman)
The module manager is a userspace component, being no different to any other component, that is granted two capabilities the
/dev/announcefile and the/sys/moddirectory.The
/dev/announcefile allows the kernel to provide userspace with a stream of messages describing device state changes. Usually, a device being attached or detached. For example:When the module manager receives a massage like the one above, it will look inside the
/comp/.index/devicesdirectory, in which there are subdirectories named after each device type, in this case this would be the/comp/.index/devices/PNP0303directory. Inside that subdirectory is a series of symlinks to components that provide kernel modules that are able to handle that device type.Some modules may return a "DEFERRED" status, this would cause the module manager to defer the loading of that module and try again when any new device is attached.
Make your own Module
Making a module is intended to be as straightforward as possible. For the sake of demonstration, we will create a simple "Hello, World!" module.
Since kernel modules are just components, we must first create a new component. We begin by creating the
comp/hellodirectory, in which we must create amanifest.sconfile for our component, to which we write the following code:(component (description "Example Hello World module.") (author "Your name here") (license MIT) (module mod/hello.ko) )This file specifies basic metadata about our component, most importantly that it provides a kernel module which the module manager can find at
mod/hello.kowithin our components directory.Now we can create a
hello.mkfile, in the same directory as the manifest file, to which we write the following code:This
.mkfile describes our component to the build system, giving it its name, version, type and most importantly what devices it can handle. In this case we specifyBOOT_ALWAYSwhich is a special device that the module manager will pretend was attached during boot, allowing modules that specify it to always be loaded.We are now able to write the actual module, we will create a
srcdirectory withincomp/hellowithin which we create ahello.cfile containing the included code:The final directory structure should look something like this:
We can now run the
make all runcommand and should see a "Hello, World!" message within the kernels logs during boot.If this didn't work, or bugs are encountered, please open an issue.
Future Plans
Now that the core of the new architecture more or less complete, the next steps will be fleshing out userspace. Starting with user accounts, some form of login manager and a new GUI.
User accounts should be rather straight forward, each account will have their own "home" and "env" directories which will be bound into the first process of that account, along with whatever other capabilities that account has been given.
The ideas for the login manager are still not finalized, but since Argon2id has been ported to PatchworkOS we will be using that for password hashing.
Finally, there have been many ideas regarding the GUI. The core limitation that we have to grapple with is that implementing GPU support is, to put it mildly, wildly outside the scope of a project like this. Which limits us to CPU or software rendering.
Using CPU rending forces us to design the "aesthetic" of our GUI around what a CPU can render most efficiently. This is the primary reason for the Windows 9x inspired GUI we had before the rewrite as it consists of simple opaque rectangular shapes which is very efficient on CPU.
So far anything more complex has been deliberately avoided to not end up in a situation where only a high-end CPU can run this OS simply because of how slow the GUI is.
However, it seems possible that, if we were to accept the cost of implementing transparency (via SIMD) which would be a not-insignificant cost but could most likely be optimized to an acceptable level. We could potentially implement a GNOME inspired design, as it also consists of relatively simple, although rounded shapes, which could be cached. Effects like shadows would be a simple partially transparent gradient that would also be cached.
Only testing will decide if such an approach will be considered acceptably performant.
Of course, as always, if you have any questions, find issues, or anything else, please leave a comment or open an issue in the GitHub.
Beta Was this translation helpful? Give feedback.
All reactions