|
| 1 | +--- |
| 2 | +title: "TClone: Decoupling Fast Branch Creation from Durable Checkpointing for Computer-Use Agents" |
| 3 | +date: 2026-05-17 |
| 4 | +draft: false |
| 5 | +hideToc: false |
| 6 | +tags: ["Sys for ML", "CUAs", "Checkpoint/Restore"] |
| 7 | +summary: " |
| 8 | +Computer-use agents increasingly want to *branch*: |
| 9 | +try several actions in parallel, keep the best, and roll back the rest. |
| 10 | +But a branch of a desktop is an entire running workspace, |
| 11 | +and cloning it with today's tools means a synchronous checkpoint/restore on |
| 12 | +the critical path of every speculative step. We built **TClone**, |
| 13 | +a workspace-versioning substrate that makes a branch runnable immediately |
| 14 | +by sharing memory and filesystem state copy-on-write and pushing durable |
| 15 | +checkpointing off the fast path. TClone clones a live workspace |
| 16 | +**up to 4.9x faster** than VM snapshots and **3.4x faster** than stock CRIU, |
| 17 | +and cuts end-to-end agent task latency by **up to 3.7x**. |
| 18 | +<br/><br/> |
| 19 | +[Read More...](https://mlsys.wuklab.io/posts/tclone/) |
| 20 | +" |
| 21 | + |
| 22 | +--- |
| 23 | +Author: Yutong Huang, Vikranth Srivatsa, Alex Asch, Hansin Tushar Patwa, and Yiying Zhang |
| 24 | + |
| 25 | +**TLDR**: Computer-use agents increasingly want to *branch*: |
| 26 | +explore several actions in parallel, keep the best trajectory, |
| 27 | +and roll back the rest. But a branch of a desktop is an entire running workspace (processes, memory, files, GUI), and |
| 28 | +cloning it with today's tools means a synchronous checkpoint-and-restore on the critical path of every speculative step. |
| 29 | +We built **TClone**, a workspace-versioning substrate built on a modified Linux kernel and an extended CRIU. Its one |
| 30 | +architectural principle is that *online branch creation is separated from durable checkpointing*: a branch becomes |
| 31 | +runnable through copy-on-write sharing without copying a single page on the critical path, while serialization to disk |
| 32 | +proceeds asynchronously and off that path. TClone clones a live workspace **up to 4.9x faster** than KVM snapshots and |
| 33 | +**3.4x faster** than stock CRIU. |
| 34 | + |
| 35 | +## Computer-Use Agents Want to Branch |
| 36 | + |
| 37 | +Computer-use agents (CUAs) drive a real desktop the way a person would. The strongest agent loops do not commit to one |
| 38 | +line of action: they run *best-of-N*, *beam search*, and rollback-heavy exploration, forking the current state, trying |
| 39 | +several next actions in parallel, keeping the winner, and discarding the rest. They also need a *personal* workspace the |
| 40 | +user keeps using, with speculative agent branches forked off it and either committed back or thrown away. |
| 41 | + |
| 42 | + |
| 43 | +<p align="center"><em>A versioned personal workspace: the live PC environment is branched into an agent search tree, |
| 44 | +branches are snapshotted and rolled back, and the selected result is committed back to the human's workspace.</em></p> |
| 45 | + |
| 46 | +The catch is that each fork is not a lightweight object. It is an entire live workspace: a process tree of a |
| 47 | +hundred-plus processes, gigabytes of memory, an open filesystem, and a running GUI stack. The usual ways to clone a |
| 48 | +running container, checkpoint/restore (CRIU) or a VM snapshot, serialize that whole workspace to storage and rebuild it, |
| 49 | +and that cost lands directly on the agent's critical path at *every* branch point. |
| 50 | + |
| 51 | + |
| 52 | +<p align="center"><em>Per-branch cost for one OSWorld step. The clone segment (red) is 18.2s under KVM and 12.8s under |
| 53 | +CRIU, but only 3.7s under TClone; everything else is model-bound and identical.</em></p> |
| 54 | + |
| 55 | +Everything except that clone segment is model-bound and identical across systems. Stock CRIU spends 12.8s and a KVM |
| 56 | +snapshot 18.2s cloning a single branch; TClone spends 3.7s. Because the overhead recurs on every fork and rollback, it |
| 57 | +compounds: the more an agent branches, the more it pays. |
| 58 | + |
| 59 | +## The Big Idea: Decouple Fast Branching From Durable Checkpointing |
| 60 | + |
| 61 | +Why is checkpoint/restore so slow? It conflates two things that do not need to happen together. To use a branch, the |
| 62 | +runtime needs a *runnable copy* of the workspace right now. To support rollback later, it needs a *durable image* on |
| 63 | +disk. Stock CRIU does both synchronously: dump the whole workspace, then restore it, with the branch blocked on both. |
| 64 | + |
| 65 | +TClone's entire design follows from one principle: **online branch creation is separated from durable checkpointing**. A |
| 66 | +branch is made runnable purely by copy-on-write (CoW) sharing, without copying a single page on the critical path; |
| 67 | +writing the durable image happens asynchronously, off that path, so nothing waits for the disk. |
| 68 | + |
| 69 | + |
| 70 | +<p align="center"><em>The fork pipeline. The source is frozen just long enough to capture metadata and set up CoW |
| 71 | +sharing, then resumes; a background daemon writes the durable image while the new branch is rebuilt in fresh |
| 72 | +namespaces.</em></p> |
| 73 | + |
| 74 | +Concretely, the source is frozen only long enough to record process metadata (the tree topology, namespaces, open files, |
| 75 | +registers) and set up CoW memory sharing. No page contents move while it is stopped, so the freeze time is independent |
| 76 | +of how big the workspace is. The source then resumes immediately while serialization and branch reconstruction finish in |
| 77 | +the background. |
| 78 | + |
| 79 | +## Rebuilding the Tree and Sharing Its Memory |
| 80 | + |
| 81 | +A workspace is a *tree* of processes, and the topology is itself part of the state: which process is whose parent, who |
| 82 | +shares a thread group, who lives in which namespace. You cannot rebuild it with plain `fork()`. If you individually fork |
| 83 | +processes *A*, *B*, and *C*, the copies all become children of the originals instead of one another, so the tree comes |
| 84 | +out wrong. TClone instead replays the recorded topology inside a fresh container, recreating each task with its original |
| 85 | +namespace-local identity, so the application-visible process tree is preserved exactly while host-level process IDs |
| 86 | +differ across branches and never collide. |
| 87 | + |
| 88 | + |
| 89 | +<p align="center"><em>Naive per-process forking (left) makes each copy a child of its source, producing the wrong tree. |
| 90 | +TClone reconstructs the original topology (right) inside an isolated container and shares memory copy-on-write across |
| 91 | +the container boundary.</em></p> |
| 92 | + |
| 93 | +Memory is where the real savings are. A workspace is heavy, but a speculative branch usually touches only a little of it |
| 94 | +before it is kept or thrown away, so TClone shares memory between branches and makes a private copy of a page only when |
| 95 | +a branch writes to it. This is exactly what `fork()` already does for a parent and child. The catch, visible on the |
| 96 | +right of the figure, is that a branch is *not* a child: it lives in a separately isolated container, so a plain `fork()` |
| 97 | +cannot link them. TClone reuses the kernel's own CoW machinery but extends it across the container boundary, so a branch |
| 98 | +starts out sharing all of the source's memory for free and diverges only page by page, and many branches at once cost no |
| 99 | +more per branch than one. State that cannot be shared safely, such as the fast-changing GUI framebuffer, is kept |
| 100 | +branch-local, which is also what stops a broken or malicious branch from reaching back into the user's workspace. |
| 101 | + |
| 102 | +## Sharing Files Without Copying Them |
| 103 | + |
| 104 | +Files get the same copy-on-write treatment, with one subtlety most people miss. Even on a filesystem like btrfs or ZFS |
| 105 | +that shares data blocks on disk between a snapshot and its origin, the *page cache* (the in-memory copy of file data) is |
| 106 | +tracked per file, not per block, so once a snapshot gets its own identity the kernel caches the very same bytes twice in |
| 107 | +RAM. Block-level CoW deduplicates on disk but not in memory. TClone closes that gap by making the page cache |
| 108 | +copy-on-write too: branches share sealed, read-only cached pages, and the first write to one copies it out privately |
| 109 | +into that branch alone. The result is a private, writable filesystem per branch that still shares all the read-mostly |
| 110 | +state (binaries, libraries, unchanged documents) for free. |
| 111 | + |
| 112 | + |
| 113 | +<p align="center"><em>Lazy page-cache CoW. Forking a branch records a sealed read-only layer shared by sibling branches; |
| 114 | +a write copies out a private page and leaves the layer intact; a further fork derives a new layer so siblings share |
| 115 | +post-write state while ancestors keep the earlier version.</em></p> |
| 116 | + |
| 117 | +## Keeping Branches Contained |
| 118 | + |
| 119 | +A forked agent branch should not inherit the full authority of the user's workspace. Most agent tasks need only a narrow |
| 120 | +slice of the environment: a browser or editor task rarely needs SSH keys, cloud credentials, or unrelated project files. |
| 121 | +TClone attaches a security profile to each branch at fork time, derived offline by recording what an application touches |
| 122 | +while a *human* operates it and enforced with standard sandboxing (seccomp and an LSM such as SELinux). The agent never |
| 123 | +participates in shaping its own profile, so it cannot widen it. |
| 124 | + |
| 125 | + |
| 126 | +<p align="center"><em>Reachable attack surface across OSWorld applications. Per-application profiles cut the syscall and |
| 127 | +file surface far below bare metal; most of what remains is shared between the human and the agent, and the agent-only |
| 128 | +delta is small.</em></p> |
| 129 | + |
| 130 | +The effect is that protection moves out of prompts and agent behavior and into the substrate: a branch is forked with |
| 131 | +only the resources its task requires, so most secrets are simply unreachable in the first place. |
| 132 | + |
| 133 | +## How Well Does It Work? |
| 134 | + |
| 135 | +We evaluate TClone on two agent setups, AgentLoop on GTA and Agent S3 on OSWorld, with the workload, trajectories, and |
| 136 | +model held fixed so only the versioning substrate varies. We compare against stock CRIU and a KVM snapshot baseline. |
| 137 | + |
| 138 | +<div style="display:flex;flex-wrap:wrap;gap:2%;justify-content:center;align-items:flex-start;"> |
| 139 | +<div style="flex:1 1 340px;max-width:49%;"> |
| 140 | +<img src="/images/tclone/e2e_osworld.png" alt="OSWorld end-to-end latency CDF" style="width:100%;" /> |
| 141 | +<p align="center"><em>End-to-end task-latency CDF on Agent-S3/OSWorld. TClone shifts the whole distribution left of CRIU |
| 142 | +and KVM.</em></p> |
| 143 | +</div> |
| 144 | +<div style="flex:1 1 340px;max-width:49%;"> |
| 145 | +<img src="/images/tclone/osworld_category.png" alt="OSWorld latency by task category" style="width:100%;" /> |
| 146 | +<p align="center"><em>The same OSWorld latency split by task category. The gap is widest on multi-application |
| 147 | +workspaces.</em></p> |
| 148 | +</div> |
| 149 | +</div> |
| 150 | + |
| 151 | +End-to-end, TClone shifts the entire latency distribution left of CRIU and KVM, not just the median: up to **2.3x faster |
| 152 | +than CRIU and 3.7x faster than KVM** on the heavier OSWorld desktop tasks. The gains are largest on short tasks and |
| 153 | +multi-application workspaces, exactly where unchanged memory and read-mostly files are shared instead of re-materialized |
| 154 | +per branch. |
| 155 | + |
| 156 | +<div style="display:flex;flex-wrap:wrap;gap:2%;justify-content:center;align-items:flex-start;"> |
| 157 | +<div style="flex:1 1 340px;max-width:49%;"> |
| 158 | +<img src="/images/tclone/scalability_latency.png" alt="Clone latency vs. number of concurrent clones" style="width:100%;" /> |
| 159 | +<p align="center"><em>Clone latency as concurrent clones increase. CRIU exceeds 50s at 16 clones; TClone stays near |
| 160 | +10s.</em></p> |
| 161 | +</div> |
| 162 | +<div style="flex:1 1 340px;max-width:49%;"> |
| 163 | +<img src="/images/tclone/scalability_memory.png" alt="Memory footprint vs. number of concurrent clones" style="width:100%;" /> |
| 164 | +<p align="center"><em>Memory footprint vs. clone count. TClone stays around 9 GB where CRIU reaches 14 GB at 16 |
| 165 | +clones.</em></p> |
| 166 | +</div> |
| 167 | +</div> |
| 168 | + |
| 169 | +The setting that matters most is keeping many branches alive at once for beam search and rollback. At 16 concurrent |
| 170 | +clones CRIU needs over 50s and 14 GB while TClone stays near 10s and 9 GB, and both gaps keep widening. The reason is |
| 171 | +simply that TClone shares unchanged state instead of recreating the whole workspace each time, so its cost grows only |
| 172 | +with what a branch actually changes, not with how big the workspace is. That is what lets an agent run a much wider |
| 173 | +search before time or memory runs out. |
| 174 | + |
| 175 | +## Our Vision |
| 176 | + |
| 177 | +Agent loops are becoming search procedures over a real machine, and search means branching. We think the workspace |
| 178 | +substrate, not the agent framework, is where that branching should be made cheap. TClone's takeaway is that creating a |
| 179 | +runnable branch and writing a durable checkpoint are two different jobs that today's tools force into one blocking step: |
| 180 | +separate them, and a branch becomes just copy-on-write sharing plus a background flush, and the cost of speculation |
| 181 | +drops from a full-workspace checkpoint to just the working set you actually diverge on. |
| 182 | + |
| 183 | +<hr> |
| 184 | + |
| 185 | +*This work is currently under submission.* |
0 commit comments