Skip to content

Commit 0e86e9f

Browse files
authored
Merge pull request #32 from WukLab/tclone_blog
add tclone blog post
2 parents 2db204e + 0739127 commit 0e86e9f

11 files changed

Lines changed: 185 additions & 0 deletions

content/posts/tclone.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
---
2+
title: "TClone: Decoupling Fast Branch Creation from Durable Checkpointing for Computer-Use Agents"
3+
date: 2026-05-17
4+
draft: false
5+
hideToc: false
6+
tags: ["Sys for ML", "CUAs", "Checkpoint/Restore"]
7+
summary: "
8+
Computer-use agents increasingly want to *branch*:
9+
try several actions in parallel, keep the best, and roll back the rest.
10+
But a branch of a desktop is an entire running workspace,
11+
and cloning it with today's tools means a synchronous checkpoint/restore on
12+
the critical path of every speculative step. We built **TClone**,
13+
a workspace-versioning substrate that makes a branch runnable immediately
14+
by sharing memory and filesystem state copy-on-write and pushing durable
15+
checkpointing off the fast path. TClone clones a live workspace
16+
**up to 4.9x faster** than VM snapshots and **3.4x faster** than stock CRIU,
17+
and cuts end-to-end agent task latency by **up to 3.7x**.
18+
<br/><br/>
19+
[Read More...](https://mlsys.wuklab.io/posts/tclone/)
20+
"
21+
22+
---
23+
Author: Yutong Huang, Vikranth Srivatsa, Alex Asch, Hansin Tushar Patwa, and Yiying Zhang
24+
25+
**TLDR**: Computer-use agents increasingly want to *branch*:
26+
explore several actions in parallel, keep the best trajectory,
27+
and roll back the rest. But a branch of a desktop is an entire running workspace (processes, memory, files, GUI), and
28+
cloning it with today's tools means a synchronous checkpoint-and-restore on the critical path of every speculative step.
29+
We built **TClone**, a workspace-versioning substrate built on a modified Linux kernel and an extended CRIU. Its one
30+
architectural principle is that *online branch creation is separated from durable checkpointing*: a branch becomes
31+
runnable through copy-on-write sharing without copying a single page on the critical path, while serialization to disk
32+
proceeds asynchronously and off that path. TClone clones a live workspace **up to 4.9x faster** than KVM snapshots and
33+
**3.4x faster** than stock CRIU.
34+
35+
## Computer-Use Agents Want to Branch
36+
37+
Computer-use agents (CUAs) drive a real desktop the way a person would. The strongest agent loops do not commit to one
38+
line of action: they run *best-of-N*, *beam search*, and rollback-heavy exploration, forking the current state, trying
39+
several next actions in parallel, keeping the winner, and discarding the rest. They also need a *personal* workspace the
40+
user keeps using, with speculative agent branches forked off it and either committed back or thrown away.
41+
42+
![overview](/images/tclone/overview.png)
43+
<p align="center"><em>A versioned personal workspace: the live PC environment is branched into an agent search tree,
44+
branches are snapshotted and rolled back, and the selected result is committed back to the human's workspace.</em></p>
45+
46+
The catch is that each fork is not a lightweight object. It is an entire live workspace: a process tree of a
47+
hundred-plus processes, gigabytes of memory, an open filesystem, and a running GUI stack. The usual ways to clone a
48+
running container, checkpoint/restore (CRIU) or a VM snapshot, serialize that whole workspace to storage and rebuild it,
49+
and that cost lands directly on the agent's critical path at *every* branch point.
50+
51+
![motivation_latency](/images/tclone/motivation_latency.png)
52+
<p align="center"><em>Per-branch cost for one OSWorld step. The clone segment (red) is 18.2s under KVM and 12.8s under
53+
CRIU, but only 3.7s under TClone; everything else is model-bound and identical.</em></p>
54+
55+
Everything except that clone segment is model-bound and identical across systems. Stock CRIU spends 12.8s and a KVM
56+
snapshot 18.2s cloning a single branch; TClone spends 3.7s. Because the overhead recurs on every fork and rollback, it
57+
compounds: the more an agent branches, the more it pays.
58+
59+
## The Big Idea: Decouple Fast Branching From Durable Checkpointing
60+
61+
Why is checkpoint/restore so slow? It conflates two things that do not need to happen together. To use a branch, the
62+
runtime needs a *runnable copy* of the workspace right now. To support rollback later, it needs a *durable image* on
63+
disk. Stock CRIU does both synchronously: dump the whole workspace, then restore it, with the branch blocked on both.
64+
65+
TClone's entire design follows from one principle: **online branch creation is separated from durable checkpointing**. A
66+
branch is made runnable purely by copy-on-write (CoW) sharing, without copying a single page on the critical path;
67+
writing the durable image happens asynchronously, off that path, so nothing waits for the disk.
68+
69+
![clone_procedure](/images/tclone/clone_procedure.png)
70+
<p align="center"><em>The fork pipeline. The source is frozen just long enough to capture metadata and set up CoW
71+
sharing, then resumes; a background daemon writes the durable image while the new branch is rebuilt in fresh
72+
namespaces.</em></p>
73+
74+
Concretely, the source is frozen only long enough to record process metadata (the tree topology, namespaces, open files,
75+
registers) and set up CoW memory sharing. No page contents move while it is stopped, so the freeze time is independent
76+
of how big the workspace is. The source then resumes immediately while serialization and branch reconstruction finish in
77+
the background.
78+
79+
## Rebuilding the Tree and Sharing Its Memory
80+
81+
A workspace is a *tree* of processes, and the topology is itself part of the state: which process is whose parent, who
82+
shares a thread group, who lives in which namespace. You cannot rebuild it with plain `fork()`. If you individually fork
83+
processes *A*, *B*, and *C*, the copies all become children of the originals instead of one another, so the tree comes
84+
out wrong. TClone instead replays the recorded topology inside a fresh container, recreating each task with its original
85+
namespace-local identity, so the application-visible process tree is preserved exactly while host-level process IDs
86+
differ across branches and never collide.
87+
88+
![process_fork](/images/tclone/process_fork.png)
89+
<p align="center"><em>Naive per-process forking (left) makes each copy a child of its source, producing the wrong tree.
90+
TClone reconstructs the original topology (right) inside an isolated container and shares memory copy-on-write across
91+
the container boundary.</em></p>
92+
93+
Memory is where the real savings are. A workspace is heavy, but a speculative branch usually touches only a little of it
94+
before it is kept or thrown away, so TClone shares memory between branches and makes a private copy of a page only when
95+
a branch writes to it. This is exactly what `fork()` already does for a parent and child. The catch, visible on the
96+
right of the figure, is that a branch is *not* a child: it lives in a separately isolated container, so a plain `fork()`
97+
cannot link them. TClone reuses the kernel's own CoW machinery but extends it across the container boundary, so a branch
98+
starts out sharing all of the source's memory for free and diverges only page by page, and many branches at once cost no
99+
more per branch than one. State that cannot be shared safely, such as the fast-changing GUI framebuffer, is kept
100+
branch-local, which is also what stops a broken or malicious branch from reaching back into the user's workspace.
101+
102+
## Sharing Files Without Copying Them
103+
104+
Files get the same copy-on-write treatment, with one subtlety most people miss. Even on a filesystem like btrfs or ZFS
105+
that shares data blocks on disk between a snapshot and its origin, the *page cache* (the in-memory copy of file data) is
106+
tracked per file, not per block, so once a snapshot gets its own identity the kernel caches the very same bytes twice in
107+
RAM. Block-level CoW deduplicates on disk but not in memory. TClone closes that gap by making the page cache
108+
copy-on-write too: branches share sealed, read-only cached pages, and the first write to one copies it out privately
109+
into that branch alone. The result is a private, writable filesystem per branch that still shares all the read-mostly
110+
state (binaries, libraries, unchanged documents) for free.
111+
112+
![file_cache](/images/tclone/file_cache.png)
113+
<p align="center"><em>Lazy page-cache CoW. Forking a branch records a sealed read-only layer shared by sibling branches;
114+
a write copies out a private page and leaves the layer intact; a further fork derives a new layer so siblings share
115+
post-write state while ancestors keep the earlier version.</em></p>
116+
117+
## Keeping Branches Contained
118+
119+
A forked agent branch should not inherit the full authority of the user's workspace. Most agent tasks need only a narrow
120+
slice of the environment: a browser or editor task rarely needs SSH keys, cloud credentials, or unrelated project files.
121+
TClone attaches a security profile to each branch at fork time, derived offline by recording what an application touches
122+
while a *human* operates it and enforced with standard sandboxing (seccomp and an LSM such as SELinux). The agent never
123+
participates in shaping its own profile, so it cannot widen it.
124+
125+
![attacksurface](/images/tclone/attacksurface.png)
126+
<p align="center"><em>Reachable attack surface across OSWorld applications. Per-application profiles cut the syscall and
127+
file surface far below bare metal; most of what remains is shared between the human and the agent, and the agent-only
128+
delta is small.</em></p>
129+
130+
The effect is that protection moves out of prompts and agent behavior and into the substrate: a branch is forked with
131+
only the resources its task requires, so most secrets are simply unreachable in the first place.
132+
133+
## How Well Does It Work?
134+
135+
We evaluate TClone on two agent setups, AgentLoop on GTA and Agent S3 on OSWorld, with the workload, trajectories, and
136+
model held fixed so only the versioning substrate varies. We compare against stock CRIU and a KVM snapshot baseline.
137+
138+
<div style="display:flex;flex-wrap:wrap;gap:2%;justify-content:center;align-items:flex-start;">
139+
<div style="flex:1 1 340px;max-width:49%;">
140+
<img src="/images/tclone/e2e_osworld.png" alt="OSWorld end-to-end latency CDF" style="width:100%;" />
141+
<p align="center"><em>End-to-end task-latency CDF on Agent-S3/OSWorld. TClone shifts the whole distribution left of CRIU
142+
and KVM.</em></p>
143+
</div>
144+
<div style="flex:1 1 340px;max-width:49%;">
145+
<img src="/images/tclone/osworld_category.png" alt="OSWorld latency by task category" style="width:100%;" />
146+
<p align="center"><em>The same OSWorld latency split by task category. The gap is widest on multi-application
147+
workspaces.</em></p>
148+
</div>
149+
</div>
150+
151+
End-to-end, TClone shifts the entire latency distribution left of CRIU and KVM, not just the median: up to **2.3x faster
152+
than CRIU and 3.7x faster than KVM** on the heavier OSWorld desktop tasks. The gains are largest on short tasks and
153+
multi-application workspaces, exactly where unchanged memory and read-mostly files are shared instead of re-materialized
154+
per branch.
155+
156+
<div style="display:flex;flex-wrap:wrap;gap:2%;justify-content:center;align-items:flex-start;">
157+
<div style="flex:1 1 340px;max-width:49%;">
158+
<img src="/images/tclone/scalability_latency.png" alt="Clone latency vs. number of concurrent clones" style="width:100%;" />
159+
<p align="center"><em>Clone latency as concurrent clones increase. CRIU exceeds 50s at 16 clones; TClone stays near
160+
10s.</em></p>
161+
</div>
162+
<div style="flex:1 1 340px;max-width:49%;">
163+
<img src="/images/tclone/scalability_memory.png" alt="Memory footprint vs. number of concurrent clones" style="width:100%;" />
164+
<p align="center"><em>Memory footprint vs. clone count. TClone stays around 9 GB where CRIU reaches 14 GB at 16
165+
clones.</em></p>
166+
</div>
167+
</div>
168+
169+
The setting that matters most is keeping many branches alive at once for beam search and rollback. At 16 concurrent
170+
clones CRIU needs over 50s and 14 GB while TClone stays near 10s and 9 GB, and both gaps keep widening. The reason is
171+
simply that TClone shares unchanged state instead of recreating the whole workspace each time, so its cost grows only
172+
with what a branch actually changes, not with how big the workspace is. That is what lets an agent run a much wider
173+
search before time or memory runs out.
174+
175+
## Our Vision
176+
177+
Agent loops are becoming search procedures over a real machine, and search means branching. We think the workspace
178+
substrate, not the agent framework, is where that branching should be made cheap. TClone's takeaway is that creating a
179+
runnable branch and writing a durable checkpoint are two different jobs that today's tools force into one blocking step:
180+
separate them, and a branch becomes just copy-on-write sharing plus a background flush, and the cost of speculation
181+
drops from a full-workspace checkpoint to just the working set you actually diverge on.
182+
183+
<hr>
184+
185+
*This work is currently under submission.*
110 KB
Loading
99.2 KB
Loading
43.7 KB
Loading
897 KB
Loading
43.8 KB
Loading
32.3 KB
Loading

static/images/tclone/overview.png

1.16 MB
Loading
43.1 KB
Loading
22.7 KB
Loading

0 commit comments

Comments
 (0)