Skip to content

Commit fcc6ca8

Browse files
NathanFlurryNicholasKisselclaude
authored
chore: RivetKit perf fixes (#4523)
* feat: US-001 - Scaffold module structure and package exports * chore: temporarily remove agent-os local link deps to unblock website builds * fix: gracefully handle missing agent-os examples in website build * feat(website): add AgentOS use cases page Add new page at /agent-os/use-cases showcasing who agentOS is for: - Background agents - Production agents - Efficient evals - Browser-based LLM interfaces - Programming agents - Personal agents - Multi-agent systems - Customer support agents - DevOps agents - Data processing agents - Content creation agents - Workflow automation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(website): restore refactoring changes lost during rebase --------- Co-authored-by: Nicholas Kissel <Nicholas.kissel@icloud.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 8330cbd commit fcc6ca8

File tree

187 files changed

+25244
-1895
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

187 files changed

+25244
-1895
lines changed

.agent/notes/workflow-replay-review.md

Lines changed: 374 additions & 0 deletions
Large diffs are not rendered by default.

.agent/research/isolation-model-comparison.md

Lines changed: 521 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# JuiceFS Metadata Schema and Versioning Research
2+
3+
## Overview
4+
5+
JuiceFS does NOT support file versioning or filesystem-level snapshots. It offers clone (metadata-only copy-on-write fork) and a trash/recycle bin for deleted files. This document covers the exact internal metadata schema and how it compares to our planned VFS design.
6+
7+
## Metadata Schema
8+
9+
JuiceFS supports three metadata backends (Redis, SQL via xorm, TKV) with the same logical schema. Below is the SQL representation.
10+
11+
### Core Tables
12+
13+
**`node`** (inode table, 19 fields):
14+
```
15+
Inode uint64 PK
16+
Type uint8 -- 1=file, 2=dir, 3=symlink, 4=FIFO, 5=blockdev, 6=chardev, 7=socket
17+
Flags uint8 -- FlagImmutable, FlagAppend, FlagSkipTrash
18+
Mode uint16 -- Unix permission bits
19+
Uid uint32
20+
Gid uint32
21+
Atime int64 -- microseconds
22+
Mtime int64 -- microseconds
23+
Ctime int64 -- microseconds
24+
Atimensec int16 -- sub-microsecond nanosecond remainder
25+
Mtimensec int16
26+
Ctimensec int16
27+
Nlink uint32
28+
Length uint64 -- file size in bytes
29+
Rdev uint32 -- device number (for device nodes)
30+
Parent Ino -- parent inode (0 for hardlinked files)
31+
AccessACLId uint32 -- FK to acl table
32+
DefaultACLId uint32 -- FK to acl table (directories only)
33+
Tier uint8 -- storage tier ID
34+
```
35+
36+
**`edge`** (directory entries):
37+
```
38+
Id int64 PK (bigserial)
39+
Parent Ino UNIQUE(edge)
40+
Name []byte UNIQUE(edge), varbinary(255)
41+
Inode Ino INDEX
42+
Type uint8
43+
```
44+
45+
**`chunk`** (file chunk-to-slices mapping):
46+
```
47+
Id int64 PK (bigserial)
48+
Inode Ino UNIQUE(chunk)
49+
Indx uint32 UNIQUE(chunk) -- chunk index (file offset / 64MB)
50+
Slices []byte blob -- packed array of 24-byte slice records
51+
```
52+
53+
**`sliceRef`** (table name: `chunk_ref`, reference counting):
54+
```
55+
Id uint64 PK (chunkid / slice id)
56+
Size uint32
57+
Refs int INDEX
58+
```
59+
60+
**`symlink`**:
61+
```
62+
Inode Ino PK
63+
Target []byte varbinary(4096)
64+
```
65+
66+
### The 24-Byte Slice Record
67+
68+
Each slice within a chunk's `Slices` blob is packed as:
69+
```
70+
pos uint32 -- offset within the chunk (0 to 64MB)
71+
id uint64 -- globally unique slice ID
72+
size uint32 -- total size of the object in object storage
73+
off uint32 -- offset within that object where this slice's data starts
74+
len uint32 -- length of data this slice covers
75+
```
76+
77+
Slices are appended in write order. Newer slices override older ones at the same byte positions.
78+
79+
### Supporting Tables
80+
81+
- **`xattr`**: Extended attributes (inode, name, value)
82+
- **`acl`**: POSIX ACL rules (owner, group, mask, other, named users/groups)
83+
- **`flock`**: BSD-style file locks
84+
- **`plock`**: POSIX range locks
85+
- **`session2`**: Client sessions (sid, expire, info JSON)
86+
- **`sustained`**: Open file handles preventing deletion (sid, inode)
87+
- **`delfile`**: Files pending deletion (unlinked but still open)
88+
- **`delslices`**: Delayed slice deletion queue (for trash)
89+
- **`dirStats`**: Per-directory usage statistics
90+
- **`dirQuota`**: Per-directory quotas
91+
- **`setting`**: Key-value config (volume format JSON)
92+
- **`counter`**: Named counters (nextInode, nextChunk, usedSpace, totalInodes)
93+
94+
### Redis Key Schema
95+
96+
```
97+
i{inode} -> binary Attr
98+
d{inode} -> hash { name -> packed(inode, type) }
99+
p{inode} -> hash { parent_ino -> count }
100+
c{inode}_{indx} -> list of 24-byte packed Slice records
101+
s{inode} -> target string
102+
x{inode} -> hash { name -> value }
103+
lockf{inode} -> hash { {sid}_{owner} -> ltype }
104+
lockp{inode} -> hash { {sid}_{owner} -> packed Plock }
105+
sessions -> sorted set { sid -> heartbeat }
106+
session{sid} -> set [ inode ]
107+
delfiles -> sorted set { {inode}:{length} -> seconds }
108+
sliceRef -> hash { k{sliceId}_{size} -> refcount }
109+
```
110+
111+
## Slice Lifecycle
112+
113+
### Write Path
114+
115+
1. Allocate new slice ID from `nextChunk` counter.
116+
2. Write data to object storage keyed by slice ID.
117+
3. Append 24-byte slice record to the chunk's `Slices` blob.
118+
4. Create `sliceRef` entry with `refs=1`.
119+
5. Update inode `Length`, `Mtime`, `Ctime`.
120+
121+
### Read Path (Resolving Overlaps)
122+
123+
`buildSlice()` uses an interval tree approach:
124+
1. Process slices in write order (oldest first).
125+
2. Each new slice cuts/splits any existing slices that overlap.
126+
3. Later writes always win at any byte position.
127+
4. Final in-order traversal yields non-overlapping resolved slice list.
128+
5. Gaps (regions with `id == 0`) are zeros/holes.
129+
130+
### Compaction
131+
132+
Triggered when a chunk accumulates many slices (every 100th slice, forced at 350+, also on read if 5+ slices).
133+
134+
1. Read all slices for the chunk.
135+
2. Skip leading large contiguous slices (no need to rewrite).
136+
3. Build resolved slice view, trim leading/trailing zeros.
137+
4. Read resolved data, write as a single new object.
138+
5. Atomic compare-and-swap: replace compacted slices with one new slice record.
139+
6. Decrement refs on old slices (or queue to `delslices` if trash enabled).
140+
141+
Constants: `maxCompactSlices = 1000`, `maxSlices = 2500`, `ChunkSize = 64MB`.
142+
143+
## What JuiceFS Has Instead of Versioning
144+
145+
### Trash / Recycle Bin
146+
147+
- Controlled by `TrashDays` setting.
148+
- Deleted files moved to `.trash/` (reserved inode `0x7FFFFFFF10000000`).
149+
- Sub-directories per hour: `.trash/2024-01-15-14/`.
150+
- Entries named `{parent_ino}-{file_ino}-{original_name}`.
151+
- Background job cleans entries older than `TrashDays`.
152+
- Files with `FlagSkipTrash` bypass trash.
153+
154+
### Clone (metadata-only COW fork)
155+
156+
`juicefs clone SRC DST`:
157+
- Creates new inodes for all entries in source tree.
158+
- Copies chunk slice arrays verbatim to new inodes.
159+
- Increments `sliceRef.Refs` for every referenced slice.
160+
- Redirect-on-write: subsequent writes to either copy create new slices; unmodified regions share data blocks.
161+
- Fast regardless of data size (metadata-only operation).
162+
- NOT a reversible snapshot. It is a one-time fork.
163+
164+
## Comparison: JuiceFS vs Our Planned Design
165+
166+
| Aspect | JuiceFS | Our Design |
167+
|--------|---------|------------|
168+
| Versioning | None. Trash + clone only. | Native per-file versioning via `inode_versions` table. |
169+
| Snapshots | No filesystem snapshots. Clone is a one-time fork. | Point-in-time snapshots by recording `{ino -> version}` mappings. Instant, metadata-only. |
170+
| Slice model | Packed 24-byte records in a blob column. Overlap resolution via interval tree. | Similar concept but our "slices" only needed for chunked-mode large files. Small files use inline SQLite or single S3 objects. |
171+
| Metadata engines | Redis, PostgreSQL, MySQL, SQLite, TiKV, etcd | SQLite primary. Interface allows Redis, Postgres, etc. |
172+
| Block store | Any S3-compatible object storage | Same. Plus inline SQLite for tiny files. |
173+
| File size tiers | All files use chunk/slice/block model | Three tiers: inline SQLite (<64KB), single S3 object (64KB-8MB), chunked (>8MB) |
174+
| Small file optimization | None. Even 1-byte files get a slice + S3 object. | Inline in SQLite. Zero S3 round-trips for tiny files. |
175+
| Reference counting | `sliceRef` table tracks refs per slice. Clone increments refs. Compaction decrements. | Version-based. Old versions kept until GC. Simpler model since we don't need clone/COW. |
176+
177+
### Key Differences
178+
179+
1. **JuiceFS is designed for shared multi-client POSIX workloads.** It needs session tracking, distributed locks, sustained inodes, and compaction under concurrent access. We are single-client (one VM per filesystem instance), so we can skip all of that complexity.
180+
181+
2. **JuiceFS uses the slice model for ALL files.** Every byte written creates a slice record and an S3 object, even for a 10-byte config file. Our tiered approach avoids S3 round-trips for small files entirely.
182+
183+
3. **JuiceFS has no versioning because it wasn't designed for it.** The slice model technically contains historical data (old slices exist until compaction), but there's no way to query "what did this file look like 5 minutes ago." Our `inode_versions` table makes this a first-class operation.
184+
185+
4. **Our versioning is cheaper than JuiceFS clone.** Clone duplicates the entire metadata tree. Our versioning just increments a version number and keeps the old S3 key/inline content around. Rolling back = updating `current_version` on the inode.

0 commit comments

Comments
 (0)