|
| 1 | +# JuiceFS Metadata Schema and Versioning Research |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +JuiceFS does NOT support file versioning or filesystem-level snapshots. It offers clone (metadata-only copy-on-write fork) and a trash/recycle bin for deleted files. This document covers the exact internal metadata schema and how it compares to our planned VFS design. |
| 6 | + |
| 7 | +## Metadata Schema |
| 8 | + |
| 9 | +JuiceFS supports three metadata backends (Redis, SQL via xorm, TKV) with the same logical schema. Below is the SQL representation. |
| 10 | + |
| 11 | +### Core Tables |
| 12 | + |
| 13 | +**`node`** (inode table, 19 fields): |
| 14 | +``` |
| 15 | +Inode uint64 PK |
| 16 | +Type uint8 -- 1=file, 2=dir, 3=symlink, 4=FIFO, 5=blockdev, 6=chardev, 7=socket |
| 17 | +Flags uint8 -- FlagImmutable, FlagAppend, FlagSkipTrash |
| 18 | +Mode uint16 -- Unix permission bits |
| 19 | +Uid uint32 |
| 20 | +Gid uint32 |
| 21 | +Atime int64 -- microseconds |
| 22 | +Mtime int64 -- microseconds |
| 23 | +Ctime int64 -- microseconds |
| 24 | +Atimensec int16 -- sub-microsecond nanosecond remainder |
| 25 | +Mtimensec int16 |
| 26 | +Ctimensec int16 |
| 27 | +Nlink uint32 |
| 28 | +Length uint64 -- file size in bytes |
| 29 | +Rdev uint32 -- device number (for device nodes) |
| 30 | +Parent Ino -- parent inode (0 for hardlinked files) |
| 31 | +AccessACLId uint32 -- FK to acl table |
| 32 | +DefaultACLId uint32 -- FK to acl table (directories only) |
| 33 | +Tier uint8 -- storage tier ID |
| 34 | +``` |
| 35 | + |
| 36 | +**`edge`** (directory entries): |
| 37 | +``` |
| 38 | +Id int64 PK (bigserial) |
| 39 | +Parent Ino UNIQUE(edge) |
| 40 | +Name []byte UNIQUE(edge), varbinary(255) |
| 41 | +Inode Ino INDEX |
| 42 | +Type uint8 |
| 43 | +``` |
| 44 | + |
| 45 | +**`chunk`** (file chunk-to-slices mapping): |
| 46 | +``` |
| 47 | +Id int64 PK (bigserial) |
| 48 | +Inode Ino UNIQUE(chunk) |
| 49 | +Indx uint32 UNIQUE(chunk) -- chunk index (file offset / 64MB) |
| 50 | +Slices []byte blob -- packed array of 24-byte slice records |
| 51 | +``` |
| 52 | + |
| 53 | +**`sliceRef`** (table name: `chunk_ref`, reference counting): |
| 54 | +``` |
| 55 | +Id uint64 PK (chunkid / slice id) |
| 56 | +Size uint32 |
| 57 | +Refs int INDEX |
| 58 | +``` |
| 59 | + |
| 60 | +**`symlink`**: |
| 61 | +``` |
| 62 | +Inode Ino PK |
| 63 | +Target []byte varbinary(4096) |
| 64 | +``` |
| 65 | + |
| 66 | +### The 24-Byte Slice Record |
| 67 | + |
| 68 | +Each slice within a chunk's `Slices` blob is packed as: |
| 69 | +``` |
| 70 | +pos uint32 -- offset within the chunk (0 to 64MB) |
| 71 | +id uint64 -- globally unique slice ID |
| 72 | +size uint32 -- total size of the object in object storage |
| 73 | +off uint32 -- offset within that object where this slice's data starts |
| 74 | +len uint32 -- length of data this slice covers |
| 75 | +``` |
| 76 | + |
| 77 | +Slices are appended in write order. Newer slices override older ones at the same byte positions. |
| 78 | + |
| 79 | +### Supporting Tables |
| 80 | + |
| 81 | +- **`xattr`**: Extended attributes (inode, name, value) |
| 82 | +- **`acl`**: POSIX ACL rules (owner, group, mask, other, named users/groups) |
| 83 | +- **`flock`**: BSD-style file locks |
| 84 | +- **`plock`**: POSIX range locks |
| 85 | +- **`session2`**: Client sessions (sid, expire, info JSON) |
| 86 | +- **`sustained`**: Open file handles preventing deletion (sid, inode) |
| 87 | +- **`delfile`**: Files pending deletion (unlinked but still open) |
| 88 | +- **`delslices`**: Delayed slice deletion queue (for trash) |
| 89 | +- **`dirStats`**: Per-directory usage statistics |
| 90 | +- **`dirQuota`**: Per-directory quotas |
| 91 | +- **`setting`**: Key-value config (volume format JSON) |
| 92 | +- **`counter`**: Named counters (nextInode, nextChunk, usedSpace, totalInodes) |
| 93 | + |
| 94 | +### Redis Key Schema |
| 95 | + |
| 96 | +``` |
| 97 | +i{inode} -> binary Attr |
| 98 | +d{inode} -> hash { name -> packed(inode, type) } |
| 99 | +p{inode} -> hash { parent_ino -> count } |
| 100 | +c{inode}_{indx} -> list of 24-byte packed Slice records |
| 101 | +s{inode} -> target string |
| 102 | +x{inode} -> hash { name -> value } |
| 103 | +lockf{inode} -> hash { {sid}_{owner} -> ltype } |
| 104 | +lockp{inode} -> hash { {sid}_{owner} -> packed Plock } |
| 105 | +sessions -> sorted set { sid -> heartbeat } |
| 106 | +session{sid} -> set [ inode ] |
| 107 | +delfiles -> sorted set { {inode}:{length} -> seconds } |
| 108 | +sliceRef -> hash { k{sliceId}_{size} -> refcount } |
| 109 | +``` |
| 110 | + |
| 111 | +## Slice Lifecycle |
| 112 | + |
| 113 | +### Write Path |
| 114 | + |
| 115 | +1. Allocate new slice ID from `nextChunk` counter. |
| 116 | +2. Write data to object storage keyed by slice ID. |
| 117 | +3. Append 24-byte slice record to the chunk's `Slices` blob. |
| 118 | +4. Create `sliceRef` entry with `refs=1`. |
| 119 | +5. Update inode `Length`, `Mtime`, `Ctime`. |
| 120 | + |
| 121 | +### Read Path (Resolving Overlaps) |
| 122 | + |
| 123 | +`buildSlice()` uses an interval tree approach: |
| 124 | +1. Process slices in write order (oldest first). |
| 125 | +2. Each new slice cuts/splits any existing slices that overlap. |
| 126 | +3. Later writes always win at any byte position. |
| 127 | +4. Final in-order traversal yields non-overlapping resolved slice list. |
| 128 | +5. Gaps (regions with `id == 0`) are zeros/holes. |
| 129 | + |
| 130 | +### Compaction |
| 131 | + |
| 132 | +Triggered when a chunk accumulates many slices (every 100th slice, forced at 350+, also on read if 5+ slices). |
| 133 | + |
| 134 | +1. Read all slices for the chunk. |
| 135 | +2. Skip leading large contiguous slices (no need to rewrite). |
| 136 | +3. Build resolved slice view, trim leading/trailing zeros. |
| 137 | +4. Read resolved data, write as a single new object. |
| 138 | +5. Atomic compare-and-swap: replace compacted slices with one new slice record. |
| 139 | +6. Decrement refs on old slices (or queue to `delslices` if trash enabled). |
| 140 | + |
| 141 | +Constants: `maxCompactSlices = 1000`, `maxSlices = 2500`, `ChunkSize = 64MB`. |
| 142 | + |
| 143 | +## What JuiceFS Has Instead of Versioning |
| 144 | + |
| 145 | +### Trash / Recycle Bin |
| 146 | + |
| 147 | +- Controlled by `TrashDays` setting. |
| 148 | +- Deleted files moved to `.trash/` (reserved inode `0x7FFFFFFF10000000`). |
| 149 | +- Sub-directories per hour: `.trash/2024-01-15-14/`. |
| 150 | +- Entries named `{parent_ino}-{file_ino}-{original_name}`. |
| 151 | +- Background job cleans entries older than `TrashDays`. |
| 152 | +- Files with `FlagSkipTrash` bypass trash. |
| 153 | + |
| 154 | +### Clone (metadata-only COW fork) |
| 155 | + |
| 156 | +`juicefs clone SRC DST`: |
| 157 | +- Creates new inodes for all entries in source tree. |
| 158 | +- Copies chunk slice arrays verbatim to new inodes. |
| 159 | +- Increments `sliceRef.Refs` for every referenced slice. |
| 160 | +- Redirect-on-write: subsequent writes to either copy create new slices; unmodified regions share data blocks. |
| 161 | +- Fast regardless of data size (metadata-only operation). |
| 162 | +- NOT a reversible snapshot. It is a one-time fork. |
| 163 | + |
| 164 | +## Comparison: JuiceFS vs Our Planned Design |
| 165 | + |
| 166 | +| Aspect | JuiceFS | Our Design | |
| 167 | +|--------|---------|------------| |
| 168 | +| Versioning | None. Trash + clone only. | Native per-file versioning via `inode_versions` table. | |
| 169 | +| Snapshots | No filesystem snapshots. Clone is a one-time fork. | Point-in-time snapshots by recording `{ino -> version}` mappings. Instant, metadata-only. | |
| 170 | +| Slice model | Packed 24-byte records in a blob column. Overlap resolution via interval tree. | Similar concept but our "slices" only needed for chunked-mode large files. Small files use inline SQLite or single S3 objects. | |
| 171 | +| Metadata engines | Redis, PostgreSQL, MySQL, SQLite, TiKV, etcd | SQLite primary. Interface allows Redis, Postgres, etc. | |
| 172 | +| Block store | Any S3-compatible object storage | Same. Plus inline SQLite for tiny files. | |
| 173 | +| File size tiers | All files use chunk/slice/block model | Three tiers: inline SQLite (<64KB), single S3 object (64KB-8MB), chunked (>8MB) | |
| 174 | +| Small file optimization | None. Even 1-byte files get a slice + S3 object. | Inline in SQLite. Zero S3 round-trips for tiny files. | |
| 175 | +| Reference counting | `sliceRef` table tracks refs per slice. Clone increments refs. Compaction decrements. | Version-based. Old versions kept until GC. Simpler model since we don't need clone/COW. | |
| 176 | + |
| 177 | +### Key Differences |
| 178 | + |
| 179 | +1. **JuiceFS is designed for shared multi-client POSIX workloads.** It needs session tracking, distributed locks, sustained inodes, and compaction under concurrent access. We are single-client (one VM per filesystem instance), so we can skip all of that complexity. |
| 180 | + |
| 181 | +2. **JuiceFS uses the slice model for ALL files.** Every byte written creates a slice record and an S3 object, even for a 10-byte config file. Our tiered approach avoids S3 round-trips for small files entirely. |
| 182 | + |
| 183 | +3. **JuiceFS has no versioning because it wasn't designed for it.** The slice model technically contains historical data (old slices exist until compaction), but there's no way to query "what did this file look like 5 minutes ago." Our `inode_versions` table makes this a first-class operation. |
| 184 | + |
| 185 | +4. **Our versioning is cheaper than JuiceFS clone.** Clone duplicates the entire metadata tree. Our versioning just increments a version number and keeps the old S3 key/inline content around. Rolling back = updating `current_version` on the inode. |
0 commit comments