Skip to content

Commit d56dc78

Browse files
committed
chunking doc update
1 parent aaf302c commit d56dc78

13 files changed

Lines changed: 624 additions & 266 deletions

File tree

docs/chunking/cdc.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: 'CDC Chunking'
3+
---
4+
5+
Content-Defined Chunking: chunk boundaries are determined by file content, not fixed byte offsets
6+
7+
## What CDC Is
8+
9+
CDC stands for Content-Defined Chunking.
10+
Unlike fixed-size chunking, CDC scans the file content and identifies chunk boundaries based on data patterns (rolling hash fingerprints)
11+
12+
Because boundaries are content-driven, when a large file is only changed locally — such as inserting or deleting a small region in the middle,
13+
or appending data at the end — many unchanged regions can still be cut into the same chunks as before.
14+
Those identical chunks are reused from existing storage, which improves deduplication rate significantly
15+
16+
CDC has no assumption about the internal structure of the file.
17+
It works for any kind of local modification: insertion, deletion, or in-place update in any position
18+
19+
## FastCDC
20+
21+
FastCDC is a specific algorithm implementing CDC, and the one adopted by Prime Backup.
22+
It was first described in a [paper at USENIX ATC 2016](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf),
23+
with further improvements published in a [2020 follow-up](https://ieeexplore.ieee.org/document/9055082)
24+
25+
At its core, FastCDC uses a gear hash — a lightweight rolling hash that processes data one byte at a time
26+
via a simple table lookup and bit shift — to detect chunk boundaries based on a bitmask condition on the hash value
27+
28+
What sets FastCDC apart from earlier CDC algorithms is its normalized chunking technique.
29+
Rather than applying a single hash mask throughout, it uses a stricter mask below the average size target and a more permissive one above it,
30+
nudging the chunk size distribution toward the desired average without sacrificing the content-adaptive nature of CDC
31+
32+
Prime Backup uses [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc), a Cython-accelerated Python implementation of FastCDC 2020
33+
that delivers near-native chunking throughput
34+
35+
## Available Algorithms
36+
37+
| Algorithm | Avg Chunk Size | Min Chunk Size | Max Chunk Size |
38+
|----------------|----------------|----------------|----------------|
39+
| `fastcdc_32k` | 32 KiB | 8 KiB | 256 KiB |
40+
| `fastcdc_128k` | 128 KiB | 64 KiB | 1 MiB |
41+
42+
`fastcdc_32k` is the default and works well for most use cases
43+
`fastcdc_128k` uses a coarser granularity and is better suited for very large files (10 GiB or more) where the per-chunk metadata overhead of `fastcdc_32k` becomes noticeable
44+
45+
Both algorithms use FastCDC with normalized chunking and a fixed seed (`0`) for reproducibility
46+
47+
## Good Candidates
48+
49+
CDC works well whenever most backups only change part of a file, for example:
50+
51+
- large database files with local row-level updates
52+
- large log files that are appended at the end and need to be backed up
53+
- any large file that is frequently modified in a local, non-global manner
54+
55+
## Poor Candidates
56+
57+
CDC is usually not a good fit when:
58+
59+
- the file is completely rewritten on every save (no local structure is preserved)
60+
- the file is a compressed or encrypted container, where any small content change scrambles a large byte region
61+
62+
Also note that the first backup containing a file still needs to write all chunks,
63+
so CDC benefits only become visible on later backups with high chunk reuse
64+
65+
## Dependencies
66+
67+
CDC chunking requires the optional Python library [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc).
68+
You can install it directly, or install the optional dependency bundle:
69+
70+
```bash
71+
pip3 install pyfastcdc
72+
# or install all optional dependencies at once
73+
pip3 install -r requirements.optional.txt
74+
```

docs/chunking/cdc.zh.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: 'CDC 分块'
3+
---
4+
5+
内容定义分块:数据块边界由文件内容决定,而非固定的字节偏移
6+
7+
## 什么是 CDC
8+
9+
CDC 是 Content-Defined Chunking 的缩写,通常译作"内容定义分块"。
10+
与固定大小分块不同,CDC 通过扫描文件内容,利用滚动哈希指纹来识别数据块边界
11+
12+
由于边界由内容决定,当大文件仅发生局部变化时——例如在中间插入或删除了少量数据,
13+
或仅在尾部追加内容——大量未变化的内容仍会被切分成与之前相同的数据块。
14+
这些相同的数据块可直接复用已有存储,显著提升去重效果
15+
16+
CDC 对文件的内部结构没有任何假设。
17+
它对任意类型的局部修改均有效:无论是任意位置的插入、删除,还是原地覆盖写入
18+
19+
## FastCDC
20+
21+
FastCDC 是实现 CDC 的一种具体算法,也是 Prime Backup 所采用的方案。
22+
它最初由 2016 年 USENIX ATC 会议上发表的[一篇研究论文](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)提出,
23+
[2020 年的后续论文](https://ieeexplore.ieee.org/document/9055082)中进一步完善
24+
25+
FastCDC 的核心是 gear hash(齿轮哈希)——一种轻量级的滚动哈希,逐字节扫描数据,
26+
每步仅需一次查表操作与一次位移运算——通过对哈希值施加位掩码条件来检测分块边界
27+
28+
FastCDC 区别于早期 CDC 算法的核心特性在于其 normalized chunking(归一化分块)技术。
29+
它不采用统一的哈希掩码,而是在当前位置低于目标平均块大小时使用更严格的掩码,超过后则切换为更宽松的掩码,
30+
使分块大小分布向目标平均值紧密收敛,同时完整保留了 CDC 的内容自适应特性
31+
32+
Prime Backup 使用 [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc),这是一个基于 Cython 加速的 FastCDC 2020 Python 实现,
33+
可达到接近原生代码的分块处理吞吐量
34+
35+
## 可用算法
36+
37+
| 算法 | 平均块大小 | 最小块大小 | 最大块大小 |
38+
|----------------|---------|--------|---------|
39+
| `fastcdc_32k` | 32 KiB | 8 KiB | 256 KiB |
40+
| `fastcdc_128k` | 128 KiB | 64 KiB | 1 MiB |
41+
42+
`fastcdc_32k` 是默认选项,适合大多数使用场景。
43+
`fastcdc_128k` 采用更粗的粒度,更适合超大型文件(10 GiB 以上),可减少 `fastcdc_32k` 粒度下每条数据块记录带来的相对元数据开销
44+
45+
两种算法均使用 FastCDC,启用了 normalized chunking,并固定种子(`0`)以保证可重现性
46+
47+
## 适用场景
48+
49+
只要大多数备份只涉及文件的局部变化,CDC 通常都能发挥效果,例如:
50+
51+
- 体积较大的数据库文件,每次更新仅涉及局部行
52+
- 需要纳入备份、且主要以尾部追加方式写入的大型日志文件
53+
- 任何经常以局部、非全局方式进行修改的大文件
54+
55+
## 不适用场景
56+
57+
以下情况 CDC 通常效果不佳:
58+
59+
- 文件每次保存时都被完整重写(无法保留局部结构)
60+
- 文件是压缩包或加密容器,任何一点内容变化都会导致大范围字节变动
61+
62+
另外,某个文件首次进入备份时,所有数据块仍需完整写入。
63+
CDC 的收益主要体现在后续那些数据块可大量复用的备份上
64+
65+
## 依赖
66+
67+
CDC 分块依赖于可选的 Python 库 [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc)
68+
你可以单独安装它,也可以安装可选依赖包:
69+
70+
```bash
71+
pip3 install pyfastcdc
72+
# 或一次性安装全部可选依赖
73+
pip3 install -r requirements.optional.txt
74+
```

docs/chunking/fixed_size.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: 'Fixed-Size Chunking'
3+
---
4+
5+
!!! warning "Alpha Status"
6+
7+
Fixed-size chunking is currently a proof of concept and is in alpha status.
8+
It is not recommended for production use.
9+
Use [CDC Chunking](chunking_cdc.md) instead unless you have a specific reason to use fixed-size chunking.
10+
11+
Fixed-size chunking splits files at predictable byte-offset boundaries, with every chunk being exactly the configured size
12+
(the last chunk may be smaller if the file size is not a multiple of the chunk size).
13+
14+
## What Fixed-Size Chunking Is
15+
16+
Fixed-size chunking is conceptually simple: the file is divided into equal-sized pieces from start to end.
17+
Each piece is hashed and stored independently, just like CDC chunks.
18+
19+
Unlike CDC, chunk boundaries do not shift when content is inserted or deleted in the middle of the file.
20+
Any edit before the end of a chunk changes that chunk's hash entirely, and any insertion or deletion causes all subsequent chunks to shift,
21+
potentially invalidating a large number of previously stored chunks.
22+
23+
This means fixed-size chunking is generally inferior to CDC for files with arbitrary edits.
24+
Its benefit is only realized in scenarios where the file's write pattern is well-aligned to chunk boundaries.
25+
26+
For example, with `fixed_4k` applied to a Minecraft region file:
27+
28+
```
29+
+----------------------------------------------------------------------+
30+
| file (e.g. r.0.0.mca) |
31+
+------+------+------+------+------+------+------+------+------+-- - --+
32+
| 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | ... |
33+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | |
34+
+------+------+------+------+------+------+------+------+------+-- - --+
35+
```
36+
37+
Each 4 KiB chunk corresponds to one internal page of the region file.
38+
When only a few game chunks change between backups, only the corresponding pages are dirtied,
39+
and the rest of the chunks are identical to those already stored.
40+
41+
## Available Algorithms
42+
43+
| Algorithm | Chunk Size | Typical Use Case |
44+
|--------------|------------|----------------------------------------------------------------------------------------------------------------------------------------|
45+
| `fixed_4k` | 4 KiB | Minecraft region files (`.mca`): each region file is organized in 4 KiB pages, so changes in one chunk only invalidate that 4 KiB page |
46+
| `fixed_32k` | 32 KiB | General intermediate granularity |
47+
| `fixed_128k` | 128 KiB | Append-only files: growth at the tail only creates new trailing chunks, leaving all previous chunks intact |
48+
49+
### fixed_4k
50+
51+
The 4KiB chunk size aligns with the internal page structure of Minecraft's Anvil region files (`.mca`).
52+
In theory, modifying a small number of chunks in the game only dirties a limited number of 4 KiB pages,
53+
making `fixed_4k` capable of the finest-grained deduplication for region files.
54+
55+
However, `fixed_4k` has serious practical drawbacks:
56+
57+
- extremely high metadata overhead: a 1 GiB file requires roughly 262 144 chunk records
58+
- poor I/O performance: each chunk requires a separate read-write cycle during backup
59+
60+
Unless the file is very large and only a tiny number of pages change per backup, `fixed_4k` is unlikely to be worth the cost.
61+
62+
### fixed_32k
63+
64+
A middle-ground option. Metadata overhead is 32× lower than `fixed_4k` but granularity is also much coarser.
65+
66+
### fixed_128k
67+
68+
The 128 KiB chunk size is well-suited for files that grow by appending data at the end.
69+
When new data is appended, only the trailing chunks change; all preceding chunks retain the same hash and are reused.
70+
71+
This makes `fixed_128k` a reasonable alternative to CDC for pure append-write files.
72+
73+
## Poor Candidates
74+
75+
Fixed-size chunking is a poor choice for:
76+
77+
- files that are frequently modified in the middle or beginning (insertion/deletion shifts all subsequent chunks)
78+
- files with completely unpredictable byte-level change patterns
79+
- files where the chunk size does not align with any meaningful internal structure
80+
81+
## No Extra Dependencies
82+
83+
Fixed-size chunking has no additional Python dependency requirements.
84+
It is available as long as Prime Backup is installed.
85+

docs/chunking/fixed_size.zh.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: '固定大小分块'
3+
---
4+
5+
!!! warning "Alpha 状态"
6+
7+
固定大小分块目前是概念验证性质的实现,处于 alpha 阶段。
8+
不建议在生产环境中使用。
9+
除非你有特定需求,否则请使用 [CDC 分块](chunking_cdc.zh.md)。
10+
11+
固定大小分块在可预测的字节偏移边界处切分文件,每个数据块的大小恰好等于配置中指定的值
12+
(若文件大小不是块大小的整数倍,则最后一个数据块可能更小)
13+
14+
## 什么是固定大小分块
15+
16+
固定大小分块在概念上非常简单:从文件开头到结尾,按照相等的大小依次切分。
17+
每个数据块单独计算哈希并存储,与 CDC 数据块的存储方式相同
18+
19+
与 CDC 不同,固定大小分块的边界不受文件内容影响。
20+
在文件中间插入或删除数据时,数据块边界不会随之位移,
21+
但被修改的数据块的哈希会完全改变,插入或删除操作还会导致其后所有数据块发生偏移,
22+
可能使大量已存储的数据块失效
23+
24+
因此,对于有任意编辑模式的文件,固定大小分块的表现通常不如 CDC。
25+
只有当文件的写入模式与块大小边界对齐时,固定大小分块才能发挥其优势
26+
27+
`fixed_4k` 作用于 Minecraft region 文件为例:
28+
29+
```
30+
+------------------------------------------------------------------------+
31+
| 文件(如 r.0.0.mca) |
32+
+------+------+------+------+------+------+------+------+------+-- - --+
33+
| 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | ... |
34+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | |
35+
+------+------+------+------+------+------+------+------+------+-- - --+
36+
```
37+
38+
每个 4 KiB 的数据块对应 region 文件的一个内部页。
39+
当两次备份之间只有少量游戏区块发生变化时,只有对应的页会被修改,
40+
其余数据块与已存储的内容完全一致,可直接复用
41+
42+
## 可用算法
43+
44+
| 算法 | 块大小 | 典型适用场景 |
45+
|--------------|---------|-------------------------------------------------------------------------------|
46+
| `fixed_4k` | 4 KiB | Minecraft region 文件(`.mca`):region 文件以 4 KiB 页为内部组织单位,修改少量游戏区块只会脏化有限的 4 KiB 页 |
47+
| `fixed_32k` | 32 KiB | 一般性的中等粒度场景 |
48+
| `fixed_128k` | 128 KiB | 追加写文件:尾部追加的数据只会产生新的末尾数据块,之前的所有数据块保持不变 |
49+
50+
### fixed_4k
51+
52+
4 KiB 的块大小与 Minecraft Anvil 格式 region 文件(`.mca`)的内部页结构对齐。
53+
理论上,游戏中修改少量区块只会脏化有限数量的 4 KiB 页面,
54+
因此 `fixed_4k` 在理论上能够对 region 文件实现最细粒度的去重
55+
56+
然而,`fixed_4k` 存在严重的实际缺陷:
57+
58+
- 元数据开销极大:一个 1 GiB 的文件需要约 262 144 条数据块记录
59+
- I/O 性能很差:备份时每个数据块都需要单独的读写操作
60+
61+
除非文件非常大且每次备份只有极少数页面发生变化,否则 `fixed_4k` 的代价很可能远大于收益
62+
63+
### fixed_32k
64+
65+
一种折中选项。元数据开销比 `fixed_4k` 低 32 倍,但粒度也粗糙得多
66+
67+
### fixed_128k
68+
69+
128 KiB 的块大小非常适合以尾部追加方式写入的文件。
70+
当新数据被追加到文件末尾时,只有末尾的数据块会发生变化;之前的所有数据块保持相同的哈希,可直接复用
71+
72+
对于纯追加写入的文件,`fixed_128k` 是 CDC 的一个合理替代选项
73+
74+
## 不适用场景
75+
76+
固定大小分块在以下情况通常效果不佳:
77+
78+
- 文件频繁在中间或开头被修改(插入/删除会导致后续所有数据块偏移)
79+
- 文件的字节级变化模式完全不可预测
80+
- 文件内部结构与块大小不对齐
81+
82+
## 无需额外依赖
83+
84+
固定大小分块没有额外的 Python 依赖要求。
85+
只要安装了 Prime Backup,该功能即可使用

0 commit comments

Comments
 (0)