TISUnion
diff --git a/‎docs/chunking/cdc.md‎
Lines changed: 74 additions & 0 deletions b/‎docs/chunking/cdc.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎docs/chunking/cdc.zh.md‎
Lines changed: 74 additions & 0 deletions b/‎docs/chunking/cdc.zh.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎docs/chunking/fixed_size.md‎
Lines changed: 85 additions & 0 deletions b/‎docs/chunking/fixed_size.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎docs/chunking/fixed_size.zh.md‎
Lines changed: 85 additions & 0 deletions b/‎docs/chunking/fixed_size.zh.md‎
Lines changed: 85 additions & 0 deletions
@@ -0,0 +1,74 @@
+---
+title: 'CDC Chunking'
+---
+
+Content-Defined Chunking: chunk boundaries are determined by file content, not fixed byte offsets
+
+## What CDC Is
+
+CDC stands for Content-Defined Chunking.
+Unlike fixed-size chunking, CDC scans the file content and identifies chunk boundaries based on data patterns (rolling hash fingerprints)
+
+Because boundaries are content-driven, when a large file is only changed locally — such as inserting or deleting a small region in the middle,
+or appending data at the end — many unchanged regions can still be cut into the same chunks as before.
+Those identical chunks are reused from existing storage, which improves deduplication rate significantly
+
+CDC has no assumption about the internal structure of the file.
+It works for any kind of local modification: insertion, deletion, or in-place update in any position
+
+## FastCDC
+
+FastCDC is a specific algorithm implementing CDC, and the one adopted by Prime Backup.
+It was first described in a [paper at USENIX ATC 2016](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf),
+with further improvements published in a [2020 follow-up](https://ieeexplore.ieee.org/document/9055082)
+
+At its core, FastCDC uses a gear hash — a lightweight rolling hash that processes data one byte at a time
+via a simple table lookup and bit shift — to detect chunk boundaries based on a bitmask condition on the hash value
+
+What sets FastCDC apart from earlier CDC algorithms is its normalized chunking technique.
+Rather than applying a single hash mask throughout, it uses a stricter mask below the average size target and a more permissive one above it,
+nudging the chunk size distribution toward the desired average without sacrificing the content-adaptive nature of CDC
+
+Prime Backup uses [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc), a Cython-accelerated Python implementation of FastCDC 2020
+that delivers near-native chunking throughput
+
+## Available Algorithms
+
+| Algorithm      | Avg Chunk Size | Min Chunk Size | Max Chunk Size |
+|----------------|----------------|----------------|----------------|
+| `fastcdc_32k`  | 32 KiB         | 8 KiB          | 256 KiB        |
+| `fastcdc_128k` | 128 KiB        | 64 KiB         | 1 MiB          |
+
+`fastcdc_32k` is the default and works well for most use cases
+`fastcdc_128k` uses a coarser granularity and is better suited for very large files (10 GiB or more) where the per-chunk metadata overhead of `fastcdc_32k` becomes noticeable
+
+Both algorithms use FastCDC with normalized chunking and a fixed seed (`0`) for reproducibility
+
+## Good Candidates
+
+CDC works well whenever most backups only change part of a file, for example:
+
+- large database files with local row-level updates
+- large log files that are appended at the end and need to be backed up
+- any large file that is frequently modified in a local, non-global manner
+
+## Poor Candidates
+
+CDC is usually not a good fit when:
+
+- the file is completely rewritten on every save (no local structure is preserved)
+- the file is a compressed or encrypted container, where any small content change scrambles a large byte region
+
+Also note that the first backup containing a file still needs to write all chunks,
+so CDC benefits only become visible on later backups with high chunk reuse
+
+## Dependencies
+
+CDC chunking requires the optional Python library [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc).
+You can install it directly, or install the optional dependency bundle:
+
+```bash
+pip3 install pyfastcdc
+# or install all optional dependencies at once
+pip3 install -r requirements.optional.txt
+```
@@ -0,0 +1,74 @@
+---
+title: 'CDC 分块'
+---
+
+内容定义分块：数据块边界由文件内容决定，而非固定的字节偏移
+
+## 什么是 CDC
+
+CDC 是 Content-Defined Chunking 的缩写，通常译作"内容定义分块"。
+与固定大小分块不同，CDC 通过扫描文件内容，利用滚动哈希指纹来识别数据块边界
+
+由于边界由内容决定，当大文件仅发生局部变化时——例如在中间插入或删除了少量数据，
+或仅在尾部追加内容——大量未变化的内容仍会被切分成与之前相同的数据块。
+这些相同的数据块可直接复用已有存储，显著提升去重效果
+
+CDC 对文件的内部结构没有任何假设。
+它对任意类型的局部修改均有效：无论是任意位置的插入、删除，还是原地覆盖写入
+
+## FastCDC
+
+FastCDC 是实现 CDC 的一种具体算法，也是 Prime Backup 所采用的方案。
+它最初由 2016 年 USENIX ATC 会议上发表的[一篇研究论文](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)提出，
+于 [2020 年的后续论文](https://ieeexplore.ieee.org/document/9055082)中进一步完善
+
+FastCDC 的核心是 gear hash（齿轮哈希）——一种轻量级的滚动哈希，逐字节扫描数据，
+每步仅需一次查表操作与一次位移运算——通过对哈希值施加位掩码条件来检测分块边界
+
+FastCDC 区别于早期 CDC 算法的核心特性在于其 normalized chunking（归一化分块）技术。
+它不采用统一的哈希掩码，而是在当前位置低于目标平均块大小时使用更严格的掩码，超过后则切换为更宽松的掩码，
+使分块大小分布向目标平均值紧密收敛，同时完整保留了 CDC 的内容自适应特性
+
+Prime Backup 使用 [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc)，这是一个基于 Cython 加速的 FastCDC 2020 Python 实现，
+可达到接近原生代码的分块处理吞吐量
+
+## 可用算法
+
+| 算法             | 平均块大小   | 最小块大小  | 最大块大小   |
+|----------------|---------|--------|---------|
+| `fastcdc_32k`  | 32 KiB  | 8 KiB  | 256 KiB |
+| `fastcdc_128k` | 128 KiB | 64 KiB | 1 MiB   |
+
+`fastcdc_32k` 是默认选项，适合大多数使用场景。
+`fastcdc_128k` 采用更粗的粒度，更适合超大型文件（10 GiB 以上），可减少 `fastcdc_32k` 粒度下每条数据块记录带来的相对元数据开销
+
+两种算法均使用 FastCDC，启用了 normalized chunking，并固定种子（`0`）以保证可重现性
+
+## 适用场景
+
+只要大多数备份只涉及文件的局部变化，CDC 通常都能发挥效果，例如：
+
+- 体积较大的数据库文件，每次更新仅涉及局部行
+- 需要纳入备份、且主要以尾部追加方式写入的大型日志文件
+- 任何经常以局部、非全局方式进行修改的大文件
+
+## 不适用场景
+
+以下情况 CDC 通常效果不佳：
+
+- 文件每次保存时都被完整重写（无法保留局部结构）
+- 文件是压缩包或加密容器，任何一点内容变化都会导致大范围字节变动
+
+另外，某个文件首次进入备份时，所有数据块仍需完整写入。
+CDC 的收益主要体现在后续那些数据块可大量复用的备份上
+
+## 依赖
+
+CDC 分块依赖于可选的 Python 库 [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc)。
+你可以单独安装它，也可以安装可选依赖包：
+
+```bash
+pip3 install pyfastcdc
+# 或一次性安装全部可选依赖
+pip3 install -r requirements.optional.txt
+```
@@ -0,0 +1,85 @@
+---
+title: 'Fixed-Size Chunking'
+---
+
+!!! warning "Alpha Status"
+
+    Fixed-size chunking is currently a proof of concept and is in alpha status.
+    It is not recommended for production use.
+    Use [CDC Chunking](chunking_cdc.md) instead unless you have a specific reason to use fixed-size chunking.
+
+Fixed-size chunking splits files at predictable byte-offset boundaries, with every chunk being exactly the configured size
+(the last chunk may be smaller if the file size is not a multiple of the chunk size).
+
+## What Fixed-Size Chunking Is
+
+Fixed-size chunking is conceptually simple: the file is divided into equal-sized pieces from start to end.
+Each piece is hashed and stored independently, just like CDC chunks.
+
+Unlike CDC, chunk boundaries do not shift when content is inserted or deleted in the middle of the file.
+Any edit before the end of a chunk changes that chunk's hash entirely, and any insertion or deletion causes all subsequent chunks to shift,
+potentially invalidating a large number of previously stored chunks.
+
+This means fixed-size chunking is generally inferior to CDC for files with arbitrary edits.
+Its benefit is only realized in scenarios where the file's write pattern is well-aligned to chunk boundaries.
+
+For example, with `fixed_4k` applied to a Minecraft region file:
+
+```
++----------------------------------------------------------------------+
+|                    file (e.g. r.0.0.mca)                             |
++------+------+------+------+------+------+------+------+------+-- - --+
+| 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB |  ...  |
+|  c1  |  c2  |  c3  |  c4  |  c5  |  c6  |  c7  |  c8  |  c9  |       |
++------+------+------+------+------+------+------+------+------+-- - --+
+```
+
+Each 4 KiB chunk corresponds to one internal page of the region file.
+When only a few game chunks change between backups, only the corresponding pages are dirtied,
+and the rest of the chunks are identical to those already stored.
+
+## Available Algorithms
+
+| Algorithm    | Chunk Size | Typical Use Case                                                                                                                       |
+|--------------|------------|----------------------------------------------------------------------------------------------------------------------------------------|
+| `fixed_4k`   | 4 KiB      | Minecraft region files (`.mca`): each region file is organized in 4 KiB pages, so changes in one chunk only invalidate that 4 KiB page |
+| `fixed_32k`  | 32 KiB     | General intermediate granularity                                                                                                       |
+| `fixed_128k` | 128 KiB    | Append-only files: growth at the tail only creates new trailing chunks, leaving all previous chunks intact                             |
+
+### fixed_4k
+
+The 4KiB chunk size aligns with the internal page structure of Minecraft's Anvil region files (`.mca`).
+In theory, modifying a small number of chunks in the game only dirties a limited number of 4 KiB pages,
+making `fixed_4k` capable of the finest-grained deduplication for region files.
+
+However, `fixed_4k` has serious practical drawbacks:
+
+- extremely high metadata overhead: a 1 GiB file requires roughly 262 144 chunk records
+- poor I/O performance: each chunk requires a separate read-write cycle during backup
+
+Unless the file is very large and only a tiny number of pages change per backup, `fixed_4k` is unlikely to be worth the cost.
+
+### fixed_32k
+
+A middle-ground option. Metadata overhead is 32× lower than `fixed_4k` but granularity is also much coarser.
+
+### fixed_128k
+
+The 128 KiB chunk size is well-suited for files that grow by appending data at the end.
+When new data is appended, only the trailing chunks change; all preceding chunks retain the same hash and are reused.
+
+This makes `fixed_128k` a reasonable alternative to CDC for pure append-write files.
+
+## Poor Candidates
+
+Fixed-size chunking is a poor choice for:
+
+- files that are frequently modified in the middle or beginning (insertion/deletion shifts all subsequent chunks)
+- files with completely unpredictable byte-level change patterns
+- files where the chunk size does not align with any meaningful internal structure
+
+## No Extra Dependencies
+
+Fixed-size chunking has no additional Python dependency requirements.
+It is available as long as Prime Backup is installed.
+
@@ -0,0 +1,85 @@
+---
+title: '固定大小分块'
+---
+
+!!! warning "Alpha 状态"
+
+    固定大小分块目前是概念验证性质的实现，处于 alpha 阶段。
+    不建议在生产环境中使用。
+    除非你有特定需求，否则请使用 [CDC 分块](chunking_cdc.zh.md)。
+
+固定大小分块在可预测的字节偏移边界处切分文件，每个数据块的大小恰好等于配置中指定的值
+（若文件大小不是块大小的整数倍，则最后一个数据块可能更小）
+
+## 什么是固定大小分块
+
+固定大小分块在概念上非常简单：从文件开头到结尾，按照相等的大小依次切分。
+每个数据块单独计算哈希并存储，与 CDC 数据块的存储方式相同
+
+与 CDC 不同，固定大小分块的边界不受文件内容影响。
+在文件中间插入或删除数据时，数据块边界不会随之位移，
+但被修改的数据块的哈希会完全改变，插入或删除操作还会导致其后所有数据块发生偏移，
+可能使大量已存储的数据块失效
+
+因此，对于有任意编辑模式的文件，固定大小分块的表现通常不如 CDC。
+只有当文件的写入模式与块大小边界对齐时，固定大小分块才能发挥其优势
+
+以 `fixed_4k` 作用于 Minecraft region 文件为例：
+
+```
++------------------------------------------------------------------------+
+|                    文件（如 r.0.0.mca）                                  |
++------+------+------+------+------+------+------+------+------+-- - --+
+| 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB |  ...  |
+|  c1  |  c2  |  c3  |  c4  |  c5  |  c6  |  c7  |  c8  |  c9  |       |
++------+------+------+------+------+------+------+------+------+-- - --+
+```
+
+每个 4 KiB 的数据块对应 region 文件的一个内部页。
+当两次备份之间只有少量游戏区块发生变化时，只有对应的页会被修改，
+其余数据块与已存储的内容完全一致，可直接复用
+
+## 可用算法
+
+| 算法           | 块大小     | 典型适用场景                                                                        |
+|--------------|---------|-------------------------------------------------------------------------------|
+| `fixed_4k`   | 4 KiB   | Minecraft region 文件（`.mca`）：region 文件以 4 KiB 页为内部组织单位，修改少量游戏区块只会脏化有限的 4 KiB 页 |
+| `fixed_32k`  | 32 KiB  | 一般性的中等粒度场景                                                                    |
+| `fixed_128k` | 128 KiB | 追加写文件：尾部追加的数据只会产生新的末尾数据块，之前的所有数据块保持不变                                         |
+
+### fixed_4k
+
+4 KiB 的块大小与 Minecraft Anvil 格式 region 文件（`.mca`）的内部页结构对齐。
+理论上，游戏中修改少量区块只会脏化有限数量的 4 KiB 页面，
+因此 `fixed_4k` 在理论上能够对 region 文件实现最细粒度的去重
+
+然而，`fixed_4k` 存在严重的实际缺陷：
+
+- 元数据开销极大：一个 1 GiB 的文件需要约 262 144 条数据块记录
+- I/O 性能很差：备份时每个数据块都需要单独的读写操作
+
+除非文件非常大且每次备份只有极少数页面发生变化，否则 `fixed_4k` 的代价很可能远大于收益
+
+### fixed_32k
+
+一种折中选项。元数据开销比 `fixed_4k` 低 32 倍，但粒度也粗糙得多
+
+### fixed_128k
+
+128 KiB 的块大小非常适合以尾部追加方式写入的文件。
+当新数据被追加到文件末尾时，只有末尾的数据块会发生变化；之前的所有数据块保持相同的哈希，可直接复用
+
+对于纯追加写入的文件，`fixed_128k` 是 CDC 的一个合理替代选项
+
+## 不适用场景
+
+固定大小分块在以下情况通常效果不佳：
+
+- 文件频繁在中间或开头被修改（插入/删除会导致后续所有数据块偏移）
+- 文件的字节级变化模式完全不可预测
+- 文件内部结构与块大小不对齐
+
+## 无需额外依赖
+
+固定大小分块没有额外的 Python 依赖要求。
+只要安装了 Prime Backup，该功能即可使用