Skip to content

Commit 126cd0e

Browse files
committed
Add fixed_1m and fastcdc_1m chunk method, adjust fastcdc_128k max size
1 parent f43fda1 commit 126cd0e

5 files changed

Lines changed: 27 additions & 5 deletions

File tree

docs/chunking/cdc.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,11 @@ that delivers near-native chunking throughput
3838
|----------------|----------------|----------------|----------------|
3939
| `fastcdc_32k` | 32 KiB | 8 KiB | 256 KiB |
4040
| `fastcdc_128k` | 128 KiB | 64 KiB | 1 MiB |
41+
| `fastcdc_1m` | 1 MiB | 512 KiB | 8 MiB |
4142

4243
`fastcdc_32k` is the default and works well for most use cases
4344
`fastcdc_128k` uses a coarser granularity and is better suited for very large files (10 GiB or more) where the per-chunk metadata overhead of `fastcdc_32k` becomes noticeable
45+
`fastcdc_1m` uses an even coarser granularity with a 1 MiB average chunk size, minimizing metadata overhead for extremely large files at the cost of deduplication granularity
4446

4547
Both algorithms use FastCDC with normalized chunking and a fixed seed (`0`) for reproducibility
4648

docs/chunking/cdc.zh.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,13 +34,15 @@ Prime Backup 使用 [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc),
3434

3535
## 可用算法
3636

37-
| 算法 | 平均块大小 | 最小块大小 | 最大块大小 |
38-
|----------------|---------|--------|---------|
39-
| `fastcdc_32k` | 32 KiB | 8 KiB | 256 KiB |
40-
| `fastcdc_128k` | 128 KiB | 64 KiB | 1 MiB |
37+
| 算法 | 平均块大小 | 最小块大小 | 最大块大小 |
38+
|----------------|---------|---------|---------|
39+
| `fastcdc_32k` | 32 KiB | 8 KiB | 256 KiB |
40+
| `fastcdc_128k` | 128 KiB | 64 KiB | 1 MiB |
41+
| `fastcdc_1m` | 1 MiB | 512 KiB | 8 MiB |
4142

4243
`fastcdc_32k` 是默认选项,适合大多数使用场景。
4344
`fastcdc_128k` 采用更粗的粒度,更适合超大型文件(10 GiB 以上),可减少 `fastcdc_32k` 粒度下每条数据块记录带来的相对元数据开销
45+
`fastcdc_1m` 采用更粗的 1 MiB 平均块大小,进一步降低极大型文件的元数据开销,但代价是去重粒度更低
4446

4547
两种算法均使用 FastCDC,启用了 normalized chunking,并固定种子(`0`)以保证可重现性
4648

docs/chunking/fixed_size.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ and the rest of the chunks are identical to those already stored
4545
| `fixed_4k` | 4 KiB | Minecraft region files (`.mca`): each region file is organized in 4 KiB pages, so changes in one chunk only invalidate that 4 KiB page |
4646
| `fixed_32k` | 32 KiB | General intermediate granularity |
4747
| `fixed_128k` | 128 KiB | Append-only files: growth at the tail only creates new trailing chunks, leaving all previous chunks intact |
48+
| `fixed_1m` | 1 MiB | Very large append-only files: even lower metadata overhead than `fixed_128k`, useful when fine-grained deduplication is not required |
4849
| `fixed_auto` | 128 KiB / 4 KiB | Adaptive fixed-size strategy that uses the previous backup's same-path chunk layout to limit metadata growth while keeping some 4 KiB reuse |
4950

5051
### fixed_4k
@@ -71,6 +72,13 @@ When new data is appended, only the trailing chunks change; all preceding chunks
7172

7273
This makes `fixed_128k` a reasonable alternative to CDC for pure append-write files
7374

75+
### fixed_1m
76+
77+
The 1 MiB chunk size further reduces metadata overhead compared to `fixed_128k`, at the cost of coarser deduplication granularity.
78+
It is suitable for extremely large append-only files where even the 128 KiB metadata overhead becomes a concern
79+
80+
For most use cases, `fixed_128k` or CDC variants are preferred. Consider `fixed_1m` only when the file is very large and write patterns are exclusively append-only
81+
7482
### fixed_auto
7583

7684
`fixed_auto` walks the file in 128 KiB windows.

docs/chunking/fixed_size.zh.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ title: '固定大小分块'
4646
| `fixed_4k` | 4 KiB | Minecraft region 文件(`.mca`):region 文件以 4 KiB 页为内部组织单位,修改少量游戏区块只会脏化有限的 4 KiB 页 |
4747
| `fixed_32k` | 32 KiB | 一般性的中等粒度场景 |
4848
| `fixed_128k` | 128 KiB | 追加写文件:尾部追加的数据只会产生新的末尾数据块,之前的所有数据块保持不变 |
49+
| `fixed_1m` | 1 MiB | 超大型追加写文件:比 `fixed_128k` 更低的元数据开销,适用于不需要细粒度去重的场景 |
4950
| `fixed_auto` | 128 KiB / 4 KiB | 根据上一次备份中同路径文件的分块布局自适应,在控制元数据增长的同时保留部分 4 KiB 复用能力 |
5051

5152
### fixed_4k
@@ -72,6 +73,13 @@ title: '固定大小分块'
7273

7374
对于纯追加写入的文件,`fixed_128k` 是 CDC 的一个合理替代选项
7475

76+
### fixed_1m
77+
78+
1 MiB 的块大小在 `fixed_128k` 的基础上进一步降低了元数据开销,代价是去重粒度更粗。
79+
适用于超大型追加写文件,且对细粒度去重需求不高的场景
80+
81+
大多数情况下,推荐优先使用 `fixed_128k` 或 CDC 变体。仅在文件体积极大且写入模式严格为追加写时才考虑 `fixed_1m`
82+
7583
### fixed_auto
7684

7785
`fixed_auto` 会按 128 KiB 窗口遍历文件。

prime_backup/types/chunk_method.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,14 @@
1111
class ChunkMethod(enum.Enum):
1212
# Content-Defined Chunking with FastCDC
1313
fastcdc_32k = FastCDCChunkerDefinition(avg_size=32 * 1024, min_size=8 * 1024, max_size=256 * 1024)
14-
fastcdc_128k = FastCDCChunkerDefinition(avg_size=128 * 1024, min_size=64 * 1024, max_size=1024 * 1024)
14+
fastcdc_128k = FastCDCChunkerDefinition(avg_size=128 * 1024, min_size=32 * 1024, max_size=1024 * 1024)
15+
fastcdc_1m = FastCDCChunkerDefinition(avg_size=1024 * 1024, min_size=256 * 1024, max_size=4 * 1024 * 1024)
1516

1617
# Fixed-Size Chunking
1718
fixed_4k = FixedSizeChunkerDefinition(4 * 1024)
1819
fixed_32k = FixedSizeChunkerDefinition(32 * 1024)
1920
fixed_128k = FixedSizeChunkerDefinition(128 * 1024)
21+
fixed_1m = FixedSizeChunkerDefinition(1024 * 1024)
2022
fixed_auto = FixedAutoChunkerDefinition()
2123

2224
if TYPE_CHECKING:

0 commit comments

Comments
 (0)