Add fixed_1m and fastcdc_1m chunk method, adjust fastcdc_128k max size

Fallen-Breath · Fallen-Breath · commit 126cd0e3c8f9 · 2026-05-14T02:26:45.000+08:00
diff --git a/docs/chunking/cdc.md b/docs/chunking/cdc.md
@@ -38,9 +38,11 @@ that delivers near-native chunking throughput
 |----------------|----------------|----------------|----------------|
 | `fastcdc_32k`  | 32 KiB         | 8 KiB          | 256 KiB        |
 | `fastcdc_128k` | 128 KiB        | 64 KiB         | 1 MiB          |
+| `fastcdc_1m`   | 1 MiB          | 512 KiB        | 8 MiB          |
 
 `fastcdc_32k` is the default and works well for most use cases
 `fastcdc_128k` uses a coarser granularity and is better suited for very large files (10 GiB or more) where the per-chunk metadata overhead of `fastcdc_32k` becomes noticeable
+`fastcdc_1m` uses an even coarser granularity with a 1 MiB average chunk size, minimizing metadata overhead for extremely large files at the cost of deduplication granularity
 
 Both algorithms use FastCDC with normalized chunking and a fixed seed (`0`) for reproducibility
 
diff --git a/docs/chunking/cdc.zh.md b/docs/chunking/cdc.zh.md
@@ -34,13 +34,15 @@ Prime Backup 使用 [`pyfastcdc`](https://github.com/Fallen-Breath/pyfastcdc)，
 
 ## 可用算法
 
-| 算法             | 平均块大小   | 最小块大小  | 最大块大小   |
-|----------------|---------|--------|---------|
-| `fastcdc_32k`  | 32 KiB  | 8 KiB  | 256 KiB |
-| `fastcdc_128k` | 128 KiB | 64 KiB | 1 MiB   |
+| 算法             | 平均块大小   | 最小块大小   | 最大块大小   |
+|----------------|---------|---------|---------|
+| `fastcdc_32k`  | 32 KiB  | 8 KiB   | 256 KiB |
+| `fastcdc_128k` | 128 KiB | 64 KiB  | 1 MiB   |
+| `fastcdc_1m`   | 1 MiB   | 512 KiB | 8 MiB   |
 
 `fastcdc_32k` 是默认选项，适合大多数使用场景。
 `fastcdc_128k` 采用更粗的粒度，更适合超大型文件（10 GiB 以上），可减少 `fastcdc_32k` 粒度下每条数据块记录带来的相对元数据开销
+`fastcdc_1m` 采用更粗的 1 MiB 平均块大小，进一步降低极大型文件的元数据开销，但代价是去重粒度更低
 
 两种算法均使用 FastCDC，启用了 normalized chunking，并固定种子（`0`）以保证可重现性
 
diff --git a/docs/chunking/fixed_size.md b/docs/chunking/fixed_size.md
@@ -45,6 +45,7 @@ and the rest of the chunks are identical to those already stored
 | `fixed_4k`   | 4 KiB           | Minecraft region files (`.mca`): each region file is organized in 4 KiB pages, so changes in one chunk only invalidate that 4 KiB page      |
 | `fixed_32k`  | 32 KiB          | General intermediate granularity                                                                                                            |
 | `fixed_128k` | 128 KiB         | Append-only files: growth at the tail only creates new trailing chunks, leaving all previous chunks intact                                  |
+| `fixed_1m`   | 1 MiB           | Very large append-only files: even lower metadata overhead than `fixed_128k`, useful when fine-grained deduplication is not required        |
 | `fixed_auto` | 128 KiB / 4 KiB | Adaptive fixed-size strategy that uses the previous backup's same-path chunk layout to limit metadata growth while keeping some 4 KiB reuse |
 
 ### fixed_4k
@@ -71,6 +72,13 @@ When new data is appended, only the trailing chunks change; all preceding chunks
 
 This makes `fixed_128k` a reasonable alternative to CDC for pure append-write files
 
+### fixed_1m
+
+The 1 MiB chunk size further reduces metadata overhead compared to `fixed_128k`, at the cost of coarser deduplication granularity.
+It is suitable for extremely large append-only files where even the 128 KiB metadata overhead becomes a concern
+
+For most use cases, `fixed_128k` or CDC variants are preferred. Consider `fixed_1m` only when the file is very large and write patterns are exclusively append-only
+
 ### fixed_auto
 
 `fixed_auto` walks the file in 128 KiB windows.
diff --git a/docs/chunking/fixed_size.zh.md b/docs/chunking/fixed_size.zh.md
@@ -46,6 +46,7 @@ title: '固定大小分块'
 | `fixed_4k`   | 4 KiB           | Minecraft region 文件（`.mca`）：region 文件以 4 KiB 页为内部组织单位，修改少量游戏区块只会脏化有限的 4 KiB 页 |
 | `fixed_32k`  | 32 KiB          | 一般性的中等粒度场景                                                                    |
 | `fixed_128k` | 128 KiB         | 追加写文件：尾部追加的数据只会产生新的末尾数据块，之前的所有数据块保持不变                                         |
+| `fixed_1m`   | 1 MiB           | 超大型追加写文件：比 `fixed_128k` 更低的元数据开销，适用于不需要细粒度去重的场景                               |
 | `fixed_auto` | 128 KiB / 4 KiB | 根据上一次备份中同路径文件的分块布局自适应，在控制元数据增长的同时保留部分 4 KiB 复用能力                              |
 
 ### fixed_4k
@@ -72,6 +73,13 @@ title: '固定大小分块'
 
 对于纯追加写入的文件，`fixed_128k` 是 CDC 的一个合理替代选项
 
+### fixed_1m
+
+1 MiB 的块大小在 `fixed_128k` 的基础上进一步降低了元数据开销，代价是去重粒度更粗。
+适用于超大型追加写文件，且对细粒度去重需求不高的场景
+
+大多数情况下，推荐优先使用 `fixed_128k` 或 CDC 变体。仅在文件体积极大且写入模式严格为追加写时才考虑 `fixed_1m`
+
 ### fixed_auto
 
 `fixed_auto` 会按 128 KiB 窗口遍历文件。
diff --git a/prime_backup/types/chunk_method.py b/prime_backup/types/chunk_method.py
@@ -11,12 +11,14 @@
 class ChunkMethod(enum.Enum):
 	# Content-Defined Chunking with FastCDC
 	fastcdc_32k = FastCDCChunkerDefinition(avg_size=32 * 1024, min_size=8 * 1024, max_size=256 * 1024)
-	fastcdc_128k = FastCDCChunkerDefinition(avg_size=128 * 1024, min_size=64 * 1024, max_size=1024 * 1024)
+	fastcdc_128k = FastCDCChunkerDefinition(avg_size=128 * 1024, min_size=32 * 1024, max_size=1024 * 1024)
+	fastcdc_1m = FastCDCChunkerDefinition(avg_size=1024 * 1024, min_size=256 * 1024, max_size=4 * 1024 * 1024)
 
 	# Fixed-Size Chunking
 	fixed_4k = FixedSizeChunkerDefinition(4 * 1024)
 	fixed_32k = FixedSizeChunkerDefinition(32 * 1024)
 	fixed_128k = FixedSizeChunkerDefinition(128 * 1024)
+	fixed_1m = FixedSizeChunkerDefinition(1024 * 1024)
 	fixed_auto = FixedAutoChunkerDefinition()
 
 	if TYPE_CHECKING: