|
| 1 | +--- |
| 2 | +title: 'Fixed-Size Chunking' |
| 3 | +--- |
| 4 | + |
| 5 | +!!! warning "Alpha Status" |
| 6 | + |
| 7 | + Fixed-size chunking is currently a proof of concept and is in alpha status. |
| 8 | + It is not recommended for production use. |
| 9 | + Use [CDC Chunking](chunking_cdc.md) instead unless you have a specific reason to use fixed-size chunking. |
| 10 | + |
| 11 | +Fixed-size chunking splits files at predictable byte-offset boundaries, with every chunk being exactly the configured size |
| 12 | +(the last chunk may be smaller if the file size is not a multiple of the chunk size). |
| 13 | + |
| 14 | +## What Fixed-Size Chunking Is |
| 15 | + |
| 16 | +Fixed-size chunking is conceptually simple: the file is divided into equal-sized pieces from start to end. |
| 17 | +Each piece is hashed and stored independently, just like CDC chunks. |
| 18 | + |
| 19 | +Unlike CDC, chunk boundaries do not shift when content is inserted or deleted in the middle of the file. |
| 20 | +Any edit before the end of a chunk changes that chunk's hash entirely, and any insertion or deletion causes all subsequent chunks to shift, |
| 21 | +potentially invalidating a large number of previously stored chunks. |
| 22 | + |
| 23 | +This means fixed-size chunking is generally inferior to CDC for files with arbitrary edits. |
| 24 | +Its benefit is only realized in scenarios where the file's write pattern is well-aligned to chunk boundaries. |
| 25 | + |
| 26 | +For example, with `fixed_4k` applied to a Minecraft region file: |
| 27 | + |
| 28 | +``` |
| 29 | ++----------------------------------------------------------------------+ |
| 30 | +| file (e.g. r.0.0.mca) | |
| 31 | ++------+------+------+------+------+------+------+------+------+-- - --+ |
| 32 | +| 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | ... | |
| 33 | +| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | | |
| 34 | ++------+------+------+------+------+------+------+------+------+-- - --+ |
| 35 | +``` |
| 36 | + |
| 37 | +Each 4 KiB chunk corresponds to one internal page of the region file. |
| 38 | +When only a few game chunks change between backups, only the corresponding pages are dirtied, |
| 39 | +and the rest of the chunks are identical to those already stored. |
| 40 | + |
| 41 | +## Available Algorithms |
| 42 | + |
| 43 | +| Algorithm | Chunk Size | Typical Use Case | |
| 44 | +|--------------|------------|----------------------------------------------------------------------------------------------------------------------------------------| |
| 45 | +| `fixed_4k` | 4 KiB | Minecraft region files (`.mca`): each region file is organized in 4 KiB pages, so changes in one chunk only invalidate that 4 KiB page | |
| 46 | +| `fixed_32k` | 32 KiB | General intermediate granularity | |
| 47 | +| `fixed_128k` | 128 KiB | Append-only files: growth at the tail only creates new trailing chunks, leaving all previous chunks intact | |
| 48 | + |
| 49 | +### fixed_4k |
| 50 | + |
| 51 | +The 4KiB chunk size aligns with the internal page structure of Minecraft's Anvil region files (`.mca`). |
| 52 | +In theory, modifying a small number of chunks in the game only dirties a limited number of 4 KiB pages, |
| 53 | +making `fixed_4k` capable of the finest-grained deduplication for region files. |
| 54 | + |
| 55 | +However, `fixed_4k` has serious practical drawbacks: |
| 56 | + |
| 57 | +- extremely high metadata overhead: a 1 GiB file requires roughly 262 144 chunk records |
| 58 | +- poor I/O performance: each chunk requires a separate read-write cycle during backup |
| 59 | + |
| 60 | +Unless the file is very large and only a tiny number of pages change per backup, `fixed_4k` is unlikely to be worth the cost. |
| 61 | + |
| 62 | +### fixed_32k |
| 63 | + |
| 64 | +A middle-ground option. Metadata overhead is 32× lower than `fixed_4k` but granularity is also much coarser. |
| 65 | + |
| 66 | +### fixed_128k |
| 67 | + |
| 68 | +The 128 KiB chunk size is well-suited for files that grow by appending data at the end. |
| 69 | +When new data is appended, only the trailing chunks change; all preceding chunks retain the same hash and are reused. |
| 70 | + |
| 71 | +This makes `fixed_128k` a reasonable alternative to CDC for pure append-write files. |
| 72 | + |
| 73 | +## Poor Candidates |
| 74 | + |
| 75 | +Fixed-size chunking is a poor choice for: |
| 76 | + |
| 77 | +- files that are frequently modified in the middle or beginning (insertion/deletion shifts all subsequent chunks) |
| 78 | +- files with completely unpredictable byte-level change patterns |
| 79 | +- files where the chunk size does not align with any meaningful internal structure |
| 80 | + |
| 81 | +## No Extra Dependencies |
| 82 | + |
| 83 | +Fixed-size chunking has no additional Python dependency requirements. |
| 84 | +It is available as long as Prime Backup is installed. |
| 85 | + |
0 commit comments