Skip to content

Commit a214fc1

Browse files
committed
Add fixed_auto chunk method
1 parent 0476b60 commit a214fc1

13 files changed

Lines changed: 313 additions & 47 deletions

docs/chunking/fixed_size.md

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,12 @@ and the rest of the chunks are identical to those already stored
4040

4141
## Available Algorithms
4242

43-
| Algorithm | Chunk Size | Typical Use Case |
44-
|--------------|------------|----------------------------------------------------------------------------------------------------------------------------------------|
45-
| `fixed_4k` | 4 KiB | Minecraft region files (`.mca`): each region file is organized in 4 KiB pages, so changes in one chunk only invalidate that 4 KiB page |
46-
| `fixed_32k` | 32 KiB | General intermediate granularity |
47-
| `fixed_128k` | 128 KiB | Append-only files: growth at the tail only creates new trailing chunks, leaving all previous chunks intact |
43+
| Algorithm | Chunk Size | Typical Use Case |
44+
|--------------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------|
45+
| `fixed_4k` | 4 KiB | Minecraft region files (`.mca`): each region file is organized in 4 KiB pages, so changes in one chunk only invalidate that 4 KiB page |
46+
| `fixed_32k` | 32 KiB | General intermediate granularity |
47+
| `fixed_128k` | 128 KiB | Append-only files: growth at the tail only creates new trailing chunks, leaving all previous chunks intact |
48+
| `fixed_auto` | 128 KiB / 4 KiB | Adaptive fixed-size strategy that uses the previous backup's same-path chunk layout to limit metadata growth while keeping some 4 KiB reuse |
4849

4950
### fixed_4k
5051

@@ -70,6 +71,17 @@ When new data is appended, only the trailing chunks change; all preceding chunks
7071

7172
This makes `fixed_128k` a reasonable alternative to CDC for pure append-write files
7273

74+
### fixed_auto
75+
76+
`fixed_auto` walks the file in 128 KiB windows.
77+
For each full window, it checks the previous backup's same-path chunk layout at the same offset:
78+
79+
- if the previous window was one 128 KiB chunk and the current content is unchanged, it keeps one 128 KiB chunk
80+
- if the previous window was one 128 KiB chunk and the current content changed, it stores the current window as thirty-two 4 KiB chunks
81+
- if the previous window was thirty-two 4 KiB chunks, it compares the 4 KiB hashes first; when none changed, it stores one 128 KiB chunk, otherwise it keeps thirty-two 4 KiB chunks
82+
83+
Missing previous data, direct blobs, irregular previous layouts, and incomplete tail windows are stored as one chunk for that window.
84+
7385
## Poor Candidates
7486

7587
Fixed-size chunking is a poor choice for:

docs/chunking/fixed_size.zh.md

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,12 @@ title: '固定大小分块'
4141

4242
## 可用算法
4343

44-
| 算法 | 块大小 | 典型适用场景 |
45-
|--------------|---------|-------------------------------------------------------------------------------|
46-
| `fixed_4k` | 4 KiB | Minecraft region 文件(`.mca`):region 文件以 4 KiB 页为内部组织单位,修改少量游戏区块只会脏化有限的 4 KiB 页 |
47-
| `fixed_32k` | 32 KiB | 一般性的中等粒度场景 |
48-
| `fixed_128k` | 128 KiB | 追加写文件:尾部追加的数据只会产生新的末尾数据块,之前的所有数据块保持不变 |
44+
| 算法 | 块大小 | 典型适用场景 |
45+
|--------------|-----------------|-------------------------------------------------------------------------------|
46+
| `fixed_4k` | 4 KiB | Minecraft region 文件(`.mca`):region 文件以 4 KiB 页为内部组织单位,修改少量游戏区块只会脏化有限的 4 KiB 页 |
47+
| `fixed_32k` | 32 KiB | 一般性的中等粒度场景 |
48+
| `fixed_128k` | 128 KiB | 追加写文件:尾部追加的数据只会产生新的末尾数据块,之前的所有数据块保持不变 |
49+
| `fixed_auto` | 128 KiB / 4 KiB | 根据上一次备份中同路径文件的分块布局自适应,在控制元数据增长的同时保留部分 4 KiB 复用能力 |
4950

5051
### fixed_4k
5152

@@ -71,6 +72,17 @@ title: '固定大小分块'
7172

7273
对于纯追加写入的文件,`fixed_128k` 是 CDC 的一个合理替代选项
7374

75+
### fixed_auto
76+
77+
`fixed_auto` 会按 128 KiB 窗口遍历文件。
78+
对于每个完整窗口,它会检查上一次备份中同路径文件在相同 offset 的分块布局:
79+
80+
- 如果上一版窗口是 1 个 128 KiB chunk,且当前内容未变化,则继续使用 1 个 128 KiB chunk
81+
- 如果上一版窗口是 1 个 128 KiB chunk,但当前内容已变化,则将当前窗口切成 32 个 4 KiB chunk
82+
- 如果上一版窗口是 32 个 4 KiB chunk,则先比较 4 KiB hash;当变化数量为 0 时,存成 1 个 128 KiB chunk,否则继续使用 32 个 4 KiB chunk
83+
84+
上一版数据缺失、上一版是 direct blob、上一版布局不规则,或当前窗口是不完整尾块时,该窗口会作为单个 chunk 存储。
85+
7486
## 不适用场景
7587

7688
固定大小分块在以下情况通常效果不佳:

docs/chunking/index.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,17 @@ The default configuration is:
3535
"chunking_rules": [
3636
{
3737
"algorithm": "fastcdc_32k",
38-
"file_size_threshold": 104857600,
38+
"file_size_threshold": 20971520,
3939
"patterns": [
40-
"**/*.db"
40+
"**/*.db",
41+
"**/*.log"
42+
]
43+
},
44+
{
45+
"algorithm": "fixed_auto",
46+
"file_size_threshold": 262144,
47+
"patterns": [
48+
"**/*.mca"
4149
]
4250
}
4351
]
@@ -105,6 +113,7 @@ The benefit becomes apparent on subsequent backups where many chunks can be reus
105113
| `fixed_4k` | Fixed | 4 KiB | MC region files (matches 4 KiB page boundaries); note: causes severe metadata bloat |
106114
| `fixed_32k` | Fixed | 32 KiB | medium fixed-size use cases |
107115
| `fixed_128k` | Fixed | 128 KiB | append-write files with predictable end-growth |
116+
| `fixed_auto` | Fixed | 128 KiB / 4 KiB | adaptive fixed-size chunks based on the previous same-path backup |
108117

109118
See the detailed pages for each approach:
110119

docs/chunking/index.zh.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,17 @@ title: '文件分块'
3535
"chunking_rules": [
3636
{
3737
"algorithm": "fastcdc_32k",
38-
"file_size_threshold": 104857600,
38+
"file_size_threshold": 20971520,
3939
"patterns": [
40-
"**/*.db"
40+
"**/*.db",
41+
"**/*.log"
42+
]
43+
},
44+
{
45+
"algorithm": "fixed_auto",
46+
"file_size_threshold": 262144,
47+
"patterns": [
48+
"**/*.mca"
4149
]
4250
}
4351
]
@@ -105,6 +113,7 @@ Prime Backup 仍会为整个文件创建一条数据对象(blob)记录,但
105113
| `fixed_4k` | 固定大小 | 4 KiB | MC region 文件(与 4 KiB 页边界对齐);注意:会导致严重的元数据膨胀 |
106114
| `fixed_32k` | 固定大小 | 32 KiB | 中等粒度的固定大小场景 |
107115
| `fixed_128k` | 固定大小 | 128 KiB | 以追加写为主的文件 |
116+
| `fixed_auto` | 固定大小 | 128 KiB / 4 KiB | 根据上一次同路径备份自适应切块 |
108117

109118
各方式的详细说明见独立文档:
110119

docs/config.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -229,9 +229,17 @@ Configs on how the backup is made
229229
"chunking_rules": [
230230
{
231231
"algorithm": "fastcdc_32k",
232-
"file_size_threshold": 104857600,
232+
"file_size_threshold": 20971520,
233233
"patterns": [
234-
"**/*.db"
234+
"**/*.db",
235+
"**/*.log"
236+
]
237+
},
238+
{
239+
"algorithm": "fixed_auto",
240+
"file_size_threshold": 262144,
241+
"patterns": [
242+
"**/*.mca"
235243
]
236244
}
237245
],
@@ -478,6 +486,7 @@ Each rule contains the following fields:
478486
| `fixed_4k` | Fixed (alpha) | 4 KiB chunks; aligns with MC region file pages, but causes heavy metadata overhead |
479487
| `fixed_32k` | Fixed (alpha) | 32 KiB chunks; intermediate fixed-size option |
480488
| `fixed_128k` | Fixed (alpha) | 128 KiB chunks; well-suited for append-write files |
489+
| `fixed_auto` | Fixed (alpha) | Adaptive 128 KiB / 4 KiB chunks based on the previous backup's same-path chunk layout |
481490

482491
CDC algorithms determine chunk boundaries from file content, so local insertions, deletions, or in-place edits leave many chunks unchanged for reuse.
483492
See [CDC Chunking](chunking/chunking_cdc.md) for details.
@@ -487,7 +496,7 @@ Each rule contains the following fields:
487496

488497
!!! warning
489498

490-
Fixed-size algorithms (`fixed_4k`, `fixed_32k`, `fixed_128k`) are in alpha status and not recommended for production use.
499+
Fixed-size algorithms (`fixed_4k`, `fixed_32k`, `fixed_128k`, `fixed_auto`) are in alpha status and not recommended for production use.
491500

492501
!!! note
493502

@@ -502,7 +511,7 @@ Each rule contains the following fields:
502511
- `patterns`: A list of [gitignore flavor](http://git-scm.com/docs/gitignore) pattern strings,
503512
matched against file paths relative to [source_root](#source_root)
504513

505-
The default value contains one rule that applies `fastcdc_32k` CDC chunking to `.db` files larger than 100 MiB.
514+
The default value contains two rules: `fastcdc_32k` CDC chunking for `.db` and `.log` files larger than 20 MiB, and `fixed_auto` chunking for `.mca` files larger than 256 KiB.
506515
It is recommended to keep the rules narrow and only cover large files that are often modified locally and really need to be backed up
507516

508517
Changing this option only affects files newly stored in future backups.

docs/config.zh.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -229,9 +229,17 @@ Prime Backup 在创建备份时的操作时序如下:
229229
"chunking_rules": [
230230
{
231231
"algorithm": "fastcdc_32k",
232-
"file_size_threshold": 104857600,
232+
"file_size_threshold": 20971520,
233233
"patterns": [
234-
"**/*.db"
234+
"**/*.db",
235+
"**/*.log"
236+
]
237+
},
238+
{
239+
"algorithm": "fixed_auto",
240+
"file_size_threshold": 262144,
241+
"patterns": [
242+
"**/*.mca"
235243
]
236244
}
237245
],
@@ -479,6 +487,7 @@ Prime Backup 会检查文件的如下这些信息。下述这些信息完全一
479487
| `fixed_4k` | 固定大小(alpha) | 4 KiB 数据块;与 MC region 文件的页边界对齐,但元数据开销极大 |
480488
| `fixed_32k` | 固定大小(alpha) | 32 KiB 数据块;中等粒度的固定大小选项 |
481489
| `fixed_128k` | 固定大小(alpha) | 128 KiB 数据块;适合追加写入为主的文件 |
490+
| `fixed_auto` | 固定大小(alpha) | 基于上一次备份中同路径文件的分块布局,在 128 KiB 与 4 KiB 粒度间自适应 |
482491

483492
CDC 算法根据文件内容确定数据块边界,因此局部插入、删除或原地修改不会影响其他数据块的哈希,这些数据块可直接复用。
484493
详见 [CDC 分块](chunking/chunking_cdc.zh.md)
@@ -488,7 +497,7 @@ Prime Backup 会检查文件的如下这些信息。下述这些信息完全一
488497

489498
!!! warning
490499

491-
固定大小算法(`fixed_4k`、`fixed_32k`、`fixed_128k`)处于 alpha 状态,不建议在生产环境中使用
500+
固定大小算法(`fixed_4k`、`fixed_32k`、`fixed_128k`、`fixed_auto`)处于 alpha 状态,不建议在生产环境中使用
492501

493502
!!! note
494503

@@ -503,7 +512,7 @@ Prime Backup 会检查文件的如下这些信息。下述这些信息完全一
503512
- `patterns`:一个 [gitignore 风格](http://git-scm.com/docs/gitignore) 的模板串列表,
504513
匹配对象是相对于 [source_root](#source_root) 的文件路径
505514

506-
默认值中包含一条规则,对大于 100 MiB 的 `.db` 文件启用 `fastcdc_32k` CDC 分块。
515+
默认值中包含两条规则:对大于 20 MiB 的 `.db` `.log` 文件启用 `fastcdc_32k` CDC 分块;对大于 256 KiB 的 `.mca` 文件启用 `fixed_auto` 分块。
507516
建议将规则控制得尽量精确,只包含那些体积大、经常发生局部修改、且确实需要备份的文件
508517

509518
修改此选项只会影响后续备份中新写入的文件。

prime_backup/action/create_backup_action.py

Lines changed: 53 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,13 @@
1818
from prime_backup.db import schema
1919
from prime_backup.db.access import DbAccess
2020
from prime_backup.db.session import DbSession
21-
from prime_backup.db.values import FileRole
21+
from prime_backup.db.values import FileRole, BlobStorageMethod
2222
from prime_backup.exceptions import UnsupportedFileFormat
2323
from prime_backup.types.backup_info import BackupInfo
2424
from prime_backup.types.backup_tags import BackupTags
2525
from prime_backup.types.blob_info import BlobDeltaSummary
26+
from prime_backup.types.chunk_method import ChunkMethod
27+
from prime_backup.types.chunker import PrettyChunk
2628
from prime_backup.types.operator import Operator
2729
from prime_backup.types.units import ByteCount
2830
from prime_backup.utils import sqlalchemy_utils
@@ -60,6 +62,8 @@ class _PreCalculationResult:
6062
stats: Dict[Path, os.stat_result] = dataclasses.field(default_factory=dict)
6163
hashes_and_chunks: Dict[Path, BlobPrecalculateResult] = dataclasses.field(default_factory=dict)
6264
reused_files: Dict[Path, schema.File] = dataclasses.field(default_factory=dict)
65+
previous_backup_files: Dict[str, schema.File] = dataclasses.field(default_factory=dict)
66+
previous_file_chunks: Dict[Path, List[PrettyChunk]] = dataclasses.field(default_factory=dict)
6367

6468

6569
class CreateBackupAction(Action[BackupInfo]):
@@ -154,12 +158,19 @@ def __pre_calculate_stats(self, scan_result: _ScanResult):
154158
for file_entry in scan_result.all_files:
155159
stats[file_entry.path] = file_entry.stat
156160

157-
def __reuse_unchanged_files(self, session: DbSession, scan_result: _ScanResult):
161+
def __load_previous_backup_files(self, session: DbSession):
162+
previous_backup_files = self.__pre_calc_result.previous_backup_files
163+
previous_backup_files.clear()
158164
with self.__time_costs.measure_time_cost(CreateBackupTimeCostKey.kind_db):
159165
backup = session.get_last_backup()
160166
if backup is None:
161167
return
162168

169+
with self.__time_costs.measure_time_cost(CreateBackupTimeCostKey.kind_db):
170+
for file in session.get_backup_files(backup):
171+
previous_backup_files[file.path] = file
172+
173+
def __reuse_unchanged_files(self, scan_result: _ScanResult):
163174
@dataclasses.dataclass(frozen=True)
164175
class StatKey:
165176
path: str
@@ -169,11 +180,8 @@ class StatKey:
169180
gid: int
170181
mtime_ns: int
171182

172-
with self.__time_costs.measure_time_cost(CreateBackupTimeCostKey.kind_db):
173-
backup_files = session.get_backup_files(backup.id)
174-
175183
stat_to_files: Dict[StatKey, schema.File] = {}
176-
for file in backup_files:
184+
for file in self.__pre_calc_result.previous_backup_files.values():
177185
if stat.S_ISREG(file.mode):
178186
if file.uid is None or file.gid is None or file.mtime is None:
179187
raise AssertionError('file {!r} with ISREG mode has missing fields'.format(file))
@@ -200,6 +208,32 @@ class StatKey:
200208
if (file_opt := stat_to_files.get(key)) is not None:
201209
self.__pre_calc_result.reused_files[file_entry.path] = file_opt
202210

211+
def __cache_previous_chunks_for_fixed_auto(self, session: DbSession, scan_result: _ScanResult):
212+
previous_file_chunks = self.__pre_calc_result.previous_file_chunks
213+
previous_file_chunks.clear()
214+
215+
for file_entry in scan_result.all_files:
216+
if not file_entry.is_file() or file_entry.path in self.__pre_calc_result.reused_files:
217+
continue
218+
219+
rel_path = file_entry.path.relative_to(self.__source_path)
220+
if ChunkMethod.get_for_file(rel_path, file_entry.stat.st_size) != ChunkMethod.fixed_auto:
221+
continue
222+
223+
previous_file = self.__pre_calc_result.previous_backup_files.get(rel_path.as_posix())
224+
if (
225+
previous_file is None or
226+
previous_file.blob_id is None or
227+
previous_file.blob_storage_method != BlobStorageMethod.chunked.value
228+
):
229+
continue
230+
231+
with self.__time_costs.measure_time_cost(CreateBackupTimeCostKey.kind_db):
232+
previous_file_chunks[file_entry.path] = [
233+
PrettyChunk(offset=offset_chunk.offset, length=offset_chunk.chunk.raw_size, hash=offset_chunk.chunk.hash)
234+
for offset_chunk in session.get_blob_chunks(previous_file.blob_id)
235+
]
236+
203237
def __pre_calculate_hash_and_chunks(self, session: DbSession, blob_allocator: BlobAllocator, scan_result: _ScanResult):
204238
hashes_and_chunks = self.__pre_calc_result.hashes_and_chunks
205239
hashes_and_chunks.clear()
@@ -220,7 +254,12 @@ def __pre_calculate_hash_and_chunks(self, session: DbSession, blob_allocator: Bl
220254
def hash_worker(pth: Path, pth_size: int):
221255
rel_path = pth.relative_to(self.__source_path)
222256
try:
223-
result = BlobPrecalculateResult.from_file(pth, rel_path, pth_size)
257+
result = BlobPrecalculateResult.from_file(
258+
pth,
259+
rel_path,
260+
pth_size,
261+
previous_chunks=self.__pre_calc_result.previous_file_chunks.get(pth),
262+
)
224263
except BlobPrecalculateResult.SizeMismatched:
225264
return # the file keeps changing, so it's not good to create a pre-calc result for it
226265
hashes_and_chunks[pth] = result
@@ -304,13 +343,17 @@ def __create_backup(self, session_context: ContextManager[DbSession], session: D
304343
def pre_calc_result_getter(src_path: Path) -> Optional[BlobPrecalculateResult]:
305344
return self.__pre_calc_result.hashes_and_chunks.pop(src_path, None) # one-time use
306345

346+
def previous_chunks_getter(src_path: Path) -> Optional[List[PrettyChunk]]:
347+
return self.__pre_calc_result.previous_file_chunks.get(src_path)
348+
307349
blob_allocator = BlobAllocator(
308350
session=session,
309351
time_costs=self.__time_costs,
310352
blob_recorder=blob_recorder,
311353
source_path=self.__source_path,
312354
temp_path=self.__temp_path,
313355
pre_calc_result_getter=pre_calc_result_getter,
356+
previous_chunks_getter=previous_chunks_getter,
314357
)
315358

316359
self.logger.info('Scanning file for backup creation at path {!r}, targets: {}'.format(
@@ -334,10 +377,12 @@ def pre_calc_result_getter(src_path: Path) -> Optional[BlobPrecalculateResult]:
334377
))
335378

336379
self.__pre_calculate_stats(scan_result)
380+
self.__load_previous_backup_files(session)
337381
if self.config.backup.reuse_stat_unchanged_file:
338382
with self.__time_costs.measure_time_cost(CreateBackupTimeCostKey.stage_reuse_unchanged_files):
339-
self.__reuse_unchanged_files(session, scan_result)
383+
self.__reuse_unchanged_files(scan_result)
340384
self.logger.info('Reused {} / {} stat unchanged files'.format(len(self.__pre_calc_result.reused_files), len(scan_result.all_files)))
385+
self.__cache_previous_chunks_for_fixed_auto(session, scan_result)
341386
if self.config.get_effective_concurrency() > 1:
342387
with self.__time_costs.measure_time_cost(CreateBackupTimeCostKey.stage_pre_calculate_hash):
343388
self.__pre_calculate_hash_and_chunks(session, blob_allocator, scan_result)

0 commit comments

Comments
 (0)