Skip to content

Commit e995ba4

Browse files
committed
The universal chunking config: chunking_enabled and chunking_rules
1 parent 1636976 commit e995ba4

5 files changed

Lines changed: 113 additions & 94 deletions

File tree

docs/config.md

Lines changed: 38 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -225,10 +225,15 @@ Configs on how the backup is made
225225
],
226226
"mutating_file_patterns": [],
227227

228-
"cdc_enabled": false,
229-
"cdc_file_size_threshold": 104857600,
230-
"cdc_patterns": [
231-
"**/*.db"
228+
"chunking_enabled": false,
229+
"chunking_rules": [
230+
{
231+
"algorithm": "cdc",
232+
"file_size_threshold": 104857600,
233+
"patterns": [
234+
"**/*.db"
235+
]
236+
}
232237
],
233238

234239
"hash_method": "blake3",
@@ -441,52 +446,53 @@ and can speed up the processing of such files during backup creation
441446
- Type: `List[str]`
442447
- Default: `[]`
443448

444-
#### cdc_enabled
449+
#### chunking_enabled
445450

446-
Whether to enable content-defined chunking (CDC) for large files during backup creation
451+
Whether to enable file chunking during backup creation
447452

448-
CDC stands for `Content-Defined Chunking`.
449-
Unlike fixed-size chunking, CDC determines chunk boundaries from the file content itself,
450-
so when data is inserted, deleted, or modified locally, many unchanged regions can still be cut into the same chunks and be reused across backups
453+
When enabled, Prime Backup iterates through [chunking_rules](#chunking_rules) in order for each file.
454+
The first rule whose `patterns` match the file path and whose `file_size_threshold` is met will be applied.
455+
If no rule matches, the file is stored as a regular direct blob without chunking
451456

452457
Changing this option only affects files newly stored in future backups.
453458
Existing direct blobs or chunked blobs will not be converted automatically
454459

455-
!!! note
456-
457-
CDC chunking requires the optional `pyfastcdc` dependency.
458-
You can install all optional dependencies with `pip3 install -r requirements.optional.txt`,
459-
or install `pyfastcdc` manually
460-
461460
- Type: `bool`
462461
- Default: `false`
463462

464-
#### cdc_file_size_threshold
463+
#### chunking_rules
465464

466-
The minimum file size in bytes for a file to be considered for CDC chunking
465+
A list of chunking rules evaluated in order when [chunking_enabled](#chunking_enabled) is `true`
467466

468-
Files smaller than this threshold will continue to use the regular direct blob storage flow,
469-
even if [cdc_enabled](#cdc_enabled) is enabled and the path matches [cdc_patterns](#cdc_patterns)
467+
For each file, Prime Backup walks through this list and applies the first rule whose `patterns` match the file path and whose `file_size_threshold` is met.
468+
If no rule matches, the file is stored as a regular direct blob
470469

471-
Changing this option only affects files newly stored in future backups.
472-
Existing stored data will not be repartitioned automatically
470+
Each rule contains the following fields:
473471

474-
- Type: `int`
475-
- Default: `104857600` (`100 MiB`)
472+
- `algorithm`: The chunking algorithm to use. Currently only `"cdc"` is available
476473

477-
#### cdc_patterns
474+
CDC stands for Content-Defined Chunking. Unlike fixed-size chunking, CDC determines chunk boundaries from the file content itself,
475+
so when data is inserted, deleted, or modified locally, many unchanged regions can still be cut into the same chunks and be reused across backups
478476

479-
A list of [gitignore flavor](http://git-scm.com/docs/gitignore) pattern strings,
480-
matched against file paths relative to [source_root](#source_root)
477+
!!! note
481478

482-
CDC chunking will only be applied when the file path matches one of these patterns,
483-
the file size reaches [cdc_file_size_threshold](#cdc_file_size_threshold),
484-
and [cdc_enabled](#cdc_enabled) is enabled
479+
CDC chunking requires the optional `pyfastcdc` dependency.
480+
You can install all optional dependencies with `pip3 install -r requirements.optional.txt`,
481+
or install `pyfastcdc` manually
485482

486-
The default value is `["**/*.db"]`.
487-
It is recommended to keep this list narrow and only include large files that are often modified locally and really need to be backed up
483+
- `file_size_threshold`: The minimum file size in bytes for a file to be eligible for this rule.
484+
Files smaller than this value will not match this rule, even if their path matches `patterns`
488485

489-
- Type: `List[str]`
486+
- `patterns`: A list of [gitignore flavor](http://git-scm.com/docs/gitignore) pattern strings,
487+
matched against file paths relative to [source_root](#source_root)
488+
489+
The default value contains one rule that applies CDC chunking to `.db` files larger than 100 MiB.
490+
It is recommended to keep the rules narrow and only cover large files that are often modified locally and really need to be backed up
491+
492+
Changing this option only affects files newly stored in future backups.
493+
Existing stored data will not be repartitioned automatically
494+
495+
- Type: `List[ChunkingRule]`
490496

491497
#### hash_method
492498

docs/config.zh.md

Lines changed: 40 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -225,10 +225,15 @@ Prime Backup 在创建备份时的操作时序如下:
225225
],
226226
"mutating_file_patterns": [],
227227

228-
"cdc_enabled": false,
229-
"cdc_file_size_threshold": 104857600,
230-
"cdc_patterns": [
231-
"**/*.db"
228+
"chunking_enabled": false,
229+
"chunking_rules": [
230+
{
231+
"algorithm": "cdc",
232+
"file_size_threshold": 104857600,
233+
"patterns": [
234+
"**/*.db"
235+
]
236+
}
232237
],
233238

234239
"hash_method": "blake3",
@@ -441,50 +446,55 @@ Prime Backup 会检查文件的如下这些信息。下述这些信息完全一
441446
- 类型:`List[str]`
442447
- 默认值:`[]`
443448

444-
#### cdc_enabled
449+
#### chunking_enabled
445450

446-
是否在创建备份时,对大文件启用内容定义分块(CDC)
451+
是否在创建备份时,对文件启用分块存储
447452

448-
CDC 是 `Content-Defined Chunking` 的缩写,即“按内容划分边界”的切块方式。
449-
它与固定大小切块不同,数据块边界由文件内容决定,因此当文件仅在局部发生增删改时,许多未变化的内容仍能被切成相同的数据块,从而复用已有数据块
453+
启用时,Prime Backup 会对每个文件依次遍历 [chunking_rules](#chunking_rules) 中的规则,
454+
并将第一条匹配的规则所指定的算法应用于该文件。
455+
若没有任何规则匹配,该文件将以常规直存数据对象(direct blob)的方式储存,不进行分块
450456

451457
修改此选项只会影响后续备份中新写入的文件。
452458
已存在的直存数据对象(direct blob)或分块数据对象(chunked blob)不会被自动转换
453459

454-
!!! note
455-
456-
CDC 分块需要可选依赖 `pyfastcdc`。
457-
你可以通过 `pip3 install -r requirements.optional.txt` 安装全部可选依赖,
458-
或者单独安装 `pyfastcdc`
459-
460460
- 类型:`bool`
461461
- 默认值:`false`
462462

463-
#### cdc_file_size_threshold
463+
#### chunking_rules
464464

465-
文件参与 CDC 分块所需达到的最小大小,单位为字节。
465+
分块规则列表,在 [chunking_enabled](#chunking_enabled)`true` 时,按顺序逐条匹配
466466

467-
小于该阈值的文件,即使 [cdc_enabled](#cdc_enabled) 已启用、路径也匹配了 [cdc_patterns](#cdc_patterns)
468-
仍会继续使用常规的直存数据对象(direct blob)存储流程
467+
对于每个文件,Prime Backup 会依次遍历该列表,将第一条 `patterns` 匹配文件路径、
468+
且文件大小达到 `file_size_threshold` 的规则应用于该文件。
469+
若无规则匹配,文件将以常规直存数据对象(direct blob)的方式储存
469470

470-
修改此选项只会影响后续备份中新写入的文件。
471-
已入库的数据不会被自动重新切分
471+
每条规则包含以下字段:
472472

473-
- 类型:`int`
474-
- 默认值:`104857600``100 MiB`
473+
- `algorithm`:分块时使用的算法。目前仅支持 `"cdc"`
475474

476-
#### cdc_patterns
475+
CDC 是 Content-Defined Chunking(按内容划分边界的切块方式)的缩写。
476+
它与固定大小切块不同,数据块边界由文件内容决定,因此当文件仅在局部发生增删改时,
477+
许多未变化的内容仍能被切成相同的数据块,从而复用已有数据块
477478

478-
一个 [gitignore 风格](http://git-scm.com/docs/gitignore) 的模板串列表,
479-
匹配对象是相对于 [source_root](#source_root) 的文件路径
479+
!!! note
480480

481-
只有当文件路径匹配这些模式、文件大小达到 [cdc_file_size_threshold](#cdc_file_size_threshold)
482-
[cdc_enabled](#cdc_enabled) 已启用时,才会使用 CDC 分块
481+
CDC 分块需要可选依赖 `pyfastcdc`。
482+
你可以通过 `pip3 install -r requirements.optional.txt` 安装全部可选依赖,
483+
或者单独安装 `pyfastcdc`
483484

484-
默认值为 `["**/*.db"]`
485-
建议将其控制得尽量精确,只包含那些体积大、经常发生局部修改、且确实需要备份的文件
485+
- `file_size_threshold`:文件参与本规则所需达到的最小大小,单位为字节
486+
小于此值的文件不会匹配本规则,即使其路径匹配了 `patterns`
486487

487-
- 类型:`List[str]`
488+
- `patterns`:一个 [gitignore 风格](http://git-scm.com/docs/gitignore) 的模板串列表,
489+
匹配对象是相对于 [source_root](#source_root) 的文件路径
490+
491+
默认值中包含一条规则,对大于 100 MiB 的 `.db` 文件启用 CDC 分块。
492+
建议将规则控制得尽量精确,只包含那些体积大、经常发生局部修改、且确实需要备份的文件
493+
494+
修改此选项只会影响后续备份中新写入的文件。
495+
已入库的数据不会被自动重新切分
496+
497+
- 类型:`List[ChunkingRule]`
488498

489499
#### hash_method
490500

prime_backup/config/backup_config.py

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,24 @@
33
from mcdreforged.api.utils import Serializable
44

55
from prime_backup.compressors import CompressMethod
6+
from prime_backup.types.chunk_method import ChunkMethod
67
from prime_backup.types.hash_method import HashMethod
78

89
if TYPE_CHECKING:
910
import pathspec
1011

1112

13+
class ChunkingRule(Serializable):
14+
algorithm: ChunkMethod
15+
file_size_threshold: int
16+
patterns: List[str] = []
17+
18+
@property
19+
def patterns_spec(self) -> 'pathspec.GitIgnoreSpec':
20+
from prime_backup.utils import pathspec_utils
21+
return pathspec_utils.compile_gitignore_spec(self.patterns)
22+
23+
1224
class BackupConfig(Serializable):
1325
# Source
1426
source_root: str = './server'
@@ -31,20 +43,18 @@ class BackupConfig(Serializable):
3143
]
3244
mutating_file_patterns: List[str] = []
3345

34-
# Content-Define-Chunking for Large files
35-
cdc_enabled: bool = False
36-
cdc_file_size_threshold: int = 100 * 1048576 # 100MiB
37-
cdc_patterns: List[str] = [
38-
'**/*.db',
46+
# Chunking
47+
chunking_enabled: bool = False
48+
chunking_rules: List[ChunkingRule] = [
49+
ChunkingRule(
50+
algorithm=ChunkMethod.cdc,
51+
file_size_threshold=100 * 1048576,
52+
patterns=[
53+
'**/*.db'
54+
],
55+
),
3956
]
4057

41-
# Fixed 4K chunking for .mca region files
42-
# f4c_enabled: bool = False
43-
# f4c_file_size_threshold: int = 128 * 1024 # 128KiB
44-
# f4c_patterns: List[str] = [
45-
# '**/*.mca',
46-
# ]
47-
4858
# Storage
4959
hash_method: HashMethod = HashMethod.blake3
5060
compress_method: CompressMethod = CompressMethod.zstd
@@ -101,16 +111,6 @@ def creation_skip_missing_file_patterns_spec(self) -> 'pathspec.GitIgnoreSpec':
101111
from prime_backup.utils import pathspec_utils
102112
return pathspec_utils.compile_gitignore_spec(self.creation_skip_missing_file_patterns)
103113

104-
@property
105-
def cdc_patterns_spec(self) -> 'pathspec.GitIgnoreSpec':
106-
from prime_backup.utils import pathspec_utils
107-
return pathspec_utils.compile_gitignore_spec(self.cdc_patterns)
108-
109-
# @property
110-
# def f4c_patterns_spec(self) -> 'pathspec.GitIgnoreSpec':
111-
# from prime_backup.utils import pathspec_utils
112-
# return pathspec_utils.compile_gitignore_spec(self.f4c_patterns)
113-
114114
@property
115115
def mutating_file_patterns_spec(self) -> 'pathspec.GitIgnoreSpec':
116116
from prime_backup.utils import pathspec_utils

prime_backup/types/chunk_method.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,11 @@ def get_for_file(cls, file_path: PathLike, file_size: int) -> Optional['ChunkMet
1919

2020
if file_size <= 0:
2121
return None
22+
if not backup_config.chunking_enabled:
23+
return None
2224

23-
# if backup_config.f4c_enabled and file_size >= backup_config.f4c_file_size_threshold:
24-
# if backup_config.f4c_patterns_spec.match_file(file_path):
25-
# return ChunkMethod.fixed_4k
26-
27-
if backup_config.cdc_enabled and file_size >= backup_config.cdc_file_size_threshold:
28-
if backup_config.cdc_patterns_spec.match_file(file_path):
29-
return ChunkMethod.cdc
25+
for cfg in backup_config.chunking_rules:
26+
if file_size >= cfg.file_size_threshold and cfg.patterns_spec.match_file(file_path):
27+
return cfg.algorithm
3028

3129
return None

tests/test_fuzzy_run.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,10 @@
3636
from prime_backup.action.validate_files_action import ValidateFilesAction
3737
from prime_backup.action.validate_filesets_action import ValidateFilesetsAction
3838
from prime_backup.compressors import CompressMethod
39+
from prime_backup.config.backup_config import ChunkingRule
3940
from prime_backup.config.config import Config
4041
from prime_backup.db.access import DbAccess
42+
from prime_backup.types.chunk_method import ChunkMethod
4143
from prime_backup.types.hash_method import HashMethod
4244
from prime_backup.types.operator import Operator
4345
from prime_backup.types.tar_format import TarFormat
@@ -495,9 +497,12 @@ def rm_test_files_dirs():
495497
Config.get().backup.targets = [env_dir.name]
496498
Config.get().backup.hash_method = HashMethod.xxh128
497499
Config.get().backup.compress_method = CompressMethod.plain
498-
Config.get().backup.cdc_enabled = True
499-
Config.get().backup.cdc_file_size_threshold = 1 * 1048756 # 1MiB
500-
Config.get().backup.cdc_patterns = ['**']
500+
Config.get().backup.chunking_enabled = True
501+
Config.get().backup.chunking_rules = [ChunkingRule(
502+
algorithm=ChunkMethod.cdc,
503+
file_size_threshold=1 * 1048576, # 1MiB
504+
patterns=['**'],
505+
)]
501506
DbAccess.init(create=True, migrate=False)
502507

503508
with contextlib.ExitStack() as es:

0 commit comments

Comments
 (0)