Skip to content

feat(splitter): 解耦 overlap 配置与上下文拼接#140

Merged
jixua merged 1 commit into
devfrom
feature/splitter-enhancement
Jun 6, 2026
Merged

feat(splitter): 解耦 overlap 配置与上下文拼接#140
jixua merged 1 commit into
devfrom
feature/splitter-enhancement

Conversation

@ottercoconut
Copy link
Copy Markdown
Collaborator

关联 Issue

Closes #138

变更内容

  • 新增 CHUNKING_OVERLAP_ENABLED 配置,显式控制 splitter overlap 是否启用。
  • CHUNKING_OVERLAP_TOKENS 限制为 0..64,配置越界时启动校验失败。
  • 新增 ChunkOverlapConfig / ChunkOverlapper,统一处理 overlap token 截取与上下文拼接。
  • PercentileSemanticChunkerStructuredSemanticChunker 复用同一 overlap 实现,不再由语义分片器承担 overlap 细节。
  • 保持原有结构分片、语义断点、长度约束与 fallback 流程不变。
  • 同步 .env.exampledocs/ops/configure.mddocs/internals/chunking.md

验收覆盖

  • 默认配置保持当前 overlap 行为兼容。
  • CHUNKING_OVERLAP_ENABLED=false 时,语义分片 overlap 和 neighbor context overlap 都不追加。
  • CHUNKING_OVERLAP_TOKENS=0 时不追加 overlap。
  • CHUNKING_OVERLAP_TOKENS=64 时允许启动,并按 64 token 上限执行。
  • CHUNKING_OVERLAP_TOKENS < 0> 64 时配置校验失败。
  • 单元测试覆盖开启、关闭、0、64、越界配置。

验证

  • .venv/bin/pytest tests/unit -q:360 passed
  • .venv/bin/pytest tests/unit/core/splitter tests/unit/test_config_sparse_vector.py -q:22 passed
  • .venv/bin/mypy src/core/splitter src/config.py:通过
  • .venv/bin/python scripts/check_docs_sync.py --staged:通过
  • .venv/bin/python scripts/check_docs_sync.py --working:通过

说明:全量 .venv/bin/mypy src 仍存在仓库既有历史类型错误;本 PR 触碰范围的 mypy 已通过。

新增 CHUNKING_OVERLAP_ENABLED 配置,并限制 CHUNKING_OVERLAP_TOKENS 为 0..64。

抽出 ChunkOverlapper 统一处理 overlap token 截取与拼接,语义分片与两阶段 neighbor context 复用同一实现,保持原有分片流程不变。

补充 splitter 与配置单元测试,并同步 .env.example、运维配置文档和 chunking 内部说明。
@jixua jixua merged commit b2a4e29 into dev Jun 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(splitter): 解耦 overlap 配置并限制 token 范围

2 participants