Skip to content

Commit 5d578eb

Browse files
author
ottercoconut
committed
feat(splitter): 解耦 overlap 配置与上下文拼接
新增 CHUNKING_OVERLAP_ENABLED 配置,并限制 CHUNKING_OVERLAP_TOKENS 为 0..64。 抽出 ChunkOverlapper 统一处理 overlap token 截取与拼接,语义分片与两阶段 neighbor context 复用同一实现,保持原有分片流程不变。 补充 splitter 与配置单元测试,并同步 .env.example、运维配置文档和 chunking 内部说明。
1 parent bdfd540 commit 5d578eb

14 files changed

Lines changed: 390 additions & 122 deletions

.env.example

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@ CHUNKING_SEMANTIC_PERCENTILE=95
5353
CHUNKING_SEMANTIC_UNIT=sentence
5454
CHUNKING_MIN_CHUNK_TOKENS=150
5555
CHUNKING_MAX_CHUNK_TOKENS=512
56+
# 是否启用相邻 chunk overlap;关闭后 CHUNKING_OVERLAP_TOKENS 不生效
57+
CHUNKING_OVERLAP_ENABLED=true
58+
# overlap token 数允许范围:0-64
5659
CHUNKING_OVERLAP_TOKENS=64
5760
CHUNKING_MIN_DISTANCE_GATE=0.25
5861
CHUNKING_EMBED_BATCH_SIZE=32

docs/internals/chunking.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ src/core/splitter/
1111
├── chunking_engine.py # Markdown 解析与分片编排入口
1212
├── rule_chunker.py # 基于 Markdown AST 的规则分片
1313
├── semantic_chunker.py # 基于 embedding 距离的语义细分
14+
├── overlap.py # chunk overlap 配置与上下文拼接
1415
├── pipeline_chunker.py # 结构分片 + 语义细分两阶段分片器
1516
└── embedding_pipeline.py # Chunk 向量化批处理管线
1617
```
@@ -88,13 +89,24 @@ class BaseChunker(ABC):
8889
- 先按 `semantic_unit` 配置把文本拆成语义比较原子;默认 `sentence` 保持原有段落、行、句子逐级降级行为,`paragraph` 则以段落作为相似度计算单位。
8990
- 调用 embedding 模型计算相邻原子的语义距离。
9091
- 使用距离分位数作为动态阈值寻找断点。
91-
-`min_chunk_tokens``max_chunk_tokens``overlap_tokens` 控制。
92+
-`min_chunk_tokens``max_chunk_tokens` 控制;overlap 由独立配置控制,但仍在原切分位置追加,保证算法流程不变
9293

9394
`paragraph` 模式只改变相似度计算粒度:单个段落超过 `max_chunk_tokens` 时,不会再改用句子级 embedding 计算断点,但最终输出仍会做长度保底拆分,避免生成超长 Chunk。
9495

9596
它通常不直接作为主分片器使用,而是被 `StructuredSemanticChunker` 注入。
9697

97-
### 3.3 StructuredSemanticChunker
98+
### 3.3 ChunkOverlapper
99+
100+
`ChunkOverlapper` 负责相邻 Chunk 的上下文 overlap,不参与语义断点计算。
101+
102+
配置:
103+
104+
- `CHUNKING_OVERLAP_ENABLED`:是否启用 overlap。
105+
- `CHUNKING_OVERLAP_TOKENS`:启用后追加的 token 数上限,范围 `0..64`
106+
107+
`CHUNKING_OVERLAP_ENABLED=false``CHUNKING_OVERLAP_TOKENS=0` 时,不追加 overlap。默认 `true + 64` 保持现有分片行为。
108+
109+
### 3.4 StructuredSemanticChunker
98110

99111
`StructuredSemanticChunker` 是两阶段分片器:
100112

@@ -201,7 +213,7 @@ chunks = engine.process(markdown)
201213
修改语义分片时关注:
202214

203215
- token 上下限是否合理。
204-
- overlap 是否造成内容膨胀
216+
- overlap 是否按 `CHUNKING_OVERLAP_ENABLED``CHUNKING_OVERLAP_TOKENS` 生效,且没有造成内容膨胀
205217
- embedding 调用是否批量且可测试。
206218
- 语义断点失败时是否有 fallback。
207219

docs/ops/configure.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,8 @@
8686
| --- | --- | --- |
8787
| `CHUNKING_MIN_CHUNK_TOKENS` | 150 | 短文档可减小 |
8888
| `CHUNKING_MAX_CHUNK_TOKENS` | 512 | 长上下文模型可加大 |
89-
| `CHUNKING_OVERLAP_TOKENS` | 64 | 提升召回时加大 |
89+
| `CHUNKING_OVERLAP_ENABLED` | `true` | 是否启用相邻 chunk overlap |
90+
| `CHUNKING_OVERLAP_TOKENS` | 64 | overlap token 数,范围 `0..64` |
9091
| `CHUNKING_HEADING_BREAK_LEVEL` | 3 | 提升结构敏感性时减小 |
9192
| `CHUNKING_SEMANTIC_PERCENTILE` | 95 | 调整语义边界严格度 |
9293
| `CHUNKING_SEMANTIC_UNIT` | `sentence` | 语义相似度计算粒度:`sentence` / `paragraph` |

src/config.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ def assemble_redis_url(cls, v: Optional[str], info) -> str:
112112
CHUNKING_SEMANTIC_UNIT: str = "sentence"
113113
CHUNKING_MIN_CHUNK_TOKENS: int = 150
114114
CHUNKING_MAX_CHUNK_TOKENS: int = 512
115+
CHUNKING_OVERLAP_ENABLED: bool = True
115116
CHUNKING_OVERLAP_TOKENS: int = 64
116117
CHUNKING_MIN_DISTANCE_GATE: float = 0.25
117118
CHUNKING_EMBED_BATCH_SIZE: int = 32
@@ -124,6 +125,13 @@ def validate_chunking_semantic_unit(cls, v: str) -> str:
124125
raise ValueError("CHUNKING_SEMANTIC_UNIT must be 'sentence' or 'paragraph'")
125126
return normalized
126127

128+
@field_validator("CHUNKING_OVERLAP_TOKENS")
129+
@classmethod
130+
def validate_chunking_overlap_tokens(cls, v: int) -> int:
131+
if v < 0 or v > 64:
132+
raise ValueError("CHUNKING_OVERLAP_TOKENS must be between 0 and 64")
133+
return v
134+
127135
# ==========================================
128136
# 向量数据库配置 (Vector Store)
129137
# ==========================================

src/core/splitter/__init__.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,8 @@
1010
Chunk — 分片数据模型
1111
"""
1212

13-
from .models import Chunk, EmbeddedChunk, EmbeddingPipelineStats
1413
from .base import BaseChunker
1514
from .chunking_engine import ChunkingEngine
16-
from .rule_chunker import ASTAwareChunker
17-
from .pipeline_chunker import StructuredSemanticChunker
18-
from .semantic_chunker import PercentileSemanticChunker, SemanticSplitter
1915
from .embedding_pipeline import ChunkEmbeddingPipeline
2016
from .factory import (
2117
LazyEmbeddingClient,
@@ -24,6 +20,11 @@
2420
create_lazy_system_embedding_client,
2521
create_system_embedding_client,
2622
)
23+
from .models import Chunk, EmbeddedChunk, EmbeddingPipelineStats
24+
from .overlap import ChunkOverlapConfig, ChunkOverlapper
25+
from .pipeline_chunker import StructuredSemanticChunker
26+
from .rule_chunker import ASTAwareChunker
27+
from .semantic_chunker import PercentileSemanticChunker, SemanticSplitter
2728

2829
__all__ = [
2930
"Chunk",
@@ -32,6 +33,8 @@
3233
"BaseChunker",
3334
"ChunkingEngine",
3435
"ASTAwareChunker",
36+
"ChunkOverlapConfig",
37+
"ChunkOverlapper",
3538
"StructuredSemanticChunker",
3639
"PercentileSemanticChunker",
3740
"SemanticSplitter",

src/core/splitter/factory.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ def create_chunking_engine() -> ChunkingEngine:
133133
semantic_unit=settings.CHUNKING_SEMANTIC_UNIT,
134134
min_chunk_tokens=settings.CHUNKING_MIN_CHUNK_TOKENS,
135135
max_chunk_tokens=settings.CHUNKING_MAX_CHUNK_TOKENS,
136+
overlap_enabled=settings.CHUNKING_OVERLAP_ENABLED,
136137
overlap_tokens=settings.CHUNKING_OVERLAP_TOKENS,
137138
min_distance_gate=settings.CHUNKING_MIN_DISTANCE_GATE,
138139
)

src/core/splitter/overlap.py

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# -*- coding: utf-8 -*-
2+
"""Chunk overlap 配置与文本上下文处理工具。"""
3+
4+
from __future__ import annotations
5+
6+
from dataclasses import dataclass
7+
from typing import TYPE_CHECKING, Any
8+
9+
if TYPE_CHECKING:
10+
from src.core.llm.tokenizer import Tokenizer
11+
else:
12+
Tokenizer = Any
13+
14+
15+
@dataclass(slots=True)
16+
class ChunkOverlapConfig:
17+
"""描述 chunk overlap 的独立配置。"""
18+
19+
enabled: bool = True
20+
tokens: int = 64
21+
22+
def __post_init__(self) -> None:
23+
if self.tokens < 0 or self.tokens > 64:
24+
raise ValueError("overlap tokens must be between 0 and 64.")
25+
26+
27+
class ChunkOverlapper:
28+
"""集中处理 chunk overlap 的 token 截取与上下文拼接。"""
29+
30+
def __init__(
31+
self,
32+
tokenizer: Tokenizer,
33+
config: ChunkOverlapConfig | None = None,
34+
) -> None:
35+
self.tokenizer = tokenizer
36+
self.config = config or ChunkOverlapConfig()
37+
38+
@property
39+
def effective_tokens(self) -> int:
40+
"""返回当前实际启用的 overlap token 数。"""
41+
if not self.config.enabled:
42+
return 0
43+
return self.config.tokens
44+
45+
def count_tokens(self, text: str) -> int:
46+
"""统计文本 token 数。"""
47+
return self.tokenizer.count_tokens(text.strip()) if text else 0
48+
49+
def take_first_tokens(self, text: str, token_limit: int) -> str:
50+
"""取出文本开头的指定数量 token。"""
51+
if not text or token_limit <= 0:
52+
return ""
53+
truncated, _ = self.tokenizer.truncate_text(text, token_limit)
54+
return truncated.strip()
55+
56+
def take_last_tokens(self, text: str, token_limit: int) -> str:
57+
"""取出文本末尾的指定数量 token。"""
58+
cleaned = text.strip()
59+
if not cleaned or token_limit <= 0:
60+
return ""
61+
if self.count_tokens(cleaned) <= token_limit:
62+
return cleaned
63+
64+
left = 0
65+
right = len(cleaned) - 1
66+
best_start = right
67+
68+
while left <= right:
69+
mid = (left + right) // 2
70+
candidate = cleaned[mid:].lstrip()
71+
tokens = self.count_tokens(candidate)
72+
if tokens <= token_limit:
73+
best_start = mid
74+
right = mid - 1
75+
else:
76+
left = mid + 1
77+
78+
return cleaned[best_start:].lstrip()
79+
80+
def build_next_chunk(
81+
self,
82+
previous_chunk: str,
83+
next_atom: str,
84+
*,
85+
max_chunk_tokens: int,
86+
) -> str:
87+
"""在切分发生时,为下一块追加上一块尾部 overlap。"""
88+
overlap_budget = self.effective_tokens
89+
if overlap_budget <= 0:
90+
return next_atom
91+
92+
next_tokens = self.count_tokens(next_atom)
93+
available_for_overlap = max(0, max_chunk_tokens - next_tokens)
94+
if available_for_overlap <= 0:
95+
return next_atom
96+
97+
overlap_tail = self.take_last_tokens(
98+
previous_chunk,
99+
min(overlap_budget, available_for_overlap),
100+
)
101+
if not overlap_tail:
102+
return next_atom
103+
104+
return f"{overlap_tail}\n\n{next_atom}".strip()
105+
106+
def build_neighbor_context(
107+
self,
108+
*,
109+
previous_content: str | None,
110+
current_content: str,
111+
next_content: str | None,
112+
) -> tuple[str, int, int]:
113+
"""为最终 chunk 构造相邻上下文,并返回实际追加的前后 token 数。"""
114+
overlap_budget = self.effective_tokens
115+
if overlap_budget <= 0:
116+
return current_content, 0, 0
117+
118+
contextual_parts: list[str] = []
119+
previous_tokens = 0
120+
next_tokens = 0
121+
122+
if previous_content:
123+
previous_context = self.take_last_tokens(previous_content, overlap_budget)
124+
if previous_context:
125+
previous_tokens = self.count_tokens(previous_context)
126+
contextual_parts.append(previous_context)
127+
128+
contextual_parts.append(current_content)
129+
130+
if next_content:
131+
next_context = self.take_first_tokens(next_content, overlap_budget)
132+
if next_context:
133+
next_tokens = self.count_tokens(next_context)
134+
contextual_parts.append(next_context)
135+
136+
return "\n\n".join(contextual_parts).strip(), previous_tokens, next_tokens

src/core/splitter/pipeline_chunker.py

Lines changed: 7 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -151,38 +151,19 @@ def _apply_neighbor_context(self, chunks: list[Chunk]) -> list[Chunk]:
151151
Returns:
152152
list[Chunk]: 追加邻接上下文后的 Chunk 列表。
153153
"""
154-
overlap_budget = self.semantic_chunker._resolve_overlap_tokens()
155-
if overlap_budget <= 0 or len(chunks) <= 1:
154+
if self.semantic_chunker.overlapper.effective_tokens <= 0 or len(chunks) <= 1:
156155
return chunks
157156

158157
base_contents = [chunk.content for chunk in chunks]
159158

160159
for index, chunk in enumerate(chunks):
161-
contextual_parts: list[str] = []
162-
previous_tokens = 0
163-
next_tokens = 0
164-
165-
if index > 0:
166-
previous_context = self.semantic_chunker._take_last_tokens(
167-
base_contents[index - 1],
168-
overlap_budget,
160+
chunk.content, previous_tokens, next_tokens = (
161+
self.semantic_chunker.overlapper.build_neighbor_context(
162+
previous_content=base_contents[index - 1] if index > 0 else None,
163+
current_content=base_contents[index],
164+
next_content=base_contents[index + 1] if index + 1 < len(chunks) else None,
169165
)
170-
if previous_context:
171-
previous_tokens = self.semantic_chunker.tokenizer.count_tokens(previous_context)
172-
contextual_parts.append(previous_context)
173-
174-
contextual_parts.append(base_contents[index])
175-
176-
if index + 1 < len(chunks):
177-
next_context = self.semantic_chunker._take_first_tokens(
178-
base_contents[index + 1],
179-
overlap_budget,
180-
)
181-
if next_context:
182-
next_tokens = self.semantic_chunker.tokenizer.count_tokens(next_context)
183-
contextual_parts.append(next_context)
184-
185-
chunk.content = "\n\n".join(contextual_parts).strip()
166+
)
186167
if previous_tokens > 0:
187168
chunk.metadata["context_prev_tokens_applied"] = previous_tokens
188169
if next_tokens > 0:

0 commit comments

Comments
 (0)