Skip to content

Commit 4b27e2e

Browse files
authored
feat(stdlib): add ChunkingStrategy ABC and built-in chunkers (#923)
* feat(stdlib): add ChunkingStrategy ABC and built-in chunkers Adds mellea/stdlib/chunking.py with ChunkingStrategy ABC and three built-in implementations: SentenceChunker, WordChunker, ParagraphChunker. split(accumulated_text) returns complete chunks, holding back trailing fragments for the next call. Closes #899 Part of #891 Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * fix(test): remove unused pytest import from test_chunking Pyright flagged import as unaccessed; no pytest.* calls in the file. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * fix(stdlib): address PR #923 review feedback on ChunkingStrategy - Fix SentenceChunker whitespace leak: lstrip() after match.end() so double-space / tab separators don't bleed into the next chunk as leading whitespace - Add end-of-stream contract to ABC docstring (callers responsible for trailing fragment after stream terminates) - Fix incorrect comment "end-of-string" → "whitespace" - Compile _WHITESPACE / _PARA_BOUNDARY / _PARA_BOUNDARY_END at module level (consistent with _SENTENCE_BOUNDARY; avoids per-call recompile) - Expand SentenceChunker char class to include right curly double/single quotes (U+201D / U+2019) for common LLM output patterns - Document CRLF limitation on ParagraphChunker - Re-export ChunkingStrategy + chunkers from mellea.stdlib.__init__ - Add __all__ to chunking.py - Add tests: closing paren, double-space separator, tab separator, abbreviation edge case (known-bad split), WordChunker leading-whitespace Assisted-by: Claude Code * fix(stdlib): address PR #923 nits — comment accuracy and curly-quote test - Fix misleading comment on _SENTENCE_BOUNDARY: was "processed by re engine as \u escapes" but the file contained literal Unicode chars. Now uses chr(0x201d) + chr(0x2019) for Python 3.12 compatibility (U+2019 is treated as a string delimiter in single-quoted raw strings on 3.12). - Add test_sentence_chunker_curly_quotes to verify U+201D/U+2019 matching. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> * fix(stdlib): apply review feedback on ChunkingStrategy (#923) - Simplify _SENTENCE_BOUNDARY regex to use \u escapes instead of chr() concatenation (cleaner, same semantics, Python 3.12-safe) - Document that SentenceChunker discards inter-sentence whitespace via lstrip() - Add test_chunking_strategy_is_abstract to document the extension-point contract Assisted-by: Claude Code --------- Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
1 parent 96d221a commit 4b27e2e

3 files changed

Lines changed: 421 additions & 0 deletions

File tree

mellea/stdlib/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,11 @@
88
``@mify`` decorator for turning ordinary Python objects into components. Import from
99
the sub-packages — ``mellea.stdlib.components``, ``mellea.stdlib.sampling``, and
1010
``mellea.stdlib.session`` — for day-to-day use.
11+
12+
Streaming chunking strategies (for use with streaming validation) are available at
13+
``mellea.stdlib.chunking`` and re-exported here for convenience.
1114
"""
15+
16+
from .chunking import ChunkingStrategy, ParagraphChunker, SentenceChunker, WordChunker
17+
18+
__all__ = ["ChunkingStrategy", "ParagraphChunker", "SentenceChunker", "WordChunker"]

mellea/stdlib/chunking.py

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
"""ChunkingStrategy ABC and built-in implementations for streaming validation."""
2+
3+
import re
4+
from abc import ABC, abstractmethod
5+
6+
__all__ = ["ChunkingStrategy", "ParagraphChunker", "SentenceChunker", "WordChunker"]
7+
8+
9+
class ChunkingStrategy(ABC):
10+
"""Abstract base class for text chunking strategies used in streaming validation.
11+
12+
A chunking strategy receives the full accumulated text so far and returns a
13+
list of complete chunks ready for downstream validation. Any trailing fragment
14+
that has not yet reached a chunk boundary is withheld — it is not included in
15+
the returned list. Each call is stateless and idempotent given the same input.
16+
17+
End-of-stream contract: ``split()`` always withholds the trailing fragment.
18+
When the stream terminates, callers are responsible for processing any remainder:
19+
take the full accumulated text, identify everything after the last returned
20+
chunk boundary, and handle it appropriately (e.g. pass to a final validator
21+
or discard).
22+
"""
23+
24+
@abstractmethod
25+
def split(self, accumulated_text: str) -> list[str]:
26+
"""Return complete chunks from accumulated_text, excluding any trailing fragment.
27+
28+
Args:
29+
accumulated_text: The full text accumulated so far, including all
30+
previously seen tokens and the latest delta.
31+
32+
Returns:
33+
A list of complete chunks. If no chunk boundary has been reached yet,
34+
returns an empty list. Never includes the trailing incomplete fragment.
35+
"""
36+
...
37+
38+
39+
# Sentence boundary: sentence-ending punctuation, optionally followed by a closing
40+
# quote or paren, then whitespace.
41+
# Character class covers: straight double/single quotes, right double/single curly
42+
# quotes (U+201D, U+2019), and closing paren.
43+
_SENTENCE_BOUNDARY = re.compile("[.!?][\"'\u201d\u2019)]?\\s")
44+
45+
# Whitespace run separator used by WordChunker.
46+
_WHITESPACE = re.compile(r"\s+")
47+
48+
# Paragraph boundary patterns used by ParagraphChunker.
49+
_PARA_BOUNDARY = re.compile(r"\n{2,}")
50+
_PARA_BOUNDARY_END = re.compile(r"\n{2,}$")
51+
52+
53+
class SentenceChunker(ChunkingStrategy):
54+
"""Splits accumulated text on sentence boundaries.
55+
56+
Sentence boundaries are detected by ``.``, ``!``, or ``?``, optionally
57+
followed by a closing quote (straight or curly) or parenthesis, then
58+
whitespace. The final sentence is only returned once it is followed by
59+
whitespace or another sentence — a trailing fragment with no following
60+
whitespace is withheld. Abbreviations are a known edge case: they will
61+
be split on (simple regex, not NLP). Inter-sentence whitespace (including
62+
double-space or tab) is discarded and does not appear as leading whitespace
63+
in subsequent chunks.
64+
"""
65+
66+
def split(self, accumulated_text: str) -> list[str]:
67+
"""Return complete sentences from accumulated_text.
68+
69+
Args:
70+
accumulated_text: The full text accumulated so far.
71+
72+
Returns:
73+
Complete sentences detected so far. The trailing fragment (if any)
74+
is withheld.
75+
"""
76+
if not accumulated_text:
77+
return []
78+
79+
chunks: list[str] = []
80+
remaining = accumulated_text
81+
82+
while True:
83+
match = _SENTENCE_BOUNDARY.search(remaining)
84+
if match is None:
85+
break
86+
# Include up to and including the punctuation (and optional quote/paren),
87+
# but not the trailing whitespace character.
88+
end = match.start() + len(match.group().rstrip())
89+
chunks.append(remaining[:end])
90+
# Advance past the entire whitespace separator; lstrip() handles
91+
# multi-character gaps (double-space, tab, etc.) so they don't
92+
# leak into the next chunk as leading whitespace.
93+
remaining = remaining[match.end() :].lstrip()
94+
95+
return chunks
96+
97+
98+
class WordChunker(ChunkingStrategy):
99+
"""Splits accumulated text on whitespace boundaries.
100+
101+
Each word is a chunk. Trailing text not yet followed by whitespace is
102+
withheld.
103+
"""
104+
105+
def split(self, accumulated_text: str) -> list[str]:
106+
"""Return complete words from accumulated_text.
107+
108+
Args:
109+
accumulated_text: The full text accumulated so far.
110+
111+
Returns:
112+
All whitespace-delimited words except the trailing fragment (if any).
113+
An empty list is returned when no whitespace boundary has been seen.
114+
"""
115+
if not accumulated_text:
116+
return []
117+
118+
# Split on runs of whitespace; the last token is a trailing fragment
119+
# unless accumulated_text ends with whitespace.
120+
parts = _WHITESPACE.split(accumulated_text)
121+
122+
# re.split on leading whitespace produces an empty first element; strip it.
123+
if parts and parts[0] == "":
124+
parts = parts[1:]
125+
if parts and parts[-1] == "":
126+
parts = parts[:-1]
127+
128+
if not parts:
129+
return []
130+
131+
# If the text does not end with whitespace, the last part is a fragment.
132+
if not accumulated_text[-1].isspace():
133+
return parts[:-1]
134+
135+
return parts
136+
137+
138+
class ParagraphChunker(ChunkingStrategy):
139+
r"""Splits accumulated text on double-newline paragraph boundaries.
140+
141+
Two or more consecutive newline characters are treated as a paragraph
142+
separator. The trailing paragraph fragment (text not yet followed by ``\n\n``)
143+
is withheld.
144+
145+
Note: only Unix-style ``\n\n`` separators are recognised. CRLF
146+
(``\r\n\r\n``) paragraph separators are not supported.
147+
"""
148+
149+
def split(self, accumulated_text: str) -> list[str]:
150+
"""Return complete paragraphs from accumulated_text.
151+
152+
Args:
153+
accumulated_text: The full text accumulated so far.
154+
155+
Returns:
156+
Complete paragraphs (separated by two or more newlines). The
157+
trailing incomplete paragraph is withheld. Returns an empty list
158+
if no paragraph boundary has been reached.
159+
"""
160+
if not accumulated_text:
161+
return []
162+
163+
parts = _PARA_BOUNDARY.split(accumulated_text)
164+
165+
# If the text does not end with \n\n, the last part is a trailing fragment.
166+
if not _PARA_BOUNDARY_END.search(accumulated_text):
167+
parts = parts[:-1]
168+
169+
# _PARA_BOUNDARY.split on leading \n\n produces an empty first element.
170+
return [p for p in parts if p]

0 commit comments

Comments
 (0)