Skip to content

Commit bfd78b2

Browse files
fix: handle text too long for spacy issue (#4353)
This PR addresses the issue that very long text can fail partition because its length exceeds `spacy`'s character limit. `spacy` is used to classify text content. For text too long to fit under the limit we now truncate the text and use the truncated text to represent the full text for classification purposes. <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Prevents tokenization failures on very long inputs by truncating text that exceeds `spacy`’s `max_length`, keeping partition/classification stable for large documents without affecting normal cases. - **Bug Fixes** - Guard `_process` for inputs over `nlp.max_length`; truncate at the last whitespace within budget, log a warning, and avoid `spacy` ValueError E088. - Add tests for truncation behavior and for normal processing within the limit. - Update version to `0.22.29` and changelog. <sup>Written for commit 1bfdd91. Summary will update on new commits. <a href="https://cubic.dev/pr/Unstructured-IO/unstructured/pull/4353?utm_source=github">Review in cubic</a></sup> <!-- End of auto-generated description by cubic. --> --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
1 parent 238657f commit bfd78b2

4 files changed

Lines changed: 44 additions & 2 deletions

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.22.29
2+
3+
### Fixes
4+
5+
- **Truncate text if it exceeds `spacy` limit**: add a guard against calling `spacy` tokenizer with very long text. Now long texts are truncated to fit under the character limit.
6+
17
## 0.22.28
28

39
### Fixes

test_unstructured/nlp/test_tokenize.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,26 @@ def test_tokenizers_functions_run():
4141
tokenize.sent_tokenize(sentence)
4242
tokenize.word_tokenize(sentence)
4343
tokenize.pos_tag(sentence)
44+
45+
46+
def test_process_truncates_text_exceeding_spacy_max_length(caplog):
47+
# Build text well above spaCy's default 1,000,000-char limit, like the prod trace.
48+
nlp = tokenize._get_nlp()
49+
long_text = "This is a sentence. " * ((nlp.max_length // 20) + 10_000)
50+
assert len(long_text) > nlp.max_length
51+
52+
with caplog.at_level("WARNING", logger=tokenize.logger.name):
53+
# Must not raise spacy ValueError E088.
54+
sents = tokenize.sent_tokenize(long_text)
55+
56+
assert len(sents) > 0
57+
assert any("exceeds spaCy max_length" in rec.message for rec in caplog.records)
58+
59+
60+
def test_process_does_not_truncate_text_within_limit():
61+
nlp = tokenize._get_nlp()
62+
text = "Greetings! I am from outer space."
63+
assert len(text) <= nlp.max_length
64+
doc = tokenize._process(text)
65+
# When no truncation occurs the full text round-trips through spaCy.
66+
assert doc.text == text

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.22.28" # pragma: no cover
1+
__version__ = "0.22.29" # pragma: no cover

unstructured/nlp/tokenize.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,20 @@ def _get_nlp() -> spacy.language.Language:
151151
def _process(text: str) -> spacy.tokens.Doc:
152152
"""Run the spaCy pipeline once. All public functions extract what they need from the Doc."""
153153
# -- str() handles numpy.str_ from OCR pipelines --
154-
return _get_nlp()(str(text))
154+
text = str(text)
155+
nlp = _get_nlp()
156+
if len(text) > nlp.max_length:
157+
logger.warning(
158+
"Input text of length %d exceeds spaCy max_length=%d; "
159+
"truncating for partition heuristics.",
160+
len(text),
161+
nlp.max_length,
162+
)
163+
# Prefer to cut at the last whitespace within the budget so we don't split a token.
164+
cut = text.rfind(" ", max(0, nlp.max_length - 256), nlp.max_length)
165+
truncated = text[: cut if cut != -1 else nlp.max_length]
166+
return nlp(truncated)
167+
return nlp(text)
155168

156169

157170
def sent_tokenize(text: str) -> List[str]:

0 commit comments

Comments
 (0)