Commit bfd78b2
fix: handle text too long for spacy issue (#4353)
This PR addresses the issue that very long text can fail partition
because its length exceeds `spacy`'s character limit. `spacy` is used to
classify text content. For text too long to fit under the limit we now
truncate the text and use the truncated text to represent the full text
for classification purposes.
<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Prevents tokenization failures on very long inputs by truncating text
that exceeds `spacy`’s `max_length`, keeping partition/classification
stable for large documents without affecting normal cases.
- **Bug Fixes**
- Guard `_process` for inputs over `nlp.max_length`; truncate at the
last whitespace within budget, log a warning, and avoid `spacy`
ValueError E088.
- Add tests for truncation behavior and for normal processing within the
limit.
- Update version to `0.22.29` and changelog.
<sup>Written for commit 1bfdd91.
Summary will update on new commits. <a
href="https://cubic.dev/pr/Unstructured-IO/unstructured/pull/4353?utm_source=github">Review
in cubic</a></sup>
<!-- End of auto-generated description by cubic. -->
---------
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>1 parent 238657f commit bfd78b2
4 files changed
Lines changed: 44 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
1 | 7 | | |
2 | 8 | | |
3 | 9 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
151 | 151 | | |
152 | 152 | | |
153 | 153 | | |
154 | | - | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
155 | 168 | | |
156 | 169 | | |
157 | 170 | | |
| |||
0 commit comments