Skip to content

Commit a3172f8

Browse files
authored
mem: exclude unused spaCy pipeline components to reduce model memory (#4296)
Only tok2vec, tagger, and sentence splitting are used (`pos_tag` and `sent_tokenize`). Exclude `ner`, `parser`, `lemmatizer`, `attribute_ruler` when loading `en_core_web_sm`, and add lightweight `sentencizer` to replace the dependency parser for sentence boundary detection. ## Benchmark Measured with [memray](https://github.com/bloomberg/memray) (`memray run` + `memray stats --json`), 3 rounds × 5 texts through `pos_tag()` + `sent_tokenize()` + `word_tokenize()`, Python 3.12. <img width="1400" alt="bench_spacy_exclude" src="https://raw.githubusercontent.com/codeflash-ai/codeflash/pr-assets/images/bench_spacy_exclude.png" /> ``` spaCy en_core_web_sm — component exclusion benchmark pos_tag + sent_tokenize + word_tokenize | 3 rounds x 5 texts | Python 3.12.12 Configuration Peak MB Saved % ---------------------------------------------------------------------- All components (default) 202.1MB 0.0MB 0.0% Exclude ner/parser/lemma/attr_ruler 189.3MB 12.7MB 6.3% ```
1 parent b6cf510 commit a3172f8

3 files changed

Lines changed: 15 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
## 0.22.11
2+
3+
### Enhancements
4+
- **Exclude unused spaCy components**: Exclude `ner`, `lemmatizer`, and `attribute_ruler` when loading `en_core_web_sm`, keeping `parser` for accurate sentence boundaries. Saves ~7 MiB peak memory.
5+
16
## 0.22.10
27
### Enhancements
38
- **Repeat table headers across continuation chunks**: Add `repeat_table_headers` to basic/title chunking options and table chunking internals so leading header rows are detected once and carried forward when large tables spill across multiple chunks.

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.22.10" # pragma: no cover
1+
__version__ = "0.22.11" # pragma: no cover

unstructured/nlp/tokenize.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -109,9 +109,15 @@ def _install_spacy_model() -> None:
109109
logger.info("Installed %s %s", _SPACY_MODEL_NAME, _SPACY_MODEL_VERSION)
110110

111111

112+
# Only tok2vec, tagger, parser (sentence boundaries), and sentencizer are used
113+
# (pos_tag and sent_tokenize). Excluding the remaining components saves ~7 MiB
114+
# of model weights per process.
115+
_SPACY_EXCLUDE = ["ner", "lemmatizer", "attribute_ruler"]
116+
117+
112118
def _load_spacy_model() -> spacy.language.Language:
113119
try:
114-
return spacy.load(_SPACY_MODEL_NAME)
120+
return spacy.load(_SPACY_MODEL_NAME, exclude=_SPACY_EXCLUDE)
115121
except OSError:
116122
pass
117123

@@ -122,13 +128,13 @@ def _load_spacy_model() -> spacy.language.Language:
122128
# Double-check: another process may have installed while we waited.
123129
importlib.invalidate_caches()
124130
try:
125-
return spacy.load(_SPACY_MODEL_NAME)
131+
return spacy.load(_SPACY_MODEL_NAME, exclude=_SPACY_EXCLUDE)
126132
except OSError:
127133
pass
128134
_install_spacy_model()
129135
importlib.invalidate_caches()
130136
try:
131-
return spacy.load(_SPACY_MODEL_NAME)
137+
return spacy.load(_SPACY_MODEL_NAME, exclude=_SPACY_EXCLUDE)
132138
except OSError as exc:
133139
raise RuntimeError(
134140
f"Installed {_SPACY_MODEL_NAME} but spacy.load() still failed. "

0 commit comments

Comments
 (0)