Drop nltk#645
Conversation
❌ Jit Scanner failed - Our team is investigatingJit Scanner failed - Our team has been notified and is working to resolve the issue. Please contact support if you have any questions. 💡 Need to bypass this check? Comment |
There was a problem hiding this comment.
Pull request overview
This PR removes the nltk optional dependency by bundling NLTK-derived stopword lists as static package data and rewiring full-text query code paths to load stopwords from the bundled JSON instead of downloading NLTK corpora at runtime.
Changes:
- Added a bundled stopwords dataset (
redisvl/utils/stopwords.json) and a new loader API (redisvl/utils/stopwords.py:get_stopwords). - Updated query-building consumers to use
get_stopwords()and removed vestigial NLTK lazy-import paths. - Removed the
nltkextra from packaging metadata and dropped the now-unnecessary test fixture guarding NLTK downloads; codespell skips the JSON word list.
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
redisvl/utils/stopwords.json |
New bundled stopword lists to replace NLTK corpus access. |
redisvl/utils/stopwords.py |
New loader that reads/caches the bundled JSON and exposes get_stopwords(language). |
redisvl/utils/full_text_query_helper.py |
Switches stopword loading from NLTK to the bundled loader. |
redisvl/query/query.py |
Switches stopword loading from NLTK to the bundled loader and removes NLTK lazy-import usage. |
redisvl/query/aggregate.py |
Removes unused NLTK lazy-imports. |
tests/conftest.py |
Removes the xdist stopwords pre-download fixture (no longer needed without NLTK downloads). |
pyproject.toml |
Removes the nltk extra and drops it from all. |
.pre-commit-config.yaml |
Adds the stopwords JSON file to codespell’s skip list. |
uv.lock |
Removes nltk from the lockfile / extras metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from redisvl.redis.utils import array_to_buffer | ||
| from redisvl.schema.fields import VectorDataType | ||
| from redisvl.utils.full_text_query_helper import FullTextQueryHelper | ||
| from redisvl.utils.utils import lazy_import |
There was a problem hiding this comment.
Does lazy_import get used by anything else? If not, happy to gut that out too.
There was a problem hiding this comment.
Good question! Just checked apparently we use it for numpy etc. in a couple places
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
❌ Jit Scanner failed - Our team is investigatingJit Scanner failed - Our team has been notified and is working to resolve the issue. Please contact support if you have any questions. 💡 Need to bypass this check? Comment |
| Raises: | ||
| ValueError: If no stopwords are bundled for the given language. | ||
| """ |
Drop NLTK dependency; serve stopwords statically
Why
RedisVL only used NLTK for one thing: default stopword lists in full-text
query building. Pulling in the whole
nltkpackage for that has cost usrepeatedly — packaging/support friction and a recurring source of test
flakiness (parallel pytest-xdist workers racing on
nltk.download(), mostrecently patched in #644).
The stopword lists are small, static, and change rarely. This PR bundles them
directly and removes
nltkentirely.What changed
redisvl/utils/stopwords.json(33 languages, ~11kwords, ~120KB), extracted once from NLTK's
stopwordscorpus so the listsare byte-identical to what NLTK returned. No behavior change for any
supported language.
redisvl/utils/stopwords.pyexposesget_stopwords(language),which lazily reads and caches the JSON via
importlib.resources.full_text_query_helper.pyandquery.pynow callget_stopwords()instead of lazily importing NLTK and downloading the corpus.The download/race/retry blocks are gone.
aggregate.pyhad unusednltklazy-imports(it already routes through the helper); deleted.
nltkextra and its entry in theallextra frompyproject.toml.ensure_nltk_stopwordssession fixturefrom
tests/conftest.py. The xdist download race it guarded against nolonger exists: there is no download.
stopwords.jsonto codespell's--skiplist (it's amulti-language word-list data file, not prose).
Backward compatibility
The public API is unchanged. All existing usages still work:
stopwords="english"(default),"german","french", etc. — any of the33 bundled languages
stopwords={"custom", "set"}/ list / tuplestopwords=None(no filtering)Error behavior is preserved: an unknown language string raises
ValueError(now with the list of available languages), and a non-string/non-collection
raises
TypeError.The only user-visible change:
pip install redisvl[nltk]no longer resolves(the extra was removed). Worth a changelog note.
Verification
germanstill yieldsder/die/und, invalid language →ValueError,invalid type →
TypeError.stopwords.jsonships in the built wheel(
redisvl/utils/stopwords.jsonpresent inredisvl-*.whl).Downstream follow-up (separate, not in this PR)
redis-ai-resourceshas two notebooks(
python-recipes/vector-search/01_redisvl.ipynb,02_hybrid_search.ipynb) that carrynltkin their%pip installlines andone stale markdown reference. Nothing breaks (they never
import nltk), butthe
nltkinstall is now dead weight. That cleanup should land after thisrelease and bump the notebooks'
redisvlfloor to the version that bundlesstopwords.
Note
Medium Risk
Touches default full-text tokenization for all string
stopwordslanguage codes; lists are meant to match NLTK but any drift or packaging omission ofstopwords.jsonwould change search behavior. Removing thenltkextra is a breaking install change for users who relied on it.Overview
Removes the
nltkoptional dependency and ships default full-text stopword lists as bundledredisvl/utils/stopwords.json(~33 languages), loaded via newget_stopwords()inredisvl/utils/stopwords.py.Full-text query paths (
query.py,full_text_query_helper.py) now callget_stopwords()instead of lazy NLTK imports and runtimenltk.download(); unused NLTK imports inaggregate.pyare removed.Packaging and CI:
nltkis dropped frompyproject.tomlextras/allanduv.lock; codespell skipsstopwords.json. The pytest session fixture that pre-downloaded NLTK stopwords for xdist is deleted fromtests/conftest.py.Public stopword API behavior for supported languages and custom sets is intended to stay the same;
pip install redisvl[nltk]no longer applies.Reviewed by Cursor Bugbot for commit 23a8c33. Bugbot is set up for automated code reviews on this repo. Configure here.