Skip to content

Drop nltk#645

Merged
rbs333 merged 2 commits into
mainfrom
feat/rm-nltk
Jul 2, 2026
Merged

Drop nltk#645
rbs333 merged 2 commits into
mainfrom
feat/rm-nltk

Conversation

@rbs333

@rbs333 rbs333 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Drop NLTK dependency; serve stopwords statically

Why

RedisVL only used NLTK for one thing: default stopword lists in full-text
query building. Pulling in the whole nltk package for that has cost us
repeatedly — packaging/support friction and a recurring source of test
flakiness (parallel pytest-xdist workers racing on nltk.download(), most
recently patched in #644).

The stopword lists are small, static, and change rarely. This PR bundles them
directly and removes nltk entirely.

What changed

  • New static dataredisvl/utils/stopwords.json (33 languages, ~11k
    words, ~120KB), extracted once from NLTK's stopwords corpus so the lists
    are byte-identical to what NLTK returned. No behavior change for any
    supported language.
  • New loaderredisvl/utils/stopwords.py exposes get_stopwords(language),
    which lazily reads and caches the JSON via importlib.resources.
  • Rewired consumersfull_text_query_helper.py and query.py now call
    get_stopwords() instead of lazily importing NLTK and downloading the corpus.
    The download/race/retry blocks are gone.
  • Removed vestigial importsaggregate.py had unused nltk lazy-imports
    (it already routes through the helper); deleted.
  • Dropped the dependency — removed the nltk extra and its entry in the
    all extra from pyproject.toml.
  • Simplified tests — deleted the ensure_nltk_stopwords session fixture
    from tests/conftest.py. The xdist download race it guarded against no
    longer exists: there is no download.
  • Spellcheck — added stopwords.json to codespell's --skip list (it's a
    multi-language word-list data file, not prose).

Backward compatibility

The public API is unchanged. All existing usages still work:

  • stopwords="english" (default), "german", "french", etc. — any of the
    33 bundled languages
  • stopwords={"custom", "set"} / list / tuple
  • stopwords=None (no filtering)

Error behavior is preserved: an unknown language string raises ValueError
(now with the list of available languages), and a non-string/non-collection
raises TypeError.

The only user-visible change: pip install redisvl[nltk] no longer resolves
(the extra was removed). Worth a changelog note.

Verification

  • All stopword / query / aggregation / hybrid unit tests pass (945 passed);
    german still yields der/die/und, invalid language → ValueError,
    invalid type → TypeError.
  • Confirmed stopwords.json ships in the built wheel
    (redisvl/utils/stopwords.json present in redisvl-*.whl).
  • pre-commit (format, isort, mypy, codespell) passes.

Downstream follow-up (separate, not in this PR)

redis-ai-resources has two notebooks
(python-recipes/vector-search/01_redisvl.ipynb,
02_hybrid_search.ipynb) that carry nltk in their %pip install lines and
one stale markdown reference. Nothing breaks (they never import nltk), but
the nltk install is now dead weight. That cleanup should land after this
release and bump the notebooks' redisvl floor to the version that bundles
stopwords.


Note

Medium Risk
Touches default full-text tokenization for all string stopwords language codes; lists are meant to match NLTK but any drift or packaging omission of stopwords.json would change search behavior. Removing the nltk extra is a breaking install change for users who relied on it.

Overview
Removes the nltk optional dependency and ships default full-text stopword lists as bundled redisvl/utils/stopwords.json (~33 languages), loaded via new get_stopwords() in redisvl/utils/stopwords.py.

Full-text query paths (query.py, full_text_query_helper.py) now call get_stopwords() instead of lazy NLTK imports and runtime nltk.download(); unused NLTK imports in aggregate.py are removed.

Packaging and CI: nltk is dropped from pyproject.toml extras/all and uv.lock; codespell skips stopwords.json. The pytest session fixture that pre-downloaded NLTK stopwords for xdist is deleted from tests/conftest.py.

Public stopword API behavior for supported languages and custom sets is intended to stay the same; pip install redisvl[nltk] no longer applies.

Reviewed by Cursor Bugbot for commit 23a8c33. Bugbot is set up for automated code reviews on this repo. Configure here.

@jit-ci

jit-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown

❌ Jit Scanner failed - Our team is investigating

Jit Scanner failed - Our team has been notified and is working to resolve the issue. Please contact support if you have any questions.


💡 Need to bypass this check? Comment @sera bypass to override.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the nltk optional dependency by bundling NLTK-derived stopword lists as static package data and rewiring full-text query code paths to load stopwords from the bundled JSON instead of downloading NLTK corpora at runtime.

Changes:

  • Added a bundled stopwords dataset (redisvl/utils/stopwords.json) and a new loader API (redisvl/utils/stopwords.py:get_stopwords).
  • Updated query-building consumers to use get_stopwords() and removed vestigial NLTK lazy-import paths.
  • Removed the nltk extra from packaging metadata and dropped the now-unnecessary test fixture guarding NLTK downloads; codespell skips the JSON word list.

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
redisvl/utils/stopwords.json New bundled stopword lists to replace NLTK corpus access.
redisvl/utils/stopwords.py New loader that reads/caches the bundled JSON and exposes get_stopwords(language).
redisvl/utils/full_text_query_helper.py Switches stopword loading from NLTK to the bundled loader.
redisvl/query/query.py Switches stopword loading from NLTK to the bundled loader and removes NLTK lazy-import usage.
redisvl/query/aggregate.py Removes unused NLTK lazy-imports.
tests/conftest.py Removes the xdist stopwords pre-download fixture (no longer needed without NLTK downloads).
pyproject.toml Removes the nltk extra and drops it from all.
.pre-commit-config.yaml Adds the stopwords JSON file to codespell’s skip list.
uv.lock Removes nltk from the lockfile / extras metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread redisvl/utils/stopwords.py

@vishal-bala vishal-bala left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

from redisvl.redis.utils import array_to_buffer
from redisvl.schema.fields import VectorDataType
from redisvl.utils.full_text_query_helper import FullTextQueryHelper
from redisvl.utils.utils import lazy_import

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does lazy_import get used by anything else? If not, happy to gut that out too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Just checked apparently we use it for numpy etc. in a couple places

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 2, 2026 14:18
@jit-ci

jit-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown

❌ Jit Scanner failed - Our team is investigating

Jit Scanner failed - Our team has been notified and is working to resolve the issue. Please contact support if you have any questions.


💡 Need to bypass this check? Comment @sera bypass to override.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.

Comment on lines +22 to +24
Raises:
ValueError: If no stopwords are bundled for the given language.
"""
@rbs333 rbs333 merged commit d026675 into main Jul 2, 2026
144 of 152 checks passed
@rbs333 rbs333 deleted the feat/rm-nltk branch July 2, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants