Skip to content

Make pandas an optional dependency#1831

Merged
MaxHalford merged 7 commits into
online-ml:mainfrom
zeel2104:remove-hard-pandas-dependency
May 28, 2026
Merged

Make pandas an optional dependency#1831
MaxHalford merged 7 commits into
online-ml:mainfrom
zeel2104:remove-hard-pandas-dependency

Conversation

@zeel2104
Copy link
Copy Markdown
Contributor

Summary

This PR removes pandas as a hard dependency.

Changes made:

  • move pandas from core dependencies to an optional extra
  • lazy-load pandas in batch-related code paths
  • raise a clear ImportError when learn_many, predict_many, transform_many, and related pandas-dependent operations are used without pandas
  • remove unnecessary internal pandas usage in places where it was not required
  • add tests covering behavior when pandas is not installed

Validation

Tested without pandas installed:

  • python -m pytest river\utils\test_pandas.py

Tested with pandas and scikit-learn installed:

  • python -m pytest river\compose\test_.py river\compose\test_product.py river\preprocessing\test_scale.py river\multiclass\test_ovr.py river\naive_bayes\test_naive_bayes.py river\proba\test_gaussian.py river\covariance\test_emp.py river\compat\test_sklearn.py river\anomaly\test_lof.py

Copy link
Copy Markdown
Member

@MaxHalford MaxHalford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly there! I just have a couple of questions.

Comment thread river/anomaly/svm.py
super().learn_one(x, y=1)

def learn_many(self, X):
pd = utils.pandas.import_pandas()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good pattern, I like it

Comment thread river/compat/sklearn_to_river.py Outdated
def predict_proba_many(self, X):
pd = utils.pandas.import_pandas()
try:
return pd.Series(self.estimator.predict_proba(self._align_df(X)), columns=self.classes)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that fixes a bug I guess, I'm surprised it didn't get caught before

Comment thread river/utils/pandas.py Outdated
Comment on lines +22 to +28
def requires_pandas(method):
@functools.wraps(method)
def wrapper(*args, **kwargs):
import_pandas()
return method(*args, **kwargs)

return wrapper
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually used?

Copy link
Copy Markdown
Member

@MaxHalford MaxHalford May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the requires_pandas decorator was unused. Removed in 072e2f7.

Comment thread pyproject.toml
]

[project.optional-dependencies]
pandas = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some instructions to the installation docs to explain that mini-batch is an opt-in feature and requires pandas?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an opt-in section to the README, the installation docs, and the mini-batching recipe explaining that mini-batch methods require pip install "river[pandas]". Also added a changelog entry under packaging in docs/releases/unreleased.md.

Resolve conflicts in pyproject.toml (keep main's modernized dev deps
plus pandas) and river/anomaly/lof.py (drop unused functools/utils
imports replaced by euclidean_distance_dict on main, keep lazy pandas
import).
@MaxHalford
Copy link
Copy Markdown
Member

Hello @zeel2104! If that's ok I'm going to finish your PR, as I'd like to get it shipped for the next version :)

- Drop `base.utils.*` references in naive_bayes/{bernoulli,multinomial,complement}.py — mypy --strict does not treat re-exposed submodules as exported. Import `utils` directly instead.
- Add `TYPE_CHECKING` pandas import to river/tree/base.py so `pd.DataFrame` annotation resolves.
- Add type annotations to river/utils/pandas.py helpers and river/utils/test_pandas.py tests.
- Sync uv.lock with new optional-pandas dependency layout.
- Resolve conflict in river/cluster/textclust.py by combining the PR's
  numpy-only distance matrix (no pandas) with main's `_macro_distance`
  refactor and snake_case names.
- Address review: drop unused `requires_pandas` decorator in
  river/utils/pandas.py.
- Address review: document `pandas` as an opt-in extra in the README,
  the installation docs, the mini-batching recipe, and the unreleased
  changelog.
- Auto-fix ruff (drop unused TYPE_CHECKING pandas imports left over by
  the PR; sort imports) and add `import pandas as pd` to doctests that
  used to rely on the module-level eager pandas import
  (compose/pipeline.py, compose/union.py, multiclass/ovr.py).
- Update `MultivariateGaussian.__repr__` doctest expected output: the
  PR's pure-Python formatting drops pandas's positive-sign padding.
- Add a no-pandas CI job (.github/workflows/code-quality.yml) that
  installs the dev environment, uninstalls pandas, and runs the full
  pytest suite. A conftest hook auto-skips test modules and doctest
  sources whose text mentions `import pandas`, `>>> pd.`, or
  `fetch_openml` (the latter routes through pandas inside scikit-learn).
- Lazy-load pandas in river.neural_net.mlp so importing river no longer
  pulls pandas in. The MLP estimator still requires pandas at call time;
  it now goes through utils.pandas.import_pandas() in learn_one /
  learn_many / __call__ / predict_one / predict_many.
- Update river.checks to skip the mini-batch consistency checks when
  pandas is not installed (mini-batch methods inherently need pandas).
- river.utils.pandas.import_pandas now raises an ImportError that
  points the user at `pip install "river[pandas]"`.
- Consolidate the pandas check in river.compat.river_to_sklearn to use
  river.utils.pandas.PANDAS_INSTALLED instead of its own duplicate.
- Drop the obsolete cibuildwheel TODO in pyproject.toml — pandas is now
  optional, so the comment's premise is gone.
Without this step, dataset-using doctests print 'Downloading...' lines
that the doctest framework reports as unexpected output. Mirror the
existing 'ubuntu' job's cache + 'make download-datasets' steps.
- Cache `import_pandas()` with `functools.cache` so repeated calls in
  hot paths (mlp.py, naive_bayes, compose) collapse to a single dict
  lookup after the first call.
- Inline the install-hint string into the ImportError; the named
  constant added no reuse.
- Drop the "Mini-batch checks are skipped..." comment in checks; the
  `if utils.pandas.PANDAS_INSTALLED:` already says it.
- In compat.river_to_sklearn, import pandas directly inside the
  `PANDAS_INSTALLED` branch instead of going through `import_pandas()`.
  This is a module-load-time block already gated by the flag, so the
  indirection only obscured the intent.
@MaxHalford MaxHalford merged commit 942fb7b into online-ml:main May 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants