Make pandas an optional dependency#1831
Conversation
MaxHalford
left a comment
There was a problem hiding this comment.
Nearly there! I just have a couple of questions.
| super().learn_one(x, y=1) | ||
|
|
||
| def learn_many(self, X): | ||
| pd = utils.pandas.import_pandas() |
There was a problem hiding this comment.
This is a good pattern, I like it
| def predict_proba_many(self, X): | ||
| pd = utils.pandas.import_pandas() | ||
| try: | ||
| return pd.Series(self.estimator.predict_proba(self._align_df(X)), columns=self.classes) |
There was a problem hiding this comment.
Ok that fixes a bug I guess, I'm surprised it didn't get caught before
| def requires_pandas(method): | ||
| @functools.wraps(method) | ||
| def wrapper(*args, **kwargs): | ||
| import_pandas() | ||
| return method(*args, **kwargs) | ||
|
|
||
| return wrapper |
There was a problem hiding this comment.
Good catch — the requires_pandas decorator was unused. Removed in 072e2f7.
| ] | ||
|
|
||
| [project.optional-dependencies] | ||
| pandas = [ |
There was a problem hiding this comment.
Could you add some instructions to the installation docs to explain that mini-batch is an opt-in feature and requires pandas?
There was a problem hiding this comment.
Added an opt-in section to the README, the installation docs, and the mini-batching recipe explaining that mini-batch methods require pip install "river[pandas]". Also added a changelog entry under packaging in docs/releases/unreleased.md.
Resolve conflicts in pyproject.toml (keep main's modernized dev deps plus pandas) and river/anomaly/lof.py (drop unused functools/utils imports replaced by euclidean_distance_dict on main, keep lazy pandas import).
|
Hello @zeel2104! If that's ok I'm going to finish your PR, as I'd like to get it shipped for the next version :) |
- Drop `base.utils.*` references in naive_bayes/{bernoulli,multinomial,complement}.py — mypy --strict does not treat re-exposed submodules as exported. Import `utils` directly instead.
- Add `TYPE_CHECKING` pandas import to river/tree/base.py so `pd.DataFrame` annotation resolves.
- Add type annotations to river/utils/pandas.py helpers and river/utils/test_pandas.py tests.
- Sync uv.lock with new optional-pandas dependency layout.
- Resolve conflict in river/cluster/textclust.py by combining the PR's numpy-only distance matrix (no pandas) with main's `_macro_distance` refactor and snake_case names. - Address review: drop unused `requires_pandas` decorator in river/utils/pandas.py. - Address review: document `pandas` as an opt-in extra in the README, the installation docs, the mini-batching recipe, and the unreleased changelog. - Auto-fix ruff (drop unused TYPE_CHECKING pandas imports left over by the PR; sort imports) and add `import pandas as pd` to doctests that used to rely on the module-level eager pandas import (compose/pipeline.py, compose/union.py, multiclass/ovr.py). - Update `MultivariateGaussian.__repr__` doctest expected output: the PR's pure-Python formatting drops pandas's positive-sign padding.
- Add a no-pandas CI job (.github/workflows/code-quality.yml) that installs the dev environment, uninstalls pandas, and runs the full pytest suite. A conftest hook auto-skips test modules and doctest sources whose text mentions `import pandas`, `>>> pd.`, or `fetch_openml` (the latter routes through pandas inside scikit-learn). - Lazy-load pandas in river.neural_net.mlp so importing river no longer pulls pandas in. The MLP estimator still requires pandas at call time; it now goes through utils.pandas.import_pandas() in learn_one / learn_many / __call__ / predict_one / predict_many. - Update river.checks to skip the mini-batch consistency checks when pandas is not installed (mini-batch methods inherently need pandas). - river.utils.pandas.import_pandas now raises an ImportError that points the user at `pip install "river[pandas]"`. - Consolidate the pandas check in river.compat.river_to_sklearn to use river.utils.pandas.PANDAS_INSTALLED instead of its own duplicate. - Drop the obsolete cibuildwheel TODO in pyproject.toml — pandas is now optional, so the comment's premise is gone.
Without this step, dataset-using doctests print 'Downloading...' lines that the doctest framework reports as unexpected output. Mirror the existing 'ubuntu' job's cache + 'make download-datasets' steps.
- Cache `import_pandas()` with `functools.cache` so repeated calls in hot paths (mlp.py, naive_bayes, compose) collapse to a single dict lookup after the first call. - Inline the install-hint string into the ImportError; the named constant added no reuse. - Drop the "Mini-batch checks are skipped..." comment in checks; the `if utils.pandas.PANDAS_INSTALLED:` already says it. - In compat.river_to_sklearn, import pandas directly inside the `PANDAS_INSTALLED` branch instead of going through `import_pandas()`. This is a module-load-time block already gated by the flag, so the indirection only obscured the intent.
Summary
This PR removes
pandasas a hard dependency.Changes made:
pandasfrom core dependencies to an optional extrapandasin batch-related code pathsImportErrorwhenlearn_many,predict_many,transform_many, and related pandas-dependent operations are used withoutpandaspandasusage in places where it was not requiredpandasis not installedValidation
Tested without
pandasinstalled:python -m pytest river\utils\test_pandas.pyTested with
pandasandscikit-learninstalled:python -m pytest river\compose\test_.py river\compose\test_product.py river\preprocessing\test_scale.py river\multiclass\test_ovr.py river\naive_bayes\test_naive_bayes.py river\proba\test_gaussian.py river\covariance\test_emp.py river\compat\test_sklearn.py river\anomaly\test_lof.py