Skip to content

Commit 942fb7b

Browse files
zeel2104MaxHalford
andauthored
Make pandas an optional dependency (#1831)
* Make pandas an optional dependency * Fix mypy errors uncovered by merge - Drop `base.utils.*` references in naive_bayes/{bernoulli,multinomial,complement}.py — mypy --strict does not treat re-exposed submodules as exported. Import `utils` directly instead. - Add `TYPE_CHECKING` pandas import to river/tree/base.py so `pd.DataFrame` annotation resolves. - Add type annotations to river/utils/pandas.py helpers and river/utils/test_pandas.py tests. - Sync uv.lock with new optional-pandas dependency layout. * Verify river works without pandas - Add a no-pandas CI job (.github/workflows/code-quality.yml) that installs the dev environment, uninstalls pandas, and runs the full pytest suite. A conftest hook auto-skips test modules and doctest sources whose text mentions `import pandas`, `>>> pd.`, or `fetch_openml` (the latter routes through pandas inside scikit-learn). - Lazy-load pandas in river.neural_net.mlp so importing river no longer pulls pandas in. The MLP estimator still requires pandas at call time; it now goes through utils.pandas.import_pandas() in learn_one / learn_many / __call__ / predict_one / predict_many. - Update river.checks to skip the mini-batch consistency checks when pandas is not installed (mini-batch methods inherently need pandas). - river.utils.pandas.import_pandas now raises an ImportError that points the user at `pip install "river[pandas]"`. - Consolidate the pandas check in river.compat.river_to_sklearn to use river.utils.pandas.PANDAS_INSTALLED instead of its own duplicate. - Drop the obsolete cibuildwheel TODO in pyproject.toml — pandas is now optional, so the comment's premise is gone. * Cache and pre-download datasets in the no-pandas CI job Without this step, dataset-using doctests print 'Downloading...' lines that the doctest framework reports as unexpected output. Mirror the existing 'ubuntu' job's cache + 'make download-datasets' steps. * Simplify the optional-pandas helpers - Cache `import_pandas()` with `functools.cache` so repeated calls in hot paths (mlp.py, naive_bayes, compose) collapse to a single dict lookup after the first call. - Inline the install-hint string into the ImportError; the named constant added no reuse. - Drop the "Mini-batch checks are skipped..." comment in checks; the `if utils.pandas.PANDAS_INSTALLED:` already says it. - In compat.river_to_sklearn, import pandas directly inside the `PANDAS_INSTALLED` branch instead of going through `import_pandas()`. This is a module-load-time block already gated by the flag, so the indirection only obscured the intent. --------- Co-authored-by: Max Halford <maxhalford25@gmail.com>
1 parent ddbf95e commit 942fb7b

40 files changed

Lines changed: 345 additions & 103 deletions

.github/workflows/code-quality.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,32 @@ jobs:
4444
- name: Run pre-commit
4545
run: uv run pre-commit run --all-files
4646

47+
no-pandas:
48+
name: Tests without pandas
49+
runs-on: ubuntu-latest
50+
steps:
51+
- uses: actions/checkout@v5
52+
53+
- name: Install uv
54+
uses: astral-sh/setup-uv@v7
55+
56+
- name: Install the project
57+
run: uv sync --locked --all-extras --dev
58+
59+
- name: Cache datasets
60+
uses: actions/cache@v4
61+
with:
62+
path: ~/river_data
63+
key: river-data-${{ hashFiles('river/datasets/**/*.py', 'river/bandit/datasets/**/*.py', 'Makefile') }}
64+
restore-keys: |
65+
river-data-
66+
67+
- name: Download datasets
68+
run: uv run make download-datasets
69+
70+
- name: Uninstall pandas
71+
run: uv pip uninstall pandas
72+
73+
- name: Run tests
74+
run: uv run --no-sync pytest
75+

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,12 @@ pip install river
9999

100100
There are [wheels available](https://pypi.org/project/river/#files) for Linux, MacOS, and Windows. This means you most probably won't have to build River from source.
101101

102+
River's core online interface (`learn_one` / `predict_one`) has no `pandas` dependency. The mini-batch interface (`learn_many`, `predict_many`, `predict_proba_many`, `transform_many`) is built on `pandas` and is opt-in:
103+
104+
```sh
105+
pip install "river[pandas]"
106+
```
107+
102108
You can install the latest development version from GitHub as so:
103109

104110
```sh

docs/introduction/installation.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,18 @@ pip install git+ssh://git@github.com/online-ml/river.git --upgrade # using SSH
2525

2626
This method requires having Cython and Rust installed on your machine.
2727

28+
## Mini-batch support (optional `pandas` extra)
29+
30+
River's core online interface (`learn_one` / `predict_one`) does **not** require `pandas`. The mini-batch interface (`learn_many`, `predict_many`, `predict_proba_many`, `transform_many`) is built on top of `pandas.DataFrame` and `pandas.Series`, so `pandas` is an opt-in dependency.
31+
32+
To install River together with `pandas`:
33+
34+
```sh
35+
pip install "river[pandas]"
36+
# or
37+
uv add "river[pandas]"
38+
```
39+
40+
If you call a mini-batch method without `pandas` installed, River raises an `ImportError` pointing you to this extra.
41+
2842
Feel welcome to [open an issue on GitHub](https://github.com/online-ml/river/issues/new) if you are having any trouble.

docs/releases/unreleased.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Unreleased
22

3+
## packaging
4+
5+
- **Breaking:** `pandas` is no longer a hard dependency of River. The core online interface (`learn_one` / `predict_one`) works with `pip install river` alone. The mini-batch interface (`learn_many`, `predict_many`, `predict_proba_many`, `transform_many`) still requires `pandas`; install with `pip install "river[pandas]"`. Calling a `*_many` method without `pandas` raises an `ImportError` pointing to the extra.
6+
- Added a `no-pandas` CI job that installs River without `pandas` and runs the full test suite. A conftest hook auto-skips test modules and doctest sources that mention `pandas` (or `fetch_openml`, which goes through pandas inside scikit-learn).
7+
38
## checks
49

510
- Added ten new global estimator checks to `river.checks`: `check_predict_one_pure` (inference methods are pure), `check_transform_one` (transform_one is exercised and returns a dict), `check_clone_is_independent` (training the original does not mutate clones), `check_predict_many_matches_predict_one` / `check_predict_proba_many_matches_predict_proba_one` / `check_transform_many_matches_transform_one` (mini-batch ↔ one-at-a-time consistency for `base.MiniBatch*` estimators), `check_get_params_matches_signature` (`_get_params()` exposes every `__init__` keyword), `check_predict_one_before_any_learn` (cold-start inference does not crash), `check_repr_roundtrips_clone` (`repr(model) == repr(model.clone())`), `check_clone_with_new_params_applies` (`clone(new_params=...)` applies the overrides), `check_classifier_tracks_seen_labels` (`predict_proba_one` includes every label observed during training), and `check_no_state_aliasing_with_input` (mutating `x` after `learn_one` does not change model state). `_yield_datasets` now also yields a dataset for plain `base.Transformer` / `base.SupervisedTransformer` estimators, which were previously skipped by the dataset-driven checks.

pyproject.toml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,15 @@ readme = "README.md"
88
license = "BSD-3-Clause"
99
dependencies = [
1010
"scipy>=1.16,<2",
11-
"pandas>=2.2,<3",
1211
"numpy>=2.3.4,<3",
1312
"altair>=5.0.0",
1413
]
1514

15+
[project.optional-dependencies]
16+
pandas = [
17+
"pandas>=2.2,<3",
18+
]
19+
1620
[project.urls]
1721
Homepage = "https://riverml.xyz/"
1822
Repository = "https://github.com/online-ml/river/"
@@ -30,6 +34,7 @@ dev = [
3034
"gymnasium>=0.29.0",
3135
"altair>=5.0.0",
3236
"mypy>=1.11.1",
37+
"pandas>=2.2,<3",
3338
"pre-commit>=3.5.0",
3439
"pytest>=9.0.3",
3540
"ruff>=0.15.8",
@@ -82,7 +87,6 @@ default-groups = [
8287

8388
[tool.cibuildwheel]
8489
build-frontend = "build[uv]"
85-
# TODO: re-enable 32-bit builds once pandas is removed as a dependency
8690
skip = ["*_i686", "*-win32", "*-musllinux_i686"]
8791
test-command = "python -c \"import river\""
8892

river/anomaly/lof.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
from __future__ import annotations
22

33
import copy
4-
5-
import pandas as pd
4+
import typing
65

76
from river import anomaly
87
from river.neighbors.base import DistanceFunc
98
from river.utils.vectordict import euclidean_distance_dict
109

10+
if typing.TYPE_CHECKING:
11+
import pandas as pd
12+
1113

1214
def check_equal(x_list: list, y_list: list):
1315
"""

river/anomaly/svm.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
from __future__ import annotations
22

3-
import pandas as pd
4-
5-
from river import anomaly, linear_model, optim
3+
from river import anomaly, linear_model, optim, utils
64

75

86
class OneClassSVM(linear_model.base.GLM, anomaly.base.AnomalyDetector):
@@ -105,6 +103,7 @@ def learn_one(self, x):
105103
super().learn_one(x, y=1)
106104

107105
def learn_many(self, X):
106+
pd = utils.pandas.import_pandas()
108107
super().learn_many(X, y=pd.Series(True, index=X.index))
109108

110109
def score_one(self, x):

river/checks/__init__.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ def yield_checks(model: Estimator) -> typing.Iterator[typing.Callable]:
146146
147147
"""
148148

149-
from river import base
149+
from river import base, utils
150150
from river.anomaly.base import AnomalyDetector
151151
from river.time_series.base import Forecaster
152152

@@ -188,12 +188,13 @@ def yield_checks(model: Estimator) -> typing.Iterator[typing.Callable]:
188188
if isinstance(model, (base.Transformer, base.SupervisedTransformer)):
189189
dataset_checks.append(common.check_transform_one)
190190

191-
if isinstance(model, (base.MiniBatchClassifier, base.MiniBatchRegressor)):
192-
dataset_checks.append(common.check_predict_many_matches_predict_one)
193-
if isinstance(model, base.MiniBatchClassifier):
194-
dataset_checks.append(common.check_predict_proba_many_matches_predict_proba_one)
195-
if isinstance(model, (base.MiniBatchTransformer, base.MiniBatchSupervisedTransformer)):
196-
dataset_checks.append(common.check_transform_many_matches_transform_one)
191+
if utils.pandas.PANDAS_INSTALLED:
192+
if isinstance(model, (base.MiniBatchClassifier, base.MiniBatchRegressor)):
193+
dataset_checks.append(common.check_predict_many_matches_predict_one)
194+
if isinstance(model, base.MiniBatchClassifier):
195+
dataset_checks.append(common.check_predict_proba_many_matches_predict_proba_one)
196+
if isinstance(model, (base.MiniBatchTransformer, base.MiniBatchSupervisedTransformer)):
197+
dataset_checks.append(common.check_transform_many_matches_transform_one)
197198

198199
if hasattr(model, "debug_one"):
199200
dataset_checks.append(common.check_debug_one)

river/cluster/textclust.py

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
import math
44

55
import numpy as np
6-
import pandas as pd
76

87
from river import base
98

@@ -370,26 +369,28 @@ def _get_distance_matrix(self, clusters):
370369
ids = list(clusters.keys())
371370

372371
# initialize all distances to 0
373-
distances = pd.DataFrame(np.zeros((num_clusters, num_clusters)), columns=ids, index=ids)
372+
distances = np.zeros((num_clusters, num_clusters))
373+
positions = {cluster_id: pos for pos, cluster_id in enumerate(ids)}
374374

375375
for idx, row in enumerate(ids):
376376
for col in ids[idx + 1 :]:
377377
# use the macro-distance metric to calculate the distances to different micro-clusters
378378
dist = self._macro_distance.dist(clusters[row], clusters[col], idf)
379-
distances.loc[row, col] = dist
380-
distances.loc[col, row] = dist
379+
row_pos = positions[row]
380+
col_pos = positions[col]
381+
distances[row_pos, col_pos] = dist
382+
distances[col_pos, row_pos] = dist
381383

382-
return distances
384+
return ids, distances
383385

384386
# This is a greedy implementation of single linkage agglomerative clustering. In the future we
385387
# will make this function more flexible
386388
def _agglomerative_clustering(self, micros, k):
387389
clusters = []
388390

389391
## calculate distance matrix
390-
distm = self._get_distance_matrix(micros)
391-
392-
indices = distm.index
392+
indices, distm = self._get_distance_matrix(micros)
393+
positions = {cluster_id: pos for pos, cluster_id in enumerate(indices)}
393394

394395
## init empty clusters
395396
for i in range(0, len(micros)):
@@ -406,8 +407,9 @@ def _agglomerative_clustering(self, micros, k):
406407
## iterate over all clusters in sets
407408
for c_i in clusters[i]:
408409
for c_j in clusters[j]:
409-
if distm[c_i][c_j] < min_dist:
410-
min_dist = distm[c_i][c_j]
410+
dist = distm[positions[c_i], positions[c_j]]
411+
if dist < min_dist:
412+
min_dist = dist
411413
min_pair = (i, j)
412414

413415
## now merge

river/compat/river_to_sklearn.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,12 @@
44
import typing
55

66
import numpy as np
7-
8-
try:
9-
import pandas as pd
10-
11-
PANDAS_INSTALLED = True
12-
except ImportError:
13-
PANDAS_INSTALLED = False
147
from sklearn import base as sklearn_base
158
from sklearn import pipeline, preprocessing, utils
169
from sklearn.utils.validation import validate_data
1710

1811
from river import base, compose, stream
12+
from river.utils.pandas import PANDAS_INSTALLED
1913

2014
__all__ = [
2115
"convert_river_to_sklearn",
@@ -30,6 +24,8 @@
3024
STREAM_METHODS: dict[type, typing.Callable] = {np.ndarray: stream.iter_array}
3125

3226
if PANDAS_INSTALLED:
27+
import pandas as pd
28+
3329
STREAM_METHODS[pd.DataFrame] = stream.iter_pandas
3430

3531
# Params passed to sklearn.utils.check_X_y and sklearn.utils.check_array

0 commit comments

Comments
 (0)