Skip to content

Commit ddbf95e

Browse files
MaxHalfordclaude
andauthored
Add ten estimator checks; fix bugs they uncovered; enable N rules (#1878)
Adds ten new global checks to `river.checks`: inference purity (`check_predict_one_pure`), transformer coverage (`check_transform_one` + a TrumpApproval dataset for plain `base.Transformer` / `base.SupervisedTransformer`), clone independence (`check_clone_is_independent`), three mini-batch consistency checks (`check_{predict,predict_proba,transform}_many_matches_*_one`), `_get_params` ↔ `__init__` signature parity (`check_get_params_matches_signature`), cold-start inference (`check_predict_one_before_any_learn`), `repr` round-trip through `clone` (`check_repr_roundtrips_clone`), `clone(new_params=...)` override propagation (`check_clone_with_new_params_applies`), classifier label tracking (`check_classifier_tracks_seen_labels`), and no-aliasing-on-input (`check_no_state_aliasing_with_input`). The existing dataset-driven checks (`check_pickling`, `check_shuffle_features_no_impact`, the three feature-disappearance checks, `check_seeding_is_idempotent`) now dispatch through `_infer` / `_learn` helpers so transformers are exercised on the same paths as classifiers, regressors, and anomaly detectors — `test_estimators.py` collects ~780 more tests (2,655 → 3,381). Real bugs the checks surfaced and this commit fixes: - `cluster.TextClust.__init__` overwrote its own `micro_distance` / `macro_distance` parameters with runtime instances, breaking `clone` / `repr` round-trips. Runtime instances are now stored on `_micro_distance` / `_macro_distance`. - `preprocessing.RobustScaler.transform_one` crashed with `TypeError` before any `learn_one` (running median returned `None`). - `neighbors.KNN{Classifier,Regressor}`, `imblearn.HardSampling*`, and `tree.mondrian.MondrianTreeRegressor` stored references to the input feature dict in their buffers/state, so callers mutating `x` after `learn_one` could corrupt the model. - `naive_bayes.BaseNB.predict_proba_many` is mis-aligned when trained via `learn_one` — flagged via `_unit_test_skips` on the BaseNB subclasses for follow-up. Lint: - Enabled pep8-naming (`N801`, `N802`, `N804`) in ruff. `N803` and `N806` are intentionally excluded because `X: pd.DataFrame` / `A_numpy = ...` conventions are pervasive in scientific Python. - Renamed TextClust's internal camelCase identifiers and nested helper classes to snake_case / CapWords. - **Breaking:** `drift.binary.HDDM_A` → `drift.binary.HDDMA`, `drift.binary.HDDM_W` → `drift.binary.HDDMW`, `tree.iSOUPTreeRegressor` → `tree.ISOUPTreeRegressor`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6a66793 commit ddbf95e

27 files changed

Lines changed: 646 additions & 188 deletions

docs/releases/unreleased.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,44 @@
11
# Unreleased
22

3+
## checks
4+
5+
- Added ten new global estimator checks to `river.checks`: `check_predict_one_pure` (inference methods are pure), `check_transform_one` (transform_one is exercised and returns a dict), `check_clone_is_independent` (training the original does not mutate clones), `check_predict_many_matches_predict_one` / `check_predict_proba_many_matches_predict_proba_one` / `check_transform_many_matches_transform_one` (mini-batch ↔ one-at-a-time consistency for `base.MiniBatch*` estimators), `check_get_params_matches_signature` (`_get_params()` exposes every `__init__` keyword), `check_predict_one_before_any_learn` (cold-start inference does not crash), `check_repr_roundtrips_clone` (`repr(model) == repr(model.clone())`), `check_clone_with_new_params_applies` (`clone(new_params=...)` applies the overrides), `check_classifier_tracks_seen_labels` (`predict_proba_one` includes every label observed during training), and `check_no_state_aliasing_with_input` (mutating `x` after `learn_one` does not change model state). `_yield_datasets` now also yields a dataset for plain `base.Transformer` / `base.SupervisedTransformer` estimators, which were previously skipped by the dataset-driven checks.
6+
- Refactored the existing dataset-driven checks (`check_pickling`, `check_shuffle_features_no_impact`, `check_emerging_features`, `check_disappearing_features`, `check_radically_disappearing_features`, `check_seeding_is_idempotent`) to dispatch through `_infer` / `_learn` helpers so transformers are exercised on the same code paths as classifiers, regressors, and anomaly detectors.
7+
- `checks.utils.assert_predictions_are_close` now treats two NaN floats as equivalent, so transformers that legitimately return NaN before they have observed any data (e.g. `MinMaxScaler.transform_one` on the first event) no longer trip the shuffle-invariance check.
8+
9+
## cluster
10+
11+
- Fixed `cluster.TextClust` corrupting its own parameters: `__init__` was overwriting `self.micro_distance` / `self.macro_distance` with runtime distance instances, breaking `clone` and `repr` round-trips. The runtime instances are now stored on `_micro_distance` / `_macro_distance`. Internal camelCase identifiers (`clusterId`, `microToMacro`, `numClusters`, `updateMacroClusters`, `_calculateIDF`) were renamed to snake_case, and the nested helper classes `tfcontainer`, `microcluster`, `distances` were renamed to `TfContainer`, `MicroCluster`, `Distances`.
12+
13+
## drift
14+
15+
- **Breaking:** Renamed `drift.binary.HDDM_A``drift.binary.HDDMA` and `drift.binary.HDDM_W``drift.binary.HDDMW` to comply with PEP-8 CapWords class naming.
16+
17+
## imblearn
18+
19+
- Fixed `imblearn.HardSamplingClassifier` / `imblearn.HardSamplingRegressor` storing references to user-supplied feature dictionaries in their buffer; the buffered triplets now hold shallow copies so callers can safely mutate `x` after `learn_one`.
20+
21+
## naive_bayes
22+
23+
- Marked `predict_many`/`predict_proba_many` checks as skipped on `BaseNB` subclasses (`MultinomialNB`, `BernoulliNB`, `ComplementNB`) via `_unit_test_skips`. `joint_log_likelihood_many`'s output is mis-aligned with the input batch when the model is trained via `learn_one` rather than `learn_many`, so the new mini-batch consistency checks fail. Tracked separately.
24+
25+
## neighbors
26+
27+
- Fixed `neighbors.KNNClassifier` / `neighbors.KNNRegressor` storing references to the input feature dicts in their search window; `learn_one` now stores a shallow copy.
28+
29+
## preprocessing
30+
31+
- Fixed `preprocessing.RobustScaler.transform_one` crashing with `TypeError` when called before any `learn_one` (the running median returned `None`); transform now passes the value through unchanged when centering statistics are not yet available.
32+
33+
## tree
34+
35+
- Fixed `tree.mondrian.MondrianTreeRegressor.learn_one` storing the input feature dict by reference on `self._x`; it now stores a shallow copy so callers can safely mutate `x` after `learn_one`. Knock-on fix for `forest.AMFRegressor`.
36+
- **Breaking:** Renamed `tree.iSOUPTreeRegressor``tree.ISOUPTreeRegressor` to comply with PEP-8 CapWords class naming.
37+
38+
## tooling
39+
40+
- Enabled the pep8-naming ruleset (`N801`, `N802`, `N804`) in ruff so that future class, function, and `classmethod`-first-argument naming violations are caught at lint time. `N803` (argument names) and `N806` (local variable names) were intentionally left out — `X: pd.DataFrame`, `A_numpy = ...`, and similar scientific-Python conventions are pervasive in the codebase.
41+
342
## docs
443

544
- Fixed corrupted markdown cells in the Hoeffding Trees notebook example that caused blank page titles and invisible sidebar navigation. Fixes [#1847](https://github.com/online-ml/river/issues/1847).

mkdocs.yml

Lines changed: 45 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -197,8 +197,8 @@ nav:
197197
- api/drift/binary/DDM.md
198198
- api/drift/binary/EDDM.md
199199
- api/drift/binary/FHDDM.md
200-
- api/drift/binary/HDDM-A.md
201-
- api/drift/binary/HDDM-W.md
200+
- api/drift/binary/HDDMA.md
201+
- api/drift/binary/HDDMW.md
202202
- datasets:
203203
- api/drift/datasets/AirlinePassengers.md
204204
- api/drift/datasets/Apple.md
@@ -542,7 +542,7 @@ nav:
542542
- api/tree/LASTClassifier.md
543543
- api/tree/SGTClassifier.md
544544
- api/tree/SGTRegressor.md
545-
- api/tree/iSOUPTreeRegressor.md
545+
- api/tree/ISOUPTreeRegressor.md
546546
- base:
547547
- api/tree/base/Branch.md
548548
- api/tree/base/Leaf.md
@@ -746,8 +746,8 @@ nav:
746746
- api/drift/binary/DDM.md
747747
- api/drift/binary/EDDM.md
748748
- api/drift/binary/FHDDM.md
749-
- api/drift/binary/HDDM-A.md
750-
- api/drift/binary/HDDM-W.md
749+
- api/drift/binary/HDDMA.md
750+
- api/drift/binary/HDDMW.md
751751
- datasets:
752752
- api/drift/datasets/AirlinePassengers.md
753753
- api/drift/datasets/Apple.md
@@ -1091,7 +1091,7 @@ nav:
10911091
- api/tree/LASTClassifier.md
10921092
- api/tree/SGTClassifier.md
10931093
- api/tree/SGTRegressor.md
1094-
- api/tree/iSOUPTreeRegressor.md
1094+
- api/tree/ISOUPTreeRegressor.md
10951095
- base:
10961096
- api/tree/base/Branch.md
10971097
- api/tree/base/Leaf.md
@@ -1295,8 +1295,8 @@ nav:
12951295
- api/drift/binary/DDM.md
12961296
- api/drift/binary/EDDM.md
12971297
- api/drift/binary/FHDDM.md
1298-
- api/drift/binary/HDDM-A.md
1299-
- api/drift/binary/HDDM-W.md
1298+
- api/drift/binary/HDDMA.md
1299+
- api/drift/binary/HDDMW.md
13001300
- datasets:
13011301
- api/drift/datasets/AirlinePassengers.md
13021302
- api/drift/datasets/Apple.md
@@ -1640,7 +1640,7 @@ nav:
16401640
- api/tree/LASTClassifier.md
16411641
- api/tree/SGTClassifier.md
16421642
- api/tree/SGTRegressor.md
1643-
- api/tree/iSOUPTreeRegressor.md
1643+
- api/tree/ISOUPTreeRegressor.md
16441644
- base:
16451645
- api/tree/base/Branch.md
16461646
- api/tree/base/Leaf.md
@@ -1844,8 +1844,8 @@ nav:
18441844
- api/drift/binary/DDM.md
18451845
- api/drift/binary/EDDM.md
18461846
- api/drift/binary/FHDDM.md
1847-
- api/drift/binary/HDDM-A.md
1848-
- api/drift/binary/HDDM-W.md
1847+
- api/drift/binary/HDDMA.md
1848+
- api/drift/binary/HDDMW.md
18491849
- datasets:
18501850
- api/drift/datasets/AirlinePassengers.md
18511851
- api/drift/datasets/Apple.md
@@ -2189,7 +2189,7 @@ nav:
21892189
- api/tree/LASTClassifier.md
21902190
- api/tree/SGTClassifier.md
21912191
- api/tree/SGTRegressor.md
2192-
- api/tree/iSOUPTreeRegressor.md
2192+
- api/tree/ISOUPTreeRegressor.md
21932193
- base:
21942194
- api/tree/base/Branch.md
21952195
- api/tree/base/Leaf.md
@@ -2393,8 +2393,8 @@ nav:
23932393
- api/drift/binary/DDM.md
23942394
- api/drift/binary/EDDM.md
23952395
- api/drift/binary/FHDDM.md
2396-
- api/drift/binary/HDDM-A.md
2397-
- api/drift/binary/HDDM-W.md
2396+
- api/drift/binary/HDDMA.md
2397+
- api/drift/binary/HDDMW.md
23982398
- datasets:
23992399
- api/drift/datasets/AirlinePassengers.md
24002400
- api/drift/datasets/Apple.md
@@ -2738,7 +2738,7 @@ nav:
27382738
- api/tree/LASTClassifier.md
27392739
- api/tree/SGTClassifier.md
27402740
- api/tree/SGTRegressor.md
2741-
- api/tree/iSOUPTreeRegressor.md
2741+
- api/tree/ISOUPTreeRegressor.md
27422742
- base:
27432743
- api/tree/base/Branch.md
27442744
- api/tree/base/Leaf.md
@@ -2942,8 +2942,8 @@ nav:
29422942
- api/drift/binary/DDM.md
29432943
- api/drift/binary/EDDM.md
29442944
- api/drift/binary/FHDDM.md
2945-
- api/drift/binary/HDDM-A.md
2946-
- api/drift/binary/HDDM-W.md
2945+
- api/drift/binary/HDDMA.md
2946+
- api/drift/binary/HDDMW.md
29472947
- datasets:
29482948
- api/drift/datasets/AirlinePassengers.md
29492949
- api/drift/datasets/Apple.md
@@ -3287,7 +3287,7 @@ nav:
32873287
- api/tree/LASTClassifier.md
32883288
- api/tree/SGTClassifier.md
32893289
- api/tree/SGTRegressor.md
3290-
- api/tree/iSOUPTreeRegressor.md
3290+
- api/tree/ISOUPTreeRegressor.md
32913291
- base:
32923292
- api/tree/base/Branch.md
32933293
- api/tree/base/Leaf.md
@@ -3491,8 +3491,8 @@ nav:
34913491
- api/drift/binary/DDM.md
34923492
- api/drift/binary/EDDM.md
34933493
- api/drift/binary/FHDDM.md
3494-
- api/drift/binary/HDDM-A.md
3495-
- api/drift/binary/HDDM-W.md
3494+
- api/drift/binary/HDDMA.md
3495+
- api/drift/binary/HDDMW.md
34963496
- datasets:
34973497
- api/drift/datasets/AirlinePassengers.md
34983498
- api/drift/datasets/Apple.md
@@ -3836,7 +3836,7 @@ nav:
38363836
- api/tree/LASTClassifier.md
38373837
- api/tree/SGTClassifier.md
38383838
- api/tree/SGTRegressor.md
3839-
- api/tree/iSOUPTreeRegressor.md
3839+
- api/tree/ISOUPTreeRegressor.md
38403840
- base:
38413841
- api/tree/base/Branch.md
38423842
- api/tree/base/Leaf.md
@@ -4040,8 +4040,8 @@ nav:
40404040
- api/drift/binary/DDM.md
40414041
- api/drift/binary/EDDM.md
40424042
- api/drift/binary/FHDDM.md
4043-
- api/drift/binary/HDDM-A.md
4044-
- api/drift/binary/HDDM-W.md
4043+
- api/drift/binary/HDDMA.md
4044+
- api/drift/binary/HDDMW.md
40454045
- datasets:
40464046
- api/drift/datasets/AirlinePassengers.md
40474047
- api/drift/datasets/Apple.md
@@ -4385,7 +4385,7 @@ nav:
43854385
- api/tree/LASTClassifier.md
43864386
- api/tree/SGTClassifier.md
43874387
- api/tree/SGTRegressor.md
4388-
- api/tree/iSOUPTreeRegressor.md
4388+
- api/tree/ISOUPTreeRegressor.md
43894389
- base:
43904390
- api/tree/base/Branch.md
43914391
- api/tree/base/Leaf.md
@@ -4589,8 +4589,8 @@ nav:
45894589
- api/drift/binary/DDM.md
45904590
- api/drift/binary/EDDM.md
45914591
- api/drift/binary/FHDDM.md
4592-
- api/drift/binary/HDDM-A.md
4593-
- api/drift/binary/HDDM-W.md
4592+
- api/drift/binary/HDDMA.md
4593+
- api/drift/binary/HDDMW.md
45944594
- datasets:
45954595
- api/drift/datasets/AirlinePassengers.md
45964596
- api/drift/datasets/Apple.md
@@ -4933,7 +4933,7 @@ nav:
49334933
- api/tree/LASTClassifier.md
49344934
- api/tree/SGTClassifier.md
49354935
- api/tree/SGTRegressor.md
4936-
- api/tree/iSOUPTreeRegressor.md
4936+
- api/tree/ISOUPTreeRegressor.md
49374937
- base:
49384938
- api/tree/base/Branch.md
49394939
- api/tree/base/Leaf.md
@@ -5137,8 +5137,8 @@ nav:
51375137
- api/drift/binary/DDM.md
51385138
- api/drift/binary/EDDM.md
51395139
- api/drift/binary/FHDDM.md
5140-
- api/drift/binary/HDDM-A.md
5141-
- api/drift/binary/HDDM-W.md
5140+
- api/drift/binary/HDDMA.md
5141+
- api/drift/binary/HDDMW.md
51425142
- datasets:
51435143
- api/drift/datasets/AirlinePassengers.md
51445144
- api/drift/datasets/Apple.md
@@ -5481,7 +5481,7 @@ nav:
54815481
- api/tree/LASTClassifier.md
54825482
- api/tree/SGTClassifier.md
54835483
- api/tree/SGTRegressor.md
5484-
- api/tree/iSOUPTreeRegressor.md
5484+
- api/tree/ISOUPTreeRegressor.md
54855485
- base:
54865486
- api/tree/base/Branch.md
54875487
- api/tree/base/Leaf.md
@@ -5685,8 +5685,8 @@ nav:
56855685
- api/drift/binary/DDM.md
56865686
- api/drift/binary/EDDM.md
56875687
- api/drift/binary/FHDDM.md
5688-
- api/drift/binary/HDDM-A.md
5689-
- api/drift/binary/HDDM-W.md
5688+
- api/drift/binary/HDDMA.md
5689+
- api/drift/binary/HDDMW.md
56905690
- datasets:
56915691
- api/drift/datasets/AirlinePassengers.md
56925692
- api/drift/datasets/Apple.md
@@ -6029,7 +6029,7 @@ nav:
60296029
- api/tree/LASTClassifier.md
60306030
- api/tree/SGTClassifier.md
60316031
- api/tree/SGTRegressor.md
6032-
- api/tree/iSOUPTreeRegressor.md
6032+
- api/tree/ISOUPTreeRegressor.md
60336033
- base:
60346034
- api/tree/base/Branch.md
60356035
- api/tree/base/Leaf.md
@@ -6233,8 +6233,8 @@ nav:
62336233
- api/drift/binary/DDM.md
62346234
- api/drift/binary/EDDM.md
62356235
- api/drift/binary/FHDDM.md
6236-
- api/drift/binary/HDDM-A.md
6237-
- api/drift/binary/HDDM-W.md
6236+
- api/drift/binary/HDDMA.md
6237+
- api/drift/binary/HDDMW.md
62386238
- datasets:
62396239
- api/drift/datasets/AirlinePassengers.md
62406240
- api/drift/datasets/Apple.md
@@ -6577,7 +6577,7 @@ nav:
65776577
- api/tree/LASTClassifier.md
65786578
- api/tree/SGTClassifier.md
65796579
- api/tree/SGTRegressor.md
6580-
- api/tree/iSOUPTreeRegressor.md
6580+
- api/tree/ISOUPTreeRegressor.md
65816581
- base:
65826582
- api/tree/base/Branch.md
65836583
- api/tree/base/Leaf.md
@@ -6781,8 +6781,8 @@ nav:
67816781
- api/drift/binary/DDM.md
67826782
- api/drift/binary/EDDM.md
67836783
- api/drift/binary/FHDDM.md
6784-
- api/drift/binary/HDDM-A.md
6785-
- api/drift/binary/HDDM-W.md
6784+
- api/drift/binary/HDDMA.md
6785+
- api/drift/binary/HDDMW.md
67866786
- datasets:
67876787
- api/drift/datasets/AirlinePassengers.md
67886788
- api/drift/datasets/Apple.md
@@ -7125,7 +7125,7 @@ nav:
71257125
- api/tree/LASTClassifier.md
71267126
- api/tree/SGTClassifier.md
71277127
- api/tree/SGTRegressor.md
7128-
- api/tree/iSOUPTreeRegressor.md
7128+
- api/tree/ISOUPTreeRegressor.md
71297129
- base:
71307130
- api/tree/base/Branch.md
71317131
- api/tree/base/Leaf.md
@@ -7329,8 +7329,8 @@ nav:
73297329
- api/drift/binary/DDM.md
73307330
- api/drift/binary/EDDM.md
73317331
- api/drift/binary/FHDDM.md
7332-
- api/drift/binary/HDDM-A.md
7333-
- api/drift/binary/HDDM-W.md
7332+
- api/drift/binary/HDDMA.md
7333+
- api/drift/binary/HDDMW.md
73347334
- datasets:
73357335
- api/drift/datasets/AirlinePassengers.md
73367336
- api/drift/datasets/Apple.md
@@ -7673,7 +7673,7 @@ nav:
76737673
- api/tree/LASTClassifier.md
76747674
- api/tree/SGTClassifier.md
76757675
- api/tree/SGTRegressor.md
7676-
- api/tree/iSOUPTreeRegressor.md
7676+
- api/tree/ISOUPTreeRegressor.md
76777677
- base:
76787678
- api/tree/base/Branch.md
76797679
- api/tree/base/Leaf.md
@@ -7877,8 +7877,8 @@ nav:
78777877
- api/drift/binary/DDM.md
78787878
- api/drift/binary/EDDM.md
78797879
- api/drift/binary/FHDDM.md
7880-
- api/drift/binary/HDDM-A.md
7881-
- api/drift/binary/HDDM-W.md
7880+
- api/drift/binary/HDDMA.md
7881+
- api/drift/binary/HDDMW.md
78827882
- datasets:
78837883
- api/drift/datasets/AirlinePassengers.md
78847884
- api/drift/datasets/Apple.md
@@ -8221,7 +8221,7 @@ nav:
82218221
- api/tree/LASTClassifier.md
82228222
- api/tree/SGTClassifier.md
82238223
- api/tree/SGTRegressor.md
8224-
- api/tree/iSOUPTreeRegressor.md
8224+
- api/tree/ISOUPTreeRegressor.md
82258225
- base:
82268226
- api/tree/base/Branch.md
82278227
- api/tree/base/Leaf.md

pyproject.toml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,14 @@ select = [
151151
"UP",
152152
# isort
153153
"I",
154+
# pep8-naming: class names (N801), function/method names (N802), and the
155+
# cls/self convention on classmethods (N804). N803 (argument names) and
156+
# N806 (local variable names) are deliberately left out — `X: pd.DataFrame`
157+
# and `A_numpy = ...` are conventional in ML/scientific code and would
158+
# produce hundreds of false positives.
159+
"N801",
160+
"N802",
161+
"N804",
154162
]
155163
ignore = ["E501"]
156164
fixable = ["ALL"]

river/base/estimator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ def _tags(self) -> set[str]:
9393
return tags
9494

9595
@classmethod
96-
def _unit_test_params(self) -> Iterator[dict[str, Any]]:
96+
def _unit_test_params(cls) -> Iterator[dict[str, Any]]:
9797
"""Indicates which parameters to use during unit testing.
9898
9999
Most estimators have a default value for each of their parameters. However, in some cases,

0 commit comments

Comments
 (0)