Speed up DBSTREAM by removing deepcopy from cleanup and reclustering#1860
Merged
Conversation
cProfile of the synthetic_sklearn workload showed `copy.deepcopy` of the entire micro-cluster dict in `_cleanup` accounting for ~91% of `learn_one` time, with a second `deepcopy` in `_generate_clusters_from_labels` dominating the predict path. - `_cleanup`: collect weak ids in a first pass, then pop them in place. - `_generate_clusters_from_labels`: group by label in a single pass and build the macro cluster from a fresh `DBSTREAMMicroCluster` with a shallow-copied center dict instead of deep-copying the first member. - `_update`: hoist `_gaussian_neighborhood(x, center)` out of the per-feature center dict comprehension (it returns the same scalar for every dimension) and replace the nested `try/except KeyError` shared-density update with a plain `dict.get` / lazy-init. Output is unchanged (all 10 dbstream tests pass, including the v_beta 4-decimal check in `test_dbstream_synthetic_sklearn`). On the 15k-sample synthetic-sklearn workload, `learn_one` is ~6.1x faster (0.516s -> 0.084s) and `learn_one + predict_one` is ~4.3x faster (0.872s -> 0.204s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
synthetic_sklearnworkload showedcopy.deepcopy(self._micro_clusters)inside_cleanupaccounting for ~91% oflearn_onetime, with a secondcopy.deepcopyin_generate_clusters_from_labelsdominating the predict path.try/except KeyErrorshared-density update into a plaindict.get/ lazy-init.test_dbstream.pytests pass, includingtest_dbstream_synthetic_sklearn(v_beta checked to 4 decimal places).synthetic_sklearnworkload,learn_oneis ~6.1× faster (0.516 s → 0.084 s) andlearn_one + predict_oneis ~4.3× faster (0.872 s → 0.204 s).Profile evidence (before)
_cleanupwas 91% of totallearn_onetime, entirely fromcopy.deepcopy(self._micro_clusters)cloning every micro cluster on every cleanup tick just so a few could be popped. The fix is to collect weak ids in a first pass, then pop in place.A second
copy.deepcopylived inside_generate_clusters_from_labels, called from every_reclusteron the predict path. It was used purely to obtain a writable copy of the first micro cluster before merging the rest into it; building a freshDBSTREAMMicroClusterwith a shallow-copied center dict has the same semantics for the same price as a plain dict copy. While there, the O(k·m) double loop overrange(max_label + 1)×cluster_labels.items()is collapsed to a single grouping pass.Changes
_cleanup: collect weak ids in one pass, then pop in place. Nodeepcopy._generate_clusters_from_labels: group by label in one pass; build the macro cluster as a freshDBSTREAMMicroClusterwith a shallow-copied center dict._update: hoist_gaussian_neighborhood(x, center)out of the per-feature center dict comprehension (it returned the same scalar for everyjand was being recomputedn_featurestimes); replace the nestedtry/except KeyErrorshared-density update with adict.get/ lazy-init.import copy.Test plan
uv run pytest river/cluster/test_dbstream.py river/cluster/dbstream.py— 10/10 pass (incl. doctest and the synthetic_sklearn v_beta check to 4 decimals)learn_one~6.1× faster,learn_one + predict_one~4.3× faster🤖 Generated with Claude Code