Skip to content

Speed up DBSTREAM by removing deepcopy from cleanup and reclustering#1860

Merged
MaxHalford merged 2 commits into
mainfrom
speed-up-dbstream
May 28, 2026
Merged

Speed up DBSTREAM by removing deepcopy from cleanup and reclustering#1860
MaxHalford merged 2 commits into
mainfrom
speed-up-dbstream

Conversation

@MaxHalford
Copy link
Copy Markdown
Member

Summary

  • cProfile of the synthetic_sklearn workload showed copy.deepcopy(self._micro_clusters) inside _cleanup accounting for ~91% of learn_one time, with a second copy.deepcopy in _generate_clusters_from_labels dominating the predict path.
  • This PR replaces both deepcopies, hoists the Gaussian neighborhood factor out of the per-feature center update loop (it's a single scalar, not per-dimension), and folds the nested try/except KeyError shared-density update into a plain dict.get / lazy-init.
  • Output is unchanged: all 10 test_dbstream.py tests pass, including test_dbstream_synthetic_sklearn (v_beta checked to 4 decimal places).
  • On the 15k-sample synthetic_sklearn workload, learn_one is ~6.1× faster (0.516 s → 0.084 s) and learn_one + predict_one is ~4.3× faster (0.872 s → 0.204 s).

Profile evidence (before)

   ncalls  tottime  cumtime  filename:lineno(function)
    15000    0.046    2.201  river/cluster/dbstream.py:253(_cleanup)
1920316/15000 0.981   2.147  copy.py:119(deepcopy)

_cleanup was 91% of total learn_one time, entirely from copy.deepcopy(self._micro_clusters) cloning every micro cluster on every cleanup tick just so a few could be popped. The fix is to collect weak ids in a first pass, then pop in place.

A second copy.deepcopy lived inside _generate_clusters_from_labels, called from every _recluster on the predict path. It was used purely to obtain a writable copy of the first micro cluster before merging the rest into it; building a fresh DBSTREAMMicroCluster with a shallow-copied center dict has the same semantics for the same price as a plain dict copy. While there, the O(k·m) double loop over range(max_label + 1) × cluster_labels.items() is collapsed to a single grouping pass.

Changes

  • _cleanup: collect weak ids in one pass, then pop in place. No deepcopy.
  • _generate_clusters_from_labels: group by label in one pass; build the macro cluster as a fresh DBSTREAMMicroCluster with a shallow-copied center dict.
  • _update: hoist _gaussian_neighborhood(x, center) out of the per-feature center dict comprehension (it returned the same scalar for every j and was being recomputed n_features times); replace the nested try/except KeyError shared-density update with a dict.get / lazy-init.
  • Drops the now-unused import copy.

Test plan

  • uv run pytest river/cluster/test_dbstream.py river/cluster/dbstream.py — 10/10 pass (incl. doctest and the synthetic_sklearn v_beta check to 4 decimals)
  • cProfile + timing harness on 15k samples (see Summary) — learn_one ~6.1× faster, learn_one + predict_one ~4.3× faster
  • CI

🤖 Generated with Claude Code

cProfile of the synthetic_sklearn workload showed `copy.deepcopy` of the
entire micro-cluster dict in `_cleanup` accounting for ~91% of `learn_one`
time, with a second `deepcopy` in `_generate_clusters_from_labels` dominating
the predict path.

- `_cleanup`: collect weak ids in a first pass, then pop them in place.
- `_generate_clusters_from_labels`: group by label in a single pass and
  build the macro cluster from a fresh `DBSTREAMMicroCluster` with a
  shallow-copied center dict instead of deep-copying the first member.
- `_update`: hoist `_gaussian_neighborhood(x, center)` out of the per-feature
  center dict comprehension (it returns the same scalar for every dimension)
  and replace the nested `try/except KeyError` shared-density update with a
  plain `dict.get` / lazy-init.

Output is unchanged (all 10 dbstream tests pass, including the v_beta
4-decimal check in `test_dbstream_synthetic_sklearn`). On the 15k-sample
synthetic-sklearn workload, `learn_one` is ~6.1x faster (0.516s -> 0.084s)
and `learn_one + predict_one` is ~4.3x faster (0.872s -> 0.204s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxHalford MaxHalford merged commit a4e4710 into main May 28, 2026
1 check passed
@MaxHalford MaxHalford deleted the speed-up-dbstream branch May 28, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant