Skip to content

Feat/clustering models#19

Merged
MechaCritter merged 9 commits into
mainfrom
feat/clustering-models
Jun 13, 2026
Merged

Feat/clustering models#19
MechaCritter merged 9 commits into
mainfrom
feat/clustering-models

Conversation

@MechaCritter

@MechaCritter MechaCritter commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Related Issues

  • Currentl, sklearn objects are exposed to Public API. This is removed, and pyvisim's own clustering objects are used instead to avoid compatibility issues (currently, these are only sklearn wrappers)

Proposed Changes

This PR introduces a pyvisim.clustering sub-package and refactors the encoder API around it. It contains one breaking change.

New: pyvisim/clustering/ module (feat(clustering))

  • Added KMeans, GaussianMixtureModel, and PCA clustering models, each owning the underlying scikit-learn estimator and exposing typed getters for the attributes the encoders need (cluster_centers, weights, means, covariances, n_components, n_features_in, etc.).
  • Models are created unfitted and accept scikit-learn constructor parameters directly, decoupling encoder configuration from raw sklearn objects.

Breaking change: encoder params-at-init API (refactor(encoders)!)

  • VLADEncoder and FisherVectorEncoder no longer accept scikit-learn estimators (kmeans_model, gmm_model, pca) in their constructors.
  • Encoders now build the matching pyvisim.clustering models themselves from n_clusters/n_components and optional kmeans_params/gmm_params/pca_params dicts forwarded verbatim to the underlying sklearn estimators.
  • learn() no longer takes n_clusters/kwargs; it fits models configured at initialization. PCA is now applied (and fitted first if needed) before fitting the clustering model.
  • Default RootSIFT feature extractor moved into ImageEncoderBase.
  • Loading pre-trained KMeansWeights/GMMWeights still works; the loaded estimators are adopted by the corresponding pyvisim models.

New: encoder persistence (feat(encoders))

  • save_to_disk() / load_from_disk() serialize the fitted clustering model, PCA model, and normalization hyperparameters to a versioned .encoder file.
  • Feature extractor and similarity function are provided again at load time; dimension validation runs on restore.
  • Intended as the long-term replacement for KMeansWeights/GMMWeights loading.

Deprecation: KMeansWeights/GMMWeights loading (feat(encoders))

  • Passing weights enums to encoder constructors now emits a DeprecationWarning. The enums and the loading path will be removed in a future release.

Docs (docs)

  • Updated top-level README.md and added pyvisim/encoders/README.md documenting the new params-at-init API, kmeans_params/gmm_params/pca_params dicts, and save_to_disk/load_from_disk with .encoder files.
  • Marked KMeansWeights/GMMWeights as deprecated in docstrings and README.

How did you test it?

  • make test-types and make fmt pass on the branch.
  • Encoder construction, learn(), save_to_disk(), and load_from_disk() exercised via the existing test suite and manually against the getting_started notebook.
  • Deprecation warnings verified by passing weights enums to an encoder constructor.

Notes for the reviewer

  • The breaking change is limited to the encoder constructor signatures. Any caller that previously passed a fitted sklearn object must now pass the equivalent parameter dict and let the encoder build its own clustering model.
  • KMeansWeights/GMMWeights loading is still functional but deprecated — callers should migrate to save_to_disk/load_from_disk.
  • The pyvisim/clustering/_base_clustering.py abstract base holds the shared interface; reviewers may want to verify the getter contracts match what _base_encoder.py consumes.

Checklist

  • I have read the contributors guidelines.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • When applicable: I have reviewed the AI-generated code sections, according to contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

MechaCritter and others added 5 commits June 13, 2026 00:41
Introduce pyvisim/clustering with KMeans, GaussianMixtureModel and PCA,
models that own the underlying scikit-learn estimator and expose the
attributes the encoders need (cluster_centers, weights, means,
covariances, n_components, n_features_in, ...) through typed getters.

The models take the scikit-learn constructor parameters directly and
are created unfitted; this prepares for removing scikit-learn objects
from the encoder constructors.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…earn objects

Breaking change: VLADEncoder and FisherVectorEncoder no longer accept
scikit-learn estimators (kmeans_model/gmm_model/pca) in their
constructors. VLAD always uses K-Means and Fisher Vectors always use a
GMM, so the encoders now build the matching pyvisim.clustering models
themselves from the parameters passed at initialization:
n_clusters/n_components plus the optional kmeans_params/gmm_params and
pca_params dictionaries, whose entries are forwarded verbatim to the
underlying scikit-learn estimators.

- learn() no longer takes n_clusters/kwargs; it fits the models that
  were configured at initialization. A configured PCA is now applied
  (and fitted first if necessary) before fitting the clustering model;
  previously it was silently reset with a warning.
- All scikit-learn attribute access (cluster_centers_, weights_,
  means_, covariances_, n_features_in_, ...) goes through the
  clustering and PCA model getters.
- Dimension validation is skipped for unfitted models and applies once
  the models are fitted.
- The default RootSIFT feature extractor moved into ImageEncoderBase.
- Loading pretrained KMeansWeights/GMMWeights still works; the loaded
  estimators are adopted by the corresponding pyvisim models.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Encoders can now persist their learned state to a versioned .encoder
file (fitted clustering model, PCA model and normalization
hyperparameters) and be restored from it via the load_from_disk
classmethod. The feature extractor and similarity function are not
serialized and are provided again at load time; dimension validation
runs on restore.

This is the designated replacement for loading pretrained models via
the KMeansWeights/GMMWeights enums.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Passing the weights enums to the encoder constructors now emits a
DeprecationWarning; the enums and the loading path will be removed in a
future release in favor of save_to_disk()/load_from_disk() with
.encoder files. The enum docstrings carry the same notice.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Quickstart now configures the encoder from parameters, calls learn()
and shows save_to_disk/load_from_disk with .encoder files. Document the
kmeans_params/gmm_params/pca_params dictionaries in the encoders README
and mark KMeansWeights/GMMWeights loading as deprecated.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new pyvisim.clustering subpackage (typed wrappers around scikit-learn estimators) and refactors VLAD/Fisher encoders to configure clustering/PCA at initialization rather than accepting raw sklearn objects, while also adding .encoder persistence and deprecating enum-based pretrained weight loading.

Changes:

  • Added pyvisim.clustering wrappers (KMeans, GaussianMixtureModel, PCA) with a shared fitted-state interface.
  • Refactored VLADEncoder / FisherVectorEncoder to build and fit clustering/PCA models configured via n_clusters/n_components + *_params dicts.
  • Added save_to_disk() / load_from_disk() encoder persistence and updated docs with deprecation guidance for KMeansWeights/GMMWeights.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
README.md Documents .encoder save/load and warns that enum-based pretrained weights are deprecated.
pyvisim/init.py Exposes the new clustering subpackage in the public package surface.
pyvisim/encoders/_base_encoder.py Central refactor: default RootSIFT, new clustering/PCA handling, persistence API, deprecation warning path.
pyvisim/encoders/vlad.py Updates VLAD to configure internal KMeans/PCA wrappers at init and use wrapper getters.
pyvisim/encoders/fisher_vector.py Updates Fisher Vector to configure internal GMM/PCA wrappers at init and use wrapper getters.
pyvisim/encoders/README.md Documents the new params-at-init encoder API and persistence workflow.
pyvisim/clustering/init.py Defines clustering public exports (KMeans, GaussianMixtureModel, PCA, base type).
pyvisim/clustering/_base_clustering.py Adds shared sklearn-backed base with is_fitted, fit, and adoption from sklearn estimators.
pyvisim/clustering/kmeans.py Adds KMeans wrapper exposing n_clusters, cluster_centers, predict.
pyvisim/clustering/gmm.py Adds GMM wrapper exposing weights/means/covariances/predict_proba with diag-covariance constraint.
pyvisim/clustering/pca.py Adds PCA wrapper exposing fitted n_components and transform.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyvisim/encoders/_base_encoder.py
Comment thread pyvisim/encoders/_base_encoder.py
Comment thread pyvisim/encoders/_base_encoder.py
Comment thread pyvisim/clustering/gmm.py
@MechaCritter MechaCritter merged commit a3e6126 into main Jun 13, 2026
2 of 3 checks passed
@MechaCritter MechaCritter deleted the feat/clustering-models branch June 13, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants