Feat/clustering models (#19)

MechaCritter · claude · web-flow · commit a3e61260ecb1 · 2026-06-13T13:18:25.000+02:00
* feat(clustering): add clustering and PCA models backed by scikit-learn

Introduce pyvisim/clustering with KMeans, GaussianMixtureModel and PCA,
models that own the underlying scikit-learn estimator and expose the
attributes the encoders need (cluster_centers, weights, means,
covariances, n_components, n_features_in, ...) through typed getters.

The models take the scikit-learn constructor parameters directly and
are created unfitted; this prepares for removing scikit-learn objects
from the encoder constructors.

Co-Authored-By: Claude Fable 5 &lt;noreply@anthropic.com&gt;

* refactor(encoders)!: configure clustering and PCA via params, not sklearn objects

Breaking change: VLADEncoder and FisherVectorEncoder no longer accept
scikit-learn estimators (kmeans_model/gmm_model/pca) in their
constructors. VLAD always uses K-Means and Fisher Vectors always use a
GMM, so the encoders now build the matching pyvisim.clustering models
themselves from the parameters passed at initialization:
n_clusters/n_components plus the optional kmeans_params/gmm_params and
pca_params dictionaries, whose entries are forwarded verbatim to the
underlying scikit-learn estimators.

- learn() no longer takes n_clusters/kwargs; it fits the models that
  were configured at initialization. A configured PCA is now applied
  (and fitted first if necessary) before fitting the clustering model;
  previously it was silently reset with a warning.
- All scikit-learn attribute access (cluster_centers_, weights_,
  means_, covariances_, n_features_in_, ...) goes through the
  clustering and PCA model getters.
- Dimension validation is skipped for unfitted models and applies once
  the models are fitted.
- The default RootSIFT feature extractor moved into ImageEncoderBase.
- Loading pretrained KMeansWeights/GMMWeights still works; the loaded
  estimators are adopted by the corresponding pyvisim models.

Co-Authored-By: Claude Fable 5 &lt;noreply@anthropic.com&gt;

* feat(encoders): add save_to_disk/load_from_disk with .encoder files

Encoders can now persist their learned state to a versioned .encoder
file (fitted clustering model, PCA model and normalization
hyperparameters) and be restored from it via the load_from_disk
classmethod. The feature extractor and similarity function are not
serialized and are provided again at load time; dimension validation
runs on restore.

This is the designated replacement for loading pretrained models via
the KMeansWeights/GMMWeights enums.

Co-Authored-By: Claude Fable 5 &lt;noreply@anthropic.com&gt;

* feat(encoders): deprecate loading from KMeansWeights/GMMWeights

Passing the weights enums to the encoder constructors now emits a
DeprecationWarning; the enums and the loading path will be removed in a
future release in favor of save_to_disk()/load_from_disk() with
.encoder files. The enum docstrings carry the same notice.

Co-Authored-By: Claude Fable 5 &lt;noreply@anthropic.com&gt;

* docs: update READMEs for the params-at-init encoder API

Quickstart now configures the encoder from parameters, calls learn()
and shows save_to_disk/load_from_disk with .encoder files. Document the
kmeans_params/gmm_params/pca_params dictionaries in the encoders README
and mark KMeansWeights/GMMWeights loading as deprecated.

Co-Authored-By: Claude Fable 5 &lt;noreply@anthropic.com&gt;

* in image encoder base, dim_reduction_factore is runtime-checked (has to be greater than 0, and an integer)

* n_components in PCA has to be greater than zero

* added _ENCODER_FILE_FORMAT_VERSION_COMPATIBILITY in case future updates use different format

* now raise error when covariance type other than `diag` is passed to GMM (instead of warning and mutating like currently done)

---------

Co-authored-by: Claude Fable 5 &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -57,6 +57,13 @@ similarity_score = encoder.similarity_score(image1, image2)
 
 print(f"Similarity Score: {similarity_score}")
 ```
+
+A fitted encoder can be saved to a `.encoder` file and restored later:
+
+```python
+path = encoder.save_to_disk("vlad_oxford102")  # writes vlad_oxford102.encoder
+encoder = VLADEncoder.load_from_disk(path)
+```
 You can also visit the [introduction notebook](examples/getting_started.ipynb) for more examples.
 
 I also provided various notebooks for different use-cases. Feel free to check them out, and let me know if you
@@ -111,14 +118,19 @@ For more details on the dataset, please refer to the [documentation](pyvisim/dat
 
 ## Pretrained Models
 
+> [!CAUTION]
+> **Deprecated:** Loading pretrained models via the `KMeansWeights`/`GMMWeights` enums is deprecated
+> and will be removed in a future release. Train an encoder with `learn()` and persist it with `save_to_disk()`/
+> `load_from_disk()` (`.encoder` files) instead.
+
 The following pretrained models are provided for clustering and dimensionality reduction. All clustering
 models were trained with `k=256`. The choice of `k` was made arbitrarily
 based on the paper <sup>[5](#references)</sup>, where the authors tested with `k=32`, `64`, `128`, `256`, `512`, and so on.
 Since higher values would take too long, I chose `k=256` as a balance between performance and computational cost.
 
 ### KMeans Models
 
-You can access these weights by importing `KMWeights` from the `pyvisim.encoders` module.
+You can access these weights by importing `KMeansWeights` from the `pyvisim.encoders` module.
 
 | Model Name                             | Features Extracted From | PCA Applied | Feature Dimensions |
 |----------------------------------------|-------------------------|-------------|--------------------|
diff --git a/pyvisim/__init__.py b/pyvisim/__init__.py
@@ -2,4 +2,4 @@
 PyVisim: A Python library for image similarity analysis using Image Encoders and Neural Networks.
 """
 
-__all__ = ["datasets", "encoders", "features", "eval"]
+__all__ = ["clustering", "datasets", "encoders", "features", "eval"]
diff --git a/pyvisim/clustering/__init__.py b/pyvisim/clustering/__init__.py
@@ -0,0 +1,11 @@
+from ._base_clustering import ClusteringModelBase
+from .gmm import GaussianMixtureModel
+from .kmeans import KMeans
+from .pca import PCA
+
+__all__ = [
+    "ClusteringModelBase",
+    "KMeans",
+    "GaussianMixtureModel",
+    "PCA",
+]
diff --git a/pyvisim/clustering/_base_clustering.py b/pyvisim/clustering/_base_clustering.py
@@ -0,0 +1,108 @@
+"""
+Base classes for the scikit-learn-backed models used by the image encoders.
+"""
+
+import abc
+from typing import Any, ClassVar, TypeVar
+
+import numpy as np
+from sklearn.exceptions import NotFittedError
+from sklearn.utils.validation import check_is_fitted
+
+_SklearnModelT = TypeVar("_SklearnModelT", bound="_SklearnModelBase")
+
+
+class _SklearnModelBase(abc.ABC):
+    """
+    Base class for models backed by a scikit-learn estimator.
+
+    :param model: Underlying scikit-learn estimator instance.
+    """
+
+    _sklearn_class: ClassVar[type[Any]]
+
+    def __init__(self, model: Any) -> None:
+        self._model = model
+
+    @property
+    def is_fitted(self) -> bool:
+        """Whether the underlying estimator has been fitted."""
+        try:
+            check_is_fitted(self._model)
+        except NotFittedError:
+            return False
+        return True
+
+    def _check_is_fitted(self) -> None:
+        """
+        Ensures the underlying estimator is fitted before accessing
+        fitted-only attributes.
+
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        if not self.is_fitted:
+            raise NotFittedError(
+                f"This {type(self).__name__} instance is not fitted yet. "
+                "Call 'fit' with appropriate data before using this attribute."
+            )
+
+    @property
+    def n_features_in(self) -> int:
+        """
+        Number of features the fitted estimator expects as input.
+
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return int(self._model.n_features_in_)
+
+    def fit(self, features: np.ndarray) -> None:
+        """
+        Fits the underlying estimator on the given feature matrix.
+
+        :param features: Feature matrix of shape (n_samples, n_features).
+        """
+        self._model.fit(features)
+
+    def _validate_sklearn_model(self) -> None:  # noqa: B027
+        """
+        Hook for subclasses to validate (or coerce) an estimator that was
+        passed in directly via :meth:`_from_sklearn`.
+        """
+
+    @classmethod
+    def _from_sklearn(cls: type[_SklearnModelT], model: Any) -> _SklearnModelT:
+        """
+        Creates a model from an existing scikit-learn estimator.
+
+        This is an internal constructor used to adopt pretrained estimators
+        (loaded from legacy weight files).
+
+        :param model: Estimator instance of the underlying scikit-learn class.
+        :return: A model backed by the given estimator.
+        :raises TypeError: If ``model`` is not an instance of the underlying class.
+        """
+        if not isinstance(model, cls._sklearn_class):
+            raise TypeError(
+                f"{cls.__name__} can only be created from instances of "
+                f"{cls._sklearn_class.__name__}, not {type(model).__name__}."
+            )
+        instance = cls.__new__(cls)
+        _SklearnModelBase.__init__(instance, model)
+        instance._validate_sklearn_model()
+        return instance
+
+    def __repr__(self) -> str:
+        return f"{type(self).__name__}(model={self._model!r})"
+
+
+class ClusteringModelBase(_SklearnModelBase):
+    """
+    Base class for clustering models.
+    """
+
+    @property
+    @abc.abstractmethod
+    def n_clusters(self) -> int:
+        """Number of clusters (or mixture components) of the estimator."""
+        raise NotImplementedError
diff --git a/pyvisim/clustering/gmm.py b/pyvisim/clustering/gmm.py
@@ -0,0 +1,91 @@
+"""Gaussian Mixture Model used by the Fisher Vector encoder."""
+
+from typing import Any
+
+import numpy as np
+from sklearn.mixture import GaussianMixture
+
+from ._base_clustering import ClusteringModelBase
+
+
+class GaussianMixtureModel(ClusteringModelBase):
+    """
+    Gaussian Mixture clustering model, used by the Fisher Vector encoder.
+    It is backed by :class:`sklearn.mixture.GaussianMixture`.
+
+    Only diagonal covariance matrices are supported: the Fisher Vector
+    computation relies on them, and training is much faster.
+
+    :param n_components: Number of mixture components.
+    :param gmm_params: Additional keyword arguments forwarded verbatim to
+        :class:`sklearn.mixture.GaussianMixture` (e.g. ``random_state``).
+    :raises ValueError: If a ``covariance_type`` other than ``"diag"`` is requested.
+    """
+
+    _sklearn_class = GaussianMixture
+
+    def __init__(self, n_components: int = 256, **gmm_params: Any) -> None:
+        covariance_type = gmm_params.pop("covariance_type", "diag")
+        if covariance_type != "diag":
+            raise ValueError(
+                f"{type(self).__name__} only supports covariance_type='diag', "
+                f"got {covariance_type!r}."
+            )
+        super().__init__(
+            GaussianMixture(
+                n_components=n_components, covariance_type="diag", **gmm_params
+            )
+        )
+
+    def _validate_sklearn_model(self) -> None:
+        if self._model.covariance_type != "diag":
+            raise ValueError(
+                f"{type(self).__name__} only supports covariance_type='diag', "
+                f"got {self._model.covariance_type!r}."
+            )
+
+    @property
+    def n_clusters(self) -> int:
+        """Number of mixture components of the GMM."""
+        return int(self._model.n_components)
+
+    @property
+    def weights(self) -> np.ndarray:
+        """
+        Mixture weights of each component, shape (n_components,).
+
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return np.asarray(self._model.weights_)
+
+    @property
+    def means(self) -> np.ndarray:
+        """
+        Mean of each mixture component, shape (n_components, n_features).
+
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return np.asarray(self._model.means_)
+
+    @property
+    def covariances(self) -> np.ndarray:
+        """
+        Diagonal covariance of each component, shape (n_components, n_features).
+
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return np.asarray(self._model.covariances_)
+
+    def predict_proba(self, features: np.ndarray) -> np.ndarray:
+        """
+        Evaluates the components' posterior probability for each feature vector.
+
+        :param features: Feature matrix of shape (n_samples, n_features).
+        :return: Posterior probabilities, shape (n_samples, n_components).
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return np.asarray(self._model.predict_proba(features))
diff --git a/pyvisim/clustering/kmeans.py b/pyvisim/clustering/kmeans.py
@@ -0,0 +1,50 @@
+"""K-Means clustering class used by the VLAD encoder."""
+
+from typing import Any
+
+import numpy as np
+from sklearn.cluster import KMeans as _SklearnKMeans
+
+from ._base_clustering import ClusteringModelBase
+
+
+class KMeans(ClusteringModelBase):
+    """
+    K-Means clustering model, used by the VLAD encoder. It is
+    backed by :class:`sklearn.cluster.KMeans`.
+
+    :param n_clusters: Number of clusters to form.
+    :param kmeans_params: Additional keyword arguments forwarded verbatim to
+        :class:`sklearn.cluster.KMeans` (e.g. ``random_state``, ``n_init``).
+    """
+
+    _sklearn_class = _SklearnKMeans
+
+    def __init__(self, n_clusters: int = 256, **kmeans_params: Any) -> None:
+        super().__init__(_SklearnKMeans(n_clusters=n_clusters, **kmeans_params))
+
+    @property
+    def n_clusters(self) -> int:
+        """Number of clusters of the K-Means model."""
+        return int(self._model.n_clusters)
+
+    @property
+    def cluster_centers(self) -> np.ndarray:
+        """
+        Coordinates of the cluster centers, shape (n_clusters, n_features).
+
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return np.asarray(self._model.cluster_centers_)
+
+    def predict(self, features: np.ndarray) -> np.ndarray:
+        """
+        Predicts the closest cluster for each feature vector.
+
+        :param features: Feature matrix of shape (n_samples, n_features).
+        :return: Cluster index of each sample, shape (n_samples,).
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return np.asarray(self._model.predict(features))
diff --git a/pyvisim/clustering/pca.py b/pyvisim/clustering/pca.py
@@ -0,0 +1,46 @@
+"""Principal Component Analysis model used by the image encoders."""
+
+from typing import Any
+
+import numpy as np
+from sklearn.decomposition import PCA as _SklearnPCA
+
+from ._base_clustering import _SklearnModelBase
+
+
+class PCA(_SklearnModelBase):
+    """
+    Principal Component Analysis model, used by the image encoders to
+    reduce the dimensionality of local features. It is backed by
+    :class:`sklearn.decomposition.PCA`.
+
+    :param n_components: Number of components to keep.
+    :param pca_params: Additional keyword arguments forwarded verbatim to
+        :class:`sklearn.decomposition.PCA` (e.g. ``whiten``, ``random_state``).
+    """
+
+    _sklearn_class = _SklearnPCA
+
+    def __init__(self, n_components: int, **pca_params: Any) -> None:
+        super().__init__(_SklearnPCA(n_components=n_components, **pca_params))
+
+    @property
+    def n_components(self) -> int:
+        """
+        Number of components of the fitted PCA.
+
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return int(self._model.n_components_)
+
+    def transform(self, features: np.ndarray) -> np.ndarray:
+        """
+        Projects the given features onto the principal components.
+
+        :param features: Feature matrix of shape (n_samples, n_features).
+        :return: Reduced features of shape (n_samples, n_components).
+        :raises NotFittedError: If the underlying estimator is not fitted.
+        """
+        self._check_is_fitted()
+        return np.asarray(self._model.transform(features))
diff --git a/pyvisim/encoders/README.md b/pyvisim/encoders/README.md
@@ -29,6 +29,40 @@ differ in the way they aggregate these descriptors and the underlying clustering
 After the feature extraction step, the local features are aggregated to their respective cluster centers. The final
 encoding matrix is then flattened and normalized to produce the final feature vector representation of the image.
 
+## Configuring Encoders
+
+The encoders build their clustering models internally: VLAD always uses K-Means and the Fisher Vector encoder always
+uses a Gaussian Mixture Model (both implemented in `pyvisim.clustering`).
+
+```python
+from pyvisim.encoders import VLADEncoder, FisherVectorEncoder
+
+vlad = VLADEncoder(
+    n_clusters=256,
+    kmeans_params={"random_state": 42},
+    pca_params={"n_components": 64},
+)
+fisher = FisherVectorEncoder(
+    n_components=256,
+    gmm_params={"random_state": 42},
+)
+```
+
+Calling `learn(images)` fits the configured PCA (if any) and the clustering model. A fitted encoder can be saved to
+disk and restored later:
+
+```python
+vlad.learn(images)
+path = vlad.save_to_disk("vlad")  # writes vlad.encoder
+vlad = VLADEncoder.load_from_disk(path)
+```
+
+The `.encoder` file stores the fitted clustering model, the PCA model and the normalization hyperparameters. The
+feature extractor and the similarity function are not serialized; provide them again when loading.
+
+Loading pretrained models via the `KMeansWeights`/`GMMWeights` enums is deprecated and will be removed in a future
+release.
+
 ## Similarity Metric Pipeline
 The _Pipeline_ class is designed to handle multiple encoders simultaneously to compute feature vectors. It takes
 a list of encoders (instances of the ImageEncoderBase class defined in the '_base_encoder.py' file) and a function
diff --git a/pyvisim/encoders/_base_encoder.py b/pyvisim/encoders/_base_encoder.py
diff --git a/pyvisim/encoders/fisher_vector.py b/pyvisim/encoders/fisher_vector.py
diff --git a/pyvisim/encoders/vlad.py b/pyvisim/encoders/vlad.py