updated docs of the new clustering models APIs (#21)

MechaCritter · web-flow · commit 053d2f5fdacf · 2026-06-13T14:27:56.000+02:00
diff --git a/docs/clustering/README.md b/docs/clustering/README.md
@@ -0,0 +1,70 @@
+# Clustering models
+
+The `pyvisim.clustering` module holds the small models the encoders use to build
+their visual vocabulary and to reduce dimensionality. Each one owns an underlying
+scikit-learn estimator and exposes just the attributes the encoders need through
+typed getters, so the encoders never touch scikit-learn's `*_` fitted attributes
+directly.
+
+| Object | File | Backed by | Used by |
+|--------|------|-----------|---------|
+| `KMeans` | [`kmeans.py`](../../pyvisim/clustering/kmeans.py) | `sklearn.cluster.KMeans` | `VLADEncoder` |
+| `GaussianMixtureModel` | [`gmm.py`](../../pyvisim/clustering/gmm.py) | `sklearn.mixture.GaussianMixture` | `FisherVectorEncoder` |
+| `PCA` | [`pca.py`](../../pyvisim/clustering/pca.py) | `sklearn.decomposition.PCA` | both encoders (optional) |
+
+`KMeans` and `GaussianMixtureModel` are clustering models and share
+[`ClusteringModelBase`](../../pyvisim/clustering/_base_clustering.py). `PCA` is not a
+clustering model, so it sits directly on the shared `_SklearnModelBase` instead.
+
+## How they work
+
+You create a model unfitted, passing the scikit-learn constructor parameters straight
+through:
+
+```python
+from pyvisim.clustering import KMeans, GaussianMixtureModel, PCA
+
+kmeans = KMeans(n_clusters=256, random_state=0)
+gmm = GaussianMixtureModel(n_components=256, random_state=0)
+pca = PCA(n_components=64, whiten=True)
+```
+
+`n_clusters` / `n_components` are explicit; everything else is forwarded verbatim to
+the wrapped estimator. Call `fit(features)` to train, and check `is_fitted` at any
+time. The fitted-only getters (`cluster_centers`, `weights`, `means`, `covariances`,
+`n_features_in`, ...) raise `NotFittedError` if you read them before fitting, so you
+get a clear error instead of an `AttributeError` from scikit-learn.
+
+In normal use you don't build these yourself: the encoders create the matching model
+for you from the parameters passed to their constructors (see
+[encoders/base_encoder.md](../encoders/base_encoder.md)).
+
+## What each model exposes
+
+**`KMeans`**
+- `n_clusters` — number of clusters.
+- `cluster_centers` — `(n_clusters, n_features)` centroid coordinates.
+- `predict(features)` — nearest cluster index per row. This is the hard assignment
+  VLAD uses.
+
+**`GaussianMixtureModel`**
+- `n_clusters` — number of mixture components (the `ClusteringModelBase` name for it).
+- `weights`, `means`, `covariances` — the GMM parameters the Fisher Vector gradients
+  are computed from.
+- `predict_proba(features)` — posterior probability per component, i.e. the soft
+  assignment Fisher Vectors use.
+- Diagonal covariance only. Asking for any other `covariance_type` raises `ValueError`
+  up front, because the Fisher Vector math assumes diagonal covariances (and training
+  is much faster that way).
+
+**`PCA`**
+- `n_components` — number of components of the fitted PCA.
+- `transform(features)` — projects features onto the principal components.
+
+## Adopting a pretrained scikit-learn estimator
+
+There's an internal `_from_sklearn` classmethod used to wrap an already-fitted
+estimator loaded from a legacy `KMeansWeights` / `GMMWeights` pickle. It type-checks
+the estimator (and, for the GMM, re-validates the diagonal covariance) before adopting
+it. You won't call this directly; it backs the deprecated weight-loading path described
+in [encoders/weights.md](../encoders/weights.md).
diff --git a/docs/encoders/README.md b/docs/encoders/README.md
@@ -12,9 +12,11 @@ vectors are used for retrieval, clustering, and classification.
 
 where `K` is the number of clusters and `D` is the local descriptor dimension.
 
-Shared machinery lives in [`ImageEncoderBase`](base_encoder.md), and pretrained
-clustering/PCA models are exposed through the enums documented in
-[weights.md](weights.md).
+Shared machinery lives in [`ImageEncoderBase`](base_encoder.md). Each encoder builds its
+aggregation model from the [`pyvisim.clustering`](../clustering/README.md) package using
+the parameters you pass at construction, then fits it in `learn`. Trained encoders are
+saved and restored with `save_to_disk` / `load_from_disk`; the older pretrained-weight
+enums in [weights.md](weights.md) still work but are deprecated.
 
 ## VLAD vs Fisher Vector
 
diff --git a/docs/encoders/base_encoder.md b/docs/encoders/base_encoder.md
@@ -12,19 +12,42 @@ A concrete encoder is the combination of:
 
 1. a **feature extractor** (`FeatureExtractorBase`),
 2. an optional **PCA** model,
-3. a **clustering model** (`KMeans` for VLAD, `GaussianMixture` for Fisher),
+3. a **clustering model** (`KMeans` for VLAD, `GaussianMixtureModel` for Fisher; both
+   from [`pyvisim.clustering`](../clustering/README.md)),
 4. a **similarity function**.
 
 The base class wires these together, validates their dimensions, and provides
-`learn`, `encode` (abstract), `generate_encoding_map`, and `similarity_score`.
+`learn`, `save_to_disk`/`load_from_disk`, `encode` (abstract), `generate_encoding_map`,
+and `similarity_score`.
 
 ## Constructing an encoder
 
-Two mutually exclusive ways to supply a clustering model:
+The encoder classes are constructed like this:
 
-- Pass `weights=` (a `KMeansWeights` / `GMMWeights` enum member). The base class loads
-  the pickled model, and if the weight name contains `PCA` it also loads the matching
-  PCA model automatically. When `weights` is given, the `clustering_model` and `pca`
-  arguments are ignored.
-- Pass an explicit `clustering_model=` (and optionally `pca=`) that you trained
-  yourself or loaded elsewhere.
+- `VLADEncoder` takes `n_clusters` plus an optional `kmeans_params` dict.
+- `FisherVectorEncoder` takes `n_components` plus an optional `gmm_params` dict.
+- Both take an optional `pca_params` dict (must include `n_components`) to add a PCA
+  step. Leave it out and no PCA is applied.
+
+Everything in `kmeans_params` / `gmm_params` / `pca_params` is forwarded verbatim to the
+underlying scikit-learn models (see scikit-learn for `KMeans` and `GaussianMixture` documentation). See
+[vlad.md](vlad.md) and [fisher_vector.md](fisher_vector.md) for the per-encoder details.
+
+## Training and persistence
+
+The models start unfitted, so you have to train before encoding:
+
+- `learn(images)` extracts features from the images, fits the configured PCA first (if
+  any), then fits the clustering model. Dimension checks against the feature extractor
+  and PCA are deferred until the models are actually fitted.
+- `save_to_disk(path)` writes the fitted clustering model, the PCA model, and the
+  normalization hyperparameters to a versioned `.encoder` file (the `.encoder` suffix is
+  added if you leave it off). It raises `NotFittedError` if you haven't called `learn`
+  yet.
+- `load_from_disk(path)` rebuilds the encoder from that file. The feature extractor and
+  similarity function aren't serialized, so you pass them again here (the feature
+  extractor defaults to `RootSIFT`); its output dimension has to match the saved PCA or
+  clustering model.
+
+This save/load round-trip is the supported way to reuse a trained encoder. The old
+`weights=` enum path still works but is deprecated, see [weights.md](weights.md).
diff --git a/docs/encoders/fisher_vector.md b/docs/encoders/fisher_vector.md
@@ -7,6 +7,29 @@ The Fisher Vector encodes an image into a vector of shape `(2 * K * D + K,)`, wh
 optional PCA). The `2 * K * D` term comes from the mean and variance gradients, and
 the `+ K` term from the mixture-weight gradients.
 
+## Constructing one
+
+Fisher Vectors always cluster with a Gaussian Mixture Model, configured through the
+encoder:
+
+```python
+from pyvisim.encoders import FisherVectorEncoder
+
+fisher = FisherVectorEncoder(
+    n_components=256,                # number of mixture components
+    gmm_params={"random_state": 0},  # forwarded to sklearn.mixture.GaussianMixture
+    pca_params={"n_components": 64}, # optional; omit for no PCA
+)
+fisher.learn(images)                 # fits the PCA (if any) then the GMM
+```
+
+`n_components` is passed directly, not inside `gmm_params` (doing both raises a
+`ValueError`). The GMM uses diagonal covariances; passing any other `covariance_type`
+in `gmm_params` raises a `ValueError`, since the Fisher Vector math assumes diagonal
+covariances. Save a fitted encoder with `fisher.save_to_disk("fisher")` and reload it
+with `FisherVectorEncoder.load_from_disk("fisher.encoder")`, see
+[base_encoder.md](base_encoder.md).
+
 ## How `encode` works
 
 For each image:
diff --git a/docs/encoders/vlad.md b/docs/encoders/vlad.md
@@ -6,6 +6,27 @@ VLAD (Vector of Locally Aggregated Descriptors) encodes an image into a vector o
 shape `(K * D,)`, where `K` is the number of KMeans clusters and `D` is the local
 descriptor dimension (after optional PCA).
 
+## Constructing one
+
+VLAD always clusters with K-Means, so you configure that model through the encoder:
+
+```python
+from pyvisim.encoders import VLADEncoder
+
+vlad = VLADEncoder(
+    n_clusters=256,                    # number of visual words
+    kmeans_params={"random_state": 0}, # forwarded to sklearn.cluster.KMeans
+    pca_params={"n_components": 64},   # optional; omit for no PCA
+)
+vlad.learn(images)                     # fits the PCA (if any) then K-Means
+```
+
+`n_clusters` is passed directly, not inside `kmeans_params` (doing both raises a
+`ValueError`). Everything else in `kmeans_params` / `pca_params` is handed straight to
+the matching scikit-learn estimator. Once fitted, save with `vlad.save_to_disk("vlad")`
+and reload with `VLADEncoder.load_from_disk("vlad.encoder")`, see
+[base_encoder.md](base_encoder.md).
+
 ## How `encode` works
 
 For each image:
diff --git a/docs/encoders/weights.md b/docs/encoders/weights.md
@@ -2,6 +2,12 @@
 
 File: [`_base_encoder.py`](../../pyvisim/encoders/_base_encoder.py)
 
+> ⚠️ **Deprecated.** Loading pretrained models through the `KMeansWeights` / `GMMWeights`
+> enums now emits a `DeprecationWarning` and will be removed in a future release. Train
+> an encoder with `learn()` and persist it with `save_to_disk()` / `load_from_disk()`
+> instead, see [base_encoder.md](base_encoder.md). The rest of this page documents the
+> legacy path while it's still around.
+
 Pretrained clustering models are exposed as enums so users can select a model by name
 instead of handling file paths. Each enum member's value is a path to a pickled
 scikit-learn model under `pyvisim/res/model_files/`, and the shared `_PretrainedModels`
@@ -21,12 +27,18 @@ each with and without PCA, for example `OXFORD102_K256_ROOTSIFT_PCA`.
 
 A weight name containing `PCA` requires the matching PCA model so descriptors are
 reduced before clustering. `_CLUSTERING_TO_PCA_MAPPING` maps each `_PCA` variant to its
-clustering weight, and `ImageEncoderBase.__init__` loads the PCA automatically when a
+clustering weight, and `_load_pretrained_weights` loads the PCA automatically when a
 `PCA` weight is selected. This is why you never reference `_PCA` directly: choosing a
 `*_PCA` clustering weight pulls in the correct PCA for you.
 
-## Adding your own weights
+The pickled scikit-learn estimators aren't used raw. Each one is adopted into the
+matching [`pyvisim.clustering`](../clustering/README.md) model via its internal
+`_from_sklearn` classmethod, which type-checks the estimator (and re-validates the
+diagonal covariance for the GMM) before wrapping it.
+
+## Using your own trained model
 
-To use a model you trained yourself, skip the enums and pass the fitted model directly
-via the encoder's `kmeans_model` / `gmm_model` (and optional `pca`) arguments. See
-[base_encoder.md](base_encoder.md).
+Don't reach for the enums for this. Configure the encoder from parameters, call
+`learn()` on your images, and save the result with `save_to_disk()`; reload it later
+with `load_from_disk()`. That's the supported replacement for the weight enums, and it
+round-trips the PCA and normalization settings too. See [base_encoder.md](base_encoder.md).
diff --git a/docs/overview.md b/docs/overview.md
@@ -14,6 +14,7 @@ pyvisim/
 ├── _errors.py           Custom exceptions
 ├── eval.py              Retrieval metrics (top-k, mAP, accuracy)
 ├── encoders/            VLAD, Fisher Vector, Pipeline, pretrained weights
+├── clustering/          KMeans, GaussianMixtureModel, PCA
 ├── features/            SIFT, RootSIFT, DeepConvFeature, Lambda
 ├── datasets/            OxfordFlowerDataset
 └── neural_networks/     Siamese network (planned, not yet implemented)
@@ -22,6 +23,8 @@ pyvisim/
 Per-area docs:
 
 - [Encoders](encoders/): how images become vectors.
+- [Clustering](clustering/): the KMeans, GMM, and PCA models the encoders build their
+  vocabulary with.
 - [Features](features/): how local descriptors are extracted from an image.
 - [Neural networks](neural_networks/): planned Siamese network.
 - [Dataset](dataset/): the bundled Oxford Flowers dataset class.
@@ -66,9 +69,10 @@ Everything is built on the two abstract base classes in
   and `(M, D)` arrays and returning an `(N, M)` matrix can be used. On assignment it
   is probed with dummy input; if it does not return the expected shape,
   fall back to a row-by-row loop. Default is cosine similarity.
-- **Pretrained weights are enums.** `KMeansWeights` and `GMMWeights` are enums whose
-  values are file paths to pickled scikit-learn models. PCA models are paired to the
-  PCA weight variants automatically. See [encoders/weights.md](encoders/weights.md).
+- **Trained encoders persist to `.encoder` files.** `save_to_disk` / `load_from_disk`
+  serialize the fitted clustering model, PCA, and normalization settings. This replaces
+  the deprecated `KMeansWeights` / `GMMWeights` enum loading path, which still works for
+  now but warns. See [encoders/weights.md](encoders/weights.md).
 
 ## Evaluation