MechaCritter · MechaCritter · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -5,6 +5,8 @@ Thank you for your interest! Contributions of all kinds are welcome.
 
 ## PR TODO list - your first PR
 
+To understand how the library is structured as well the technical details before diving in, you can first read the [developer documentation](docs/overview.md), and/or you can also read the docstrings of the modules and classes that you are working on.
+
 Use this checklist to stay on track for your first code PR:
 
 - **Clone this repository**: see [Set up developer environment](#set-up-developer-environment) section.

diff --git a/README.md b/README.md
@@ -26,6 +26,8 @@ and the Siamese Neural Networks.
 7. [License](#license)
 8. [References](#references)
 
+For a technical deep-dive into the library internals, see the [developer documentation](docs/overview.md).
+
 ## Why `pyvisim`?
 
 `pyvisim` is designed to provide a simple and efficient way to compare images.

diff --git a/docs/dataset/README.md b/docs/dataset/README.md
@@ -0,0 +1,7 @@
+# Dataset
+
+File: [`datasets/datasets.py`](../../pyvisim/datasets/datasets.py)
+
+The module bundles one dataset class, [`OxfordFlowerDataset`](oxford_flower_dataset.md),
+used for the project's experiments and pretrained weights. It is a standard PyTorch
+`Dataset`, so it works with `DataLoader` and the rest of the Torch ecosystem.
diff --git a/docs/dataset/oxford_flower_dataset.md b/docs/dataset/oxford_flower_dataset.md
@@ -0,0 +1,29 @@
+# OxfordFlowerDataset
+
+File: [`datasets/datasets.py`](../../pyvisim/datasets/datasets.py)
+
+A PyTorch `Dataset` for the [Oxford 102 Flowers](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/)
+dataset (8189 images across 102 categories). Indexing yields a
+`(image, label, image_path)` tuple, where `image` is an RGB NumPy array.
+
+```python
+from pyvisim.datasets import OxfordFlowerDataset
+
+dataset = OxfordFlowerDataset(purpose="train")
+image, label, path = dataset[0]
+```
+
+## The swapped train/test split
+
+The constructor's `purpose` accepts `"train"`, `"validation"`, `"test"`, or a list to
+combine splits (for example `["train", "validation"]`).
+
+Note the deliberate swap: the original dataset ships 1020 training and 6149 test images.
+This class maps the original **test** ids to `train` and the original **train** ids to
+`test`, so training has the larger pool. This is more useful for fitting the clustering
+models, which benefit from more data. Keep this in mind if you compare results against
+papers that use the original split.
+
+## TODO
+
+- Implement `transform` method
diff --git a/docs/encoders/README.md b/docs/encoders/README.md
@@ -0,0 +1,28 @@
+# Encoders
+
+An encoder turns an image into a single fixed-size vector by extracting local
+descriptors and aggregating them against a learned visual vocabulary. The resulting
+vectors are used for retrieval, clustering, and classification.
+
+| Object | File | Aggregation model | Output size |
+|--------|------|-------------------|-------------|
+| [`VLADEncoder`](vlad.md) | [`vlad.py`](../../pyvisim/encoders/vlad.py) | KMeans | `K * D` |
+| [`FisherVectorEncoder`](fisher_vector.md) | [`fisher_vector.py`](../../pyvisim/encoders/fisher_vector.py) | Gaussian Mixture Model | `2 * K * D + K` |
+| [`Pipeline`](pipeline.md) | [`pipeline.py`](../../pyvisim/encoders/pipeline.py) | n/a (composes encoders) | sum of members |
+
+where `K` is the number of clusters and `D` is the local descriptor dimension.
+
+Shared machinery lives in [`ImageEncoderBase`](base_encoder.md), and pretrained
+clustering/PCA models are exposed through the enums documented in
+[weights.md](weights.md).
+
+## VLAD vs Fisher Vector
+
+Both follow the same extract → aggregate → normalize flow and share the same base
+class. They differ in what statistics they capture:
+
+- **VLAD** records only first-order statistics: the sum of residuals (descriptor minus
+  centroid) per cluster. Clustering is hard-assignment via KMeans.
+- **Fisher Vector** records first- and second-order statistics as gradients of the GMM
+  log-likelihood with respect to its weights, means, and variances. Assignment is soft
+  (posterior probabilities). This makes Fisher vectors larger but more expressive.
diff --git a/docs/encoders/base_encoder.md b/docs/encoders/base_encoder.md
@@ -0,0 +1,30 @@
+# ImageEncoderBase
+
+File: [`_base_encoder.py`](../../pyvisim/encoders/_base_encoder.py)
+
+`ImageEncoderBase` holds all logic shared by `VLADEncoder` and `FisherVectorEncoder`.
+It implements `SimilarityMetric` and leaves `encode` abstract for subclasses. If you
+add a new aggregation-based encoder, subclass this.
+
+## What it manages
+
+A concrete encoder is the combination of:
+
+1. a **feature extractor** (`FeatureExtractorBase`),
+2. an optional **PCA** model,
+3. a **clustering model** (`KMeans` for VLAD, `GaussianMixture` for Fisher),
+4. a **similarity function**.
+
+The base class wires these together, validates their dimensions, and provides
+`learn`, `encode` (abstract), `generate_encoding_map`, and `similarity_score`.
+
+## Constructing an encoder
+
+Two mutually exclusive ways to supply a clustering model:
+
+- Pass `weights=` (a `KMeansWeights` / `GMMWeights` enum member). The base class loads
+  the pickled model, and if the weight name contains `PCA` it also loads the matching
+  PCA model automatically. When `weights` is given, the `clustering_model` and `pca`
+  arguments are ignored.
+- Pass an explicit `clustering_model=` (and optionally `pca=`) that you trained
+  yourself or loaded elsewhere.
diff --git a/docs/encoders/fisher_vector.md b/docs/encoders/fisher_vector.md
@@ -0,0 +1,31 @@
+# FisherVectorEncoder
+
+File: [`fisher_vector.py`](../../pyvisim/encoders/fisher_vector.py)
+
+The Fisher Vector encodes an image into a vector of shape `(2 * K * D + K,)`, where
+`K` is the number of GMM components and `D` is the local descriptor dimension (after
+optional PCA). The `2 * K * D` term comes from the mean and variance gradients, and
+the `+ K` term from the mixture-weight gradients.
+
+## How `encode` works
+
+For each image:
+
+1. Extract local descriptors (default `RootSIFT`) and apply PCA if set.
+2. Compute **soft assignments**: the GMM posterior probability of each descriptor
+   belonging to each component (`predict_proba`).
+3. Accumulate the sufficient statistics (`pp_sum`, `pp_x`, `pp_x_2`) needed for the
+   gradients.
+4. Compute the gradients of the GMM log-likelihood with respect to its parameters:
+   - `d_pi`: gradient w.r.t. mixture weights (first-order).
+   - `d_mu`: gradient w.r.t. means (first-order).
+   - `d_sigma`: gradient w.r.t. variances (second-order). This is what VLAD lacks.
+5. Apply the analytical diagonal Fisher information normalization (dividing by the
+   square roots involving the mixture weights and covariances).
+6. Concatenate `[d_pi, d_mu, d_sigma]`, then power-normalize and L2-normalize.
+
+## References
+
+- H. Jégou et al. "Aggregating Local Image Descriptors into Compact Codes". In: IEEE
+  Transactions on Pattern Analysis and Machine Intelligence 34.9 (2012), pp. 1704-1716.
+  doi: 10.1109/TPAMI.2011.235.
diff --git a/docs/encoders/pipeline.md b/docs/encoders/pipeline.md
@@ -0,0 +1,32 @@
+# Pipeline
+
+File: [`pipeline.py`](../../pyvisim/encoders/pipeline.py)
+
+`Pipeline` combines several encoders into one. It encodes an image with every member
+encoder, concatenates the per-encoder vectors, and compares the combined vectors with
+a single similarity function. The goal is a more robust representation that blends,
+for example, VLAD and Fisher Vector encodings.
+
+It implements `SimilarityMetric` (not `ImageEncoderBase`), so it exposes `encode`,
+`generate_encoding_map`, and `similarity_score` but has no clustering model of its own.
+
+## How `encode` works
+
+1. Validate that every member is an `ImageEncoderBase` (rejected otherwise).
+2. Because the input `images` may be a one-shot iterator, it is duplicated with
+   `itertools.tee` so each encoder sees the full sequence.
+3. Each member's output is temporarily forced to `flatten=True`, encoded, then the
+   member's original `flatten` setting is restored. Flattening is mandatory here
+   because different encoders produce different output sizes, and concatenation needs
+   1D vectors per image.
+4. The per-encoder results are concatenated with `np.hstack` into one wide vector per
+   image.
+
+## Notes
+
+- Member encoders can use different feature extractors and clustering models; the
+  pipeline does not require them to agree, since their outputs are simply concatenated.
+- The similarity function is guarded the same way as in the encoders (probed on
+  assignment, with a row-wise fallback). Default is cosine similarity.
+- A commented-out `fit` method exists in the source; training is done per encoder, not
+  through the pipeline.
diff --git a/docs/encoders/vlad.md b/docs/encoders/vlad.md
@@ -0,0 +1,28 @@
+# VLADEncoder
+
+File: [`vlad.py`](../../pyvisim/encoders/vlad.py)
+
+VLAD (Vector of Locally Aggregated Descriptors) encodes an image into a vector of
+shape `(K * D,)`, where `K` is the number of KMeans clusters and `D` is the local
+descriptor dimension (after optional PCA).
+
+## How `encode` works
+
+For each image:
+
+1. Extract local descriptors with the feature extractor (default `RootSIFT`).
+2. Apply PCA if one is set.
+3. Hard-assign each descriptor to its nearest KMeans centroid.
+4. For each cluster, accumulate the **residual** `descriptor - centroid`. This is the
+   first-order statistic that defines VLAD.
+5. Power-normalize (`sign(x) * |x|^power_norm_weight`), then L2-normalize per cluster
+   row.
+6. Flatten to `(K * D,)` if `flatten=True`.
+
+A batch returns a stacked `(num_images, K * D)` array.
+
+## References
+
+- R. Arandjelović and A. Zisserman. "All About VLAD". In: 2013 IEEE Conference on
+  Computer Vision and Pattern Recognition. 2013, pp. 1578-1585.
+  doi: 10.1109/CVPR.2013.207.
diff --git a/docs/encoders/weights.md b/docs/encoders/weights.md
@@ -0,0 +1,32 @@
+# Pretrained weights
+
+File: [`_base_encoder.py`](../../pyvisim/encoders/_base_encoder.py)
+
+Pretrained clustering models are exposed as enums so users can select a model by name
+instead of handling file paths. Each enum member's value is a path to a pickled
+scikit-learn model under `pyvisim/res/model_files/`, and the shared `_PretrainedModels`
+base provides `.load()` to unpickle it with `joblib`.
+
+## Available enums
+
+- `KMeansWeights`: pretrained `KMeans` models for `VLADEncoder`.
+- `GMMWeights`: pretrained `GaussianMixture` models for `FisherVectorEncoder`.
+- `_PCA`: internal; PCA models. Not exported.
+
+All models were trained on the Oxford Flowers dataset with `k=256`. Each enum has the
+same six variants, covering three feature types (VGG16 deep features, RootSIFT, SIFT)
+each with and without PCA, for example `OXFORD102_K256_ROOTSIFT_PCA`.
+
+## Automatic PCA pairing
+
+A weight name containing `PCA` requires the matching PCA model so descriptors are
+reduced before clustering. `_CLUSTERING_TO_PCA_MAPPING` maps each `_PCA` variant to its
+clustering weight, and `ImageEncoderBase.__init__` loads the PCA automatically when a
+`PCA` weight is selected. This is why you never reference `_PCA` directly: choosing a
+`*_PCA` clustering weight pulls in the correct PCA for you.
+
+## Adding your own weights
+
+To use a model you trained yourself, skip the enums and pass the fitted model directly
+via the encoder's `kmeans_model` / `gmm_model` (and optional `pca`) arguments. See
+[base_encoder.md](base_encoder.md).
diff --git a/docs/features/README.md b/docs/features/README.md
@@ -0,0 +1,13 @@
+# Features
+
+File: [`_features.py`](../../pyvisim/features/_features.py)
+
+A feature extractor maps one image to a `(N, D)` array of local descriptors. Encoders
+consume these descriptors and aggregate them into a fixed-size vector.
+
+| Object | `output_dim` | Notes |
+|--------|--------------|-------|
+| [`SIFT`](sift.md) | 128 | SIFT descriptors |
+| [`RootSIFT`](rootsift.md) | 128 | SIFT with Hellinger normalization (default extractor) |
+| [`DeepConvFeature`](deep_conv_feature.md) | layer channels (+2) | CNN feature maps, optional spatial coordinates |
+| [`Lambda`](lambda.md) | user-defined | wraps any custom function |
diff --git a/docs/features/deep_conv_feature.md b/docs/features/deep_conv_feature.md
@@ -0,0 +1,25 @@
+# DeepConvFeature
+
+File: [`_features.py`](../../pyvisim/features/_features.py)
+
+Extracts local descriptors from the convolutional feature map of a neural network
+(default VGG16). Each spatial location in the chosen conv layer becomes one descriptor,
+giving a CNN-based alternative to SIFT that plugs into the same encoders.
+
+## Spatial encoding and `output_dim`
+
+Convolutional features alone discard where in the image a descriptor came from.
+Appending normalized coordinates re-injects coarse spatial information, which helps for
+similarity tasks. When `spatial_encoding=True`, `output_dim` is the layer's output
+channels `+ 2`; otherwise just the channel count. For VGG16's last conv layer this is
+`512 + 2 = 514`.
+
+## Selecting the layer
+
+- `list_conv_layers()` enumerates the conv layers as `(index, name, module)`.
+- `layer_index` chooses which to hook; `-1` (default) takes the last conv layer.
+- `target_submodule` restricts the search to one named submodule of the model.
+
+# TODO
+- input range handling and batch processing; currently
+  one image is processed per call.
diff --git a/docs/features/lambda.md b/docs/features/lambda.md
@@ -0,0 +1,21 @@
+# Lambda
+
+File: [`_features.py`](../../pyvisim/features/_features.py)
+
+`Lambda` wraps any user-defined function as a feature extractor, so you can plug a
+custom descriptor into the encoders without writing a new `FeatureExtractorBase`
+subclass.
+
+## Usage
+
+```python
+from pyvisim.features import Lambda
+
+extractor = Lambda(func=my_descriptor_fn, output_dim=64)
+```
+
+- `func` must take a single image (NumPy array), and return a
+  `(N, output_dim)` array of descriptors.
+- `output_dim` is supplied explicitly because, unlike SIFT or a CNN layer, an arbitrary
+  function has no inspectable descriptor size. The encoders need this value to validate
+  against PCA and the clustering model.
diff --git a/docs/features/rootsift.md b/docs/features/rootsift.md
@@ -0,0 +1,29 @@
+# RootSIFT
+
+File: [`_features.py`](../../pyvisim/features/_features.py)
+
+RootSIFT is SIFT with Hellinger-kernel normalization. It is the **default feature
+extractor** for both `VLADEncoder` and `FisherVectorEncoder`.
+
+## What it does differently from SIFT
+
+After computing standard SIFT descriptors, each descriptor is:
+
+1. L1-normalized (divided by the sum of its elements, plus a small epsilon), then
+2. element-wise square-rooted.
+
+Comparing these transformed vectors with the Euclidean/dot-product operations the
+encoders use is equivalent to comparing the original descriptors under the Hellinger
+kernel. This is a well-established, near-free improvement over raw SIFT for retrieval,
+which is why it is the default.
+
+## Notes
+
+- `output_dim` is `128`, same as SIFT.
+- Same empty-result handling as SIFT: no keypoints yields an empty `(0, 128)` array.
+
+## References
+
+- R. Arandjelović and A. Zisserman. "Three things everyone should know to improve
+  object retrieval". In: 2012 IEEE Conference on Computer Vision and Pattern
+  Recognition. 2012, pp. 2911-2918. doi: 10.1109/CVPR.2012.6248018.
diff --git a/docs/features/sift.md b/docs/features/sift.md
@@ -0,0 +1,19 @@
+# SIFT
+
+File: [`_features.py`](../../pyvisim/features/_features.py)
+
+Scale-Invariant Feature Transform descriptors via OpenCV's `cv2.SIFT`. SIFT was the
+original local descriptor used for VLAD and Fisher Vector encoding, so it is the
+baseline handcrafted extractor here.
+
+- `output_dim` is `128` (standard SIFT descriptor length).
+
+For most uses prefer [`RootSIFT`](rootsift.md), which normalizes these descriptors and
+usually improves retrieval at no extra cost.
+
+## References
+
+- D. G. Lowe. "Distinctive Image Features from Scale-Invariant Keypoints". In:
+  International Journal of Computer Vision 60.2 (2004), pp. 91-110. issn: 1573-1405.
+  doi: 10.1023/B:VISI.0000029664.99615.94.
+  url: https://doi.org/10.1023/B:VISI.0000029664.99615.94.
diff --git a/docs/neural_networks/README.md b/docs/neural_networks/README.md
@@ -0,0 +1,10 @@
+# Neural networks
+
+File: [`neural_networks/`](../../pyvisim/neural_networks/)
+
+This module is a placeholder for a **Siamese network** and is **not yet implemented**.
+
+## Planned direction
+
+TODO: define the architecture, dependencies (most likely backbone will come from
+`torchvision`) and training procedure.