Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ Thank you for your interest! Contributions of all kinds are welcome.

## PR TODO list - your first PR

To understand how the library is structured as well the technical details before diving in, you can first read the [developer documentation](docs/overview.md), and/or you can also read the docstrings of the modules and classes that you are working on.

Use this checklist to stay on track for your first code PR:

- **Clone this repository**: see [Set up developer environment](#set-up-developer-environment) section.
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ and the Siamese Neural Networks.
7. [License](#license)
8. [References](#references)

For a technical deep-dive into the library internals, see the [developer documentation](docs/overview.md).

## Why `pyvisim`?

`pyvisim` is designed to provide a simple and efficient way to compare images.
Expand Down
7 changes: 7 additions & 0 deletions docs/dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Dataset

File: [`datasets/datasets.py`](../../pyvisim/datasets/datasets.py)

The module bundles one dataset class, [`OxfordFlowerDataset`](oxford_flower_dataset.md),
used for the project's experiments and pretrained weights. It is a standard PyTorch
`Dataset`, so it works with `DataLoader` and the rest of the Torch ecosystem.
29 changes: 29 additions & 0 deletions docs/dataset/oxford_flower_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# OxfordFlowerDataset

File: [`datasets/datasets.py`](../../pyvisim/datasets/datasets.py)

A PyTorch `Dataset` for the [Oxford 102 Flowers](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/)
dataset (8189 images across 102 categories). Indexing yields a
`(image, label, image_path)` tuple, where `image` is an RGB NumPy array.

```python
from pyvisim.datasets import OxfordFlowerDataset

dataset = OxfordFlowerDataset(purpose="train")
image, label, path = dataset[0]
```

## The swapped train/test split

The constructor's `purpose` accepts `"train"`, `"validation"`, `"test"`, or a list to
combine splits (for example `["train", "validation"]`).

Note the deliberate swap: the original dataset ships 1020 training and 6149 test images.
This class maps the original **test** ids to `train` and the original **train** ids to
`test`, so training has the larger pool. This is more useful for fitting the clustering
models, which benefit from more data. Keep this in mind if you compare results against
papers that use the original split.

## TODO

- Implement `transform` method
28 changes: 28 additions & 0 deletions docs/encoders/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Encoders

An encoder turns an image into a single fixed-size vector by extracting local
descriptors and aggregating them against a learned visual vocabulary. The resulting
vectors are used for retrieval, clustering, and classification.

| Object | File | Aggregation model | Output size |
|--------|------|-------------------|-------------|
| [`VLADEncoder`](vlad.md) | [`vlad.py`](../../pyvisim/encoders/vlad.py) | KMeans | `K * D` |
| [`FisherVectorEncoder`](fisher_vector.md) | [`fisher_vector.py`](../../pyvisim/encoders/fisher_vector.py) | Gaussian Mixture Model | `2 * K * D + K` |
| [`Pipeline`](pipeline.md) | [`pipeline.py`](../../pyvisim/encoders/pipeline.py) | n/a (composes encoders) | sum of members |

where `K` is the number of clusters and `D` is the local descriptor dimension.

Shared machinery lives in [`ImageEncoderBase`](base_encoder.md), and pretrained
clustering/PCA models are exposed through the enums documented in
[weights.md](weights.md).

## VLAD vs Fisher Vector

Both follow the same extract → aggregate → normalize flow and share the same base
class. They differ in what statistics they capture:

- **VLAD** records only first-order statistics: the sum of residuals (descriptor minus
centroid) per cluster. Clustering is hard-assignment via KMeans.
- **Fisher Vector** records first- and second-order statistics as gradients of the GMM
log-likelihood with respect to its weights, means, and variances. Assignment is soft
(posterior probabilities). This makes Fisher vectors larger but more expressive.
30 changes: 30 additions & 0 deletions docs/encoders/base_encoder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# ImageEncoderBase

File: [`_base_encoder.py`](../../pyvisim/encoders/_base_encoder.py)

`ImageEncoderBase` holds all logic shared by `VLADEncoder` and `FisherVectorEncoder`.
It implements `SimilarityMetric` and leaves `encode` abstract for subclasses. If you
add a new aggregation-based encoder, subclass this.

## What it manages

A concrete encoder is the combination of:

1. a **feature extractor** (`FeatureExtractorBase`),
2. an optional **PCA** model,
3. a **clustering model** (`KMeans` for VLAD, `GaussianMixture` for Fisher),
4. a **similarity function**.

The base class wires these together, validates their dimensions, and provides
`learn`, `encode` (abstract), `generate_encoding_map`, and `similarity_score`.

## Constructing an encoder

Two mutually exclusive ways to supply a clustering model:

- Pass `weights=` (a `KMeansWeights` / `GMMWeights` enum member). The base class loads
the pickled model, and if the weight name contains `PCA` it also loads the matching
PCA model automatically. When `weights` is given, the `clustering_model` and `pca`
arguments are ignored.
- Pass an explicit `clustering_model=` (and optionally `pca=`) that you trained
yourself or loaded elsewhere.
31 changes: 31 additions & 0 deletions docs/encoders/fisher_vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# FisherVectorEncoder

File: [`fisher_vector.py`](../../pyvisim/encoders/fisher_vector.py)

The Fisher Vector encodes an image into a vector of shape `(2 * K * D + K,)`, where
`K` is the number of GMM components and `D` is the local descriptor dimension (after
optional PCA). The `2 * K * D` term comes from the mean and variance gradients, and
the `+ K` term from the mixture-weight gradients.

## How `encode` works

For each image:

1. Extract local descriptors (default `RootSIFT`) and apply PCA if set.
2. Compute **soft assignments**: the GMM posterior probability of each descriptor
belonging to each component (`predict_proba`).
3. Accumulate the sufficient statistics (`pp_sum`, `pp_x`, `pp_x_2`) needed for the
gradients.
4. Compute the gradients of the GMM log-likelihood with respect to its parameters:
- `d_pi`: gradient w.r.t. mixture weights (first-order).
- `d_mu`: gradient w.r.t. means (first-order).
- `d_sigma`: gradient w.r.t. variances (second-order). This is what VLAD lacks.
5. Apply the analytical diagonal Fisher information normalization (dividing by the
square roots involving the mixture weights and covariances).
6. Concatenate `[d_pi, d_mu, d_sigma]`, then power-normalize and L2-normalize.

## References

- H. Jégou et al. "Aggregating Local Image Descriptors into Compact Codes". In: IEEE
Transactions on Pattern Analysis and Machine Intelligence 34.9 (2012), pp. 1704-1716.
doi: 10.1109/TPAMI.2011.235.
32 changes: 32 additions & 0 deletions docs/encoders/pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Pipeline

File: [`pipeline.py`](../../pyvisim/encoders/pipeline.py)

`Pipeline` combines several encoders into one. It encodes an image with every member
encoder, concatenates the per-encoder vectors, and compares the combined vectors with
a single similarity function. The goal is a more robust representation that blends,
for example, VLAD and Fisher Vector encodings.

It implements `SimilarityMetric` (not `ImageEncoderBase`), so it exposes `encode`,
`generate_encoding_map`, and `similarity_score` but has no clustering model of its own.

## How `encode` works

1. Validate that every member is an `ImageEncoderBase` (rejected otherwise).
2. Because the input `images` may be a one-shot iterator, it is duplicated with
`itertools.tee` so each encoder sees the full sequence.
3. Each member's output is temporarily forced to `flatten=True`, encoded, then the
member's original `flatten` setting is restored. Flattening is mandatory here
because different encoders produce different output sizes, and concatenation needs
1D vectors per image.
4. The per-encoder results are concatenated with `np.hstack` into one wide vector per
image.

## Notes

- Member encoders can use different feature extractors and clustering models; the
pipeline does not require them to agree, since their outputs are simply concatenated.
- The similarity function is guarded the same way as in the encoders (probed on
assignment, with a row-wise fallback). Default is cosine similarity.
- A commented-out `fit` method exists in the source; training is done per encoder, not
through the pipeline.
28 changes: 28 additions & 0 deletions docs/encoders/vlad.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# VLADEncoder

File: [`vlad.py`](../../pyvisim/encoders/vlad.py)

VLAD (Vector of Locally Aggregated Descriptors) encodes an image into a vector of
shape `(K * D,)`, where `K` is the number of KMeans clusters and `D` is the local
descriptor dimension (after optional PCA).

## How `encode` works

For each image:

1. Extract local descriptors with the feature extractor (default `RootSIFT`).
2. Apply PCA if one is set.
3. Hard-assign each descriptor to its nearest KMeans centroid.
4. For each cluster, accumulate the **residual** `descriptor - centroid`. This is the
first-order statistic that defines VLAD.
5. Power-normalize (`sign(x) * |x|^power_norm_weight`), then L2-normalize per cluster
row.
6. Flatten to `(K * D,)` if `flatten=True`.

A batch returns a stacked `(num_images, K * D)` array.

## References

- R. Arandjelović and A. Zisserman. "All About VLAD". In: 2013 IEEE Conference on
Computer Vision and Pattern Recognition. 2013, pp. 1578-1585.
doi: 10.1109/CVPR.2013.207.
32 changes: 32 additions & 0 deletions docs/encoders/weights.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Pretrained weights

File: [`_base_encoder.py`](../../pyvisim/encoders/_base_encoder.py)

Pretrained clustering models are exposed as enums so users can select a model by name
instead of handling file paths. Each enum member's value is a path to a pickled
scikit-learn model under `pyvisim/res/model_files/`, and the shared `_PretrainedModels`
base provides `.load()` to unpickle it with `joblib`.

## Available enums

- `KMeansWeights`: pretrained `KMeans` models for `VLADEncoder`.
- `GMMWeights`: pretrained `GaussianMixture` models for `FisherVectorEncoder`.
- `_PCA`: internal; PCA models. Not exported.

All models were trained on the Oxford Flowers dataset with `k=256`. Each enum has the
same six variants, covering three feature types (VGG16 deep features, RootSIFT, SIFT)
each with and without PCA, for example `OXFORD102_K256_ROOTSIFT_PCA`.

## Automatic PCA pairing

A weight name containing `PCA` requires the matching PCA model so descriptors are
reduced before clustering. `_CLUSTERING_TO_PCA_MAPPING` maps each `_PCA` variant to its
clustering weight, and `ImageEncoderBase.__init__` loads the PCA automatically when a
`PCA` weight is selected. This is why you never reference `_PCA` directly: choosing a
`*_PCA` clustering weight pulls in the correct PCA for you.

## Adding your own weights

To use a model you trained yourself, skip the enums and pass the fitted model directly
via the encoder's `kmeans_model` / `gmm_model` (and optional `pca`) arguments. See
[base_encoder.md](base_encoder.md).
13 changes: 13 additions & 0 deletions docs/features/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Features

File: [`_features.py`](../../pyvisim/features/_features.py)

A feature extractor maps one image to a `(N, D)` array of local descriptors. Encoders
consume these descriptors and aggregate them into a fixed-size vector.

| Object | `output_dim` | Notes |
|--------|--------------|-------|
| [`SIFT`](sift.md) | 128 | SIFT descriptors |
| [`RootSIFT`](rootsift.md) | 128 | SIFT with Hellinger normalization (default extractor) |
| [`DeepConvFeature`](deep_conv_feature.md) | layer channels (+2) | CNN feature maps, optional spatial coordinates |
| [`Lambda`](lambda.md) | user-defined | wraps any custom function |
25 changes: 25 additions & 0 deletions docs/features/deep_conv_feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# DeepConvFeature

File: [`_features.py`](../../pyvisim/features/_features.py)

Extracts local descriptors from the convolutional feature map of a neural network
(default VGG16). Each spatial location in the chosen conv layer becomes one descriptor,
giving a CNN-based alternative to SIFT that plugs into the same encoders.

## Spatial encoding and `output_dim`

Convolutional features alone discard where in the image a descriptor came from.
Appending normalized coordinates re-injects coarse spatial information, which helps for
similarity tasks. When `spatial_encoding=True`, `output_dim` is the layer's output
channels `+ 2`; otherwise just the channel count. For VGG16's last conv layer this is
`512 + 2 = 514`.

## Selecting the layer

- `list_conv_layers()` enumerates the conv layers as `(index, name, module)`.
- `layer_index` chooses which to hook; `-1` (default) takes the last conv layer.
- `target_submodule` restricts the search to one named submodule of the model.

# TODO
- input range handling and batch processing; currently
one image is processed per call.
21 changes: 21 additions & 0 deletions docs/features/lambda.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Lambda

File: [`_features.py`](../../pyvisim/features/_features.py)

`Lambda` wraps any user-defined function as a feature extractor, so you can plug a
custom descriptor into the encoders without writing a new `FeatureExtractorBase`
subclass.

## Usage

```python
from pyvisim.features import Lambda

extractor = Lambda(func=my_descriptor_fn, output_dim=64)
```

- `func` must take a single image (NumPy array), and return a
`(N, output_dim)` array of descriptors.
- `output_dim` is supplied explicitly because, unlike SIFT or a CNN layer, an arbitrary
function has no inspectable descriptor size. The encoders need this value to validate
against PCA and the clustering model.
29 changes: 29 additions & 0 deletions docs/features/rootsift.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# RootSIFT

File: [`_features.py`](../../pyvisim/features/_features.py)

RootSIFT is SIFT with Hellinger-kernel normalization. It is the **default feature
extractor** for both `VLADEncoder` and `FisherVectorEncoder`.

## What it does differently from SIFT

After computing standard SIFT descriptors, each descriptor is:

1. L1-normalized (divided by the sum of its elements, plus a small epsilon), then
2. element-wise square-rooted.

Comparing these transformed vectors with the Euclidean/dot-product operations the
encoders use is equivalent to comparing the original descriptors under the Hellinger
kernel. This is a well-established, near-free improvement over raw SIFT for retrieval,
which is why it is the default.

## Notes

- `output_dim` is `128`, same as SIFT.
- Same empty-result handling as SIFT: no keypoints yields an empty `(0, 128)` array.

## References

- R. Arandjelović and A. Zisserman. "Three things everyone should know to improve
object retrieval". In: 2012 IEEE Conference on Computer Vision and Pattern
Recognition. 2012, pp. 2911-2918. doi: 10.1109/CVPR.2012.6248018.
19 changes: 19 additions & 0 deletions docs/features/sift.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# SIFT

File: [`_features.py`](../../pyvisim/features/_features.py)

Scale-Invariant Feature Transform descriptors via OpenCV's `cv2.SIFT`. SIFT was the
original local descriptor used for VLAD and Fisher Vector encoding, so it is the
baseline handcrafted extractor here.

- `output_dim` is `128` (standard SIFT descriptor length).

For most uses prefer [`RootSIFT`](rootsift.md), which normalizes these descriptors and
usually improves retrieval at no extra cost.

## References

- D. G. Lowe. "Distinctive Image Features from Scale-Invariant Keypoints". In:
International Journal of Computer Vision 60.2 (2004), pp. 91-110. issn: 1573-1405.
doi: 10.1023/B:VISI.0000029664.99615.94.
url: https://doi.org/10.1023/B:VISI.0000029664.99615.94.
10 changes: 10 additions & 0 deletions docs/neural_networks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Neural networks

File: [`neural_networks/`](../../pyvisim/neural_networks/)

This module is a placeholder for a **Siamese network** and is **not yet implemented**.

## Planned direction

TODO: define the architecture, dependencies (most likely backbone will come from
`torchvision`) and training procedure.
Loading
Loading